Chapter 15: Parallel Computing

Return to all notes

Briefly, parallel computing is a method of running code on multiple processors (or multiple cores of the same processor) at the same time. In general, this is a difficult task depending on where data is stored and retrieved. The Julia Documentation on parallel computing is a good place to start.

Let’s start with something relatively simple. Consider the following code:

 function count_heads(n::Int)
    c::Int = 0
    for i=1:n
        c += rand(Bool)
    end
    c
end

which mimics flipping a coin n times. We can simulate flipping 10 billion coins and finding the fraction that is heads by the following:

@time count_heads(10^10)/10^10

which returns

597828 seconds (6.25 k allocations: 263.473 KB)

5000025478

or about 30 seconds.

Running parallel code.

We now wish to take advantage of the fact that today’s processors often have multiple cores and multiple threading. There are some helpful things in the Distributed package:

using Distributed

Let’s say that we wish to run it simultaneously on 2 cores (even if we know we have more) If we type

addproc(2)

this would then allow julia up to 3 “processors”. It doesn’t seem like adding 1 is working right now. Look into this.

There are a couple of ways to now run code. The first gives us more control, while the second allows for some easy code writing.

Assigning code to workers

First, we need to make sure that the function count_heads is available to each core of the machine. To do this, we will start the function with the @everywhere macro:

@everywhere function count_heads(n::Int)
   c::Int = 0
   for i=1:n
       c += rand(Bool)
   end
   c
end

Any function that you will need on multiple cores will need to be prefaced with @everywhere.

We can now send code to different cores. Type:

a= @spawn count_heads(5*10^9);
b= @spawn count_heads(5*10^9)

and you should notice that it doesn’t return the number of heads as expected and it goes almost instanteously. Don’t worry right now about the type of object that is returned.

Even though it returns quickly, code is already running and you can see this with your machine’s process manager (Activity Monitor on MacOS).

To get the results, we type

fetch(a)+fetch(b)

and it will return the number of heads flipped.

To test the timing, put all three of these in a block:

a= @spawn count_heads(5*10^9);
b= @spawn count_heads(5*10^9);
@time fetch(a)+fetch(b)

and it should return about half the time from above.

Exercise

Add two more cores to julia with the addprocs command.
Rerun the @everywhere function count_heads code above.
Create 4 spawning lines. Call them a, b, c and d and use 2.5*10^9 for each.
Time the sum of the four.
Note: if you truly have 4 cores, then you will see a further halving of the time. If not, you will probably only see the same results.

Adding all of the cores

Since most computers have multiple cores, this is helpful, but what about adding all of the cores? If we just call

addprocs()

then it adds everything available. You will get an array of the workers.

Parallel for loops

One issue with what we did above is that we have to think about spawning to individual processors or cores. That is fairly annoying. Fortunately, another helpful feature of julia is that of a parallel for loop. Try the following code:

@time let
 nheads = @distributed (+) for i = 1:10^10
   Int(rand(Bool))
 end
end

where julia distributes the load in the for loop across the available processors.

and the time should be a fraction of the original head counter code where the fraction is the number of actual number of possible processors or cores

Writing a parallel card simulator

For this, we need to reload the PlayingCard module:

include("PlayingCards.jl")
using .PlayingCards, Random

(or you may need to put in a different path to your module)

If we have the following code to simulate a large number of hands:

function count_hands(trials::Int,f::Function)
    local deck=map(Card,1:52)
    local numhands=0
    for i=1:trials
        shuffle!(deck)
        h = Hand(deck[1:5])
        if(f(h))
            numhands+=1
        end
    end
    numhands
end

And let’s just check to make sure it works:

@time count_hands(10_000_000,is_full_house)

and after about 10 seconds, this returns 14438.

We now wish to write a parallelized version of this in which we will replace the for loop with a distributed for loop.

function para_count_hands(trials::Integer,f::Function)
  local deck=map(Card,1:52)
  local numhands=0
  function check(f::Function)
    shuffle!(deck)
    f(Hand(deck[1:5]))
  end
  @distributed (+) for i = 1:trials
    Int(check(f))
  end  
end

and running this:

@time fh = para_count_hands(10_000_000,is_full_house)

and this returns:

2.121102 seconds (111.70 k allocations: 5.521 MiB, 0.28% gc time)

14390

which cuts the time by a quarter.

A Parallel Map Functions

In Chapter 5, we saw the map function, which is the standard way if you have an array and need to do the same thing to each element of the array, returning the result. If the function that is applied to each element of the array is costly and you have multiple processors/cores to work with, doing this in parallel can be helpful.

We demonstrate how to do this with just the coin flip right now using the pmap function.

The following array will determine the number of coin flips to do:

num_coins = 1_000_000_000*ones(Int64,12)

so each element is 1 billion. We wish to run the coin flipper on each element of the array. Don’t forget that the count_heads function must be declared everywhere. The following runs the function.

@time pmap(count_heads,num_coins)

and the resulting time is:

5.779404 seconds (1.03 k allocations: 83.984 KiB)

In comparison, if we run the regular map function:

@time map(count_heads,num_coins)

the result is

22.679107 seconds (78.88 k allocations: 3.957 MiB, 0.03% gc time)

again, about 4 times slower.

Shared arrays

The last example shows that we may have to do something on an array. For simplicity, let’s say we have a fairly large array:

arr = rand(1:100,100_000_000);

A common thing to do is to smooth an array (and is often done to images) by a windowed mean, which means that every element is replaced by a mean of the points around it in some window. We first define a windowed mean by the following function:

function window_mean(arr::Array{Int64,1},i::Integer,width::Integer)
  window = max(1,i-width):min(i+width,length(arr))
  sum(arr[window])/(last(window)-first(window)+1)
end

which first determine a window (being careful with the first and end of the array) and then just calculating the mean.

Then if we have a new array of zeros:

smoothed_array = zeros(length(arr))

we fill the new array with the windowed mean:

@time let
  for i=1:length(arr)
    smoothed_array[i]=window_mean(arr,i,100)
  end
end

will fill the array with the smoothed version in about 43 seconds.

Note: the astute reader is probably thinking that using map to do this is the right way to go, however, we can’t use map in this instance because the window_mean function doesn’t just use a single number (from an array), it uses the entire array.

A natural way to speed this is is to send this to individual processor/cores and work on this in pieces. One problem with this is that we would have to send the entire array to each worker and that is expensive. Since a 64-bit integer is 8 bytes, the array of 100 million is about 800 Mb of memory and that is reasonable expensive to pass around. Instead, we are going to use a shared array in the package SharedArrays and you will need to add this.

using SharedArrays

Course Notes of Peter Staab

This is where Symbolic Computation class notes are found.