Skip to main content

Chapter 29 Parallel Computing

Briefly, parallel computing is a method of running code on multiple processors (or multiple cores of the same processor) at the same time. In general, this is a difficult task depending on where data is stored and retrieved. The Julia Documentation on parallel computing is a good place to start.
Let’s start with something relatively simple. Consider the following code:
function countHeads(n::Int)
  c::Int = 0
  for i=1:n
    c += rand(Bool)
  end
  c
end
which mimics flipping a coin n times. We can simulate flipping 2 billion coins and finding the fraction that is heads by the following:
@time countHeads(2*10^9)/(2*10^9)
which returns 0.5000087195 in 2.548122 seconds.
 1 
And remember that since we are using random numbers that if you run this code you will get different results, but the number should be close to the expected value of 0.5.

Section 29.1 Running parallel code

We now wish to take advantage of the fact that today’s processors often have multiple cores and multiple threading. There are some helpful things in the Distributed package:
using Distributed
Unless you specify otherwise, Julia will startup only using 1 core. You can see this with nprocs(), which will return 1 as the number of cores used. Let’s say that we wish to run it simultaneously on 2 cores (even if we know we have more) If we type
addproc(1)
this would then allow julia up to 2 ``processors’’. There are a couple of ways to now run code. The first gives us more control, while the second allows for some easy code writing. Checking with nprocs() now shows that there are 2 cores available.

Subsection 29.1.1 Assigning code to workers

First, we need to make sure that the function countHeads is available to each core of the machine. To do this, we will start the function with the @everywhere macro:
@everywhere function countHeads(n::Int)
  c::Int = 0
  for i=1:n
    c += rand(Bool)
  end
  c
end
Any function that you will need on multiple cores will need to be prefaced with the macro @everywhere. After this, we can now send code to different cores. Type:
a= @spawn countHeads(10^9)
b= @spawn countHeads(10^9)
and you should notice that it doesn’t return the number of heads as expected and it goes almost instanteously. Don’t worry right now about the type of object that is returned.
Even though it returns quickly, code is already running and you can see this with your machine’s process manager (Activity Monitor on MacOS). To get the results, we type
fetch(a)+fetch(b)
and it will return the number of heads flipped. To test the timing, put all three of these in a block:
a= @spawn countHeads(10^9)
b= @spawn countHeads(10^9)
@time fetch(a)+fetch(b)
and the results show that it took 2.469003 seconds.
This is perhaps a bit disappointing in that if we use two cores, then might expect half of the time. It’s not quite that simple in that there is some overhead for using multiple cores. However, if you use more cores like the exercise below, you’ll probably get better results and for longer processes, it will pay off.
Check Your Understanding 29.1.
Perform the following tasks:
  1. Add two more cores to julia with the addprocs command.
  2. Rerun the block of code @everywhere function countHeads code above.
  3. Create 4 spawning lines. Call them a, b, c and d and use 5*10^8 for each.
  4. Time the sum of the four.
  5. Note: if you truly have 4 cores, then you will see a further halving of the time. If not, you will probably only see the same results.

Section 29.2 Managing Processes

Julia has a few commands that allow us to manage how many processes are used. These include addprocs seen above, but also procs and rmprocs. We will see these in action in this section.

Subsection 29.2.1 The procs command

Entering procs() shows all of the current processes running in the kernel. If you did the exercise above, you should see 4 and may look like:
4-element Vector{Int64}:
  1
  2
  3
  4
which shows that there are 4 cores available and the have the numbers 1 through 4.

Subsection 29.2.2 Adding all of the cores

Since most computers have multiple cores, this is helpful, but what about adding all of the cores? If we just call
addprocs()
then it adds everything available. You will get an array of the workers.

Subsection 29.2.3 Removing cores

Although if you are trying to run code as quickly as possible you probably want as many cores as possible, you may need to reduce the number of cores used. We can do this with the rmprocs command. Pass in the number of the process that you want to remove.
If you have 4 processes numbered 1 through 4 currently, and want to return to just 1 process, then rmprocs(2,3,4) will return to the single core.

Section 29.3 Parallel for loops

First, use the rmprocs command as shown above to return to a single core.
One issue with what we did above is that we have to think about spawning to individual processors or cores. That is fairly annoying. Fortunately, another helpful feature of julia is that of a parallel for loop. Try the following code:
@time let
  nheads = @distributed (+) for i = 1:2*10^10
    Int(rand(Bool))
  end
end
and the results took 24.830175 seconds. Note that we are doing 10 times as many coin flips as the beginning of the chapter, so you should see this taking about 10 times longer.
Add 3 cores with addprocs(3) and rerun the code above. When running, I got the results in 9.434582 seconds. This is about twice as fast. If you were expecting (or hoping) for 4 times as fast, recall that there is overhead for handling parallel code. This machine has 8 core, so doing addprocs() added another 4 cores and rerunning, I got the results in 5.423481 seconds which is about 5 times faster than running without parallel code.
So what is going on with this? The @distributed macro runs a for loop on all available cores, splitting the range across the cores. The (+) says to add the results. Any reducer function (see Section 7.2 for general reduce function) can be used.

Section 29.4 Writing a parallel card simulator

For this, we need to reload the PlayingCard module:
include("PlayingCards.jl")
using .PlayingCards, Random
(or you may need to put in a different path to your module)
If we have the following code to simulate a large number of hands:
function countHands(trials::Int,f::Function)
    local deck=map(Card,1:52)
    local num_hands=0
    for i=1:trials
        shuffle!(deck)
        h = Hand(deck[1:5])
        if(f(h))
            num_hands+=1
        end
    end
    num_hands
end
And let’s just check to make sure it works:
@time countHands(10_000_000,isFullHouse)
which returns the result in 4.305718 seconds.
We now wish to write a distributed version of this in which we will replace the for loop with a distributed for loop. First, the module needs to be loaded for all cores by the core below, but first restart the kernel to avoid errors that occur when rerunning the same code.
using Distributed
@everywhere include("PlayingCards.jl")
@everywhere using .PlayingCards, Random
and then the following will be the parallel version of the hand count function:
@everywhere function paraCountHands(trials::Integer,f::Function)
  local deck=map(Card,1:52)
  function checkHand(f::Function) ## shuffle the deck then check the hand.
    shuffle!(deck)
    f(Hand(deck[1:5]))
  end
  @distributed (+) for i = 1:trials
    Int(checkHand(f))
  end
end
Add 3 cores with the command addprocs(3) and then rerun the code above with
				@time fh = paraCountHands(10_000_000,isFullHouse)
				
returns \printpythontex[verbatim] which cuts the time by about quarter.

Subsection 29.4.1 parallelizing the ....

What goes here?

Section 29.5 A Parallel Map Function

@everywhere function countHeads(n::Int)
   c::Int = 0
   for i=1:n
       c += rand(Bool)
   end
   c
end
In Chapter 7, we saw the map function, which is the standard way if you have an array and need to do the same thing to each element of the array, returning the result. If the function that is applied to each element of the array is costly and you have multiple processors/cores to work with, doing this in parallel can be helpful.
We demonstrate how to do this with just the coin flip right now using the pmap function. We first make an array of length 12 (although this isn’t an important number):
				num_coins = 1_000_000_000*ones(Int64,12)
				
so each element is 1 billion. We wish to run the coin flipper on each element of the array. Don’t forget that the countHeads function must be declared everywhere. The following runs the function:
				@time pmap(countHeads,num_coins)
				
and the resulting time is: \printpythontex[verbatim]
In comparison, if we run the regular map function:
				@time map(countHeads,num_coins)
				
the result is \printpythontex[verb]. Again, this is about 4 times slower.

Subsection 29.5.1 When to use pmap

It seems like the pmap function should always be used if it speeds up calculations. However, note that in this example, the size of the array was small but the function on each element took a relatively long time. If we have an array with millions of elements, though, the pmap function may actually be slower.

Section 29.6 Shared arrays

The last example shows that we may have to do something on an array. For simplicity, let’s say we have a fairly large array:
			arr = rand(1:100,100_000_000);
			
A common thing to do is to smooth an array (and is often done to images) by a windowed mean, which means that every element is replaced by a mean of the points around it in some window. We first define a windowed mean by the following function:
function windowMean(arr::Vector{T},i::Integer,width::Integer) where T <: Real
  window = max(1,i-width):min(i+width,length(arr))
  sum(arr[window])/(last(window)-first(window)+1)
end
which first determine a window (being careful with the first and end of the array) and then just calculating the mean.
Then if we have a new array of zeros:
			smoothed_array = zeros(Float64,length(arr))
			
we fill the new array with the windowed mean:
			@time let
			for i=1:length(arr)
			smoothed_array[i]=windowMean(arr,i,100)
			end
			end
			
will fill the array with the smoothed version in about 43 seconds.
Note: the astute reader is probably thinking that using map to do this is the right way to go, however, we can’t use map in this instance because the window\_mean function doesn’t just use a single number (from an array), it uses the entire array.
A natural way to speed this is is to send this to individual processor/cores and work on this in pieces. One problem with this is that we would have to send the entire array to each worker and that is expensive. Since a 64-bit integer is 8 bytes, the array of 100 million is about 800 Mb of memory and that is reasonable expensive to pass around. Instead, we are going to use a shared array in the package SharedArrays and you will need to add this.
			using SharedArrays