Briefly, parallel computing is a method of running code on multiple processors (or multiple cores of the same processor) at the same time. In general, this is a difficult task depending on where data is stored and retrieved. The Julia Documentation on parallel computing is a good place to start.
And remember that since we are using random numbers that if you run this code you will get different results, but the number should be close to the expected value of 0.5.
We now wish to take advantage of the fact that today’s processors often have multiple cores and multiple threading. There are some helpful things in the Distributed package:
Unless you specify otherwise, Julia will startup only using 1 core. You can see this with nprocs(), which will return 1 as the number of cores used. Let’s say that we wish to run it simultaneously on 2 cores (even if we know we have more) If we type
this would then allow julia up to 2 ``processors’’. There are a couple of ways to now run code. The first gives us more control, while the second allows for some easy code writing. Checking with nprocs() now shows that there are 2 cores available.
First, we need to make sure that the function countHeads is available to each core of the machine. To do this, we will start the function with the @everywhere macro:
@everywhere function countHeads(n::Int)
c::Int = 0
for i=1:n
c += rand(Bool)
end
c
end
Any function that you will need on multiple cores will need to be prefaced with the macro @everywhere. After this, we can now send code to different cores. Type:
and you should notice that it doesn’t return the number of heads as expected and it goes almost instanteously. Don’t worry right now about the type of object that is returned.
Even though it returns quickly, code is already running and you can see this with your machine’s process manager (Activity Monitor on MacOS). To get the results, we type
This is perhaps a bit disappointing in that if we use two cores, then might expect half of the time. It’s not quite that simple in that there is some overhead for using multiple cores. However, if you use more cores like the exercise below, you’ll probably get better results and for longer processes, it will pay off.
Julia has a few commands that allow us to manage how many processes are used. These include addprocs seen above, but also procs and rmprocs. We will see these in action in this section.
Although if you are trying to run code as quickly as possible you probably want as many cores as possible, you may need to reduce the number of cores used. We can do this with the rmprocs command. Pass in the number of the process that you want to remove.
One issue with what we did above is that we have to think about spawning to individual processors or cores. That is fairly annoying. Fortunately, another helpful feature of julia is that of a parallel for loop. Try the following code:
@time let
nheads = @distributed (+) for i = 1:2*10^10
Int(rand(Bool))
end
end
and the results took 24.830175 seconds. Note that we are doing 10 times as many coin flips as the beginning of the chapter, so you should see this taking about 10 times longer.
Add 3 cores with addprocs(3) and rerun the code above. When running, I got the results in 9.434582 seconds. This is about twice as fast. If you were expecting (or hoping) for 4 times as fast, recall that there is overhead for handling parallel code. This machine has 8 core, so doing addprocs() added another 4 cores and rerunning, I got the results in 5.423481 seconds which is about 5 times faster than running without parallel code.
So what is going on with this? The @distributed macro runs a for loop on all available cores, splitting the range across the cores. The (+) says to add the results. Any reducer function (see Section 7.2 for general reduce function) can be used.
function countHands(trials::Int,f::Function)
local deck=map(Card,1:52)
local num_hands=0
for i=1:trials
shuffle!(deck)
h = Hand(deck[1:5])
if(f(h))
num_hands+=1
end
end
num_hands
end
We now wish to write a distributed version of this in which we will replace the for loop with a distributed for loop. First, the module needs to be loaded for all cores by the core below, but first restart the kernel to avoid errors that occur when rerunning the same code.
@everywhere function paraCountHands(trials::Integer,f::Function)
local deck=map(Card,1:52)
function checkHand(f::Function) ## shuffle the deck then check the hand.
shuffle!(deck)
f(Hand(deck[1:5]))
end
@distributed (+) for i = 1:trials
Int(checkHand(f))
end
end
Add 3 cores with the command addprocs(3) and then rerun the code above with
@everywhere function countHeads(n::Int)
c::Int = 0
for i=1:n
c += rand(Bool)
end
c
end
In Chapter 7, we saw the map function, which is the standard way if you have an array and need to do the same thing to each element of the array, returning the result. If the function that is applied to each element of the array is costly and you have multiple processors/cores to work with, doing this in parallel can be helpful.
We demonstrate how to do this with just the coin flip right now using the pmap function. We first make an array of length 12 (although this isn’t an important number):
so each element is 1 billion. We wish to run the coin flipper on each element of the array. Don’t forget that the countHeads function must be declared everywhere. The following runs the function:
It seems like the pmap function should always be used if it speeds up calculations. However, note that in this example, the size of the array was small but the function on each element took a relatively long time. If we have an array with millions of elements, though, the pmap function may actually be slower.
A common thing to do is to smooth an array (and is often done to images) by a windowed mean, which means that every element is replaced by a mean of the points around it in some window. We first define a windowed mean by the following function:
function windowMean(arr::Vector{T},i::Integer,width::Integer) where T <: Real
window = max(1,i-width):min(i+width,length(arr))
sum(arr[window])/(last(window)-first(window)+1)
end
which first determine a window (being careful with the first and end of the array) and then just calculating the mean.
Note: the astute reader is probably thinking that using map to do this is the right way to go, however, we can’t use map in this instance because the window\_mean function doesn’t just use a single number (from an array), it uses the entire array.
A natural way to speed this is is to send this to individual processor/cores and work on this in pieces. One problem with this is that we would have to send the entire array to each worker and that is expensive. Since a 64-bit integer is 8 bytes, the array of 100 million is about 800 Mb of memory and that is reasonable expensive to pass around. Instead, we are going to use a shared array in the package SharedArrays and you will need to add this.