Distributed Programming with MPI

Prologue: Logistics

Due Dates

For this lab, you’ll be turning in the following deliverables:

Checkpoint: Due Friday, February 14, 11:59 pm. Submit your completed code in XXX, your responses to Questions XXX, and a discussion of your progress on Part 1.
Final Submission: Due Friday, November 21, 11:59 pm. Submit your completed code for all parts of the lab, as well as a PDF writeup containing your answers for Question XXX of the final writeup.

Starter Code

You can get the starter code for this lab by cloning the lab repository:

git clone git@github.com:accelerated-computing-class/lab12.git

Section 1: Getting Started with MPI

Goals for This Lab

Part 1: Warm-up Exercises

As a group, complete the following exercises from HPSC.

Exercise 2.18
Exercise 2.19
Exercise 2.21
Exercise 2.22
Exercise 2.23
Exercise 2.27

Include your responses to these exercises in your project write-up.

Part 2: Setup on HPCC

The following is a very quick tutorial on the basics of using HPCC for this class.

Log in to the HPCC gateway:
```
ssh <netid>@hpcc.msu.edu
```
Then log in to the AMD-20 cluster from the gateway:
```
module load powertools
amd20
```
The default environment and software stack should be sufficient for this exercise, but if you run into issues compiling and running the code try issuing the following commands.
```
module purge
module load intel/2021a
```
When using the HPCC for development and exercises in this class please do NOT just use the head node, dev-amd20. We will swamp the node and no one will get anything done. Instead, request an interactive job using the SLURM scheduler. An easy way to do this is to set up an alias command like so:
```
alias devjob='salloc -n 4 --time 1:30:00'
```
Run devjob, then to request 4 tasks for 90 minutes. This should be sufficient for most of the stuff we do during class, though for your projects you will at times require more resources. The above module and alias commands can be added to your .bashrc so that they are automatically executed when you log in.

Part 3: MPI Basics

Clone your Project 2 repo from GitHub on HPCC.
In the project directory you will find a simple "Hello World!" source file in C++. Compile and run the code. E.g.,
```
g++ hello.cpp
./a.out
```
Now run the executable a.out using the MPI parallel run command and explain the output:
```
mpiexec -n 4 ./a.out
```
Add the commands MPI_Init and MPI_Finalize to your code. Put three different print statements in your code: one before the init, one between init and finalize, and one after the finalize. Recompile and run the executable, both in serial and with mpiexec, and explain the output.
Complete Exercises 2.3, 2.4, and 2.5 in the Parallel Programming book.

Part 4: Eat Some Pi

Pi is the ratio of a circle's circumference to its diameter. As such, the value of pi can be computed as follows. Consider a circle of radius r inscribed in a square of side length r. Randomly generate points within the square. Determine the number of points in the square that are also in the circle. If f = nc/ns is the number of points in the circle divided by the number of points in the square then pi can be approximated as pi ~ 4f. Note that the more points generated, the better the approximation.

Look at the C program ser_pi_calc. Extend this program using collective MPI routines to compute pi in parallel using the method described above. Feel free to use C++, if you prefer, of course.
For the first iteration, perform the same number of "rounds" on each MPI rank. Measure the total runtime using MPI_WTIME(). Vary the number of ranks used from 1 to 4. How does the total runtime change?
Now, divide the number of "rounds" up amongst the number of ranks using the appropriate MPI routines to decide how to distribute the work. Again, run the program on 1 to 4 ranks. How does the runtime vary now?
Now let's change the number of "darts" and ranks. Use your MPI program to compute pi using total numbers of "darts" of 1E3, 1E6, and 1E9. For each dart count, run your code on HPCC with processor counts of 1, 2, 4, 8, 16, 32, and 64. Keep track of the resulting value of pi and the runtimes. Use non-interactive jobs and modify the submitjob.sb script as necessary.
For each processor count, plot the resulting errors in your computed values of pi compared to the true value as functions of the number of darts used to compute it. Use log-log scaling on this plot. What is the rate of convergence of your algorithm for computing pi? Does this value make sense? Does the error or rate of convergence to the correct answer vary with processor count? Should it?
For each dart count, make a plot of runtime versus processor count. Each line represents a "strong scaling" study for your code. For each dart count, also plot the "ideal" scaling line. Calculate the parallel scaling efficiency of your code for each dart count. Does the parallel performance vary with dart count? Explain your answer.
Going further. Try running your code on different node types on HPCC with varied core counts. In particular, try to ensure that your runs utilize multiple nodes so that the communication network is used. Do you see a change in the communication cost when the job is run on more than one node?

What to turn in

To your git project repo, commit your final working code for the above exercises and a concise write-up including responses to the warm-up exercises, performance and accuracy data for your calculations of pi, the plots made in Part 4, and detailed responses to the questions posed concerning your results.

Section 2: MPI Ping-Pong and Ring Shift

Background

The ping-pong problem is a benchmark often used to evaluate the performance of message passing interfaces (MPI) in parallel computing. In this problem, two processes exchange messages back and forth a specified number of times, with each process sending a message and receiving a message alternately. In the ping-pong, process i sends a message of size m to process j, then receives a message of size m back from j. The values of i, j, and m to use are given below.

The "ring shift" problem is similar to ping-pong. In the MPI ring shift, a group of processes is arranged in a ring, with each process holding a unique subset of a larger array of data. The goal is to shift the data elements by a specified number of positions around the ring, wrapping around the ends of the ring as necessary.

Part 1: Blocking Ping-Pong

Your task is to implement the ping-pong problem using MPI in C or C++ and analyze the behavior and performance of your code. Specifically, you should:

Implement the ping-pong problem using MPI in C or C++. Use blocking MPI_Send() and MPI_Recv() calls. You should define the number of iterations and the size of the message to be exchanged.
Measure the time taken to complete the ping-pong exchange for different message sizes. You should use the MPI_Wtime() function to obtain the time before and after the exchange and calculate the elapsed time. Vary the message size from 2 bytes to 4 kilobytes in powers of 2 (i.e., 2 bytes, 4 bytes, 8 bytes,..., 2048 bytes, 4096 bytes). For each message size, perform 100 iterations of the ping-pong to build up statistical significance.
Record the total amount of data sent and received during the ping-pong exchange for each configuration.
Repeat steps 2 and 3 but ensure that the 2 processes that are communicating reside on different physical hardware nodes on HPCC.
Plot the average communication time of a single exchange (send and receive) as a function of message size for the two cases. Using this plot, estimate the latency and bandwidth for each case. Are they different? Explain your results.
Analyze and discuss your results. Explain the behavior of the resulting curves.

Part 2: Non-block Ping-Pong

Repeat Part 1 using non-blocking MPI communication, i.e., using MPI_Isend() and MPI_Irecv(). You will need to include explicit process synchronization using, e.g., MPI_Wait() calls. Compare the results to the blocking case.

Part 3: MPI Ring Shift

Implement the MPI ring shift in C or C++ for an arbitrary number of processes in the ring and arbitrary message size (i.e., number of elements per process). In your implementation, use MPI_Sendrecv() instead of separate MPI_Send() and MPI_Recv() calls.
As in Parts 1 and 2, vary the message size from 2 bytes to 4 kb, in powers of 2. Also vary the number of processes used from 2 to N, in powers of 2, where N is sufficiently large that rank 0 and rank N-1 are guaranteed to reside on separate nodes (N will depend on which cluster you are using on HPCC).
Compute the bandwidth and latency, as above. Plot the bandwidth as a function of message size. Include separate lines for each number of processes used.
Analyze and discuss your results. Explain the behavior of the resulting curves.

Part 4: Non-blocking MPI Ring Shift

Repeat Part 3 but using non-blocking communication via MPI_Isend() and MPI_Irecv(). Compare the results to the blocking case.

What to turn-in

To your git project repo, commit your final working code for the above exercises and a concise write-up including all plots, and detailed responses to the questions posed concerning your results.

Section 3: Advanced MPI topics

In this project, you will gain experience with some of the slightly more advanced features of MPI for distributed-memory parallelism.

Part 1: Latency hiding in three-point average

Look at the example C++ code in the project repo. This code implements a blocking ghost zone exchange for a 1D vector assuming periodic boundary conditions. The operation performed in parallel on the vector is a simple three-point rolling averaging.

Come up with a design plan for how to adapt this code to implement "latency hiding." By this we mean to overlap communication and calculation by posting non-blocking communication calls, then instead of waiting for those calls to complete, going ahead and perform all calculations that do not depend on the data being communicated.
Now, go ahead with implementing a latency-hiding version of the three point averaging function. Check that your code works and gives the correct result on 1, 2, and 4 processors.
Compare the performance of the non-blocking version of the code to the blocking version of the code for multiple numbers of ranks keeping the array size per-rank fixed.

Part 2: Latency hiding with one-sided MPI

Watch the lecture by Bill Gropp on One-sided MPI.
Modify your three-point averaging code to use one-sided MPI communication to achieve the halo exchanges.
Implement overlapping communication and calculation then be sure to correctly handle computing the "boundary" values that depend on halo data.
Verify that you get sensible results for arbitrary number of processors and global array sizes.
Compare the relative performance of the non-blocking and one-sided versions of your three-point averaging code for various numbers of ranks keeping the array size per-rank fixed.

Part 3: Custom data types in MPI

This exercise based on https://tech.io/playgrounds/349/introduction-to-mpi/custom-types.

Note: the example C++ code here uses C++11 features, so make sure you pass any necessary flags to your compiler. This may be an issue on Macs using clang. In that case, pass the -std=c++11 flag to clang (or mpic++, as it were) and you should be good to go.

As you might have noticed, all datatypes in MPI communications are atomic types: an element corresponds to one singular value. Moreover, every communication forces you to use a contiguous buffer with the same datatype. Sometimes, it might be desirable to give additional information and meaning to the communications by creating higher-level structures. MPI allows us to do that in the form of derived or custom datatypes. To make our point, let's take a simple example:

Let's consider a system with N processes where all processes are charged with generating data while process 0 centralizes and stores the data. The data generated by the processes corresponds to this struct :

struct CustomData {
  int n_values;
  double dbl_values[10];
};

Every process generates M of these custom structures, and then send them to process 0. What we want here is a simple gather on process 0 of all the values, but we are limited at the moment with MPI and cannot do that in a simple way. If we wanted to send this kind of data structure with the knowledge we currently have, we would do it in a fashion similar to that shown in types_example.cpp.

As you can see from this very naive version, everything looks a lot more complicated than it should be. First we have to separate the values from every process into two tables, one for integer values, one for double values. Also note how the indexing part starts to become confusing with linear indexing on the double table. Then we have to gather everything in two passes and finally unpack everything in the final structure.

This problem could be solved in a simpler way using derived datatypes. A datatype can be defined easily by specifying a sequence of couples. Each couple represent a block : (type, displacement). The type is one of the usual types used in MPI, while the displacement indicates the offset in bytes where this data block starts in memory. For instance, if we wanted to use a structure like this :

struct DataType {
  int int_val;
  char char_val;
  float float_val;
};

We could describe this, as : [(int, 0), (char, 4), (float, 5)]. As for the example above, well the description is a bit more complicated since we have 10 double each time, but the idea is the same. Now, there are multiple ways of creating datatypes in MPI. For instance, there is a dedicated way to repeat the same datatype multiple times. There is also a more complex way of creating datatypes by generating lists such as the one showed above. We are going to see the simpler version here and the complex in the following exercise.

Vectors

Of course the simplest form of custom datatype is the simple repetition of the same type of data. For instance, if we were handling points in a 3D reference frame, then we would like to manipulate a Point structure with three doubles in it. We can achieve this very simply using the MPI_Type_contiguous function. Its prototype is :

int MPI_Type_contiguous(int count, MPI_Datatype old_type, MPI_Datatype *new_type);

So if we want to create a vector datatype, we can easily do :

MPI_Datatype dt_point;
MPI_Type_contiguous(3, MPI_DOUBLE, &dt_point);

We are not entirely done here, we need to commit the datatype. The commit operation allows MPI to generate a formal description of the buffers you will be sending and receiving. This is a mandatory operation. If you don't commit but still use your new datatype in communications, you are most likely to end up with invalid datatype errors. You can commit by simply calling MPI_Type_commit.

MPI_Type_commit(&dt_point);

Then we can freely use this in communications, see vector_example.cpp

Let's now move on to an exercise on custom datatypes.

Custom types - exercise

Above we have saw how to create very basic contiguous datatypes. This way of creating datatypes does not help us when we want to create datatypes that mix different basic types. For instance, in the previous example, we have seen a custom structure used to store the data :

struct CustomData {
  int n_values;
  double dbl_values[10];
};

To represent this using the type/displacement formalism, our datatype would look something like :

[(int, 0), (double, 4), (double, 12), (double, 20), (double, 28), (double, 36), 
(double, 44), (double, 52), (double, 60), (double, 68), (double, 76)]

To simplify everything, we can convert everyone of these couples as a triplet : (type, start, number of elements). Thus our list simplifies to :

[(int, 0, 1), (double, 4, 10)]

MPI provides us with a special function to actually convert such a list in a datatype :

int MPI_Type_create_struct(int count, const int block_length[], 
                           const MPI_Aint displacement[], 
                           const MPI_Datatype types[], 
                           MPI_Datatype *new_type);

Let's see these arguments one by one. count is the number of elements in your list, in our case we have two entries, so count will be 2.

block_length is an array of integers, indicating, for entry i, the number of contiguous elements of that type. That will be the third value of our triplet : 1 in the int case, 10 in the double case.

displacement is an array of MPI_Aint. MPI_Aint stands for Address integer. These are just a specific MPI type for integers. In our case, that's the second element of each triplet.

types is an array of MPI_Datatypes. This should be pretty obvious by now : it's an array of all the different sub-types we are going to use in the custom type.

Finally, we store the resulting datatype in new_type.

Knowing this, you are ready to optimise the example code from above, specifically, removing all the copies in memory and transferring all the data using only one gather communication.

Displacements

Now there is a catch with the displacement. Computing manually the offsets can be tricky. Although it tends to be less and less the case, some types have sizes that can vary on a system/OS basis, so hardcoding the values might lead to trouble. One way of doing things in a cleaner way is to use the offsetof macro from the standard library (You will have to include stddef.h in C or cstddef in C++). offsetof takes two parameters : a struct and the name of one attribute of the struct. It returns a size_t (implicitly castable in a MPI_Aint) corresponding to the displacement of this attribute.

For example if we had the following structure :

struct MyStruct {
  int a;
  double b;
  char c;
  float d;
};

Then, we could define out displacement table as :

MPI_Aint displacements[4] = {
  offsetof(MyStruct, a),
  offsetof(MyStruct, b),
  offsetof(MyStruct, c),
  offsetof(MyStruct, d)
};

Exercise

It's your turn to optimise the program we have made in the previous section. Use MPI_Type_create_struct to define a derived datatype and commit it so the data can be gathered on process 0. Start with the code in create_struct.cpp. Your output from the code should be identical to that in create_struct.txt.

Part 4: MPI Subcommunicators

In this exercise, you will gain some experience with creating subcommunicators in MPI. Refer to section 6.4 of the Parallel Programming text for help.

Look at the non-functional code mpi_subcommReduce.cpp. Complete this code so that MPI_COMM_WORLD is divided into a 2D array of ranks. Split MPI_COMM_WORLD into a subcommunicator for each row and for each column. Then, produce sums along each row and column of the process MPI_COMM_WORLD rank numbers. Run your code for several different total number of ranks and row sizes. Have rank 0 for each communicator output the result. Verify that it is correct.

What to turn-in

To your git project repo, commit your final working code for the above exercises and a concise write-up including all plots, and detailed responses to the questions posed concerning your results.