Section 1: Eat Some Pi

Please click here to accept the assignment.

Pi is the ratio of a circle's circumference to its diameter. As such, the value of pi can be computed as follows:

Consider a circle of radius r inscribed in a square of side length r. Randomly generate ns points ("darts") within the square. Then, determine the number of points in the square that are also in the circle (nc). If f is the ratio between the number of points in the circle divided by the number of points in the square then--the ratio between the two areas--pi can be approximated as pi ~ 4f. The more points generated, the better the approximation. One benefit ot this Monte Carlo calculation of Pi is that it's embarrassingly parallel: it requires little to no effort to program a parallel implementation of it. We simply split the computation of Pi across different processors and aggregate their individual estimates. To improve the accuracy of our estimate of Pi, we'll perform several trials and calculate the mean.

Look at the C program ser_pi_calc. Extend this program using collective MPI routines to compute pi in parallel using the method described above.
For the first version of your program, perform the same number of trials on each MPI rank. This is, all ranks will run identical pieces of code. Measure the total runtime using MPI_WTIME(). Vary the number of ranks used from 1 to 4. How does the total runtime change?
Now, you'll coordinate your MPI ranks to divide the number of "rounds" equally among them. Make sure to use the appropriate MPI routines to (1) distribute and (2) communicate the workload across workers. Again, run the program on 1 to 4 ranks. How does the runtime vary now?
Now let's change the number of "darts" and ranks. Use your MPI program to compute pi using total numbers of "darts" of 1E3, 1E6, and 1E9. For each dart count, run your code on HPCC with processor counts of 1, 2, 4, 8, 16, 32, and 64. Keep track of the resulting value of pi and the runtimes. Use non-interactive jobs and modify the submitjob.sb script as necessary.
For each processor count, plot the resulting errors in your computed values of pi compared to the true value as functions of the number of darts used to compute it. Use log-log scaling on this plot. What is the rate of convergence of your algorithm for computing pi? Does this value make sense? Does the error or rate of convergence to the correct answer vary with processor count? Should it? Feel free to use the scripting language of your preference to perform this analysis and generate the visualizations.
For each dart count, make a plot of runtime versus processor count. Each line represents a "strong scaling" study for your code (check out this resource to learn more about strong and weak scaling). For each dart count, also plot the "ideal" scaling line. Calculate the parallel scaling efficiency of your code for each dart count. Does the parallel performance vary with dart count? Explain your answer.
Going further. Try running your code on different node types on HPCC with varied core counts. In particular, try to ensure that your runs utilize multiple nodes so that the communication network is used. Do you see a change in the communication cost when the job is run on more than one node?

What to turn in

To your git project repo, commit your final working code for the above exercises and a concise write-up including:

Performance and accuracy data for your calculations of pi,
The plots you made to visualize the strong and weak scaling,
Detailed responses to the questions.