Supercomputing Challenge Team 070

Please refer to Appendix B for select frames of output from our program.
Click Here for the MPEG Videos

EFFICIENCY OF THE PROGRAM

To test the efficiency of the program we made several test runs with varying parameters. During each test run the efficiency of the program was measured in two ways. First, we used a Silicon Graphics utility called “pixie” to profile what each thread of execution was spending its time doing. We found that as the number of threads increased, the efficiency of the program decreased even though the wall clock time becomes shorter. In these graphs blue represents Input / Output (I/O), green represents the time used processing neuron data, yellow represents program initialization and the overhead of running multiple threads, and red represents time spent spin locking. All tests were performed outputting 600 frames at various widths and heights. The results from this series of tests at 300x300 are displayed in figures R.2 through R.8.

(Figure R.2 - A profile of processor utilization at 300x300. On a single processor the program is most efficient.)

On a sing processor the program was very efficient, because there was no overhead from initializing threads or any waiting for work. More time is spent processing than either initializing or I/O.

(Figure R.3 - A profile of processor utilization at 300x300 on 2 processors. This master process took part in one half of the parallel region's computations.)

(Figure R.4 - A profile of processor utilization at 300x300 on 2 processors. More time is spent waiting for work than doing work.)

(Figure R.5 - A profile of processor utilization at 300x300 on 4 processors. Processors each have one fourth of the parallel computations.)

(Figure R.6 - A profile of processor utilization at 300x300 on 4 processors. The waiting condition of the slave threads worsens.)

(Figure R.7 - A profile of processor utilization at 300x300 on 16 processors. Serial computations are primarily what the master thread is doing.)

(Figure R.8 - A profile of processor utilization at 300x300 on 16 processors. Now only a small fraction of the parallel computations are taken care of by each slave thread, the parallel region is very fast while the serial region remains the same speed. Slave threads spend 91% of their time waiting.)

In a 300x300 run, as the number of threads and processors increase, the slave threads spend more time waiting for work. The parallel region of the program takes less and less time as the serial region of the program takes a constant amount of time.

The graphs of processor utilization for a 1200x1200 run outputting 600 frames are displayed in figures R.9 through R.15

(Figure R.9 - A profile of processor utilization at 1200x1200 on 1 processor. Again, the most efficient run.

(Figure R.10 - A profile of processor utilization at 1200x1200. At this width and height the same division of work takes place.)

Figure R.11 - A profile of processor utilization at 1200x1200. When there is more work to do in a wider and higher data set the processor spends most of its time processing.)

(Figure R.12 - A profile of processor utilization at 1200x1200. The work is being divided as in the 300x300 run.)

(Figure R.13 - A profile of processor utilization at 1200x1200. There is more spin locking taking place in this run but to a relatively lesser degree than at 300x300 because there is more processing to do.)

(Figure R.14 - A profile of processor utilization at 1200x1200. Wall clock time has decreased significantly but at the price of running on 16 processors.)

(Figure R.15 - A profile of processor utilization at 1200x1200. When there is more work for each thread to do running on multiple processors becomes much more efficient with the reduction of time spent spin locking.)

Although a 1200x1200 run takes more wall clock time than a 300x300 run the program is more efficient because there are many more computations for the threads to divide amongst themselves, decreasing the percentage of time each processor spends spin locking. As the program becomes more complex in its computations the efficiency of the program on multiple processors will greatly increase.

Next, we used a UNIX utility called "time" to measure the wall clock time and the CPU time the program took to complete. The results of these tests are displayed in figures R.16 and R.17.

(Figure R.16 - A display of CPU Time versus Wall Clock Time in a 300x300 simulation. The program takes less time but becomes less efficient.)

(Figure R.17 - A display of CPU Time versus Wall Clock Time in a 1200x1200 simulation. As in the 300x300 run the program becomes less efficient with more processors but since there are more computations this run is relatively more efficient. )

New Mexico High School Supercomputing Challenge
Send Mail to Team 070 Members