Parallel Performance

The spectral element method is often characterized as a coarse-grained algorithm since heavy computations need to be performed before inter-element communication is required. An order of magnitude analysis shows that the speed-up in a typical two-dimensional application is:

		P		   T_c	            1      
 	S= ----------------, and -------  ~ O ( ----------- ) 
            1 + T_c/(K T_o)	  K T_o		sqrt(K) N^2
where P is the number of processors, K the number of spectral elements within each processor, N the spectral truncation, To the computational time, and Tc the computational time. The above formula shows that the speed-up deteriorates as the second term in the denominator increases. This second term decreases quadratically with the spectral truncation, and as the square root of the number of elements in the partition. The formula also shows the distinguishing property of the spectral element method which gives it its coarse grain character: the communication cost increases only linearly with the order of the method while its computational cost increases cubically, yielding a quadratic ratio between the two. High-order finite difference methods, by contrast, show a quadratic increase of the communication cost with the order, since the halo of points needed to be passed between processors increases.

The good scalability characteristics of SEOM are illustrated in the speed-up and MFLOPS curves shown in Figure 1. The computations consist of simulating the oceanic circulation in the NorthAtlantic Basin using a grid with 388 elements in the horizontal with a 7th order polynomial, and 8 elements in the vertical with a 3rd order interpolation. These performance curves were obtained on a CRAY-T3E, after removing the initialization cost. The peak measure floating point operation count is 2.7 GFLOPS on 40 processors, reflecting a per-processor speed of about 67.5 MFLOPS.

  
Figure 1: Speed-up (left panel) and MFLOPS (right panel) as a function of processor number for SEOM-3D on T3E. The speed-up is defined as the ratio Tp/T1, where Tp is the CPU time needed to compute a single time-step for a fixed-size problem. The dashed line represents the ideal linear speed-up. You can obtain a postscript copy of this figure .

SEOM's parallelization is based on a static decomposition of the grid which is then distributed to the processors. Due to the irregularity of the grid and the decomposition, we have adopted a message passing implementation that allows precise control over the data decomposition and the overlapping of communications and computations. The explicit message passing approach also frees us from relying on compiler sophistication.

For further information on the parallelization please refer to the article:
E. N. Curchitser, M. Iskandarani, and D. B. Haidvogel, ``A spectral element solution of the shallow water equations on multiprocessor computers'', Journal of Atmospheric and Oceanographic Technology , 15 No 2, 510--521, 1998.