Re: NEC-LIST: LAPACK LU (was Re: NEC-LIST: Xeon vs Pentium II - 450)

From: Dave Michelson <dmichelson_at_email.domain.hidden>
Date: Fri, 09 Apr 1999 09:57:11 -0700

"D. B. Miron" wrote:
>
> I downloaded the .pdf file but my reader had problems
> translating it so I got 7 blank pages. How about a plain
> text version?

Here's the plain text version. Those who want to see the graphs will
still have to read the pdf version.

--
Dave Michelson
dmichelson_at_home.com
    NEC2 Benchmarks Under Linux: Standard NEC2 Versus
         LAPACK and ASCI Red BLAS Versions
                    I D Flintoft
                       6/4/99
This note compares the performance of NEC2 running under linux using
the built in LU decomposition and back substitution routines to a
modified version using routines from LAPACK. Two versions of LAPACK
were investigated, one based on the reference BLAS implementation, and
one built on the ASCI Red Optimised BLAS library.
System Hardware
Intel Pentium II at 400MHz on a Supermicro P6DBS motherboard with
128MB PC100 SDRAM (512k pipeline burst L2 cache at half processor
speed).
OS
The Debian GNU/Linux 2.1 (http://www.debian.org) distribution was used
with a 2.2.5 linux kernel (http://www.linuxhq.com). The benchmarks
were run in single user mode to prevent any problems from system
daemons.
Compilers
Two compilers were compared in the benchmarks. The experimental GNU
compiler, egcs, version 1.1.2 (http://egcs.cygnus.com), was used with
the compiler options:
-O2 -ffast-math -funroll-loops -fomit-frame-pointer -fno-emulate-complex
-march=pentiumpro -malign-double
The -fno-emulate-complex options forces the compiler to use the
complex number support in the gcc back end which is significantly
faster than the emulated complex number support in g77. Note that
there are potential problems with the gcc complex support which is why
it is not used by default. Care must therefore be taken to check that
valid code is produced by using this option.
The time limited demo of Portland Group Incs. (PGI,
http://www.pgroup.com) fortran compiler, version 1.7, was also used
with the options:
-O2 -tp p6 -Munroll -Mdalign
NEC2 Source
The NEC2 source code, nec2_src.tar.Z, from the "Unofficial Numerical
Electromagnetics Code (NEC) Archives"
(http://www.qsl.net/wb6tpu/swindex.html) was used. The following
modifications were made to the code:
1. The code was made fully double precision, including all embedded
constants.
2. Common blocks were realigned to comply with the FORTRAN77 standard.
3. Minor modifications were made to allow compilation with the GNU
fortran compiler.
4. Added ETIME intrinsic to subroutine SECOND for timing information.
5. Other cleanups.
Standard NEC2
Standard NEC2 binaries were compiled directly from this modified NEC2
source code with the compiler options detailed above.
LAPACK NEC2
The LU decomposition and back substitution routines (FACTRS and
SOLVES) in NEC2 were modified to use the routines ZGETRF and ZGETRS
from the LAPACK library (http://www.netlib.org/lapack). The LAPACK
library was built using the reference BLAS implementation included in
the LAPACK source with the same compilation options as used for the
NEC2 code.
Modification were made to four fortran source files in the LAPACK
timing routines to allow compilation with GNU Fortran (which does not
accept calls to intrinsic functions in PARAMETER statements).
ASCI Red NEC2
The linux ASCI Red PentiumPro Optimised BLAS libraries, version 1.1n,
by Greg Henry (http://www.cs.utk.edu/~ghenry/distrib) from the Intel
Performance Library Suite
(http://developer.intel.com/design/perftoll/perflibst) and other
sources were used with LAPACK to build optimised versions of NEC2.
Benchmarks
The TEST300.NEC, TEST600.NEC and TEST1200.NEC files used in the PC
NEC4.1 Performance Data benchmark survey posted to the NEC mailing
list were used. These files have 300, 600 and 1200 segments
respectively. A new version of the test file with 2000 segments
(TEST2000.NEC) was also used.
-----------------
Table 1: Details of benchmark results.
                      Fill Time   Factor Time   Run Time
                         (s)            (s)        (s)
TEST300.NEC
NEC4.1         DVF/NT   1.090        0.830        2.040
Standard NEC2, egcs     0.960        0.940        1.990
LAPACK NEC2,   egcs     0.960        0.600        1.650
ASCI Red NEC2, egcs     0.970        0.540        1.580
Standard NEC2, PGI      0.800        1.050        1.920
LAPACK NEC2,   PGI      0.810        0.720        1.600
ASCI Red NEC2, PGI      0.810        0.310        1.190
TEST600.NEC
NEC4.1         DVF/NT   4.130        8.670       13.390
Standard NEC2, egcs     3.500        8.440       12.150
LAPACK NEC2,   egcs     3.490        6.240        9.950
ASCI Red NEC2, egcs     3.470        4.680        8.360
Standard NEC2, PGI      2.900        9.330       12.420
LAPACK NEC2,   PGI      2.960        7.230       10.380
ASCI Red NEC2, PGI      2.930        2.240        5.360
TEST1200.NEC
NEC4.1         DVF/NT  16.560       71.540       89.840
Standard NEC2, egcs    11.010       85.390       97.030
LAPACK NEC2,   egcs    11.010       67.110       78.740
ASCI Red NEC2, egcs    11.030       37.810       49.470
Standard NEC2, PGI      9.340       94.310      104.250
LAPACK NEC2,   PGI      9.480       79.760       89.790
ASCI Red NEC2, PGI      9.310       16.950       26.740
TEST2000.NEC
NEC4.1         DVF/NT
Standard NEC2, egcs    26.290      459.030      486.880
LAPACK NEC2,   egcs    26.240      317.820      345.520
ASCI Red NEC2, egcs    26.320      175.590      203.370
Standard NEC2, PGI     22.620      516.090      540.230
LAPACK NEC2,   PGI     23.040      399.310      423.690
ASCI Red NEC2, PGI     22.440       80.110      103.670
-----------------
Results
Table 1 presents the details of the fill time, factor time and total
run time for each benchmark using the different versions of NEC2. The
results below are compared with the corresponding result from the PC
NEC4.1 Performance Data benchmark survey for a 400MHz Pentium II using
Digital Visual Fortran under Windows NT. The results are shown
graphically in Figures 1 to 4.
For standard NEC2 the PGI compiled version is slightly faster than
egcs for the matrix fill but somewhat slower at the LU
decomposition. Both egcs and PGI are typically 10-15% slower at the
factorisation than the NT version of NEC4.1. The variation of the fill
time, factor time and total run time with the number of segments is
shown in Figures 5 to 7.
Table 2 shows the speed-up in the factorisation time, 
speed-up = reference factor time / factor time,
obtained by using the LAPACK libraries with both the reference BLAS
implementation and the ASCI Red BLAS for the two compilers. For egcs
there is a speed-up of 1.3 to 1.6 using the reference BLAS LAPACK and
1.8 to 2.6 using the ASCI Red BLAS. The performance boost is greater
for a larger number of segments. For the PGI compiler the speed-up
with the reference BLAS version is 1.2 to 1.5, slightly lower than
with egcs.  However using the ASCI Red BLAS the PGI compiler gives an
improvement of 3.4 to 6.4, much greater than with egcs.
-----------------
Table 2: Speed-up factors for LAPACK and ASCI Red BLAS versions of NEC2
relative to the standard version.
                   egcs               PGI
              LAPACK  ASCI Red  LAPACK  ASCI Red
TEST300.NEC    1.57     1.74     1.45     3.39
TEST600.NEC    1.35     1.8      1.29     4.17
TEST1200.NEC   1.27     2.26     1.18     5.56
TEST2000.NEC   1.44     2.61     1.29     6.44
------------------
The far better performance of PGI with the ASCI Red BLAS may be due to
the known stack alignment problems with the GNU compilers. Serious
performance degradation can result if double precision variables are
not aligned on 64-bit boundaries in memory on 686 architectures: even
with -malign-double egcs does not always make a good job of doing
this.
-----
Received on Fri Apr 09 1999 - 16:02:40 EDT

This archive was generated by hypermail 2.2.0 : Sat Oct 02 2010 - 00:10:39 EDT