Introduction to Sun SparcServer 1000 "Scorpion"

We have four Sun SparcServer 1000 "Scorpion" machines.  Their configurations
are:
    Name	CPUs	Memory	Swap	Disc			For use of:
    Bamboo	8	480 Mb	620 Mb	4x500 Mb		Applied
    Deodar	4	228 Mb	620 Mb	4x500 Mb		Applied
    Poplar	4	228 Mb	620 Mb	4x500 Mb		Applied
    Laplace	2	64 Mb	350 Mb	8x500 Mb		Statistics
    (Sycamore)	2	256 Mb	768 Mb	2x1002 Mb		Grad Students

The processors are 50 MHz TI "Viking" SuperSparc processors.  The primary
swap space (210 Mb) is shared with the /tmp filesystem.  (Sycamore is a 
20/612 without multiprocessing capability, and with 60 MHz processors.)

On calculations equivalent to matrix multiply, arranged for best use of the
cache, simple Fortran do-loops compiled with SC3.0 Fortran with optimization
as suggested in the accompanying handout, with 4 processors in parallel,
speeds of 60 to 63 MFlops/sec are consistently achieved, counting t = t +
a(i,k)*a(k,j) as two flops.  With a Fortran algorithm hand-tuned give the
optimizer the best chances, the same operations are done at up to 134
MFlops/sec.  See below for discussion.

These machines are intended as computation servers.  They are running
SunOS v5.3 (Solaris 2.3), a System V R4 type of UNIX.  Installed on them
is the SC3.0 compiler suite including Fortran, "C", C++ and Pascal;
the multiprocessing extensions are also installed.  (See the accompanying
handout "fortran-SC3.0" for information.)  Standard Mathnet software
support including X11R5 and Matlab is available (mostly).  

At present there is no NCARG, but this is coming soon. [950313]

Accessing the Scorpions is the same as for other Mathnet machines, that is,
"rxterm bamboo" from a machine with X-windows, or "c bamboo" from an office
terminal and the CS-1, or "rlogin bamboo" on a dial connection once you are
logged onto your home site (assuming your home site is not a Scorpion).  
Your home directory is accessible from the Scorpions just as from any other
Mathnet machine.  Put your source code (and small or valuable data files) in
your home directory or a subdirectory.  For self-written programs that you
want to have always available, create a directory "bin.sparc" in your home
directory, and put the executable modules in there; it will be on your path
automatically on the Solaris machines, but will be ignored on other machine
types.  

For large files of a temporary nature and low value, there is 210 Mb
available in /tmp which is shared with swap space.  Files may remain there
for 3 days, but this space is erased at each reboot.  It is suggested to
make your own subdirectory to avoid name conflicts with other users.  

After extensive consultations with Sun and patches to the kernel, we have
gotten reasonable interactive performance on the Scorpions despite heavy
compute-intensive loads.  Occasional sluggishness is still seen, probably
related to memory congestion when jobs are run that occupy most of memory.

Non-parallel jobs share the processors efficiently; in one test, 8 jobs were
set running at the same time (on 4 processors) and the last one finished
after time 2.04*T, whereas the same jobs were run one by one and the last
one finished after 8*T, for an efficiency of 98%.  After the kernel patches,
it appears that parallel jobs also compete efficiently with non-parallel and
parallel jobs.  However, the long-standing problem of internal competition
is still there; when 5 processors (on a 4 processor machine) were specified
for one parallel job running alone, efficiency was 0.1%.  

In conclusion, if you run a parallel job you need to be aware of how many
processors will actually be available to you.  When running uniprocessor
jobs, remember that there are other issues than just switching between
processes; see "handout policies/background.jobs" for a discussion.  Note
that Sycamore can only run uniprocessor jobs.  

The Viking architecture is capable of very fast performance, but it can also
be abused easily and will then be slower than a Sparc-1.  We are still
learning the tricks of making the machine go fast.  However, here are some
preliminary suggestions.  Note that the Sparc 10/30's and 20/61 use the same
chip but at a different speed, so these ideas are directly applicable to
them too, except for parallelization; also, other RISC machines such as the
RS6000 and the IRIS are affected by the same issues.

1.  There are two caches.  The small cache holds 64k bytes (8192 double
precision numbers).  The large cache holds 2 Mbytes (2.6e5 doubles).  (On
Sparc 10 the large cache is 1 Mb.)  In addition, you have to consider the
total size of physical memory as indicated above.  When you access
an item of data, in the worst case it makes a trip from disc to physical
memory to the large cache to the small cache, and each step takes time. 
Data in memory but not in the small cache takes twice as long to access as
data already in the cache; data on disc but not in memory takes almost 10^6
times longer.  However, a large block of data is loaded at once, and if you
use the neighboring data your investment of time is spread out over all the
items. 

You win if you can fully digest one cache load of material, or somewhat
less, before going on to the next set.  Extra complication in nested loops,
so as to enhance local references, can pay off with a 5 to 1 speedup in the
worst examples.  Here is some data showing the effect on performance of
fitting the data in the cache.  The algorithm is multiplying matrices with
one matrix transposed so the data is accessed with a stride of 1, and one
processor is used (results are  proportional with multiple processors).  A
smarter algorithm can do this job faster, but the amount of loss at the
n=50 break would probably be  bigger too.

	Matrix	Bytes	Mfl/s	Comment
	Order	per Mtx
	   4	0.4k	 9.4	Loop setup time is significant for small loops
	  14	4.7k	15.7	Speed increases smoothly with order up to here
	  46	51k	16.8	Nearly constant up to here; using small cache
	  56	75k	14.7	A noticeable drop, starts using large cache
	 261	1.6m	14.0	Nearly constant up to here, break estd at 290
	 562	7.6m	 8.7	Declining to here, overflows large cache
	1000	24m	 8.6	Constant above that (actual test ended at 1000)
	2300	128m	 0.05	Estimated -- there would be a sudden drop 
				when the matrices would not fit in core.  
				Avoid this situation.

2.  In a loop you work on repeated data items, which may be some distance
apart in memory.  "Stride" refers to the separation of the data items.
Processing arrays with a stride of 1 is the best, i.e. handle items one
after another, skipping none.  The reason is that the time investment to
load a block of data into the cache is spread out over all the data items,
not just one or a few.  In Fortran the first subscript varies with stride 1,
so the loop on the left, below, can be 10 times faster than the loop on the
right.  Sometimes it is worthwhile to transpose an array first, then do an
operation like matrix multiply.  Transposition necessarily works at a stride
of N, but only has to be done once, whereas each datum is accessed N times
when multiplying.

    C Inner loop stride 1: Faster  C Inner loop stride N: Slower
	do 10 j = 1, n			do 10 i = 1, n
	do 10 i = 1, n			do 10 j = 1, n
    10  t = t + a(i,j)		    10  t = t + a(i,j)

3.  The machine can overlap operations if you give it the chance.  Compare
these loops:

    do 10 i = 1, n	do 10 i = 1, n, 3	do 10 i = 1, n, 3
 10 t = t + v(i)     10 t=t+v(i)+v(i+1)+v(i+2)  s1 = s1 + v(i)
						s2 = s2 + v(i+1)
						s3 = s3 + v(i+2)
					     10 continue
						t = s1 + s2 + s3

The middle one is just as slow as the one on the left, because before it
can add v(i+1) it has to wait for the previous addition to finish.  The
loop on the right is 1.5 times as fast because each of the additions can be
started before the previous one has finished, including the last addition
of the previous loop.  The optimal number of sub-sums depends on small
details of the problem and of how much of the matrix fits in the cache.
Here are the speeds (on 1 processor) for the inner loop of a matrix
multiply:

	Subsums		Mflops/sec
			24x24	60x60
	1		17.1	16.0
	2		21.7	20.7
	3		21.8	21.4
	4		26.6	23.3
	5*		14.2	21.8
	6		13.0	18.5

		* Problem size 25x25

4.  Within limits, the compiler will automaticaly unroll some loops so as to
use the sub-sum technique, but you can get better results by hand for
critical code sections.  Compile with -pg and use gprof to find out which
these are.  

5.  Another critical area for making parallel programs run fast is loop
setup time.  On a heavily loaded machine, when one of your processors
finishes its assignment, i.e. 1/4 of some loop, it will transfer to a
different job, and for your next parallelized loop you will have to drag it
back, at considerable cost in time estimated as 100 to 200 usec.  Thus you
should do as much as possible in each parallel loop.  In this example, the
code on the left lets all the processors go after the scaling step, then has
to re-recruit them to do the dot product.  Plus the data of matrix "a" has
to be reloaded into the cache, or into physical memory if very large.  But
the code on the right both scales and dots each matrix row before going on
to the next, doing a bigger job with each investment of time to acquire a
processor and load the cache.  

	C Two loops waste time		C One loop is faster
	    do 10 i = 1, n (PARALLEL)	    do 22 i = 1, n (PARALLEL)
	    do 10 j = 1, n		    do 10 j = 1, n
	10  a(j,i) = scale*a(j,i)	10  a(j,i) = scale*a(j,i)
	    do 22 i = 1, n (PARALLEL)
	    t = 0.0			    t = 0.0
	    do 20 j = 1, n		    do 20 j = 1, n
	20  t = t + a(j,i)*b(j,k)	20  t = t + a(j,i)*b(j,k)
	22  v(i) = t			22  v(i) = t

6.  For the same reason, make sure that the loops being parallelized are the
outer ones, as much as the algorithm allows.  See the accompanying Fortran
handout for conditions that prevent parallelization and suggestions for how
to get rid of them.  For example, the code above might be inside a loop over
k, and a little cleverness in the scaling step would let you parallelize the
outer k loop rather than the middle i loop, for even more time saving.