UCLA Department of Mathematics

UCLA Mathematics Consulting Group

NEMO Cluster

The NEMO cluster is a set of 9 Intel Xeon rack machines, with 3.4 GHz speed, and 9.0GB memory per node. These machines run the current version of SuSE Linux, 10.3. The nodes are numbered, nemo01 - nemo09.

The nine machines of the Nemo Cluster were donated to UCLA Math by Tony DeRose and his colleagues at Pixar. They were formerly part of the giant Pixar rendering farm.

The purpose of these machines is to run serial and parallel computing jobs, generally for parametric studies and algorithmic development. The Extended Memory is substantial, and ideal for memory oriented jobs such as those which employ large matrices.

Setting up your .cshrc file to run jobs on NEMO

Cut and paste the following into the end of your .cshrc file:

setenv SGE_ROOT /usr/local/sgen
setenv SGE /usr/local/sgeqn
setenv PGI /usr/local/pgi
setenv LM_LICENSE_FILE $PGI/license.dat
setenv LD_LIBRARY_PATH /usr/dt/lib
setenv MANPATH "$MANPATH":$PGI/man
set path = ( $SGE_ROOT $SGE_ROOT/bin/glinux $PGI/linux86/7.1-6/bin $SGE/local/bin $path )

Running jobs on NEMO

Summary: Compile your code, ssh to nemo01, run the batch job submitter, view your job in the queue, view your output and/or job execution statistics.

1) First, it is assumed that you have already complied your code (in C, C++, F77, F90, etc.)
2) From the same directory as your code is located, spawn a remote shell to nemo01. To do this type ssh nemo01
3) To run the batch job submitter, simply type:

job.q

and then a queueing script will appear in your UNIX window. The script will guide you through a series of questions for submitting your job. First and foremost, you should "build" your command script to run the job. This can be done by selecting "b" for build. Afterwards, it will create the command script for you.

If instead of using the batch submission (queueing) procedure, you intend to log directly in to one of the nemo nodes, you can also run jobs that way. However, you run the risk of overloading a machine that could already be running jobs. In addition, if you a running a job there, the queueing system will overlook that machine when submitting a new job (as long as your job is running on it). I.e., if others want to submit to nemo, they may not be apprised of the status of the available machines if you are running your job outside of the queueing system.

Reviewing job queue status on NEMO

To review the job queue status on NEMO, simply type:

qstat

Then, you will see a screen which displays the jobs running on the various NEMO nodes. It will look something like this:

job-ID prior name user p state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
21 0 hello.sh.c ra t 09/26/2003 15:47:48 nemo02.q MASTER

To interpret this: job-ID is 21, user is ra, file submitted is hello.sh.cmd, time of submission is 3:47 on 9/26, and the job ran on queue 'nemo02.q' (one of the NEMO nodes). You may also see jobs from other users submitted to the queue as well.

Sample job run on NEMO

The following are a sequence of windows which appear in your command line screen after running 'job.q':

Notice to users of the job.queue script:

The output for SGE jobs generated by the job.queue script

will be written to two files:

'jobname'.joblog will contain the output from the 'jobname'.cmd script.

'jobname'.output will contain the output from the program or script being executed.

Enter to continue.

Functions (acceptable abbreviations are shown in CAPS)

Menu: Display this menu

Build: Build a SGE .cmd file for Serial

Submit: Submit a SGE .cmd file for execution

STatus: Display the status of SGE jobs for ra

SYsstat: Display the status of SGE jobs for the system

Hold: Hold a SGE job

RELease: Release a SGE job that is held

RESet Reset the priority of a SGE job

Cancel: Cancel a SGE job

Quit: Exit this script

Command: b (<-- for build)

Enter the name of the program or script to be executed : hello.sh (<-- for example)

Checking for duplicate queue control files.

You already have a "/net/tupelo/h1/maint/ra/hello/hello.sh.cmd" file.

Do you want to remove this file and continue (y or n)?
'default n': y (<-- for example)

Enter any arguments for the hello.sh program or script (default none):

The "hello.sh.cmd" file has been built. Would you like to submit it (y or n)? : y (<-- for example)

Checking for duplicate output files.

You already have a "/net/tupelo/h1/maint/ra/hello/hello.sh.joblog" file.

You already have a "/net/tupelo/h1/maint/ra/hello/hello.sh.output" file.

Do you want to overwrite these files and continue (y or n)?
'default n': y (<-- for example)

your job 19 ("hello.sh.cmd") has been submitted
Current SGE job status for ra

job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
19 0 hello.sh.c ra qw 09/25/2003 12:00:05

Enter to continue.

BRIEF EXPLANATION OF WHAT WENT ON ABOVE.

FURTHER NOTES

The EMAILS that you will receive regarding the job you just ran

Subject: Job 19 (hello.sh.cmd) Started

Job 19 (hello.sh.cmd) Started
User = ra
Queue = nemo02.q
Host = nemo02.math.ucla.edu
Start Time = 09/25/2003 12:00:13

Subject: Job 19 (hello.sh.cmd) Complete

Job 19 (hello.sh.cmd) Complete
User = ra
Queue = nemo02.q
Host = nemo02.math.ucla.edu
Start Time = 09/25/2003 12:00:13
End Time = 09/25/2003 12:00:13
User Time = 00:00:00
System Time = 00:00:00
Wallclock Time = 00:00:00
CPU = NA
Max vmem = NA
Exit Status = 0