UCLA Mathematics Consulting Group
NEMO Cluster

The NEMO cluster is a set of 9 Intel Xeon rack machines, with 3.4 GHz speed, and 9.0GB memory per node. These machines run the current version of SuSE Linux, 10.3. The nodes are numbered, nemo01 - nemo09.

The nine machines of the Nemo Cluster were donated to UCLA Math by Tony DeRose and his colleagues at Pixar. They were formerly part of the giant Pixar rendering farm.

The purpose of these machines is to run serial and parallel computing jobs, generally for parametric studies and algorithmic development. The Extended Memory is substantial, and ideal for memory oriented jobs such as those which employ large matrices.


Setting up your .cshrc file to run jobs on NEMO

Cut and paste the following into the end of your .cshrc file:

setenv SGE_ROOT /usr/local/sgen
setenv SGE /usr/local/sgeqn
setenv PGI /usr/local/pgi
setenv LM_LICENSE_FILE $PGI/license.dat
setenv LD_LIBRARY_PATH /usr/dt/lib
setenv MANPATH "$MANPATH":$PGI/man
set path = ( $SGE_ROOT $SGE_ROOT/bin/glinux $PGI/linux86/7.1-6/bin $SGE/local/bin $path )


Running jobs on NEMO

Summary: Compile your code, ssh to nemo01, run the batch job submitter, view your job in the queue, view your output and/or job execution statistics.


1) First, it is assumed that you have already complied your code (in C, C++, F77, F90, etc.)
2) From the same directory as your code is located, spawn a remote shell to nemo01. To do this type ssh nemo01
3) To run the batch job submitter, simply type:

job.q

and then a queueing script will appear in your UNIX window. The script will guide you through a series of questions for submitting your job. First and foremost, you should "build" your command script to run the job. This can be done by selecting "b" for build. Afterwards, it will create the command script for you.

If instead of using the batch submission (queueing) procedure, you intend to log directly in to one of the nemo nodes, you can also run jobs that way. However, you run the risk of overloading a machine that could already be running jobs. In addition, if you a running a job there, the queueing system will overlook that machine when submitting a new job (as long as your job is running on it). I.e., if others want to submit to nemo, they may not be apprised of the status of the available machines if you are running your job outside of the queueing system.


Reviewing job queue status on NEMO

To review the job queue status on NEMO, simply type:

qstat

Then, you will see a screen which displays the jobs running on the various NEMO nodes. It will look something like this:

job-ID    prior name        user p         state submit/start at      queue       master   ja-task-ID
---------------------------------------------------------------------------------------------
      21       0   hello.sh.c   ra      t       09/26/2003   15:47:48   nemo02.q    MASTER

To interpret this: job-ID is 21, user is ra, file submitted is hello.sh.cmd, time of submission is 3:47 on 9/26, and the job ran on queue 'nemo02.q' (one of the NEMO nodes). You may also see jobs from other users submitted to the queue as well.


Sample job run on NEMO
The following are a sequence of windows which appear in your command line screen after running 'job.q':


Notice to users of the job.queue script:
The output for SGE jobs generated by the job.queue script
will be written to two files:


'jobname'.joblog will contain the output from the 'jobname'.cmd script.


'jobname'.output will contain the output from the program or script being executed.


Enter to continue.




Functions (acceptable abbreviations are shown in CAPS)
Menu: Display this menu
Build: Build a SGE .cmd file for Serial
Submit: Submit a SGE .cmd file for execution
STatus: Display the status of SGE jobs for ra
SYsstat: Display the status of SGE jobs for the system
Hold: Hold a SGE job
RELease: Release a SGE job that is held
RESet Reset the priority of a SGE job
Cancel: Cancel a SGE job
Quit: Exit this script

Command: b (<-- for build)




Enter the name of the program or script to be executed : hello.sh (<-- for example)



Checking for duplicate queue control files.
You already have a "/net/tupelo/h1/maint/ra/hello/hello.sh.cmd" file.

Do you want to remove this file and continue (y or n)?
'default n': y (<-- for example)




Enter any arguments for the hello.sh program or script (default none):



The "hello.sh.cmd" file has been built. Would you like to submit it (y or n)? : y (<-- for example)



Checking for duplicate output files.
You already have a "/net/tupelo/h1/maint/ra/hello/hello.sh.joblog" file.
You already have a "/net/tupelo/h1/maint/ra/hello/hello.sh.output" file.

Do you want to overwrite these files and continue (y or n)?
'default n': y (<-- for example)




your job 19 ("hello.sh.cmd") has been submitted
Current SGE job status for ra

job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
19 0 hello.sh.c ra qw 09/25/2003 12:00:05

Enter to continue.




    BRIEF EXPLANATION OF WHAT WENT ON ABOVE.

    1) You ran 'job.q'.
    2) It launched a notice screen, you hit enter
    3) It gave you a menu; you selected 'b' to build the command script
    4) It asked for the name of the file to run (e.g., hello.sh)
    5) It found that you already have a command script; asked to remove it; you answered yes
    6) It asked you to enter any arguments. You had none, and hit enter
    7) It told you the command file was built; it asked you to submit it; you answered yes
    8) It checked for duplicate files; it asked to overwrite them; you answered yes
       (please note, each output file is NOW tagged with the job ID number at the end.
    9) It submitted your job, and gave you a message indicating the status in the queue
    10) If you hit enter again, you'll return to the main menu, and then hit 'q' for quit


    FURTHER NOTES

    1) If you have built your command script, and just want to run it (perhaps with different input
       only, then you type 's' for submit instead of 'b' for build. If you 'submit' a job, you
       still need only type the filename (e.g., hello.sh), and NOT the command script name hello.sh.cmd.
    2) You will receive 2 emails to your address in regards to the job: 1) tells you when the job
       was submitted, and 2) tells you full details of the job (runtime, completion time, etc.).
       Samples of these emails are below.


The EMAILS that you will receive regarding the job you just ran

Subject: Job 19 (hello.sh.cmd) Started

Job 19 (hello.sh.cmd) Started
User = ra
Queue = nemo02.q
Host = nemo02.math.ucla.edu
Start Time = 09/25/2003 12:00:13

Subject: Job 19 (hello.sh.cmd) Complete

Job 19 (hello.sh.cmd) Complete
User = ra
Queue = nemo02.q
Host = nemo02.math.ucla.edu
Start Time = 09/25/2003 12:00:13
End Time = 09/25/2003 12:00:13
User Time = 00:00:00
System Time = 00:00:00
Wallclock Time = 00:00:00
CPU = NA
Max vmem = NA
Exit Status = 0