UCLA Mathematics Consulting Group |
NEMO Cluster
|
The NEMO cluster is a set of 9 Intel Xeon rack machines, with 3.4 GHz speed, and 9.0GB memory per node. These machines run the current version of SuSE Linux, 10.3. The nodes are numbered, nemo01 - nemo09.
The nine machines of the Nemo Cluster were donated to UCLA Math by Tony
DeRose
and his colleagues at Pixar. They were formerly part of the giant Pixar
rendering farm.
The purpose of these machines is to run serial and parallel computing jobs, generally for parametric studies and algorithmic development. The Extended Memory is substantial, and ideal for memory oriented jobs such as those which
employ large matrices.
|
Setting up your .cshrc file to run jobs on NEMO |
Cut and paste the following into the end of your .cshrc file:
setenv SGE_ROOT /usr/local/sgen
setenv SGE /usr/local/sgeqn
setenv PGI /usr/local/pgi
setenv LM_LICENSE_FILE $PGI/license.dat
setenv LD_LIBRARY_PATH /usr/dt/lib
setenv MANPATH "$MANPATH":$PGI/man
set path = ( $SGE_ROOT $SGE_ROOT/bin/glinux $PGI/linux86/7.1-6/bin $SGE/local/bin $path )
|
Running jobs on NEMO |
Summary: Compile your code, ssh to nemo01, run the batch job submitter, view your job in the queue, view your output and/or job execution statistics.
1) First, it is assumed that you have already complied your code (in C, C++, F77, F90, etc.)
2) From the same directory as your code is located, spawn a remote shell to nemo01. To do this type ssh nemo01
3) To run the batch job submitter, simply type:
job.q
and then a queueing script will appear in your UNIX window. The script will guide
you through a series of questions for submitting your job. First and foremost, you should "build"
your command script to run the job. This can be done by selecting "b" for build.
Afterwards, it will create the command script for you.
If instead of using the batch submission (queueing) procedure, you intend to log directly in to one of the nemo
nodes, you can also run jobs that way. However, you run the risk of overloading a machine
that could already be running jobs. In addition, if you a running a job there, the queueing
system will overlook that machine when submitting a new job (as long as your job is running on it).
I.e., if others want to submit to nemo, they may not be apprised of the status of the available
machines if you are running your job outside of the queueing system.
|
Reviewing job queue status on NEMO |
To review the job queue status on NEMO, simply type:
qstat
Then, you will see a screen which displays the jobs running on the various NEMO nodes. It will
look something like this:
job-ID    prior name        user p         state submit/start at      queue       master   ja-task-ID
---------------------------------------------------------------------------------------------
      21       0   hello.sh.c   ra      t       09/26/2003   15:47:48   nemo02.q    MASTER
To interpret this: job-ID is 21, user is ra, file submitted is hello.sh.cmd, time of submission is 3:47 on 9/26, and the
job ran on queue 'nemo02.q' (one of the NEMO nodes). You may also see jobs from other users submitted to the queue as well.
|
Sample job run on NEMO |
The following are a sequence of windows which appear in your command line screen after running 'job.q':
Notice to users of the job.queue script:
The output for SGE jobs generated by the job.queue script
will be written to two files:
'jobname'.joblog will contain the output from the 'jobname'.cmd script.
'jobname'.output will contain the output from the program or script being executed.
Enter to continue.
Functions (acceptable abbreviations are shown in CAPS)
Menu: Display this menu
Build: Build a SGE .cmd file for Serial
Submit: Submit a SGE .cmd file for execution
STatus: Display the status of SGE jobs for ra
SYsstat: Display the status of SGE jobs for the system
Hold: Hold a SGE job
RELease: Release a SGE job that is held
RESet Reset the priority of a SGE job
Cancel: Cancel a SGE job
Quit: Exit this script
Command: b (<-- for build)
Enter the name of the program or script to be executed
: hello.sh (<-- for example)
Checking for duplicate queue control files.
You already have a "/net/tupelo/h1/maint/ra/hello/hello.sh.cmd" file.
Do you want to remove this file and continue (y or n)?
'default n': y (<-- for example)
Enter any arguments for the hello.sh program or script (default none):
The "hello.sh.cmd" file has been built.
Would you like to submit it (y or n)?
: y (<-- for example)
Checking for duplicate output files.
You already have a "/net/tupelo/h1/maint/ra/hello/hello.sh.joblog" file.
You already have a "/net/tupelo/h1/maint/ra/hello/hello.sh.output" file.
Do you want to overwrite these files and continue (y or n)?
'default n': y (<-- for example)
your job 19 ("hello.sh.cmd") has been submitted
Current SGE job status for ra
job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
19 0 hello.sh.c ra qw 09/25/2003 12:00:05
Enter to continue.
BRIEF EXPLANATION OF WHAT WENT ON ABOVE.
1) You ran 'job.q'.
2) It launched a notice screen, you hit enter
3) It gave you a menu; you selected 'b' to build the command script
4) It asked for the name of the file to run (e.g., hello.sh)
5) It found that you already have a command script; asked to remove it; you answered yes
6) It asked you to enter any arguments. You had none, and hit enter
7) It told you the command file was built; it asked you to submit it; you answered yes
8) It checked for duplicate files; it asked to overwrite them; you answered yes
   (please note, each output file is NOW tagged with the job ID number at the end.
9) It submitted your job, and gave you a message indicating the status in the queue
10) If you hit enter again, you'll return to the main menu, and then hit 'q' for quit
|
FURTHER NOTES
1) If you have built your command script, and just want to run it (perhaps with different input
   only, then you type 's' for submit instead of 'b' for build. If you 'submit' a job, you
   still need only type the filename (e.g., hello.sh), and NOT the command script name hello.sh.cmd.
2) You will receive 2 emails to your address in regards to the job: 1) tells you when the job
   was submitted, and 2) tells you full details of the job (runtime, completion time, etc.).
   Samples of these emails are below.
The EMAILS that you will receive regarding the job you just ran
Subject: Job 19 (hello.sh.cmd) Started
Job 19 (hello.sh.cmd) Started
User = ra
Queue = nemo02.q
Host = nemo02.math.ucla.edu
Start Time = 09/25/2003 12:00:13
Subject: Job 19 (hello.sh.cmd) Complete
Job 19 (hello.sh.cmd) Complete
User = ra
Queue = nemo02.q
Host = nemo02.math.ucla.edu
Start Time = 09/25/2003 12:00:13
End Time = 09/25/2003 12:00:13
User Time = 00:00:00
System Time = 00:00:00
Wallclock Time = 00:00:00
CPU = NA
Max vmem = NA
Exit Status = 0
|