Sun Grid Engine
On Holly Cluster
by
2/18/2003
Description:
Sun Grid Engine (SGE) is a Batch Submitter, just like PBS. It has a “built in” advantage of having a graphical managing and monitoring interface. SGE on Holly has an added feature of specialized scripts to facilitate job submission, and aid in the troubleshooting of submission and execution problems.
Philosophy:
Jobs should be submitted by all users to HollyFS. HollyFS then runs the jobs (through SGE) by parceling them out to the “computational nodes”, Holly01-10. (or, XX, considering that new machines should be coming in the future). No one should be accessing the Holly “computational” nodes, only submitting jobs on HollyFS. Individual queues will be designed with SGE queue control – but jobs will still be submitted to a central job queue, for parceling out the jobs based on machine, group membership, etc.
Installation
location:
HollyFS: /m1/sge (Sun Grid Engine files)
/m1/sgeq (specialized scripts for running serial/parallel queue jobs)
Holly01-10: /usr/local/sge => /u/hollyfs/m1/sge
/usr/local/sgeq => /u/hollyfs/m1/sgeq
Configuring shell (before login):
The user needs only one line added to the shell dotfile (e.g., .cshrc):
set path =
(/usr/local/sgeq/local/bin $path)
(i.e., prepend the path to the SGE queueing scripts to your regular path).
Submitting Jobs:
After configuring the shell file (.cshrc, or .bashrc, as above) the user should log in to HollyFS (as themselves). They can do this using “rsh” from their home machine (several of the homes are on tupelo). HollyFS is a “bash” shell, but user’s dot-files can possibly create a “csh” environment. Again, all jobs should be run from HollyFS (and not the computational nodes). For a user to submit a job, they simply type:
job.q
This launches a menu for building, submitting, etc., a command script that will be sent to the Sun Grid Engine for distribution to the computational nodes. The user generally “builds” the script by typing “b”, entering the name of the code (c, fortran, whatever), and is taken through a series of questions regarding priority, usage of memory, time limit, number of nodes, etc. – basically selecting defaults. In the future, we will probably eliminate the display of selections; however, at the current time, there are actually no limits on number of nodes, memory usage, etc.
The queue script indicates that the job has been submitted (and provides you with the “job-id” of the job), and then keeps you in the menu format, until you “exit”. After that, you’re on your own to monitor the status of the queue.
Monitoring Status of Jobs Submitted:
To check jobs “currently in the queue”, type
qstat
To check the “qualities” of your specific job (if still in the queue, or running), type
qstat –j
<jobid>
To check the “qualities” of your specific job (which has completed), type
qacct –j
<jobid>
Deleting/removing jobs:
qdel
<jobid>
Files
(output, etc.) generated by running ‘job.q’:
Let’s assume the name of the code is “code.sh”, for example:
1) command file: the command file generated by job.q will be code.sh.cmd; one can run “qsub code.sh.cmd” to make further runs of the jobs
2) log file: the log file is named code.sh.joblog
3) output file: the output file is named code.sh.output.jobid
The “job log” file is overwritten for each job run. The ouput.jobid file is unique to each output produced.
Job diagnostics and troubleshooting:
The following are either command names or file names for troubleshooting your job and all jobs run on Holly through SGE:
1) all queues: qmon (see below)
2) specific job: code.sh.joblog (i.e., your job in your home dir.)
3) job queue logs: /usr/local/sgeq/local/apps/queue.log
4) machine message logs: /usr/local/sge/default/spool/machine/messages
In number 4), the machine is either holly01-10, hollyfs, or the qmaster (which is actually on hollyfs, and runs the jobs).
The
“qmon” command:
The qmon command is run from hollyfs; in order to run it properly, you need to set the proper environment variable for your display. This command is run from a “UNIX based” window (such as Xwin32). The environment setting should be implemented similar to as follows:
Ø (csh): setenv DISPLAY machine.math.ucla.edu:0.0
Ø (bash): export DISPLAY=machine.math.ucla.edu:0.0
After this is done, you can simply run the qmon command:
Ø
qmon
The qmon command is graphical – if everything works right, you should see the “GRID ENGINE” logo pop up on the right side of your screen, and the button selection pop up on the left side. If you have a firewall on your local computer (like ZoneAlarm), you may be blocked from seeing this graphical display.
The button selection consists many items, but the two most important ones (and the ones you should be interested in) are the first two starting sequentially from the left:
1) Job Control: pending jobs, running jobs, and finished jobs
2) Queue Control: a display of all the queues, and their status (by color code)
You can see the state of each machine instantly, and report any anomalies to bugs. To exit out of the qmon, select (from the top menu bar of the button display): File/Exit.