When doing massive computations, a user has two problems: which machine to run his giant job on, and how to coordinate with other competing users. The Portable Batch System (PBS) is a POSIX-compliant suite of commands intended to manage large jobs running on multiple compute servers.
At present, PBS is active on these machines:
Sixpac | Holly | |
---|---|---|
Queue name | sixpac@sixpac | holly@holly |
Number of Hosts | 6 | 10 (48 planned) |
CPU | 933 MHz P-3 | 1.0 GHz P-3 |
Memory (each) | 512 Mb | 2 Gb |
Disc (each) | 45 Gb | 37 Gb |
To run a job on one of these machines, first prepare a shell script which invokes whatever commands you want, including your self-written program. Click here for a simple PBS job script which you can copy. In the script, lines beginning with ``#PBS'' add to the command line switches of qsub (see below) which you use to submit your job (but switches actually on the command line are believed whatever the script says). Most of the useful switches are demonstrated here; you don't have to specify all of them.
And here is another script illustrating how to set up a local directory, so your job can write scratch files on the local machine, and then remove them. That is much faster than sending them over Ethernet to some other machine, like your home directory server, and filling up your disc quota.
Then just do qsub filename and the job's identifier will be reported back to you. When the job finishes, stdout and stderr of the job will be written to files, named after the job ID, in the current directory when qsub was executed (or to files specified with the -o or -e switches).
(About the CPU limit: the job monitor checks from time to time whether the job has gone over, and kills it if so. Hence the actual time received by the job could be slightly more than the specified limit.)
You may optionally specify a limit on memory by (for example) ``#PBS -l mem=500mb''. However, each cluster has a default equal to the whole physical memory of the (smallest) member host, and there is no benefit to requesting less memory except if your code has a bug which you are trying to diagnose by using a memory limit. If you request more than the size of the smallest node, your job will be run only on a machine that has that much memory. If you request more than the largest node, the job will be ``rejected by all destinations''. It is a bad idea to force a machine to run a job larger than its physical memory, because swap space isn't much larger, and the I/O to access the swap space is very slow.
You may write files in your home directory or its subdirectories, but if you have large files you may run yourself out of disc quota, and also, massive data flows via NFS are slow. If you read and write a lot of data, it is better to do that on the local machine. A large scratch space, most of the disc as indicated in the above table, has been allocated on each machine for job temporary files. Files will be deleted automatically 5 days after the last writing, but please delete files as soon as you are done with them, either in the job script or after visualization and summarization is finished.
The temporary space is called /scr (on Ficus it is /hugetmp). To avoid entanglement with other users, it's best if you create your own directory in there, e.g. /scr/jimc. This sample script illustrates how to do it.
Here is a short description of the various administrative commands for PBS available to users, generally with the more useful ones first. For complete details, see the man page, e.g. do man qsub. Here are some common switches and command line arguments:
When the description mentions jobs, you need to give the job IDs on the command line; generally multiple jobs can be affected at once.
These commands are present in /usr/local/pbs/bin but are generally not useful to ordinary users.
These commands are present in /usr/local/pbs/bin but are available only to managers.