Hyperion Cluster: Using SGE

The first step to taking advantage of the Hyperion cluster is understanding how to submit jobs to the cluster using SGE. It may be helpful to read the man pages for these commands:

  • qsub
  • qalter
  • qresub

All of the SGE man pages should be in your MANPATH by default, so you can access all of them using the 'man' command.

Job submission scripts are nothing more than shell scripts that can have some additional "comment" lines added that specify options for SGE. For example, this simple BASH shell script can be a job submission script:

#!/bin/bash

myname=$(hostname)
echo "Hello, World! from ${myname}"
sleep 30

The "sleep 30" line is there just to keep this short script to run a little longer for demonstration purposes. This script can how be submitted to sge using the qsub command:

$ qsub bash_hello.sh 
Your job 2302 ("bash_hello") has been submitted

In the in above output, the "2302" is the job-ID, and the string in parenthesis ("bash_hello") is the job name. Once the job is submitted, you can check the status of it using the qstat command:

$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
   2302 0.55500 bash_hello prentice     r     02/12/2009 11:23:09 all.q@node59.hyperion                1  

The "r" in the state column means the job is running. You may also see "qw", which means the job is queued and waiting to run. By default, SGE runs the job in your home directory and writes the standard output and standard error of jobs to files in your home directory. It also names the job with the name of the submission script by default. The output files are of the format <jobname>.o<jobid> and <jobname>.e<jobid>. The "o" file is for standard output, and the "e" file is for standard error:

$ ls -l ~/bash_hello*
-rw-r--r-- 1 prentice admin    0 Feb 12 11:34 /home/prentice/bash_hello.sh.e2302
-rw-r--r-- 1 prentice admin   33 Feb 12 11:35 /home/prentice/bash_hello.sh.o2302

$ more ~/bash_hello.sh.o2302 
Hello, World! from node64.hyperion

Notice that the standard error file, bash_hello.sh.e2302 has a size of 0. That's good - that means there were no errors running this job.

Normally, it's more convenient to have the program run in the directory you submit it from, so that it will put it's output files in that same directory. This is also desirable for programs that read/write data from/to files so you can have all the input and output files in the same directory. You can specify this behavior with the -cwd switch to qsub (see the qsub man page for a full list of all the qsub options):

$ pwd
/home/prentice/bash_hello

$ qsub -cwd bash_hello.sh 
Your job 2304 ("bash_hello.sh") has been submitted

$ ls -l 
total 4
-rw------- 1 prentice admin 77 Feb 12 11:31 bash_hello.sh
-rw-r--r-- 1 prentice admin  0 Feb 12 11:43 bash_hello.sh.e2304
-rw-r--r-- 1 prentice admin 33 Feb 12 11:44 bash_hello.sh.o2304

Instead of remembering to use the -cwd switch everytime you submit a job with qsub, you can add the switch to your submission script by putting it on a line prefixed with a "#$":

#!/bin/bash

#$ -cwd

myname=$(hostname)
echo "Hello, World! from ${myname}"
sleep 30

You can add any switch to qsub to your submission script this way. If you've already used PBS, this syntax for a job submission script is very similar to that of PBS. The only real difference is that the special comment string for specifying options to the qsub command within a submit script is "#$" instead of "#PBS". You can also change this prefix string with the -C option to qsub ('qsub -C ...'). This is handy if you already have submission scripts for Torque/PBS, provided that you are not using switches unique to Torque/PBS. For example, here is a more complicated submission script to run an MPI job:

#!/bin/bash
#$ -N xhpl
#$ -pe orte 512
#$ -cwd
#$ -V
#$ -l h_rt=00:10:00

MPI=/usr/local/openmpi/gcc/x86_64
PATH=${MPI}/bin:${PATH}
LD_LIBRARY_PATH=${MPI}/lib
mpirun ./xhpl

The -N, and -V switches are identical to those in Torque/PBS. -pe and -cwd are options unique to SGE. The -pe option specifies the parallel environment. In this example, the parallel environment is 'orte', which is the parallel environment for Open MPI (ORTE = Open MPI RunTime Environment). (For more information on Open MPI, see the next section). In order to run parallel jobs, you MUST specify the parallel environment you wish to use.

Since mpirun isn't in my PATH, I added it to my PATH in the job script. Since my program 'xhpl' is dynamically linked to the MPI libraries, I also need to make sure the path to them is specified in my LD_LIBRARY_PATH. I do both of these just before calling the mpirun command to start xhpl, and thanks to the -V option, the correct values for these environment variables are passed along to the job. If the -V option isn't used, this job will fail.

Notice that I do not specify the number of processors (cores) I want to use in the mpirun command itself. OpenMPI has support for SGE built into it, so it gets the number of processors requested directly from SGE. I'll be discussing parallel programming and MPI more in the section on Open MPI.

Specifying a wallclock time (run time) with qsub

You are required to specify a run time (also referred to as "wallclock" time) for your cluster jobs. This is often referred to as the "wallclock", "elapsed" or "real" time.

This is necessary to better make use of the cluster and prevent job starvation (the situation where a job can never be scheduled to run on the cluster). If you do not specify a run time. Your job will not run, and will remain in the "qw" state (queued and waiting to run) indefinitely.

By knowing the runtime of the different jobs, the scheduler can plan a time in the future when enough slots will be available to run the larger job. When h_rt is not specified, SGE assumes h_rt=INFINITY, so it can't do this planning.

In SGE, this time is referred to as h_rt (hard run-time), and is requested as a resource, using the -l switch to qsub. For example, to submit a job with a runtime of only 10 minutes:

$ qsub -l h_rt=00:10:00 mpihello.sh

If you forget to specify a value for h_rt, you will get an error message like this:

$ qsub mpihello.sh 
Unable to run job: error: no suitable queues.
Exiting.

Since this is a hard limit, any job that exceeds the requested run-time will be killed by SGE. To be safe, you should either use checkpointing so if underestimate your job's runtime, and it's killed prematurely, you can restart it with no work lost.

If your job doesn't use checkpointing, you should overestimate the runtime of your job by a safety factor (say 10% -20%). If you are unsure how long your job will run, you can run a test case with a smaller problem size and then extrapolate that to the full-size run. This will, of course, require that you know how your job scales with problem size (linearly, exponentially, etc.). If you really don't know how long your job will run, multiplying you best estimate by 2x may be acceptable for the first few runs of a program.

However, PLEASE do not abuse the cluster by specifying overly generous runtime. Not only will this prevent the schedule from scheduling jobs efficiently, you may prevent your owns from running if SGE cannot find a time slot big enough to fit the h_rt you specified.

Requesting exclusive use of a node

Since the cluster nodes have 16 slots (one for each core), it's common for a node to be running processes from two or more separate jobs. There are times when this is undesirable:

  • Memory-intensive job where each process needs to use the full 32 BG RAM of a node
  • Threaded programs, where the number of threads created are not controlled by SGE

In these cases, you can request that your job has exclusive use of a node, and no other jobs will be running on it at the same time. This can be done by requesting the resource "exclusive", or "excl" for short, with the -l switch:

$ qsub -l exclusive=true foo.sh
$ qsub -l excl=true foo.sh

If you request this resource, no other jobs will be assigned to the same node once your job starts. If all the nodes are already busy, your job will stay queued until a node frees up so that your job can run on it exclusively.

PLEASE use this feature only when necessary. If you request it for all your jobs, can severely decrease the capacity of the cluster.

SGE Tips

See only the jobs of a specific user

Normally, SGE's qstat command will only show your own jobs that are running. However, on the Hyperion Cluster it has been configured to show the jobs for all users by default. To see what jobs are running under a specific username you can use the -u switch to qstat:

$ qstat -u prentice

How to see what jobs are running where

One of the first questions most users of this new cluster ask is if there is pbstop installed on this cluster. (pbstop was a utility on the old apollo cluster that would show what jobs are running on which cluster node and node status). I haven't looked to see if pbstop can be ported to the Hyperion cluster yet. In the mean time, you can use the command 'qstat -f' to see what jobs are running where:

$ qstat -f

queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@node01.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node02.hyperion            BIP   0/8/8          1.01     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node03.hyperion            BIP   0/8/8          1.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node04.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node05.hyperion            BIP   0/8/8          1.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node06.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node07.hyperion            BIP   0/8/8          0.00     lx24-amd64    
   2381 0.50500 mpihello_1 prentice     r     02/13/2009 16:15:41     8        
---------------------------------------------------------------------------------
all.q@node08.hyperion            BIP   0/0/8          0.02     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node09.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node10.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node11.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node12.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node13.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node14.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node15.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node16.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node17.hyperion            BIP   0/8/8          1.02     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node18.hyperion            BIP   0/8/8          1.01     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node19.hyperion            BIP   0/8/8          1.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node20.hyperion            BIP   0/8/8          1.02     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node21.hyperion            BIP   0/8/8          1.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node22.hyperion            BIP   0/8/8          1.02     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node23.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node24.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node25.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node26.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node27.hyperion            BIP   0/8/8          1.01     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node28.hyperion            BIP   0/8/8          1.01     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node29.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node30.hyperion            BIP   0/8/8          1.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node31.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node32.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node33.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node34.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node35.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node36.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node37.hyperion            BIP   0/8/8          0.00     lx24-amd64    
   2382 0.50500 mpihello_2 prentice     r     02/13/2009 16:15:56     8        
---------------------------------------------------------------------------------
all.q@node38.hyperion            BIP   0/8/8          0.00     lx24-amd64    
   2383 0.50500 mpihello_3 prentice     r     02/13/2009 16:15:56     8        
---------------------------------------------------------------------------------
all.q@node39.hyperion            BIP   0/8/8          0.00     lx24-amd64    
   2382 0.50500 mpihello_2 prentice     r     02/13/2009 16:15:56     8        
---------------------------------------------------------------------------------
all.q@node40.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node41.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node42.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node43.hyperion            BIP   0/0/8          0.08     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node44.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node45.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node46.hyperion            BIP   0/8/8          1.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node47.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node48.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node49.hyperion            BIP   0/8/8          1.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node50.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node51.hyperion            BIP   0/8/8          0.00     lx24-amd64    
   2381 0.50500 mpihello_1 prentice     r     02/13/2009 16:15:41     8        
---------------------------------------------------------------------------------
all.q@node52.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node53.hyperion            BIP   0/0/8          0.10     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node54.hyperion            BIP   0/8/8          1.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node55.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node56.hyperion            BIP   0/8/8          1.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node57.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node58.hyperion            BIP   0/8/8          1.04     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node59.hyperion            BIP   0/8/8          1.02     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node60.hyperion            BIP   0/8/8          1.01     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node61.hyperion            BIP   0/8/8          0.00     lx24-amd64    
   2383 0.50500 mpihello_3 prentice     r     02/13/2009 16:15:56     8        
---------------------------------------------------------------------------------
all.q@node62.hyperion            BIP   0/8/8          1.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node63.hyperion            BIP   0/0/8          0.00     lx24-amd64    
---------------------------------------------------------------------------------
all.q@node64.hyperion            BIP   0/0/8          0.00     lx24-amd64  

Admittedly, it's not as condensed and easy to read as pbstop, but it does show essentially the same information.