Using Condor

The Physics department has a submit node to the CHTC Condor cluster named submit.physics.wisc.edu.
To use this submit node, login by ssh to submit.physics.wisc.edu using your physics cluster username and password.
The basic idea is to create a working directory with all necessary programs and data files. Then create a Condor job by creating a submit file which contains information about the working directory and the program to be run.

Working directory

The working directory must be accessible to Condor’s submit program. Where to keep the working directory?
The recommended approach is to store the working directory in your home directory. (This is the directory you are in when you first log in to the submit node.)
Your entire home directory (and the working directory) can be accessed on your local computer by installing an AFS client. One can then move, copy, and edit the files and directories as usual on the local computer before logging in to the Condor submit node. One can also open up a terminal (on MacOSX and Linux), change to one’s working directory, and run commands from the working directory to test their functioning.
Your home directory has multiple aliases: ~ , /home/<username>, and /afs/physics.wisc.edu/home/<username.
Before jobs can be submitted to Condor from your home directory, the permissions need to be changed to allow Condor access. Let’s say you have a directory workdir in your home directory ~ .
$ fs setacl -dir ~ -acl condor-submit l
$ find ~/workdir -noleaf -type d -exec fs setacl -dir ’{}’ -acl condor-submit rlidkw \;
​
If you are curious about AFS permissions read this page.

Submitting a job (C program)

(Shamelessly adapted from a presentation by Alain Roy. Thank you!)

First you need a job

Before you can submit a job to Condor, you need a job. We will quickly write a small program in C. If you aren’t an expert C program, fear not. We will hold your hand throughout this process.
First, create a file called simple.c using your favorite editor. Put it anywhere you like in your home directory. In that file, put the following text. Copy and paste is a good choice:
$ cd ~/workdir
$ cat > simple.c
#include <stdio.h>
main(int argc, char **argv) 
{ 
 int sleeptime; 
 int input; 
 int failure; 
 if (argc != 3) { 
 printf("Usage: simple <sleep-time> <integer> \n"); 
 failure = 1; 
 } else { 
 sleeptime = atoi(argv[1]); 
 input = atoi(argv[2]);
 printf("Thinking really hard for%d seconds...\n", sleep_time); sleep(sleeptime); 
 printf("We calculated:%d\n", input * 2);  failure = 0; 
 } 
 return failure; 
 }
type control-d here
Now compile that program:
$ gcc -o simple simple.c
$ ls -lh simple
-rwxr-x--- 1 temp-01 temp-01 4.9K Mar 15 16:24 simple* 
Finally, run the program and tell it to sleep for four seconds and calculate 10 * 2:
$ ./simple 4 10
Thinking really hard for 4 seconds... 
We calculated: 20
Great! You have a job you can tell Condor to run! Although it clearly isn’t an interesting job, it models some of the aspects of a real scientific program. It takes a while to run and it does a calculation.

Submitting your job

Now that you have a job, you just have to tell Condor to run it. Put the following text into a file called submit:
Universe = vanilla
Executable = simple
Arguments = 4 10
Log = simple.log
Output = simple.out
Error = simple.error
Queue
Let’s examine each of these lines:
Next, tell Condor to run your job:
$ condor_submit submit
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 6075.
Now, watch your job run:
$ condor_q
-- Submitter: ws-03.gs.unina.it :  < 192.167.2.23:34353 >  : ws-03.gs.unina.it 
 ID OWNER SUBMITTED RUNTIME ST PRI SIZE CMD 
 2.0 temp-01 3/15 16:27 0+00:00:00 I 0 0.0 simple 4 10 
 
1 jobs; 1 idle, 0 running, 0 held
​
$ condor_q
-- Submitter: ws-03.gs.unina.it :  < 192.167.2.23:34353 >  : ws-03.gs.unina.it 
 ID OWNER SUBMITTED RUNTIME ST PRI SIZE CMD 
 2.0 temp-01 3/15 16:27 0+00:00:01 R 0 0.0 simple 4 10 
 
1 jobs; 0 idle, 1 running, 0 held 
​
$ condor_q
-- Submitter: ws-03.gs.unina.it :  < 192.167.2.23:34353 >  : ws-03.gs.unina.it 
 ID OWNER SUBMITTED RUNTIME ST PRI SIZE CMD 
 
0 jobs; 0 idle, 0 running, 0 held 
Notice a few things here. In a real pool, when you do condor_q, you might get a long list of everyone’s jobs. So you can tell condor_q to just list you jobs with the -sub option, which is short for submitter, as in:
% condor_q -sub roy
For this tutorial, there is probably only one person per computer, so it probably isn’t necessary. When my job was done, it was no longer listed. Because I told Condor to log information about my job, I can see what happened:
% cat simple.log
000 (002.000.000) 03/15 16:27:22 Job submitted from host:  < 192.167.2.23:34353 >  
... 
001 (002.000.000) 03/15 16:27:25 Job executing on host:  < 192.167.2.23:33085 >  
... 
005 (002.000.000) 03/15 16:27:29 Job terminated. 
 (1) Normal termination (return value 0) 
 Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage 
 Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage 
 Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage 
 Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 
 0 - Run Bytes Sent By Job 
 0 - Run Bytes Received By Job 
 0 - Total Bytes Sent By Job 
 0 - Total Bytes Received By Job 
... 
That looks good: It took a few seconds for the job to start up, though you will often see slightly slower startups. Condor doesn’t optimize for fast job startup, but for high throughput, The job ran for about four seconds. But did our job execute correctly? If this had been a real Condor pool, the execution computer would have been different than the submit computer, but otherwise it would have looked the same.
$ cat simple.out
Thinking really hard for 4 seconds... 
We calculated: 20 
Excellent! We ran our sophisticated scientific job on a Condor pool!
Condor project documentation is available here.

Using multiple files (Python program)

Often computing tasks require the use of more than just the program file. Condor uses the submit description file directives transfer_input_files and transfer_input_files to transfer files and directories between the submit node and the compute node.
Below is an example which creates a directory which is transfered to and from the compute nodes and a python program which tests the use of NumPy and SciPy.
To use this example, login by ssh to submit.physics.wisc.edu using your physics cluster username and password. Create a working directory workdir and follow the instructions in Section 1↑ to set the correct access permissions. Change the directory so that the current directory is workdir .
Next create a file named submitscr.sh . This file is a shell script which creates a uniquely named directory whose contents will be transfered to and from the compute node. The directory should be uniquely named so that files within do not get overwritten when many jobs are run at the same time. The script also generates a submit description file for the directory and submits the job to Condor. Create submitscr.sh with the following code:
#!/bin/sh
#
# Script to create rundirs and corresponding submit files
#   To make the rundirs unique, use the time this script was run
# and append a random string. (Also check for the directory’s existence)
​
​
# Create the name of the rundir
while :
do
	DATE="‘date +%Y%m%d-%H%M%S‘"
	RND="‘tr -dc A-Za-z0-9 < /dev/urandom | head -c3‘"
	RUNDIR="rundir-${DATE}-${RND}"
	if [ ! -d "${RUNDIR}" ]; then
		echo "using ${RUNDIR}"
		break
	fi
done
​
# actually create the directory
mkdir "${RUNDIR}"
​
# Here one could copy files into rundir
​
# create the submit description file
SUBMIT="${RUNDIR}/submit-${DATE}-${RND}"
​
cat > "${SUBMIT}" << EOF
Universe   = vanilla
Requirements = (PoolName == "CHTC") && HasAFS
#Requirements = HasAFS
Executable = ./init_python.sh
Log        = ${RUNDIR}/condor.log
Output     = ${RUNDIR}/stdout
Error      = ${RUNDIR}/stderr
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = ./test.py,./init_python.sh,${RUNDIR}
transfer_output_files = ${RUNDIR}
Queue
EOF
​
# submit the job
condor_submit "${SUBMIT}"
Next, a create a shell script named init_python.sh which gathers some information about the host it is being run on and executes the python test program (below).
#!/bin/sh
#
# script for starting python programs
​
# cd into run dir
# only one rundir transferred, so using wildcard works
cd rundir*
​
OUTFILE="hostinfo"
echo "hostname is >‘hostname‘<" >> "${OUTFILE}"
uname -a >> "${OUTFILE}"
​
ls -al >> "${OUTFILE}"
ls -al .. >> "${OUTFILE}"
ls -ald /afs/hep.wisc.edu/  >> "${OUTFILE}"
ls -ald /afs/physics.wisc.edu/  >> "${OUTFILE}"
​
​
​
# initializing python
# put the directory with our favorite python at the 
# beginning of the PATH
export PATH=/afs/hep.wisc.edu/cms/sw/python/2.7/bin:$PATH
​
# run the python program
../test.py
Finally, create the python test program which calls some SciPy and NumPy functions and writes them to file. The file should be called test.py :
#!/usr/bin/env python
#
# python program to test NumPy and SciPy
​
import datetime
import numpy
import scipy
import scipy.interpolate
​
# use numpy
x = numpy.arange(10,dtype=’float32’) * 0.3
y = numpy.cos(x)
​
# use scipy
sp = scipy.interpolate.UnivariateSpline(x,y)
​
# save a result
out=sp(0.5)
print out
​
# also a date place in output file to verify it was recently written
today = datetime.datetime.today()
print today
​
# write results to output file
f1=open(’./pythontestout’, ’w+’)
f1.write(str(today) + "\n")
f1.write(str(out) + "\n")
f1.close()
All these scripts should have the executable attribute:
$ chmod u+x *
Now run the script which create a uniquely named run directory, create the submit description file, and submit the job to Condor.
$ ./submitscr.sh using rundir-20130506-101252-U4C Submitting job(s). 1 job(s) submitted to cluster 24419.
Check the current directory to verify the run directory was created:
$ ls -al
total 12
drwxrwxr-x  3 cwseys cwseys 2048 May  6 10:12 .
drwxr-xr-x 43 cwseys cwseys 4096 Apr 22 14:29 ..
-rwxrwxr-x  1 cwseys cwseys  566 May  6 10:05 init_python.sh
drwxrwxr-x  2 anon   cwseys 2048 May  6 10:12 rundir-20130506-101252-U4C
-rwxrwxr-x  1 anon   cwseys 1060 May  6 10:06 submitscr.sh
-rwxrwxr-x  1 cwseys cwseys  548 May  6 10:03 test.py
Check the status of the condor job:
$ condor_q
-- Submitter: submit01.physics.wisc.edu : <128.104.160.33:52591> : submit01.physics.wisc.edu  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD 24419.0   cwseys          5/6  10:12   0+00:00:00 I  0   0.0  init_python.sh
1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
Eventually the job will finish:
$ condor_q
-- Submitter: submit01.physics.wisc.edu : <128.104.160.33:52591> : submit01.physics.wisc.edu  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Look at the output.
$ cat rundir-20130506-101252-U4C/pythontestout
2013-05-06 10:14:53.553482
0.881980721213
It appears to have worked!
Questions? Email help@physics.wisc.edu