Using HTCondor (aka Condor)

The Physics department has HTCondor submit nodes named login01.physics.wisc.edu and login02.physics.wisc.edu.  These Linux computers are linked to large computer pools including CHTC, HEP, and OSG.  Since our HTCondor system uses Linux, it will help to be familiar with basic Linux commands.

To use a submit node, log in by ssh to login01.physics.wisc.edu (or login02) using your physics cluster username and password.

The basic idea is to create a submit directory with all necessary programs and data files. Then create a Condor job by making a submit file which contains information about the submit directory and the program to be run.

Submit directory

Your home directory is in a network filesystem called AFS. Writing into AFS from Condor is problematic, so putting the submit directory in your AFS home directory is not recommended.  The recommended approach is to store the working directory in /scratch/<username>/<jobname>. Therefore, the first thing to do is to create a submit directory in /scratch.

$ mkdir -p /scratch/$USER/simplecondor

Submitting a job

(Shamelessly adapted from a presentation by Alain Roy. Thank you!)

First you need a job

Before you can submit a job to Condor, you need a job. We will quickly write a small program in C. A job doesn't have to be written in C. It could be a shell script, a python script, matlab, a fortran program, or anything that is executable in Linux.

First, create a file called simple.c using your favorite editor. In that file, put the following text. Copy and paste is a good choice:

$ mkdir -p /scratch/$USER/simplecondor
$ cd /scratch/$USER/simplecondor
$ cat > simple.c
#include <stdio.h>
main(int argc, char **argv) 
{ 
 int sleeptime; 
 int input; 
 int failure; 
 if (argc != 3) { 
 printf("Usage: simple <sleep-time> <integer> \n"); 
 failure = 1; 
 } else { 
 sleeptime = atoi(argv[1]); 
 input = atoi(argv[2]);
 printf("Thinking really hard for%d seconds...\n", sleeptime); sleep(sleeptime); 
 printf("We calculated:%d\n", input * 2);  failure = 0; 
 } 
 return failure; 
 }

type control-d here

Now compile that program:

$ gcc -o simple simple.c
$ ls -lh simple
-rwxr-x--- 1 temp-01 temp-01 4.9K Mar 15 16:24 simple* 

Finally, run the program and tell it to sleep for four seconds and calculate 10 * 2:

$ ./simple 4 10
Thinking really hard for 4 seconds... 
We calculated: 20

Great! You have a job you can tell Condor to run! Although it clearly isn’t an interesting job, it models some of the aspects of a real scientific program. It takes a while to run and it does a calculation.

Submitting your job

Now that you have a job, you just have to tell Condor to run it. Put the following text into a file called submit:

Executable = simple
Arguments = 4 10
Log = simple.log
Output = simple.out
Error = simple.error
Queue

Let’s examine each of these lines:

  • Executable: The name of your program
  • Arguments: These are the arguments you want. They will be the same arguments we typed above.
  • Log: This is the name of a file where Condor will record information about your job’s execution. While it’s not required, it is a really good idea to have a log.
  • Output: Where Condor should put the standard output from your job.
  • Error: Where Condor should put the standard error from your job. Our job isn’t likely to have any, but we’ll put it there to be safe.
  • Queue: Submit one job.  It is also possible to submit many similarly configured jobs, such as when doing a parameter sweep.

Next, tell Condor to run your job:

$ condor_submit submit
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 6075.

Now, watch your job run:

$ condor_q
-- Submitter: ws-03.gs.unina.it :  < 192.167.2.23:34353 >  : ws-03.gs.unina.it 
 ID OWNER SUBMITTED RUNTIME ST PRI SIZE CMD 
 2.0 temp-01 3/15 16:27 0+00:00:00 I 0 0.0 simple 4 10 
 
1 jobs; 1 idle, 0 running, 0 held
​
$ condor_q
-- Submitter: ws-03.gs.unina.it :  < 192.167.2.23:34353 >  : ws-03.gs.unina.it 
 ID OWNER SUBMITTED RUNTIME ST PRI SIZE CMD 
 2.0 temp-01 3/15 16:27 0+00:00:01 R 0 0.0 simple 4 10 
 
1 jobs; 0 idle, 1 running, 0 held 
​
$ condor_q
-- Submitter: ws-03.gs.unina.it :  < 192.167.2.23:34353 >  : ws-03.gs.unina.it 
 ID OWNER SUBMITTED RUNTIME ST PRI SIZE CMD 
 
0 jobs; 0 idle, 0 running, 0 held 

Notice a few things here. In a real pool, when you do condor_q, you might get a long list of everyone’s jobs. So you can tell condor_q to just list you jobs with the -sub option, which is short for submitter, as in:

% condor_q -sub roy

When my job was done, it was no longer listed. Because I told Condor to log information about my job, I can see what happened:

% cat simple.log
000 (002.000.000) 03/15 16:27:22 Job submitted from host:  < 192.167.2.23:34353 >  
... 
001 (002.000.000) 03/15 16:27:25 Job executing on host:  < 192.167.2.23:33085 >  
... 
005 (002.000.000) 03/15 16:27:29 Job terminated. 
 (1) Normal termination (return value 0) 
 Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage 
 Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage 
 Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage 
 Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 
 0 - Run Bytes Sent By Job 
 0 - Run Bytes Received By Job 
 0 - Total Bytes Sent By Job 
 0 - Total Bytes Received By Job 
... 

That looks good: It took a few seconds for the job to start up, though you will often see slightly slower startups. Condor doesn’t optimize for fast job startup, but for high throughput.

The job ran for about four seconds. But did our job execute correctly? If this had been a real Condor pool, the execution computer would have been different than the submit computer, but otherwise it would have looked the same.

$ cat simple.out
Thinking really hard for 4 seconds... 
We calculated: 20 

Excellent! We ran our sophisticated scientific job on a Condor pool! Condor project documentation is available here.

Using multiple files (Python program)

Often computing tasks require the use of more than just the program file. Condor uses the submit description file directives transfer_input_files to transfer files and directories between the submit node and the compute node.

Below is an example which creates a directory which is transfered to and from the compute nodes and a python program which tests the use of NumPy and SciPy.

To use this example, login by ssh to login01.physics.wisc.edu using your physics cluster username and password. Create a working directory /scratch/$USER/condorxfer.

Next create a file named submitscr.sh . This file is a shell script which creates a uniquely named directory whose contents will be transfered to and from the compute node. The directory should be uniquely named so that files within do not get overwritten when many jobs are run at the same time. The script also generates a submit description file for the directory and submits the job to Condor. Create submitscr.sh with the following code:

#!/bin/sh

# Script to create rundirs and corresponding submit files
#  To make the rundirs unique, use the time this script was run
# and append a random string. (Also check for the directory's existence)
# Create the name of the rundir
while :
do
	DATE="`date +%Y%m%d-%H%M%S`"
	RND="`tr -dc A-Za-z0-9 < /dev/urandom | head -c3`"
	RUNDIR="rundir-${DATE}-${RND}"
	if [ ! -d "${RUNDIR}" ]; then
		echo "using ${RUNDIR}"
		break
	fi
done

# actually create the directory
mkdir "${RUNDIR}"

# Here one could copy files into rundir

# create the submit description file
SUBMIT="${RUNDIR}/submit-${DATE}-${RND}"
cat > "${SUBMIT}" << EOF
Executable = ./init_python.sh
Log        = ${RUNDIR}/condor.log
Output     = ${RUNDIR}/stdout
Error      = ${RUNDIR}/stderr
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = test.py,init_python.sh,${RUNDIR}
transfer_output_files = ${RUNDIR}
Queue
EOF

# submit the job
condor_submit "${SUBMIT}"

Next, create a shell script named init_python.sh which gathers some information about the host it is being run on and executes the python test program (below).

#!/bin/sh
#
# script for starting python programs

# cd into run dir
# only one rundir transferred, so using wildcard works
cd rundir*

# gather information about the computer this script runs on
OUTFILE="hostinfo"
echo "hostname is >`hostname`<" >> "${OUTFILE}"
uname -a >> "${OUTFILE}"
ls -al >> "${OUTFILE}"
ls -al .. >> "${OUTFILE}"
ls -ald /afs/hep.wisc.edu/  >> "${OUTFILE}"
ls -ald /afs/physics.wisc.edu/  >> "${OUTFILE}"

# initializing python
# put the directory with our favorite python at the
# beginning of the PATH
export PATH=/afs/hep.wisc.edu/cms/sw/python/2.7/bin:$PATH

# run the python program
../test.py

Finally, create the python test program which calls some SciPy and NumPy functions and writes them to file. The file should be called test.py :

#!/usr/bin/env python
#
# python program to test NumPy and SciPy

import datetime
import numpy
import scipy
import scipy.interpolate

# use numpy
x = numpy.arange(10,dtype='float32') * 0.3
y = numpy.cos(x)

# use scipy
sp = scipy.interpolate.UnivariateSpline(x,y)

# save a result
out=sp(0.5)
print out

# also a date place in output file to verify it was recently written
today = datetime.datetime.today()
print today

# write results to output file
f1=open('./pythontestout', 'w+')
f1.write(str(today) + "\n")
f1.write(str(out) + "\n")
f1.close()

All these scripts should have the executable attribute:

$ chmod u+x *

Now run the script which creates a uniquely named run directory, create the submit description file, and submit the job to Condor.

$ ./submitscr.sh using rundir-20130506-101252-U4C Submitting job(s). 1 job(s) submitted to cluster 24419.

Check the current directory to verify the run directory was created:

$ ls -al
total 12
drwxrwxr-x  3 cwseys cwseys 2048 May  6 10:12 .
drwxr-xr-x 43 cwseys cwseys 4096 Apr 22 14:29 ..
-rwxrwxr-x  1 cwseys cwseys  566 May  6 10:05 init_python.sh
drwxrwxr-x  2 anon   cwseys 2048 May  6 10:12 rundir-20130506-101252-U4C
-rwxrwxr-x  1 anon   cwseys 1060 May  6 10:06 submitscr.sh
-rwxrwxr-x  1 cwseys cwseys  548 May  6 10:03 test.py

Check the status of the condor job:

$ condor_q
-- Submitter: login01.physics.wisc.edu : <128.104.160.33:52591> : login01.physics.wisc.edu
  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
24419.0   cwseys          5/6  10:12   0+00:00:00 I  0   0.0  init_python.sh
1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

Eventually the job will finish:

$ condor_q
-- Submitter: submit01.physics.wisc.edu : <128.104.160.33:52591> : submit01.physics.wisc.edu
  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Look at the output.
$ cat rundir-20130506-101252-U4C/pythontestout
2013-05-06 10:14:53.553482
0.881980721213

It worked!

 

Questions? Email help@physics.wisc.edu

©2013 Board of Regents of the University of Wisconsin System