The Physics department has HTCondor submit nodes named login01.physics.wisc.edu and login02.physics.wisc.edu. These Linux computers are linked to large computer pools including CHTC, HEP, and OSG. Since our HTCondor system uses Linux, it will help to be familiar with basic Linux commands.
To use a submit node, log in by ssh to login01.physics.wisc.edu (or login02) using your physics cluster username and password.
The basic idea is to create a submit directory with all necessary programs and data files. Then create a Condor job by making a submit file which contains information about the submit directory and the program to be run.
Submit directory
Your home directory is in a network filesystem called AFS. Writing into AFS from Condor is problematic, so putting the submit directory in your AFS home directory is not recommended. The recommended approach is to store the working directory in /scratch/<username>/<jobname>. Therefore, the first thing to do is to create a submit directory in /scratch.
$ mkdir -p /scratch/$USER/simplecondor
Submitting a job
(Shamelessly adapted from a presentation by Alain Roy. Thank you!)
First you need a job
Before you can submit a job to Condor, you need a job. We will quickly write a small program in C. A job doesn’t have to be written in C. It could be a shell script, a python script, matlab, a fortran program, or anything that is executable in Linux.
First, create a file called simple.c using your favorite editor. In that file, put the following text. Copy and paste is a good choice:
$ mkdir -p /scratch/$USER/simplecondor $ cd /scratch/$USER/simplecondor $ cat > simple.c #include <stdio.h> main(int argc, char **argv) { int sleeptime; int input; int failure; if (argc != 3) { printf("Usage: simple <sleep-time> <integer> \n"); failure = 1; } else { sleeptime = atoi(argv[1]); input = atoi(argv[2]); printf("Thinking really hard for%d seconds...\n", sleeptime); sleep(sleeptime); printf("We calculated:%d\n", input * 2); failure = 0; } return failure; }
type control-d here
Now compile that program:
$ gcc -o simple simple.c $ ls -lh simple -rwxr-x--- 1 temp-01 temp-01 4.9K Mar 15 16:24 simple*
Finally, run the program and tell it to sleep for four seconds and calculate 10 * 2:
$ ./simple 4 10 Thinking really hard for 4 seconds... We calculated: 20
Great! You have a job you can tell Condor to run! Although it clearly isn’t an interesting job, it models some of the aspects of a real scientific program. It takes a while to run and it does a calculation.
Submitting your job
Now that you have a job, you just have to tell Condor to run it. Put the following text into a file called submit:
Executable = simple Arguments = 4 10 Log = simple.log Output = simple.out Error = simple.error RequestDisk = 500M RequestMemory = 1G Queue
Let’s examine each of these lines:
- Executable: The name of your program
- Arguments: These are the arguments you want. They will be the same arguments we typed above.
- Log: This is the name of a file where Condor will record information about your job’s execution. While it’s not required, it is a really good idea to have a log.
- Output: Where Condor should put the standard output from your job.
- Error: Where Condor should put the standard error from your job. Our job isn’t likely to have any, but we’ll put it there to be safe.
- RequestDisk: How much disk space should be allocated to your job in its temporary working directory.
- RequestMemory: How much memory should be allocated to your job.
- Queue: Submit one job. It is also possible to submit many similarly configured jobs, such as when doing a parameter sweep.
Next, tell Condor to run your job:
$ condor_submit submit Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 6075.
Now, watch your job run:
$ condor_q -- Submitter: ws-03.gs.unina.it : <192.167.2.23:34353> : ws-03.gs.unina.it ID OWNER SUBMITTED RUNTIME ST PRI SIZE CMD 2.0 temp-01 3/15 16:27 0+00:00:00 I 0 0.0 simple 4 10 1 jobs; 1 idle, 0 running, 0 held $ condor_q -- Submitter: ws-03.gs.unina.it : <192.167.2.23:34353> : ws-03.gs.unina.it ID OWNER SUBMITTED RUNTIME ST PRI SIZE CMD 2.0 temp-01 3/15 16:27 0+00:00:01 R 0 0.0 simple 4 10 1 jobs; 0 idle, 1 running, 0 held $ condor_q -- Submitter: ws-03.gs.unina.it : <192.167.2.23:34353> : ws-03.gs.unina.it ID OWNER SUBMITTED RUNTIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held
When my job was done, it was no longer listed. Because I told Condor to log information about my job, I can see what happened:
% cat simple.log 000 (002.000.000) 03/15 16:27:22 Job submitted from host: <192.167.2.23:34353> ... 001 (002.000.000) 03/15 16:27:25 Job executing on host: <192.167.2.23:33085> ... 005 (002.000.000) 03/15 16:27:29 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ...
That looks good: It took a few seconds for the job to start up, though you will often see slightly slower startups. Condor doesn’t optimize for fast job startup, but for high throughput.
The job ran for about four seconds. But did our job execute correctly? If this had been a real Condor pool, the execution computer would have been different than the submit computer, but otherwise it would have looked the same.
$ cat simple.out Thinking really hard for 4 seconds... We calculated: 20
Excellent! We ran our sophisticated scientific job on a Condor pool! Condor project documentation is available here.
Using multiple files (Python program)
Often computing tasks require the use of more than just the program file. Condor uses the submit description file directives transfer_input_files to transfer files and directories between the submit node and the compute node.
Below is an example which creates a directory which is transfered to and from the compute nodes and a python program which tests the use of NumPy and SciPy.
To use this example, login by ssh to login01.physics.wisc.edu using your physics cluster username and password. Create a working directory /scratch/$USER/condorxfer.
Next create a file named submitscr.sh . This file is a shell script which creates a uniquely named directory whose contents will be transfered to and from the compute node. The directory should be uniquely named so that files within do not get overwritten when many jobs are run at the same time. The script also generates a submit description file for the directory and submits the job to Condor. Create submitscr.sh with the following code:
#!/bin/sh # submitscr.sh # Script to create rundirs and corresponding submit files # To make the rundirs unique, use the time this script was run # and append a random string. (Also check for the directory's existence) # Create the name of the rundir while : do DATE="`date +%Y%m%d-%H%M%S`" RND="`tr -dc A-Za-z0-9 < /dev/urandom | head -c3`" RUNDIR="rundir-${DATE}-${RND}" if [ ! -d "${RUNDIR}" ]; then echo "using ${RUNDIR}" break fi done # actually create the directory mkdir "${RUNDIR}" # Here one could copy files into rundir # create the submit description file SUBMIT="${RUNDIR}/submit-${DATE}-${RND}" cat > "${SUBMIT}" << EOF Executable = ./init_python.sh Log = ${RUNDIR}/condor.log Output = ${RUNDIR}/stdout Error = ${RUNDIR}/stderr should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = test.py,init_python.sh,${RUNDIR} transfer_output_files = ${RUNDIR} request_disk = 500M request_memory = 1G Queue EOF # submit the job condor_submit "${SUBMIT}"
Next, create a shell script named init_python.sh which gathers some information about the host it is being run on and executes the python test program (below).
#!/bin/sh # init_python.sh # script for initializing and starting python programs # also gather some debugging information first # cd into run dir # only one rundir transferred, so using wildcard works cd rundir* # gather information about the computer this script runs on # for debugging purposes OUTFILE="hostinfo" echo "hostname is >`hostname`<" >> "${OUTFILE}" uname -a >> "${OUTFILE}" ls -al >> "${OUTFILE}" ls -al .. >> "${OUTFILE}" ls -ald /afs/hep.wisc.edu/ >> "${OUTFILE}" ls -ald /afs/physics.wisc.edu/ >> "${OUTFILE}" # initializing python # put the directory with our favorite python at the # beginning of the PATH export PATH=/afs/hep.wisc.edu/cms/sw/python/2.7/bin:$PATH # run the python program ../test.py
Finally, create the python test program which calls some SciPy and NumPy functions and writes them to file. The file should be called test.py :
#!/usr/bin/env python # test.py # python program to test NumPy and SciPy import datetime import numpy import scipy import scipy.interpolate # use numpy x = numpy.arange(10,dtype='float32') * 0.3 y = numpy.cos(x) # use scipy sp = scipy.interpolate.UnivariateSpline(x,y) # save a result out=sp(0.5) print out # also a date place in output file to verify it was recently written today = datetime.datetime.today() print today # write results to output file f1=open('./pythontestout', 'w+') f1.write(str(today) + "\n") f1.write(str(out) + "\n") f1.close()
All these scripts should have the executable attribute:
$ chmod u+x *
Now run the script which creates a uniquely named run directory, create the submit description file, and submit the job to Condor.
$ ./submitscr.sh using rundir-20130506-101252-U4C Submitting job(s). 1 job(s) submitted to cluster 24419.
Check the current directory to verify the run directory was created:
$ ls -al total 12 drwxrwxr-x 3 cwseys cwseys 2048 May 6 10:12 . drwxr-xr-x 43 cwseys cwseys 4096 Apr 22 14:29 .. -rwxrwxr-x 1 cwseys cwseys 566 May 6 10:05 init_python.sh drwxrwxr-x 2 anon cwseys 2048 May 6 10:12 rundir-20130506-101252-U4C -rwxrwxr-x 1 anon cwseys 1060 May 6 10:06 submitscr.sh -rwxrwxr-x 1 cwseys cwseys 548 May 6 10:03 test.py
Check the status of the condor job:
$ condor_q -- Submitter: login01.physics.wisc.edu : <128.104.160.33:52591> : login01.physics.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 24419.0 cwseys 5/6 10:12 0+00:00:00 I 0 0.0 init_python.sh 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
Eventually the job will finish:
$ condor_q -- Submitter: submit01.physics.wisc.edu : <128.104.160.33:52591> : submit01.physics.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
$ cat rundir-20130506-101252-U4C/pythontestout 2013-05-06 10:14:53.553482 0.881980721213
It worked!
Questions? Further details about HTCondor @ UW Madison Physics can be found here. You can also receive assistance by emailing help@physics.wisc.edu