Batch computing with HTCondor @ UW Physics

This page contains details about HTCondor at the University of Wisconsin - Madison Physics Department. By using HTCondor, you can utilize large numbers of computers for performing scientific calculations. For an overview of the scientific computing resources at UW, see here.

Quick Tutorial

The tutorial for using HTCondor at UW Physics provides specific examples of how to use UW HTCondor resources from the Physics Department.

CMS Users

Members of the Compact Muon Solenoid (CMS) experiment should refer to CMS User Documentation for information specific to CMS.

Getting an Account

You will need an account for logging into login.physics.wisc.edu or login.hep.wisc.edu. To request an account, contact help AT physics.wisc.edu.

How long can jobs run?

It is best to break up the work you have to do into chunks that take less than 24 hours and more than a few minutes. This reduces time lost to overhead and improves the chances of the job running to completion before being preempted by other higher priority users. See more about preemption below.

Linux Versions

Some Condor worker nodes may be running different versions of Linux. If your job is compiled on Scientific Linux 7, for example, it may fail to run on a Scientific Linux 6 machine. Compiling your program on an older version of Linux is one solution. Statically compiling your program is another way to try to make it more portable, though in practice we have found that the program is still not portable if it makes any libc calls that do DNS lookups or reading of unix account information, since these are handled via dynamic library loading.

If the program simply can't run on older versions of Linux, you should specify what version is required. One way to do this is by checking the glibc version.

Example for specifying that Scientific Linux 5 or newer is required:

requirements = TARGET.OSglibc_major == 2 && TARGET.OSglibc_minor >= 5

Another way is to specify a specific operating system version.  (See below for how to list the available versions.)  Example:

requirements = TARGET.OpSysAndVer == "SL6"

Data Management

Use AFS for software and /scratch for data files

In some cases it may be convenient to use AFS for your job's data files. However, there are a number of disadvantages to using AFS in this way, including both performance and security, so we strongly recommend that you put your input/output data files in a directory in /scratch/your-user-name. Your program executables and libraries could also exist in /scratch, but it is usually convenient and reasonable to put these in AFS so that libraries can be easily accessed by the Condor job from wherever it runs.

To grant Condor access to your software, you must make the directories containing the software readable without an AFS token and you must make all parent directories listable without an AFS token. We recommend using the condor-hosts AFS group for this purpose. The following example command can be used to grant access to a sub-directory sw in your AFS home directory:

fs setacl -dir ~ -acl condor-hosts l
find ~/sw -type d -exec fs setacl -dir '{}' -acl condor-hosts rl \;

New subdirectories inherit the AFS ACLs of their parent directory, so the above command should only need to be run once unless you add another top-level directory that needs to be accessible.

Once you have compiled your program, and your input files are ready, you can create a submit file describing your job(s) and submit it to Condor. Here is a simple example:

universe = vanilla
executable = /afs/hep.wisc.edu/home/dan/sw/my_program

arguments = arg1 arg2

# Copy environment variables that are set at submit time, such as
# LD_LIBRARY_PATH.
getenv = true

output = stdout
error = stderr
log = condor_log

transfer_input_files = inputfile1,inputfile2

# specify resources needed by the job
request_memory = 1G
request_disk = 1G

# only run on computers that have access to AFS
requirements = TARGET.HasAFS

queue

Once the submit file is ready, you can submit the job to Condor using the following command. Run this command from the directory where you want the output files to go (i.e. in /scratch/your-user-name/...) or expicitly specify an initial working directory in the submit file.

condor_submit submit_file

Important events in the life of your job will be logged in the log file specified in the submit file (condor_log in the example above). This includes the time and place where your job began executing and the time when it finished or was preempted by higher priority users on the machine where it was running. You can view the current status of the job in the job queue using condor_q jobid.

Case 2: Use AFS for software and data files

To allow your job to write to an AFS directory, you must give all processes on all Condor worker nodes and submit machines the ability to write to the directory. This is generally not a good thing to do. Don't do things this way unless you have to! Please inform us at condor-help AT hep.wisc.edu before you make heavy use of this option, because it can cause performance problems on the AFS server when many Condor jobs are writing to it at the same time.

The following command can be used to give all Condor machines write access to a directory:

find /path/to/directory -type d -exec fs sa -dir '{}' -acl condor-hosts rlidkw \;

When you are done, you should remove the ability of Condor machines to write to the directory. To do that, use the following command:

find /path/to/directory -type d -exec fs sa -dir '{}' -acl condor-hosts none \;

Case 3: Use HDFS for storing data files

The UW HEP group provides a large HDFS storage system that others in the Department can use for data handling that scales beyond local disk space on the submit machine. This is free for small usage (a few TBs). For larger needs, please contact help@hep.wisc.edu.

A convenience script, called runWiscJobs, for submitting jobs that store output in HDFS is described here.

What Condor pools exist at UW Madison?

From physics.wisc.edu and hep.wisc.edu, several Condor pools are accessible. You don't have to do anything special to access them. Once jobs are submitted to Condor, they "flock" to these pools.

  1. condor.hep.wisc.edu: The Physics Department's High Energy Physics (HEP) Condor pool contains the UW CMS Tier-2 Computing Center and a collection of desktops and a few other machines. Since these machines are owned by specific groups, those groups have immediate priority. Guest jobs will be kicked off whenever the owners have work to do. (See Preemption for more information on dealing with your job getting kicked off.)
  2. cm.chtc.wisc.edu: The CHTC Condor pool is for use by UW Madison researchers.
  3. wid-cm.discovery.wisc.edu: A condor pool located in the Wisconsin Institutes for Discovery.
  4. glidein2.chtc.wisc.edu: The dynamic OSG glidein pool for CHTC users. See Open Science Grid for more information.

To view the status of machines in the various condor pools, use condor_status:

condor_status -pool condor.hep.wisc.edu
condor_status -pool cm.chtc.wisc.edu
condor_status -pool wid-cm.discovery.wisc.edu
condor_status -pool glidein2.chtc.wisc.edu

To see more details about the machines in the pools, you can use the -long or -format options to condor_status. For example, to see what operating system flavors are being run you could use condor_status -long and notice that in the information about each machine, there is an attribute named OpSysAndVer. The following command could then be used to summarize how many computers are running each operating system flavor in the CHTC condor pool:

condor_status -pool cm.chtc.wisc.edu -constraint 'SlotID==1 && DynamicSlot=!=True' \
  -af OpSysAndVer | sort | uniq -c

To additionally see which glibc versions are in use, the following command could be used:

condor_status -pool cm.chtc.wisc.edu -constraint 'SlotID==1 && DynamicSlot=!=True' \
              -af OpSysAndVer \
              -af OSglibc_major \
              -af OSglibc_minor \
              | sort | uniq -c

Why is my job not running?

There is a tool for analyzing what machines match your job's requirements. Example:

condor_q -better-analyze -pool cm.chtc.wisc.edu jobid

One possible reason for a job not to be running is that Condor hit some error such as a missing input file. In this case, the job will go "on hold", indicated with an 'H' in the status field in the job queue. To remove the job and resubmit it, use condor_rm jobid. If instead you can fix the problem without resubmitting the job, you can release it from hold with condor_release jobid.

Another reason for a job not to be running is that it ran once and Condor observed that it had a very large virtual image size before the job was preempted. Then future attempts to find a suitable machine may fail if no slots have sufficient memory to match the observe image size. If you have this problem, try to reduce the amount of memory needed by the job. If that is not possible, contact us for additional options.

My jobs keep getting evicted. What can I do?

 

Your job may get kicked off of a computer before it finishes in some cases. This can happen for two main reasons: the owner of the machine has immediate need for that machine, or another user in the pool with a better fair-share priority has work for the machine to do. In the case of preemption by the machine owner, your job is kicked off immediately. In the case of fair-share prioritization, your job can run for typically up to 24 hours before being killed.

If you are willing to restrict yourself to machines that are not owned by anyone else (i.e. machines provided for all UW researchers), then you can avoid having your job evicted by the machine owner. If you have a relatively small number of jobs, this is a reasonable thing to do. The following may be inserted into your Condor submit file to achieve this. If you already have a requirements line, you must logically merge this one with your other requirements.

+estimated_run_hours = 24
requirements = TARGET.MAX_PREEMPT >= MY.estimated_run_hours*3600

Alternatively, you can try to be opportunistic and get work out of the computers that are owned by other people. When your job is preempted by someone else, it returns to the idle state in the job queue and will try to run again. It is possible to make a job save state when it is kicked off so that it can resume from where it left off. Otherwise, it must restart from the beginning.

One way to make it save state is to use Condor's "standard universe". This requires relinking your program with Condor's standard library. Not all programs are compatible with this (e.g. multi-threaded or dynamically linked programs). For more information, see the Condor Manual.

Another option is to have your job intercept the kill signal (SIGTERM) sent by Condor when it wishes to kick the job off the machine. It should then quickly write out whatever information it needs in order to resume from where it left off. (If it doesn't shut down within the grace period (typically 10 minutes), then it will be hard-killed with SIGKILL.) In order to tell Condor to save the intermediate files that your program has generated in the working directory on the worker node, you should use the following option in the submit description file:

when_to_transfer_output = ON_EXIT_OR_EVICT

The Open Science Grid (OSG)

The UW is part of the Open Science Grid. This means that you can use computers at many other campuses when those computers are available for opportunistic use. For the right type of job, this can add up to a lot of additional computing power.

The mechanism that we use to access the OSG is called glideinWMS. When jobs are submitted that express a desire to run on the OSG, this causes the UW's glidein.chtc.wisc.edu Condor pool to be dynamically expanded as computers from the OSG are made available.

Requirements for a job to run on OSG:

  • The job must be submitted from login.hep.wisc.edu or submit.chtc.wisc.edu. If you would like additional submit machines to be supported, please let us know.
  • The job must define WantGlidein=true. To do this, insert the following line in the job's submit file:
    +WantGlidein = true
    

    Note that the '+' is a required part of the syntax.

  • The job must be a vanilla universe job (the default). Standard universe is not currently supported, due to firewalls that exist at many OSG sites.
  • The job must be entirely self-contained. For example, it must not depend on access to AFS, because the computers at other campuses often do not support AFS.
  • The job should make minimal assumptions about what shared libraries and other programs are available. A variety of Linux versions exist on the OSG. It is best to ship all libraries with the job (or statically compile). As of 2012-02-22, most machines in OSG are compatible with Scientific Linux 5.
  • The job should not need to run for long periods of time to get useful work done. A job that runs for 2 hours or less is ideal. A job that runs for more than a day must specify estimated_run_hours, which may limit the availability of computers that it has access to. If it does not specify estimated_run_hours, the job will likely get interrupted before it finishes, which will cause the job to return to the idle state and start over from the beginning in the next attempt. To set this parameter, put the following in your submit file, adjusting the number of hours to be appropriate for your job:
    
    +estimated_run_hours = 36
    
  • The job should ideally use less than 2GB of RAM. If it needs more, request_memory must be set to the required amount of memory. This may reduce the number of computers available to run the job, but it is important to avoid running on computers with not enough memory. Example:
    
    request_memory = 4G
    

    Note that a + should not preceed the setting of request_memory, because this is a built-in command recognized by condor_submit.

  • By default, the job is expected to only keep one CPU busy (i.e. one active thread). If the job needs multiple CPUs, this can be specified in the submit file:
    request_cpus = 8
    

    Note that there may not be as many computers available to run multi-CPU jobs as single-CPU jobs.

Because of the flexibility of Condor flocking, a batch of identically submitted jobs may use some computers in UW Condor pools as well as OSG computers when and if they become available. Condor will try to find a machine for your job at UW before it attempts to run it on OSG. Therefore, if your job is suitable for running in OSG, there are few disadvantages to doing so. The two main costs to consider are:

  1. Your job may run on computers that are configured differently from UW computers; if this causes problems, you may have to spend a little more time debugging.
  2. Your job may get preempted (killed) at any time when there is higher priority work at the OSG site. When this happens, the job will remain in the queue and will be scheduled to run again when a computer becomes available.

Although normally it is desirable to let jobs run at UW in addition to OSG, for testing, you may wish to submit jobs that will only run in OSG. The following requirements expression can be used to do that. If you already have a requirements expression in your submit file, you will need to logically AND this expression with your existing one.

requirements = IS_GLIDEIN

Getting Help

©2013 Board of Regents of the University of Wisconsin System