This page describes basic configuration and most important commands used on the Cobalt (Computers On Benches All Linked Together) cluster, and covers the following topics:
Due to privacy, security, and licensing concerns, some of the links in this document may be inaccessible for systems outside of the Chemistry Department of the University of Calgary. Parts of this page may be displayed incorrectly by browsers without table support, such as lynx.
|Node(s)||Total||Memory (Mbytes)||Scratch (Gbytes)|
|1, 4, 5, 6, 8||5||256||3.1|
Cobalt24 is special - it is the cluster's central computer, serving the operating system and necessary databases to the rest of the cluster. It is outfitted with 128Mbytes of memory and 31Gbytes of disk space organized in mirrored disk arrays for increased reliability and performance. It uses two 100Mbit Ethernet cards to pump the data to other nodes. You cannot run any jobs on Cobalt24, and should not use it for interactive work - it is quite busy as it is.
All Cobalt nodes are connected by four 3COM Fast Ethernet switches, which together provide a gigabit cluster backbone. They also have a very pretty set of flashing lights. The cluster is controlled by means of a VT105 serial terminal, which adds to the rarity value of the whole hardware setup. CD recorder device (Yamaha CRW4260t) is physically located in room 431a in Science B, and is available for backups. The final piece of hardware constituting the Cobalt cluster is an industrial-grade air cooling unit, which prevents the whole thing from going up in flames.
Cobalt is connected to the outside world by a single 100Mbit Ethernet line.
Guided tours of the Cobalt machine room are available on request. In the meantime, please visit the Cobalt gallery.
Cobalt runs Compaq Tru64 Unix (formerly Digital Unix, formerly OSF/1) version 4.0D. All Cobalt nodes have access to: development environment, including dbx and ladebug debuggers, atom (profiling and analysis tool), cvs, rcs and SCCS (source code control systems), etc; Digital C compiler (two versions, see man cc); Digital Fortran compiler (both f77 and f90); KAP optimizing preprocessors for Fortran-77 and Fortran-90; Digital's extended mathematical library (DXML), as well as many other packages available under the Campus Software Licence Grant program. You can find manuals (plain text) and release notes at /usr/local/doc. (Due to licensing restrictions, hosts outside Chemistry department won't be able to access this link).
A lot of free software is available in the /freeware directory (which contains a copy of the freeware CD supplied by Digital), and in /usr/local/ directory tree. Some of the more useful freeware packages installed there are teTeX and netscape. Three versions of netscape are installed on Cobalt: Netscape Navigator 3.04 Gold (installed as netscape), Netscape Communicator 4.07 international version (netscape4), and Netscape Communicator 4.07 US version (netscape4us). Due to the U.S. export restrictions, only U.S. and Canadian nationals or permanent residents are allowed to use netscape4us.
PVM parallel programming library and manual pages are installed in /usr/people/pvm. DQS distributed queuing system is installed in /usr/people/dqs. Gaussian-94 is installed in /usr/people/wagstaff/g94. ADF and PAW density functional codes are installed in /usr/people/program. MPICH implementation of the MPI message passing standard in installed in /usr/people/mpich.
Environment variables necessary to access most packages (most notably PATH, MANPATH, and LD_LIBRARY_PATH, but also others) are initialized in /etc/csh.login (for csh and tcsh users) or /etc/profile (for sh/ksh users).
Several versions of ADF are installed on the Cobalt cluster. Not all features are available in all versions of ADF, so that depending on your project you may have to use different versions. Here is a short summary of the ADF versions available on Cobalt.
|Version||Binary Location||Source Code Location|
|NMR for 2.3.3||~program/adf/bin/adf.nmr + ~program/adf/bin/nmr||~program/adf/src/nmr2.3/|
|2.3.3 + QM/MM||~program/adf/bin/adf.qmmm||N/A?|
|2.3.3 + COSMO||~program/adf/bin/adf.solv||N/A?|
|2.3.3 + QM/MM + NMR + Parallel||~program/adf/bin/adf.parallel||~patchkov/adf|
|98 + COSMO||~program/adf/bin/adf98||N/A?|
In order to run any of the 2.3.3, 98, 1999, and higher, binaries, you should set ADFHOME and ADFBIN environment variables respectively to ~program/adf and ~program/adf/bin. In C shell, the necessary commands will be:
setenv ADFHOME ~program/adf setenv ADFBIN ~program/adf/binDoing this is particularly important if you are going to run ADF in parallel.
ADF version 99 and higher requires several environment variables to be set in addition to ADFHOME and ADFBIN. The easiest way to achieve this is to execute a standard initialization script before starting ADF:
|In C shell:||source ~adf99/profile.csh|
|In Bourne or Korn shell:||. /usr/people/adf99/profile.sh|
The 2000 and higher versions have similar scripts in their respective
Please refer to the Queueing System and Parallel Jobs sections for the instructions for submitting your ADF jobs to the queueing system.
The key features supported by each of the pre-compiled ADF binaries are summarized in the following table. This list is by no means complete; please consult the reference manual and the program source code for the corresponding version for the complete list of supported features. The ADF documentation is available at the website of Scientific Computing and Modeling (SCM).
Once you picked up the version of ADF you are going to use, stick with it: binary files (including TAPE21 and TAPE12 files) generally are not compatible between ADF releases.
PAW has its own home page on Cobalt, please go there.
All home directories have permanent limit of 300 megabytes on the disk space usage. For short periods of time (e.g. program builds), somewhat more disk space may be used. The excess has to be cleared within a week - otherwise, write permissions will be withdrawn automagically. If you are over your home quota (i.e. if 'df -k .' in your home directory shown 0 blocks available), you can check the remaining time before write permissions will be withdrawn by logging to the node containing your home directory and issuing command 'showfsets local people'.
Using the node containing your home directory for interactive logins will improve response time and reduce load on the cluster interconnect. Try to use "your" node for most of interactive work. You can also use cobalt1 to cobalt8 for interactive work which requires more memory than is available on your home node, but please try to avoid doing interactive work on Cobalt nodes with numbers above 40 - these nodes may be used for running parallel jobs (see below). Any interactive load on these nodes may disrupt load balancing.
Running production jobs interactively is strongly discouraged. You can use "your" node for interactive debugging, or running small test jobs. Everything else should go to the queue. In cases where it is absolutely necessary to run large jobs interactively, individual arrangements must be made with cluster administrator before job is started.
All home directories are backed up nightly. Due to the space limitations of backup media, about 1GByte is available for each Cobalt user. Within this limit, up to about five most recent backups will be stored, and can be used for recovery of accidentally removed or damaged files. Long-term backups are the sole responsibility of the users, and can be made using either CD recorder connected to the Cobalt cluster, or tape backup services provided by ACS.
Additional personal disk space excluded from nightly backups is available for special projects on request.
Home directories are physically distributed over Cobalt nodes. Here is the list of nodes with home directories which was accurate at the time this document was prepared. (Due to privacy and security concerns, this link is available only within Chemistry department). You can always find the location of your home directory by saying 'ypmatch `whoami` auto.home' while logged on any Cobalt node.
On each of the cluster machines, /local/scratch contains local scratch space. The amount of the scratch space which is (nominally) available on each node is given above. Somewhat more disk space may be available in practice. However, the remaining space is shared with local system temporary directories (/tmp and /var/tmp) and home directories, and should not be counted upon.
Please remove files from the scratch area as soon as you do not need them anymore, so as to make more space available for other users. All files older that two weeks will be automatically deleted from /local/scratch at midnight.
All scratch and and home directories located on cluster hosts are accessible throughout the cluster. All home directories for local cluster users are accessible under /usr/people/user_name. Scratch directory on a cluster machine host_name is accessible as /usr/remote/host_name/local/scratch. For example, for cobalt7's scratch: /usr/remote/cobalt7/local/scratch. This name also works on cobalt7 itself.
Sharing of file systems throughout the cluster is implemented with automount service, so that directories are mounted when needed and unmounted if not used for more than a few minutes. As a consequence, taking a directory listing of /usr/people for example, may not list some, or all of the possible home directories. However, referring to any of the remote files by the explicit name (e.g., taking a directory listing of /usr/people/patchkov) will work correctly. The same applies to all /usr/remote entries.
With some shells (sh, csh) making an automounted directory your current directory will result in "strange" directory names being reported by pwd command. For example, doing "cd /usr/people/patchkov;pwd" on any of the cluster machines except for cobalt18 may report /tmp_mnt/cobalt18/usr/people/patchkov. Please do NOT use these names in any of your scripts, and not not rely on them. Unlike the /usr/people and /usr/remote names, these are internal to automounter, and do not possess the magic necessary to make your home directory appear on a remote machine.
In Korn shell, pwd works as expected. tcsh (available at /usr/local/bin) can be configured either way, please read the corresponding manual page.
Cobalt nodes trust each other. All r-commands (rlogin, rsh, rcp) will let you access any Cobalt node from any other Cobalt node without a password.
Macintosh users on the Chemistry department network can access their Cobalt home directories using AppleShare. To do this, pick CobaltCluster is the Chooser, and provide your Cobalt password when asked for the password. This service is implemented using CAP (Columbia AppleTalk Protocol), see corresponding manual page ('man CAP').
Cobalt cluster is controlled by the Distributed Queuing System (DQS) version 3. Some DQS documentation is available in the form of manual pages for most important DQS commands (qdel, qstat, and qsub), as well as in HTML form . Either way, documentation is often misleading or plain wrong, so here is a minimal synopsis of the most important commands:
|qstat||is used to examine status of jobs and nodes controlled by the queuing system. qstat without arguments will list all jobs known to the queueing system. 'qstat -u user_name' will list jobs submitted by any particular user. 'qstat -f' will include status of nodes in the output.|
|qjobs||is a simple Perl script calling qstat and formatting the same information in the (hopefully) more useful form. Without arguments qjobs will show only your jobs. qjobs -all will show all jobs know to the queuing system, while qjobs -u user_name will show jobs submited by user user_name.|
|qdel||is used to remove jobs (either running or waiting in the queue). The only argument accepted by qdel is job identification number.|
|qsub||is used to submit jobs to the queuing system. In the simplest form, the only argument supplied to qsub is the name of a shell script which should be executed by the queuing system. A DQS batch job may look like this:|
#!/usr/local/bin/tcsh #$ -N MyFirstJob #$ -S /usr/local/bin/tcsh #$ -j y #$ -o OutputOfMyFirstJob.out #$ -l mem.ge.32.and.disk.ge.100.and.EXPRESS cd $TMPDIR ~program/adf/bin/adf98 <~/InputForMyFirstJob.in
Every batch job consists of two sections: DQS header (lines beginning with '#$' at the top of the script and actual commands constituting the job. The header MUST be at the very top of the job file - any shell commands or comments apart from the magic '#!' line will signal the end of the DQS header.
The example above uses most of the normally encountered DQS directives, namely:
|-N||sets the name under which the job will appear in the qstat output|
|-S||specifies shell which should be used to start the script. Note that the magic '#!' line, which would normally be used to specify shell in a Unix script is in fact ignored by DQS|
|-j||line instructs DQS to combine standard error and standard output in the same file|
|-o||specifies location of the file which will receive standard output produced by the job. Unless this name begins with the slash /, it will be interpreted relative to your home directory|
|-l||specifies resources needed by the job. Currently, DQS on Cobalt knows about the following resources:|
|mem||is the amount of memory available on each node|
|disk||is the amount of scratch space on each node|
|PAW||is true if node can run PAW jobs|
|EXPRESS||is true if node can run express jobs|
|PARALLEL||is true if node can run parallel jobs|
A special resource 'qty' specifies number of nodes for parallel jobs, e.g. 'qty.eq.4' will ask for four nodes. Each node should still satisfy all other resource constraints. Parallel jobs are described in more detail below.
Many (but not all) resource specifiers can be combined (see Queues below to see which combinations are allowed). At the very least, your job should specify the amount of memory (mem resource) it will need.
When DQS attempts to start a job, the resource record specified in the header is matched against the attributes of each available queue. If all the attributes match, the job is started. In particular, the example above requests any queue with 32 or more megabytes of memory, 100 or more megabytes of scratch space, and capable of running express jobs.
When job is started, DQS will create a temporary directory in the scratch area (/local/scratch) on the node running the job, and set TMPDIR to point to this directory. Each job is started with current working directory set to your home directory. If your job needs to create substantial temporary files, you should use scratch directory provided by DQS for that purpose. DQS will take care of removing any files remaining in the area pointed to by TMPDIR when your job terminates.
Some of the programs used on Cobalt provide their own scripts for submitting DQS jobs, in which case you can simply use the ready script instead of writing your own. Scripts used by PAW are described in the PAW documentation.
Cobalt provides the following queues:
|Queue type||CPUs||Max time||Max memory||Max disk||Required keywords|
|Serial ADF||58||4 weeks||110Mb||3.4Gb||none (default)|
|Express ADF||15||30 minutes||45Mb||3.4Gb||EXPRESS|
|Parallel ADF||28(7x4)||4 weeks||110Mb||3.1Gb||PARALLEL|
|Serial PAW||10||4 weeks||480Mb||3.1Gb||PAW|
|Express PAW||8||2 hours||360Mb||3.1Gb||EXPRESS, PAW|
|Parallel PAW||6(2x3)||no limit||480Mb||2.5Gb||PARALLEL, PAW|
Not that the number of queues of each type, as well as the resource limits, are adjusted from time to time depending on the workload. Therefore, the numbers in this table should be treated as a guideline only.
Serial ADF queue is the default queue, which will be used if no resource keywords were specified in the DQS job header. Use of all other queues requires addition of the keywords given above to the resource record. Individual nodes assigned to a given queue may have less memory or disk space than the queue's maximum value given above. At the very least, you should always specify the amount of memory your job will need for execution.
Additional restrictions exist for the use of EXPRESS and PARALLEL queues. You are not allowed to run more than one EXPRESS job at any given time. Parallel ADF jobs can use either four or eight nodes. Other node counts are not allowed for load balancing reasons. Parallel PAW jobs can use three nodes. Neither smaller nor larger node counts are permitted. It is absolutely forbidden to run parallel jobs in regular queues.
These restrictions are not enforced by any mechanical means, so it is your responsibility to ensure they are not violated. Repeated offenders will be barred access to the parallel and express queues.
The queuing system places an overall limit on the number of CPUs all your jobs together may use. The limit is set by the system administrator, and is adjusted periodically to ensure fair access to the Cobalt queues. For example, if the overall limit is set to eight, and you have an 8-cpu parallel ADF job running on Cobalt, no new jobs submitted by you will start until the parallel job terminates or the per-user job limit is increased by the system administrator. If 'qstat -u `whoami`' shows all your jobs waiting in the queue with the MAXJOB attribute, this is it.
If you feel what neither one of the default queues satisfies your requirements, please do not try to circumvent the queueing system. Come and talk to me, and we'll try to find a solution.
Cobalt is a distributed-memory system, and requires explicit message passing programming in order to run jobs utilizing more than one CPU. Two message passing libraries are currently supported on Cobalt: PVM and MPI. ADF was parallelized with the PVM library, while PAW utilizes MPI.
Running a parallel ADF job requires a slight modification of the sample job script above:
#!/usr/local/bin/tcsh #$ -N ParallelJob #$ -S /usr/local/bin/tcsh #$ -j y #$ -o OutputOfMyParallelJob.out #$ -l mem.ge.32.and.disk.ge.100.and.PARALLEL.and.qty.eq.4 #$ -par PVM cd $TMPDIR setenv ADFHOME ~program/adf setenv ADFBIN $ADFHOME/bin $ADFBIN/start -n 50 adf.parallel <~/InputForMyParallelJob.in
Or, if you prefer to run your parallel job using ADF 99 (just do not forget that you can't mix TAPE21's and TAPE12's between 2.3.3 and 99 - and you will also get slightly different numbers out of them), you can use this script:
#!/usr/local/bin/tcsh #$ -N ParallelJob #$ -S /usr/local/bin/tcsh #$ -j y #$ -o OutputOfMyParallelJob.out #$ -l mem.ge.45.and.disk.ge.100.and.PARALLEL.and.qty.eq.4 #$ -par PVM cd $TMPDIR source /usr/people/adf99/profile.csh $ADFBIN/adf -n 4 <~/InputForMyParallelJob.in
The three important changes are: PARALLEL and qty.eq.4 on the resources line; an additional line identifying the job as a PVM application (-par PVM); finally, ADF is now started with the start script, and, for ADF 2.3.3 a special parallel version of ADF is used. Not all calculations can be done in parallel with adf.parallel at this point. It will run single-point SCF and geometry optimizations in parallel, and should be able to execute hybrid QM/MM jobs. ADF 99 should be able to run most of the jobs in parallel.
The second application parallelized for Cobalt is PAW.; Instructions describing parallel execution of PAW are given elsewhere.
Several printers are accessible from Cobalt. Chemistry department's black and white Postscript printer (in room SB 229) is accessible as 'chem_ps' printer. The colour Postscript wax printer (at the same location) is accessible as 'wax_ps_color'. Finally, a black and white Postscript printer in SB 433 is known to Cobalt as 'jana_ps'. In order to access any of these printers with lp or lpr use:
lp -d chem_ps a_postscript_file.ps or:
lpr -Pchem_ps a_postscript_file.ps
You can also set any of the printers as your default printer by doing:
setenv PRINTER chem_ps # in csh/tcsh or
PRINTER=chem_ps; export PRINTER # in sh/ksh
All three Postscript printers will only accept Postscript input. If
you need to print a text file, you will have to convert it to Postscript
asc2ps <my_file.txt >my_file.ps ; lp -d chem_ps my_file.ps
Once you send the output to any of the printers, it is not possible to cancel it. Be careful with what and where you print. It is not advisable to print large outputs on the jana_ps printer: people working in SB 433 get mightily annoyed with printer noise and having to replace the paper all the time. It is also a bad idea to print black and white output on wax_ps_color, which uses expensive paper and is much slower than chem_ps. Both printers in SB 229 have a duplex option installed (for double sided printing) but it can be tricky to enable this option when printing from COBALT. For the wax printer, using the duplex_wax_ps_color print queue should do the job. For chem_ps you have the option to login to praseodymium.chem.ucalgary.ca and print the file from there with lpr my_file.ps. The print queue on this machine has been successfully set up to support the chem_ps printer's duplex option. Please note, however, that this printer frequently forgets that it has such an option and needs to be switched off and on in order to enable duplex printing. After having informed you about the hazards of trying to print double sided on this printer I should also note that the outcome is of really good quality and that it prints very fast.
Cobalt's CD recorder (Yamaha CRW4260t) is physically located in SB 431a and is attached to zinc10, which runs Linux. 4260 supports recording ordinary CDR blanks at 4x speed (600KB/sec for data) and (the "silver") CDRW blanks at 2x speed (300KB/sec). Recorded CDs can be mounted either on zinc10 or, for extended use, directly on 'your' Cobalt node. Alternatively, you can use the CD writer installed in the machine praseodymium (pr), also located in SB 431a.
You are required to bring your own blank CDs for the recording session, as no blank CD are stocked in SB 431a. Blank CDs are available from the MicroStore (in the basement of Math Sci building) or from the New Media Centre (in the basement of MacKimmie library block).
Please contact the system administrator before using the CD recorder for the very first time.
zinc10 and praseodymium will accept your Cobalt password, and will automatically provide access to your home directory and all directories in /usr/people. In order to access scratch directories from zinc10/pr, you will have to use file names in the format: /usr/remote/cobaltNN.local.scratch/. This is different from the Cobalt proper, where you have to use slashes instead of dots in the pathname.
Macintosh users can also prepare data for CD session in the AppleShare volume 'CDR scratch volume' available from AppleShare node CobaltCluster. This volume will appear as the directory '/macVolume' on zinc10. If you use this volume to prepare your CDs, please clean up after yourself once you are finished with your CD.
The easiest way to record your data on a CD is to use the 'record-cd' script which is located in /usr/local/bin directory on zinc10 and pr. This script takes a single argument - name of the directory containing files to be backed up, and creates a hybrid CD containing ISO 9660 file system with both RockRidge and Microsoft Joliet extensions. The resulting CD is readable on most unix systems as well as on Microsoft Windows machines.
If your directory structure contains extended Finder information (i.e. if you created it on the 'CDR scratch volume' or on a Mac volume in your home directory - see man AUFS for instructions on creating a Mac volume on Cobalt), 'record-cd' will recognize extended attributes and create a CD containing Mac HFS volume in addition to the ISO, RockRidge, and Joliet file systems.
In either case, you will be given an opportunity to examine the contents of your CD image before recording it. You can also record multiple copies of your CD using record-cd script.
If you are not satisfied with the features provided by record-cd, you can always create your CD using mastering and recording utilities directly. The following CD mastering software are installed on zinc10 and pr in /usr/local/bin, with the corresponding manual pages in /usr/local/man hierarchy:
|mkisofs||Command line utility for mastering ISO9660, RockRidge (Unix) and Joliet (Windows) CDs. A lot of interesting links on CD formats was collected by Jörg Schilling. Both RockRidge and Joliet can be on at the same time. mkisofs also supports mastering multi session CDs. Note however that Cobalt nodes are unable to read multi session CDs due to a bug in CD firmware|
|mkhybrid||Command line utility for mastering Macintosh CDs. It also supports RockRidge and Joliet, but not multi session. Note: newer versions of mkisofs contain this functionality. On pr this program is therefore not installed.|
|cdda2wav||classic DAE tool for UNIX. Not tested and may not work|
|cdrecord||Command line utility for recording CD images created with the programs above. cdrecord supports data and audio CDs, and can create multi session CDs. CDRW blanks are supported in the same way as CDR disks, but can be erased and reused.|
|xcdroast||GUI interface for mkisofs/cdrecord. Although it does not support multi session CDs, mastering and recording single-session audio and data CDs is quite easy with xcdroast. Since this is a development version of the program, there are some minor glitches with it. In particular, if you use xcdroast from a remote machine, you will have to set DISPLAY to the numerical IP address. Host names won't work.|
|Person or project||External name||Internal name|
|Tom Ziegler's group||http://www.cobalt.chem.ucalgary.ca/group/||file:/usr/home_pages/group/|
|Cory C. Pye||http://www.cobalt.chem.ucalgary.ca/cory/||file:/usr/home_pages/cory/|
|The PAW Project||http://www.cobalt.chem.ucalgary.ca/paw/||file:/usr/home_pages/paw/|
|Eva D. Zurek||http://www.cobalt.chem.ucalgary.ca/edzurek/||file:/usr/home_pages/edzurek/|
The internal name of each home page (which always starts with file:) is only accessible to browsers running on the Cobalt cluster itself, but allows modification of the page contents. The external name only allows read-only access, but is valid at any Internet location. The external name is what you want to give to your friends so that they can see your cool home page.
The contents of your home page directory are entirely under your own control. However, you are expected to follow few simple rules, namely:
|Aug. 29, 2002||Added link to ADF 2002 and home page of E.Z., printer info updated|
|Sept. 28, 2001||Added link to ADF 2000, other updates|
|May 15, 2001||Added link to J.A.'s home page|
|December 6, 2000||Added link to Cobalt NRC presentation|
|September 30, 1999||Documented installation of ADF 1999 on Cobalt|
|May 20, 1999||Added a link to the Cobalt poster|
|March 24, 1999||Added section on personal home pages|
|March 12, 1999||Added a link to the Cobalt gallery|
|March 1, 1999||Fixed links mangled by Netscape Communicator. All links should work now|
This document was initially prepared by Serguei Patchkovskii, who took care of the day-to-day operations of the Cobalt until Sept. 2001 and is a nice guy all round. Jochen Autschbach took care of Cobalt, from Sept. 2001 to Sept. 2002. He is also a nice guy. Another nice guy Zhitao Xu took care of Cobalt from Oct. 2002 to Oct. 2003. Since Nov. 2003, another guy, Michael Seth who might be nice we haven't decided yet, has been nursing Cobalt along.