Awoonga User Guide

Awoonga User Guide

 


 

Contents

 

Introduction

Technical overview

Who can use the Awoonga cluster?

Getting a login

Accessing Awoonga

Major file systems

The batch system

Usage accounting

Modules, applications, compilers, debuggers

User training

Getting help

Free cluster connection tools

 

 

Introduction

The name Awoonga comes from Lake Awoonga, formed on the Boyne River in central Queensland.

Awoonga provides a batch scheduler environment and is pre-populated with a range of computational application software. The hardware and configuration of Awoonga is optimised for running multiple independent jobs, each using a small number of processors cores. It is ideally suited to large parameter sweep or ensemble applications.

 

Technical overview

Awoonga built on the following hardware:

  • 1032 Intel cores across 43 compute nodes, each with 24 cores, 256GB memory and 300GB of disk

The cluster provides the following resources:

  • 2 login nodes, with a load balancer.
  • The open source PBS TORQUE batch system with Maui scheduler.
  • 500TB of shared storage connected via GPFS and accessible across Awoonga, FlashLite, and Tinaroo clusters.
  • Home directories with tight quotas and *no* backups. Backup is the responsibility of the user.
  • Two temporary storage filesystems for staging data: /30days and /90days.
  • The 500 GB of local scratch storage on each compute nodes as $TMPDIR.

 


Who can use the Awoonga cluster?

Awoonga is available for use by any researcher who belongs to a QCIF member institution or partner organisation.

 

 
Getting a login

To access to Awoonga, you need to:


1. Create a QRIScloud account.

  • Go to the QRIScloud portal (http://www.qriscloud.org.au)
  • Click on the “Account” link
  • Log in using your Australian Access Federation (AAF) credentials
  • Accepting the Terms and Conditions
  • Updating your user profile information (click on the “My Profile” link)

2. Once logged into the QRIScloud portal, you can then request a new service.

  • Click on “My Services”
  • Click on “Order new services”
  • Click on “Register to use Awoonga”
  • Complete the request form and submit the request.

3. Once your request has been processed you will need to generate your QRIScloud service access credentials (QSAC).

  • Click on “My Credential”
  • Click on the “Create credential” button
  • You will be presented with the username and password that can be used to login to Awoonga. Please make a careful note of these.

You will be contacted by email when your account has been registered.

 

Accessing Awoonga

Registered users should connect to the Awoonga cluster by using Secure Shell (SSH) to connect to awoonga.qriscloud.org.au
If you are connecting from a Linux or Mac system, you can use the ssh command from a command shell. For example:

$ ssh <qsac-account>@awoonga.qriscloud.org.au
# enter your qsac password when prompted

On Windows, you can use the third-party PuTTY tool to SSH to Awoonga.

When you log in to Awoonga, you will find yourself logged in on one of three identical login nodes (awoonga1, awoonga2, or awoonga3). We recommend that you connect to the hostname awoonga.qriscloud.org.au to ensure you get the least loaded available login node. The awoonga hostname provides a load balancer for the three login nodes.

 

Major file systems

 

/home

Your home directory on Awoonga is /home/$USER. It is created automatically when you login for the first time. Your home directory can be accessed on the Awoonga login nodes and the Awoonga compute nodes. The purpose of your home directory is to hold software that you have brought to the system, batch scripts and configuration files, and a relatively small amount of data.

The /home has quota control of storage and numbers of files (refer to default filesystem quota settings below).


Important Note: The /home file system is NOT backed up. If files are accidentally deleted we are unable to restore them. It is your responsibility to backup files located in your home directory. We advise you to regularly transfer any valuable files from your home directory to some other system that is backed up.

/30days

Each Awoonga user is allocated a directory on the /30days file system. It is present on the Awoonga login nodes and the Awoonga compute nodes. The main purpose is to hold large data sets on a temporary basis while you are computing against them. It is designed to be a data staging area.


Users have a quota of 1 TB (1000 GB) on the /30days file system. This file system is NOT backed up. Furthermore, files left on this file system are automatically deleted 30 days after they were created.

 

/90days

Each Awoonga user is allocated a directory on the /90days file system. It is present on the Awoonga login nodes and the Awoonga compute nodes. The main purpose is to hold moderately large data sets on a temporary basis while you are computing against them. It is designed to be a data staging area.


Users have a quota of 400GB on the /90days file system. This file system is NOT backed up. Furthermore, files left on this file system are automatically deleted 90 days after they were created.

 

/sw

The /sw file system contains all of the currently available software modules that can be used on Awoonga. It is present on the Awoonga login nodes and the Awoonga compute nodes, and is read-only for normal users. 

 

/RDS/Qxxxx

The RDS collections are available to use on Awoonga via network filesystem mounts. These mounts are permanently connected with only members of the collection project team  able to access their collection data. Quotas and other limits imposed on the collection will also apply when collections are accessed from Awoonga.

Quota settings on file systems

 

File
System
GB
 Limit 
 File
 Number 
Limit

Other
Limits

/home 20 204,800 Indefinite but no backup
/30days 1000 3,145,728 Files deleted 30 days from creation
/90days 400 1,048,576 Files deleted 90 days from creation
/RDS/Qxxxx depends
on
collection
depends
on
collection
Indefinite

 

Filesystem dos and don'ts

  1. Use $TMPDIR when you need local disk for your batch jobs. The $TMPDIR directory is created automatically as part of your batch job and is removed for you automatically at the end of the job.

  2. Saving user data randomly into local disk on a node (outside of $TMPDIR) can adversely impact other users. Please DON'T do that.

  3. Compute Nodes are periodically rebuilt and the local disk space is reformatted, so do not rely on using local disk on compute nodes unless via $TMPDIR.

  4. Don't forget that $TMPDIR is unique for each job and job-array sub-job.
    Although the path may be the same, the $TMPDIR directory will probably contain different files on different nodes and for different jobs.

  5. If you need to work with many small files, please keep them bound together in a single archive file (ZIP or tar) and copy the archive file to local disk (i.e. $TMPDIR) before unpacking them to work on them in local disk space.

  6. Further information about storage is provided in storage user guide (via rcc.uq.edu.au)


The batch system

The Awoonga cluster uses the open source TORQUE Resource Manager combined with the Maui as its batch scheduler. The way to use Awoonga is to create a job script, and submit it using the qsub command. A sample submission script is provided in the section below.

For example, the command:

$ qsub -A UQ-RCC -l nodes=1:ppn=1 myjob.pbs

submits a job script called myjob.pbs to be run on one node under the account group UQ-RCC. On Awoonga it is mandatory to specify an account group when submitting a job.

You can find out what groups are available to you by running the groups command. Only some of them are valid account groups (see “Usage accounting” below for further information).

Jobs with nodes > 1 are not permitted on Awoonga.

 

To get a birds eye view of the batch system, I recommend qstat -Q
(We are currently restructuring the queues.)

stephenbird@awoonga1:~> qstat -Q

Torque Batch System Status

Queue              Max    Tot   Ena   Str   Que   Run   Hld   Wat   Trn   Ext T   Cpt

----------------   ---   ----    --    --   ---   ---   ---   ---   ---   --- -   ---

Single             100      0   yes   yes     0     0     0     0     0     0 E     0

Interact            20      0   yes   yes     0     0     0     0     0     0 E     0

Short                0      0   yes   yes     0     0     0     0     0     0 E     0

DeadEnd              0      0   yes    no     0     0     0     0     0     0 E     0

Long                24      0   yes    no     0     0     0     0     0     0 E     0

workq                0      0   yes   yes     0     0     0     0     0     0 R     0

PBSpro Batch System Status

Queue              Max   Tot Ena Str   Que   Run   Hld   Wat   Trn   Ext Type

---------------- ----- ----- --- --- ----- ----- ----- ----- ----- ----- ----

Short                0     0 yes yes     0     0     0     0     0     0 Exec

Single               0     0 yes yes     0     0     0     0     0     0 Exec

Special              0     0 yes yes     0     0     0     0     0     0 Exec

Multiple             0     0 yes yes     0     0     0     0     0     0 Exec

workq                0     0 yes yes     0     0     0     0     0     0 Rout

Interact             0     0 yes yes     0     0     0     0     0     0 Exec

Long                 0     0 yes  no     0     0     0     0     0     0 Exec

 

The qstat -q command also provides information about the queue limits, however the formatting is not quite right in some situations.

 

stephenbird@awoonga1:~> qstat -q

Torque Batch System Status

server: awongmgmr1.qern.qcif.edu.au

Queue            Memory CPU Time Walltime Node  Run Que Lm  State

---------------- ------ -------- -------- ----  --- --- --  -----

Single            124gb    --    168:00:0     1   0   0 10   E R

Interact           --      --    48:00:00     1   0   0 20   E R

Short              --      --    24:00:00     1   0   0 --   E R

DeadEnd            --      --       --      --    0   0 --   E S

Long               --      --    336:00:0     1   0   0 24   E S

workq              --      --       --      --    0   0 --   E R

                                               ----- -----

                                                   0     0

PBSpro Batch System Status

server: awongmgmr1.local

Queue            Memory CPU Time Walltime Node   Run   Que   Lm  State

---------------- ------ -------- -------- ---- ----- ----- ----  -----

Short              --      --    24:00:00  --      0     0   --   E R

Single            120gb    --    168:00:0    1     0     0   --   E R

Special            --      --       --     --      0     0   --   E R

Multiple           --      --    168:00:0   52     0     0   --   E R

workq              --      --       --     --      0     0   --   E R

Interact          120gb    --    48:00:00    1     0     0   --   E R

Long               --      --    336:00:0  --      0     0   --   E S

                                               ----- -----

                                                   0     0

 

Queues

All jobs begin their journey in the "workq" queue and get routed to the correct execution queue depending on the resource request you have made.

The default queue (workq) should be used to submit all jobs to the cluster.
Jobs are routed to a correct queue according to the resource requests.

 

Job parameters

By design, all jobs on Awoonga are single node jobs (i.e. nodes=1: ... always).

The key parameters to adjust for a job are

  • interactive (-I) or not
  • your -A account string
  • walltime resource request
  • other resource requests
    • processors per node nodes=1:ppn=
    • job memory ,mem=
    • job vmem ,vmem=
    • job memory per processor ,pmem=
  • emailing options -M and -m (email for all jobs has been disabled to avoid spam events)

For memory related parameters, you can use memory units like mb, gb with upper/lower/mixed case - all should be understood by the job submission filter.

 

Usage Limits

Usage limits are imposed to fairly share the resources amongst users. They may vary over time with total workload and priority workloads.

The batch system has limits imposed to control the number of jobs users can have queued and running. This is to fairly share the resource amongst the user community.
Queues have limits imposed on memory and cores and these form the basis of the routing of jobs to execution queues.

Usage is controlled through a parameter called (PS) that is the product of the walltime (in seconds) and the number of processors.
Think of the PS parameter as piece of cloth with time on one side and cores on the other. You can cut that cloth into lots of thin pieces (in time or numbers of cores), or a smaller number of bigger pieces.
The PS parameter is set in the job scheduler and is usually invisible to the user.
Once a user reaches their PS in their running jobs, other jobs can be submitted but will queue waiting for earlier jobs to finish.

Wall time and other job queue limits can be examined using qmgr -c "p s" command or by using the qstat -q and qstat -Q

 

Sample job submissions

Interactive Job Submission

qsub -I -A UQ-RCC -l walltime=04:00:00 -l nodes=1:ppn=2,mem=10GB,vmem=10GB

You will need to change the accounting string to yours.

 

Using a job submission file.

Copy and paste this into your own file (called filename.pbs) and modify the account string. Then qsub filename.pbs

 

#!/bin/bash

#

#PBS -A UQ-RCC

#PBS -l nodes=1:ppn=1,mem=3GB,vmem=3GB,walltime=01:00:00

#PBS -m n

#Now do some things
echo -n "What time is it ? "; date

echo -n "Who am I ? " ; whoami

echo -n "Where am I ?"; pwd

echo -n "What's my PBS_O_WORKDIR ?"; echo $PBS_O_WORKDIR

echo -n "What's my TMPDIR ?"; echo $TMPDIR

echo "Sleep for a while"; sleep 1m

echo -n "What time is it now ? "; date

 

 

 

Batch best practices

  • Do not specify the queue ... let the batch system figure it out for you!
  • Job submissions are filtered and fixed if possible or rejected if not.
  • You should avoid using the ncpus=4 nomenclature. Instead you should use nodes=1:ppn=4 ... nomenclature instead.
  • Setting a wall time for your job is mandatory because of problems with scheduling jobs. If you forget, then the filter gives you one hour.
  • If you specify a realistic wall time (somewhat longer than your expected run time), it will usually result in your jobs being scheduled more quickly than a job with an excessively long wall-time. Note that the job will be terminated at the end of the wall time (whether it has finished or not) so you should always add a bit extra to your walltime.
  • Think carefully about whether you want to receive emails at the start and end of every job you submit (-m options).
    Email has been disabled for job arrays to avoid spamming problems experienced in mid-2016.
    If you do not want the emails then explicitly disable by using the option
    #PBS -m n
  • If you do not want email then you should ensure you keep the stdout and stderr files that are generated when your job runs. These can help you with troubleshooting and refining your resource requests (the stderr "e" file contains a summary of resourced used and requested)
  • If you have a large number of similar independent computations to perform, please consider using a PBS "job array". These allow to submit and manage many jobs via a single entry in the PBS queue. See PBS qsub man page or the PBS User Guide for more details. Given the potential to overwhelm the batch and I/O systems, please consider using a pause in your job arrays to "smear" the start times out. This can be done by including a line such as:
    (sleep $(( ($PBS_ARRAY_INDEX % 10) * 15 )))
    
  • Aim to get each job or sub-job to run for at least an hour (if necessary combine sub-jobs together to create a more substantial chunk of work per sub-job). This will avoid problems that arise for all users when the PBS server is turning around many many short duration jobs/sub-jobs.
  • PLEASE USE THE BATCH SYSTEM EVEN FOR INTERACTIVE WORKLOADS - DO NOT RUN HEAVY PROCESSING ON THE LOGIN NODES
  • Use $TMPDIR space ... it is faster and kinder to your fellow users. You must copy your results back to permanent storage.
  • If you need to run something interactively (perhaps with X11 display) for more than a few minutes (eg. compilation or data analysis) please launch an interactive session on a compute node by issuing a command like:
    qsub -I -l nodes=1:ppn=12,vmem=32gb -A accountString -v DISPLAY
    

 

Usage accounting

We account for all usage on the Awoonga cluster to satisfy our stakeholders that their entitlements are being met, and to assist with planning and resource management.

The Awoonga account groups correspond either to organisational groups or to projects that span multiple organisations.
• For UQ, the account groups are broken down to the level of a School or Centre. The group qriscloud-uq will not work as an account group for job submission.
• For other organisations, the account group will be the appropriate qriscloud-xxx group for the organisation.

If you are a member of multiple accounting groups, it is important that you chose the most appropriate group when submitting jobs.

 

Other Accounting Tips

  1. Changing your Linux default group (using the newgrp command) at the command line does not affect accounting. Make sure you use the -A option with the appropriate account group within your submitted batch job.
  2. If your jobs are being rejected because of an invalid account group, please contact This email address is being protected from spambots. You need JavaScript enabled to view it. for assistance.

 

Software

Awoonga is a ROCKS cluster.

  • A lot of software lives on the local disk of each compute node and has been deployed as part of the cluster imaging mechanism.
  • The command rocks list roll will summarise the deployed rolls. Some rolls (e.g. biotools) contain many individual applications.
  • Some software and software development tools are located on teh shared storage point under /sw which is available on all nodes.
  • Project and discipline areas are encouraged to compile and maintain their own software if currency is an issue.

Environment modules

Installed software and software development tools can be found in the /sw file system available on all nodes.

The current list of available software modules is displayed by running the command:

$ module avail

Not all modules show up when you run the module avail command. Modules that depend on compiler modules are hidden from view until the compiler module has been loaded.

 

Getting help

In addition to user training and the associated training materials, there are a number of other ways to get help:

  • Documentation on standard Linux commands, the PBS job submission (qsub) and management commands and many other things can be viewed using the Linux man command
  • User documentation for most of the applications and tools available via the modules mechanism.
  • Do NOT post any commercial software documentation that you find in /sw on a website, or forward copies to anyone else
  • The system message of the day will occasionally carry information about forthcoming work and other outages.
  • Requests for assistance with Awoonga should be sent to This email address is being protected from spambots. You need JavaScript enabled to view it.

 

Free cluster connection tools

In order to use the Awoonga cluster, you will typically need a way to connect to it from a laptop or desktop system. You may also need tools for transferring files too and from the cluster, and possibly other things. There are numerous free tools available for these tasks.

To login to Awoonga, you will need a tool that is capable of running an interactive SSH session. The tools you can use include:

  • For Microsoft Windows platforms: the third-party PuTTY or WinSCP tools are available.
  • For Mac OSX and Linux: the ssh command is included preinstalled or from your platform's package installer.

To transfer files to and from Awoonga, you will need to use an SSH-based file transfer method such as SCP or SFTP. The tools you can use for this include:

  • Cross-platform: CyberDuck (GUI) or FileZilla.
  • For Microsoft Windows platforms: WinSCP.
  • For Mac OSX: rbrowser, fugu, or the scp and sftp commands.
  • For Linux: distribution specific browsers, and the scp and ftp commands.

If you need to run an interactive application on Awoonga with a GUI, then you will need to run an X11 server on your laptop or desktop that the application can connect to:

  • For Microsoft Windows: Xming is a good (i.e. free) option.
  • For Mac OSX: X11 is available in the Utilities folder.
  • For Linux: if you have a “desktop” install (e.g. Gnome, KDE, etcetera) your system will already be running an X11 server.

 

 

About QRIScloud

QRIScloud is a large-scale cloud computing and data storage service.  It is aimed at stimulating and accelerating research use of computing across all disciplines. 

Latest Posts

Get in touch

QRIScloud @
QCIF Ltd
Axon building, 47
The University of Queensland
St Lucia, Qld, 4072

Contact us through the QRIScloud support desk, or email support@qriscloud.org.au

 

Connect with us