Hopper HPCC User’s Guide

Hopper HPCC User’s Guide

Version 1.3 – February 1, 2018


Table of Contents


Introduction

About the Hopper Cluster

The Hopper Cluster is Auburn University’s newest and most powerful system for high-performance parallel computing. It is designed to accelerate research in a wide variety of fields by providing high-end computing resources. These resources include centrally managed cluster resources, storage, software and technical user support.

The Hopper Cluster is a collaborative, campus-wide effort of Auburn University funded by:

Hopper Namesake

Auburn has named this supercomputer “Hopper” in honor of the late Rear Admiral Dr. Grace Murray Hopper, an extraordinary woman whose contributions to computer science laid the foundation for modern programming. Among her long list of remarkable achievements are the creation of the “A” compiler, key efforts in the development of COBOL and other high level languages, and significant input into the development of early machines, including UNIVAC.

About this Manual

This manual is the primary reference document for Auburn University’s Hopper High Performance Computing Cluster. It is not intended to be an exhaustive reference, but rather a concise guide to the basics of the system and a starting point for users.

This manual will be revised on an ongoing basis. It is important that users refer to the latest version.

Acceptable Use

Policies outlined in OIT’s Appropriate Use of Information Technology, and well as any other applicable policies, apply to use of the Hopper cluster. Hopper also has its own guidelines and restrictions for acceptable use, described throughout this document.

Citations and Acknowledgements

Please help show the importance of computational resources and OIT support staff in research at Auburn University by acknowledging such in any publication or presentation that are made possible using the Hopper Cluster.

Acceptable citations are below:

  • This work was completed in part with resources provided by the Auburn University Hopper Cluster.
  • We are grateful for the support of the Auburn University Hopper Cluster for assistance with this work.
  • We acknowledge the Auburn University Hopper Cluster for support of this work.

If an acknowledgement is included your work, please send a brief email to hpcadmin@auburn.edu.
Please note that the Hopper Cluster is distinct and in no way related to the Alabama Supercomputer Authority.

System Overview

Hopper is a Lenovo System X based HPC Cluster with approximately 5888 Cores, 39 TB RAM, 1.4 PB Disk, and 175 TFlops*.  For details, please visit the AU HPC website.

*as of the document version and date


Accessing Hopper

Request an Account

In order to submit jobs to the Hopper Cluster, you must have a Hopper account:

  1. Access the Account Request Form
  2. Login with Auburn credentials
  3. Select your sponsor from drop-down
  4. Complete the form
  5. Accept the Terms of Use
  6. Submit the request

Your request will be emailed to the sponsor that you selected and they will approve or deny the request at their discretion. Upon approval, you will receive a Welcome email. It will confirm that your account was created and include the Hopper Quick Start Guide.

Connect to Hopper

To connect to Hopper, open a terminal program, then ssh to ‘hopper.auburn.edu’ with your Auburn userid.

ssh <auburn-userid>@hopper.auburn.edu

If successfully logged in, you should now see a command prompt and be in your home directory.

You will need to use an AU VPN connection to access Hopper from off-campus. There is more information available online at: http://www.auburn.edu/oit/vpn .


Locations and Resources

Login Node

The login node ( hopper.auburn.edu ) is your sole interface to the Hopper Cluster and is accessed remotely over the network using ssh. It is where you can issue commands, submit jobs, see results and manage files.

Although it is acceptable to run short, non-intensive jobs on the login node during testing, the login node is not intended to be used for computationally intensive work.

Running intense computations on the login node affects the performance and availability of the cluster for all other users, and is therefore, not allowed.

Any processes that violate this policy will be killed automatically. You can check the impact of your running processes, or the processes of other users with:

top

Hopper will kill processes automatically that violate this policy, however, please notify hpcadmin@auburn.edu if you see any activity that affects your ability to access or effectively work on the cluster.

Compute Nodes

Your jobs run on the compute nodes in the cluster and you utilize them by submitting work to the Workload Manager, traditionally known as a queue system. The Workload Manager assigns your job to the compute nodes based on the attributes that you indicated in your job submission, and the resources that you have at your disposal.

Hopper has the following compute nodes:

Category Count Processor Speed Cores Memory Queue Restricted qsub Example
Standard 190 E5-2660 2.60GHz 20 128 GB general N qsub -l nodes=3;ppn=20 job.sh
Standard 28 55 E5-2680 2.40GHz 28 128 GB gen28 N qsub -q gen28 -l nodes=3;ppn=28 job.sh
Fast Fat 13 E5-2667 3.20GHz 16 256 GB fastfat Y qsub -q fastfat -l nodes=2;ppn=16 job.sh
GPU K80 2 E5-2660 2.60GHz 20 128 GB gpu N qsub -q gpu -l nodes=1;ppn=20 job.sh
Phi 7120P 2 E5-2660 2.60GHz 20 128 GB phi N qsub -q phi -l nodes=1;ppn=20 job.sh
Super 1 E7-4809 2.00GHz 64 1 TB super Y qsub -q super -l nodes=1;ppn=64 job.sh

File Storage

Users are provided a high performance GPFS file system which is used for users’ home directories, the scratch directory and the tools directory.

Home Directory

Each user has their own directory, called a home directory, in /home/<userid>. Your home directory is the primary location for your datasets, output, and custom software and is limited to 2TB.

Home directories are backed up (snapshotted) daily and kept for 90 days. However, it is the user’s responsibility to transfer their data from their home directory back to their computer for permanent storage.

Scratch Directory

All users have access to a large, temporary, work-in-progress directory for storing data, called a scratch directory in /scratch.

Use this directory to store very large datasets for a short period of time and to run your jobs. Although you can submit jobs to run from your home directory, scratch is better option as it is much larger at 1.4 PB. Files in this location are purged after 30 days of inactivity (access time) and are not backed up.

How to use scratch for your data
  1. Create a directory for your job in scratch.
  2. Copy your input data to this directory in scratch.
  3. Run your job that uses the files in that directory.
  4. Within a week, make sure and copy any needed results back to your home directory.
  5. Delete your directory in scratch.
  6. Create a new directory in scratch for every job you run.
Warnings – Scratch Directory

Warning: Do not use scratch as long term storage.

Any data left on scratch is automatically erased after 30 days of inactivity, based on the last access time of the file(s). Each time the data is accessed, the time window is renewed, so as long as you are using your data it will remain available. There is no backup of scratch. Thus any data files left on /scratch must be transferred elsewhere within a few days if they need to be kept.

Tools Directory

Each user has access to a directory for installed software called the tools directory located in /tools. Many of the most popular software packages, compilers, and libraries are installed here.

File Storage Summary

Name Directory Purpose Quota Retention Backup
Home /home/userid Small datasets, output and custom software 2 TB long Y
Scratch /scratch Large datasets and output 1.4 PB short N
Tools /tools Software packages, compilers and libraries N\A long Y

How to copy files

To transfer files to your home directory from your local machine:

scp –r <source filename> <userid>@hopper.auburn.edu:<~/target_filename>

To transfer files to your local machine from your home directory:

scp –r <userid>@hopper.auburn.edu:<~/source_filename> <target_filename>


Software

Existing Software

OIT manages a set of supported software for use on the Hopper Cluster.  Software binaries and libraries installed globally by OIT administrators are most often found in the /tools directory, each accompanied with a corresponding Environment Module.

The best way to get started using a piece of software is to take a look at the corresponding module files.

To see what software is currently available on Hopper, you can look in the /tools directory or view available software modules …

> ls /tools
> $ module avail

Each module contains detailed information about the software including a short description, where to obtain documentation, when and how it was installed, and any corresponding changes that are made to your environment when loaded

> module show <software_name>[/version]

New Software

Any unlicensed open-source software of use to the general user community can be installed by OIT provided that the Hopper Cluster meets the software requirements. Please contact OIT support staff or submit a software request at https://aub.ie/hpcsw.

Licensed software must be legally licensed before it can be installed. Since there are many different types of licenses and different vendor definitions of these licenses, any multi-user licensed software installation needs to be coordinated with OIT support staff. Single-user licensed software should be installed in the user’s /home directory.

Custom software should be built in the user’s /home directory. Users are encouraged to maintain and publish their own local module files to configure the environment for their software.

OIT support staff: hpcadmin@auburn.edu
Software request: https://hpcportal.auburn.edu/hpc/forms/login.php?form=/hpc/forms/sw.php.

Set the Environment

Using Environment Modules

Different programs require that OS environment variables be correctly defined in order to use them. For example, the commonly used $PATH and $LD_LIBRARY_PATH environment variables assign the locations of available software binaries and the existence of shared libraries, respectively. Most software can be properly executed with the correct manipulation of these variables.  More complex software also requires the assignment of additional, custom environment variables.

These variables can be set manually or in the user’s profile through the traditional shell methods…

> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/my/library

However, this can be cumbersome when a user wants to use different versions of an application, employ a different compiler, or build against different libraries. A convenient solution is to use environment modules.

Environment modules provide the user a way to easily and dynamically change their environment. With simple commands, modules define the environment variables of your shell environment, so that the correct versions of executables are in the path and compiler toolchains can find the correct libraries.

OIT creates modules for globally installed software in the tools directory. Users must find and load the module(s) needed by their application.

Common Module Commands
Module Command Description
module avail [module] List all modules available to be loaded
module whatis <module> Display info about a module
module list List modules currently loaded
module load <module> Load a module
module unload <module> Unload a module
module swap <existing> <new> Unload a module
module purge Remove all modules
Creating your own modules

Users should create their own modules for software installed in their /home directory.

To create your own module files, just create a directory named ‘privatemodules’ in your /home directory and put your modules there. Once that directory exists, the module system will search it for module files the next time you login and should be listed as an available.

Users can review existing modules in /tools/modulefiles and use them as a template. More information can be found here:

Modules homepage: http://modules.sourceforge.net/
Modules manual: http://modules.sourceforge.net/man/module.html

Compile Software

You are free to compile and run any custom software needed for your research within your home directory, provided that it is from a reputable, trustworthy source and adheres to the acceptable use policies described above.

Hopper provides a variety of development tools that you can use to build software, including several versions of gcc, make, make, and openmpi.

Your software will most likely provide a README or INSTALL file in the source distribution that should provide hints on what tools you will need and what procedure to follow. Alternatively, check the software’s web site or other Internet sources if you encounter problems.

For parallelized software, you will want to compile with your software’s recommended version of openmpi, i.e…

> module avail openmpi
> module load openmpi/1.8.3
> ./configure --prefix=/home//software/build
> make
> make install


Job Scheduling and Resource Allocation

The Hopper Cluster is a shared resource among the many Auburn University principal investors (PIs). The goal is to maximize utilization of the computing resources while guaranteeing each PI access to the resources that they have purchased. This requires special software to schedule jobs and manage the resources.

Hopper uses the Torque Resource Manager with the Moab scheduler (aka ‘Torque/Moab’).

Torque is an open source fork of OpenPBS, which is maintained by Adaptive Computing. Even further deconstructed, OpenPBS itself is an open source implementation of the Portable Batch System.

Torque serves as Hopper’s Resource Manager, and essentially handles the counting and grouping of available resources in the cluster.

Moab serves as the Workload Manager, and performs the majority of decision making for where and when jobs are allocated.

Hopper implements a shared-maximum model of scheduling which guarantees that each PI has access to the resources that they have purchased, while also providing extra computational power through leveraging underutilized processing power.  This model relies heavily on Moab “reservations” which are similar to traditional queues, but are defined in terms of ownership.  Moab reservations grant access to the system’s global pool of researchers when unused, but will initiate job preemption when they are eventually requested by the owner.

If your job(s) consume more than your share (or your sponsor’s share) of available resources, they have a high chance of being preempted. Therefore, cluster researchers are encouraged to be mindful of their primary allocation of cores, the system load, and the current demand from fellow researchers when requesting resources from the Workload Manager using the commands described below.

Torque\Moab provide commands that give users information about resource availability in order to obtain quicker job turnaround times and to more fully utilize the system.  Deep familiarity with these commands is essential in gleaning useful work from the machine.

Job Submission Overview

Hopper’s recommended methods for job submission and monitoring is through the use of the Torque “q” based commands, specifically “qsub”.

Torque\Moab provide a multitude of commands that will enable you to (in addition to others) instruct the cluster on what code and data to act upon, poll and assign available resources, and estimate and specify how long a given job should run.

Determining optimal values for “qsub” parameters comes with experience, and it is usually best to submit test jobs to get an estimate for the number of processors and wall time.

Familiarity with the qsub command, specifically its expected syntax, input, and behavior, is crucial for getting work done on Hopper.

How to Submit a Job

Job submission is accomplished using the Torque ‘qsub’ command. This command includes numerous directives which are used to specify resource requirements and other attributes for jobs. Torque directives can be in a batch script as header lines (#PBS) or as command-line options to the qsub command.

The general form of the qsub command:

qsub –l <resource-list> -m abe –M <email addr> <job script>

Job Submission Options
qsub Option Description
-q queue Specifies a queue
-l resource_list Defines resources required by the job
-m abe Sends email if the job is (a) aborted, when it (b) begins, and when it (e) ends
-W x=FLAGS:ADVRES: Specifies a reservation
resource_list

Max length of time during which job can be in the running state

Resource Descripton
mem Max amount of physical memory used by the job
nodes Number of nodes for exclusive use by the job
ppn Number of processors per node allocated for the job
walltime Time from job start to job completion. ( Default: 2 days/Maximum: 90 days )
reservation Dedicated resources ( nodes ) based on your group.
Examples

Example 1:

This job submission requests 40 processors on two nodes for the job ‘test.sh’ and 20 hr of walltime. It will also email ‘nouser’ when the job begins and ends or if the job is aborted. Since no queue is specified, the general queue is used as it is the default.

qsub -l nodes=2:ppn=20,walltime=20:00:00 -m abe –M nouser@auburn.edu test.sh

Example 2:

This job requests a node with 200MB of available memory in the gen28 queue. Since no walltime is indicated, the job will get the two day default walltime.

qsub -q gen28 -l mem=200mb /home/user/script.sh

Example 3:

This job specifies a reservation.

qsub -l nodes=2:ppn=20,walltime=20:00:00 -W x=FLAGS:ADVRES:hpctest_lab.56273

In this example, the reservation id is hpcadmin_lab.56273. Your reservation id will be different.
Please run the showres command to determine your reservation id.

Example 4:

This job specifies a reservation using rsub.

rsub -l nodes=2:ppn=20,walltime=20:00:00 -m abe –M nouser@auburn.edu test.sh

This example does the same thing as the previous example: it specifies a reservation in the job submission. However, it does so by means of a wrapper script ‘rsub’ that finds your correct reservation and includes it without you having to do so yourself.

Testing

Compile and run a test job on login node first before submitting a job to the queue.

To see how your job will run on a compute node, exactly as it will using the scheduler, you can use an interactive job in the debug queue:

$ qsub -q debug -l nodes=1:ppn=8 -I -V

Common Resource Commands

Moab Command Description
pbsnodes Detailed node information
pbsnodes -l free Available nodes
showres -n Nodes assigned to you
showbf Shows what resources are available for immediate use

Examples

What is the estimated start time for a job requiring 20 processors for 30 minutes?

What resources are available for immediate use?

Note: Tasks are equivalent to processors in this context.

Monitor Jobs

To monitor the status of a queued or running job, use the qstat command:

Example: Display running jobs

qstat –r

qstat option

qstat Option Description
-u user_list Displays jobs for users listed in user_list
-a Displays all jobs
-r Displays running jobs
-f jobid Displays the full listing of jobs
-n Displays nodes allocated to jobs

To display information about active, eligible and blocked jobs, use the showq command:

showq

To display detailed job state information and diagnostic output for a specified job, use the checkjob command:

checkjob -v <jobid>

To cancel a job:

canceljob <jobid>

In some cases, a job can get stuck and does not respond to the canceljob command, if you have a job that refuses to die, you can try:

mjobctl -F <jobid>

Monitor Resources

Use the checkquota command to check your disk space usage.

checkquota

To see if you have files that are scheduled to expire soon

expiredfiles


Quick Start

For users sufficiently experienced with HPC clusters, please refer to the Hopper Quick Start Guide to get you up and running on the Hopper Cluster as fast as possible.


Best Practices

How to run a program for the first time:

  • First, run on login node to make sure that your code will run.
  • Then run using qsub in interactive mode to make sure that it will run on a compute node.
  • Finally, run in batch mode using qsub.

Do not run jobs on the login node except as a test.

  • This means short jobs using small amounts of memory to ensure that your code will run.
  • Processes that violate this will be killed.

Do not submit a job and walk away or leave for weekend.

  • Make sure the job is running or, if not, know why it’s not running.

Specify walltimes in your job submission.

  • Allows Scheduler to maximize utilization which means your jobs run sooner.
  • Users should receive an email after a job completes that contains the actual walltime.

Submit short-running jobs with fewer resources in order to reduce likelihood of preemption when not using your group’s reservation.

Clean up when your jobs are finished.

  • Hopper does not provide archival or long-term storage.
  • If files no longer need to be available for work on the system, copy them off and delete them so that the space can be used for active projects.

Pay attention to your disk usage.

  • Once the hard limit is reached in disk space or # of files, your program will stop executing.

Do not share passwords or accounts.

  • If you want others to access your files, then set them to read only.

Reference

Grace Hopper biography

Hopper Account Request

Hopper Software Request

Modules


Glossary


You may also like...

Leave a Reply