Hopper HPCC User’s Guide
Version 1.3 – February 1, 2018
Table of Contents
Introduction
About the Hopper Cluster
The Hopper Cluster is Auburn University’s newest and most powerful system for high-performance parallel computing. It is designed to accelerate research in a wide variety of fields by providing high-end computing resources. These resources include centrally managed cluster resources, storage, software and technical user support.
The Hopper Cluster is a collaborative, campus-wide effort of Auburn University funded by:
- The National Science Foundation
- Office of the Vice President for Research and Economic Development
- Office of Information Technology
- Office of the Provost
- College of Sciences and Mathematics
- Samuel Ginn College of Engineering
Hopper Namesake
Auburn has named this supercomputer “Hopper” in honor of the late Rear Admiral Dr. Grace Murray Hopper, an extraordinary woman whose contributions to computer science laid the foundation for modern programming. Among her long list of remarkable achievements are the creation of the “A” compiler, key efforts in the development of COBOL and other high level languages, and significant input into the development of early machines, including UNIVAC.
About this Manual
This manual is the primary reference document for Auburn University’s Hopper High Performance Computing Cluster. It is not intended to be an exhaustive reference, but rather a concise guide to the basics of the system and a starting point for users.
This manual will be revised on an ongoing basis. It is important that users refer to the latest version.
Acceptable Use
Policies outlined in OIT’s Appropriate Use of Information Technology, and well as any other applicable policies, apply to use of the Hopper cluster. Hopper also has its own guidelines and restrictions for acceptable use, described throughout this document.
Citations and Acknowledgements
Please help show the importance of computational resources and OIT support staff in research at Auburn University by acknowledging such in any publication or presentation that are made possible using the Hopper Cluster.
Acceptable citations are below:
- This work was completed in part with resources provided by the Auburn University Hopper Cluster.
- We are grateful for the support of the Auburn University Hopper Cluster for assistance with this work.
- We acknowledge the Auburn University Hopper Cluster for support of this work.
If an acknowledgement is included your work, please send a brief email to hpcadmin@auburn.edu.
Please note that the Hopper Cluster is distinct and in no way related to the Alabama Supercomputer Authority.
System Overview
Hopper is a Lenovo System X based HPC Cluster with approximately 5888 Cores, 39 TB RAM, 1.4 PB Disk, and 175 TFlops*. For details, please visit the AU HPC website.
*as of the document version and date
Accessing Hopper
Request an Account
In order to submit jobs to the Hopper Cluster, you must have a Hopper account:
- Access the Account Request Form
- Login with Auburn credentials
- Select your sponsor from drop-down
- Complete the form
- Accept the Terms of Use
- Submit the request
Your request will be emailed to the sponsor that you selected and they will approve or deny the request at their discretion. Upon approval, you will receive a Welcome email. It will confirm that your account was created and include the Hopper Quick Start Guide.
Connect to Hopper
To connect to Hopper, open a terminal program, then ssh to ‘hopper.auburn.edu’ with your Auburn userid.
ssh <auburn-userid>@hopper.auburn.edu
If successfully logged in, you should now see a command prompt and be in your home directory.
You will need to use an AU VPN connection to access Hopper from off-campus. There is more information available online at: http://www.auburn.edu/oit/vpn .
Locations and Resources
Login Node
The login node ( hopper.auburn.edu ) is your sole interface to the Hopper Cluster and is accessed remotely over the network using ssh. It is where you can issue commands, submit jobs, see results and manage files.
Although it is acceptable to run short, non-intensive jobs on the login node during testing, the login node is not intended to be used for computationally intensive work.
Running intense computations on the login node affects the performance and availability of the cluster for all other users, and is therefore, not allowed.
Any processes that violate this policy will be killed automatically. You can check the impact of your running processes, or the processes of other users with:
top
Hopper will kill processes automatically that violate this policy, however, please notify hpcadmin@auburn.edu if you see any activity that affects your ability to access or effectively work on the cluster.
Compute Nodes
Your jobs run on the compute nodes in the cluster and you utilize them by submitting work to the Workload Manager, traditionally known as a queue system. The Workload Manager assigns your job to the compute nodes based on the attributes that you indicated in your job submission, and the resources that you have at your disposal.
Hopper has the following compute nodes:
Category | Count | Processor | Speed | Cores | Memory | Queue | Restricted | qsub Example |
---|---|---|---|---|---|---|---|---|
Standard | 190 | E5-2660 | 2.60GHz | 20 | 128 GB | general | N | qsub -l nodes=3;ppn=20 job.sh |
Standard 28 | 55 | E5-2680 | 2.40GHz | 28 | 128 GB | gen28 | N | qsub -q gen28 -l nodes=3;ppn=28 job.sh |
Fast Fat | 13 | E5-2667 | 3.20GHz | 16 | 256 GB | fastfat | Y | qsub -q fastfat -l nodes=2;ppn=16 job.sh |
GPU K80 | 2 | E5-2660 | 2.60GHz | 20 | 128 GB | gpu | N | qsub -q gpu -l nodes=1;ppn=20 job.sh |
Phi 7120P | 2 | E5-2660 | 2.60GHz | 20 | 128 GB | phi | N | qsub -q phi -l nodes=1;ppn=20 job.sh |
Super | 1 | E7-4809 | 2.00GHz | 64 | 1 TB | super | Y | qsub -q super -l nodes=1;ppn=64 job.sh |
File Storage
Users are provided a high performance GPFS file system which is used for users’ home directories, the scratch directory and the tools directory.
Home Directory
Each user has their own directory, called a home directory, in /home/<userid>. Your home directory is the primary location for your datasets, output, and custom software and is limited to 2TB.
Home directories are backed up (snapshotted) daily and kept for 90 days. However, it is the user’s responsibility to transfer their data from their home directory back to their computer for permanent storage.
Scratch Directory
All users have access to a large, temporary, work-in-progress directory for storing data, called a scratch directory in /scratch.
Use this directory to store very large datasets for a short period of time and to run your jobs. Although you can submit jobs to run from your home directory, scratch is better option as it is much larger at 1.4 PB. Files in this location are purged after 30 days of inactivity (access time) and are not backed up.
How to use scratch for your data
- Create a directory for your job in scratch.
- Copy your input data to this directory in scratch.
- Run your job that uses the files in that directory.
- Within a week, make sure and copy any needed results back to your home directory.
- Delete your directory in scratch.
- Create a new directory in scratch for every job you run.
Warnings – Scratch Directory
Warning: Do not use scratch as long term storage.
Any data left on scratch is automatically erased after 30 days of inactivity, based on the last access time of the file(s). Each time the data is accessed, the time window is renewed, so as long as you are using your data it will remain available. There is no backup of scratch. Thus any data files left on /scratch must be transferred elsewhere within a few days if they need to be kept.
Tools Directory
Each user has access to a directory for installed software called the tools directory located in /tools. Many of the most popular software packages, compilers, and libraries are installed here.
File Storage Summary
Name | Directory | Purpose | Quota | Retention | Backup |
---|---|---|---|---|---|
Home | /home/userid | Small datasets, output and custom software | 2 TB | long | Y |
Scratch | /scratch | Large datasets and output | 1.4 PB | short | N |
Tools | /tools | Software packages, compilers and libraries | N\A | long | Y |
How to copy files
To transfer files to your home directory from your local machine:
scp –r <source filename> <userid>@hopper.auburn.edu:<~/target_filename>
To transfer files to your local machine from your home directory:
scp –r <userid>@hopper.auburn.edu:<~/source_filename> <target_filename>
Software
Existing Software
OIT manages a set of supported software for use on the Hopper Cluster. Software binaries and libraries installed globally by OIT administrators are most often found in the /tools directory, each accompanied with a corresponding Environment Module.
The best way to get started using a piece of software is to take a look at the corresponding module files.
To see what software is currently available on Hopper, you can look in the /tools directory or view available software modules …
> ls /tools
> $ module avail
Each module contains detailed information about the software including a short description, where to obtain documentation, when and how it was installed, and any corresponding changes that are made to your environment when loaded
> module show <software_name>[/version]
New Software
Any unlicensed open-source software of use to the general user community can be installed by OIT provided that the Hopper Cluster meets the software requirements. Please contact OIT support staff or submit a software request at https://aub.ie/hpcsw.
Licensed software must be legally licensed before it can be installed. Since there are many different types of licenses and different vendor definitions of these licenses, any multi-user licensed software installation needs to be coordinated with OIT support staff. Single-user licensed software should be installed in the user’s /home directory.
Custom software should be built in the user’s /home directory. Users are encouraged to maintain and publish their own local module files to configure the environment for their software.
OIT support staff: hpcadmin@auburn.edu
Software request: https://hpcportal.auburn.edu/hpc/forms/login.php?form=/hpc/forms/sw.php.
Set the Environment
Using Environment Modules
Different programs require that OS environment variables be correctly defined in order to use them. For example, the commonly used $PATH and $LD_LIBRARY_PATH environment variables assign the locations of available software binaries and the existence of shared libraries, respectively. Most software can be properly executed with the correct manipulation of these variables. More complex software also requires the assignment of additional, custom environment variables.
These variables can be set manually or in the user’s profile through the traditional shell methods…
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/my/library
However, this can be cumbersome when a user wants to use different versions of an application, employ a different compiler, or build against different libraries. A convenient solution is to use environment modules.
Environment modules provide the user a way to easily and dynamically change their environment. With simple commands, modules define the environment variables of your shell environment, so that the correct versions of executables are in the path and compiler toolchains can find the correct libraries.
OIT creates modules for globally installed software in the tools directory. Users must find and load the module(s) needed by their application.
Common Module Commands
Module Command | Description |
---|---|
module avail [module] | List all modules available to be loaded |
module whatis <module> | Display info about a module |
module list | List modules currently loaded |
module load <module> | Load a module |
module unload <module> | Unload a module |
module swap <existing> <new> | Unload a module |
module purge | Remove all modules |
Creating your own modules
Users should create their own modules for software installed in their /home directory.
To create your own module files, just create a directory named ‘privatemodules’ in your /home directory and put your modules there. Once that directory exists, the module system will search it for module files the next time you login and should be listed as an available.
Users can review existing modules in /tools/modulefiles and use them as a template. More information can be found here:
Modules homepage: http://modules.sourceforge.net/
Modules manual: http://modules.sourceforge.net/man/module.html
Compile Software
You are free to compile and run any custom software needed for your research within your home directory, provided that it is from a reputable, trustworthy source and adheres to the acceptable use policies described above.
Hopper provides a variety of development tools that you can use to build software, including several versions of gcc, make, make, and openmpi.
Your software will most likely provide a README or INSTALL file in the source distribution that should provide hints on what tools you will need and what procedure to follow. Alternatively, check the software’s web site or other Internet sources if you encounter problems.
For parallelized software, you will want to compile with your software’s recommended version of openmpi, i.e…
> module avail openmpi
> module load openmpi/1.8.3
> ./configure --prefix=/home//software/build
> make
> make install
Job Scheduling and Resource Allocation
The Hopper Cluster is a shared resource among the many Auburn University principal investors (PIs). The goal is to maximize utilization of the computing resources while guaranteeing each PI access to the resources that they have purchased. This requires special software to schedule jobs and manage the resources.
Hopper uses the Torque Resource Manager with the Moab scheduler (aka ‘Torque/Moab’).
Torque is an open source fork of OpenPBS, which is maintained by Adaptive Computing. Even further deconstructed, OpenPBS itself is an open source implementation of the Portable Batch System.
Torque serves as Hopper’s Resource Manager, and essentially handles the counting and grouping of available resources in the cluster.
Moab serves as the Workload Manager, and performs the majority of decision making for where and when jobs are allocated.
Hopper implements a shared-maximum model of scheduling which guarantees that each PI has access to the resources that they have purchased, while also providing extra computational power through leveraging underutilized processing power. This model relies heavily on Moab “reservations” which are similar to traditional queues, but are defined in terms of ownership. Moab reservations grant access to the system’s global pool of researchers when unused, but will initiate job preemption when they are eventually requested by the owner.
If your job(s) consume more than your share (or your sponsor’s share) of available resources, they have a high chance of being preempted. Therefore, cluster researchers are encouraged to be mindful of their primary allocation of cores, the system load, and the current demand from fellow researchers when requesting resources from the Workload Manager using the commands described below.
Torque\Moab provide commands that give users information about resource availability in order to obtain quicker job turnaround times and to more fully utilize the system. Deep familiarity with these commands is essential in gleaning useful work from the machine.
Job Submission Overview
Hopper’s recommended methods for job submission and monitoring is through the use of the Torque “q” based commands, specifically “qsub”.
Torque\Moab provide a multitude of commands that will enable you to (in addition to others) instruct the cluster on what code and data to act upon, poll and assign available resources, and estimate and specify how long a given job should run.
Determining optimal values for “qsub” parameters comes with experience, and it is usually best to submit test jobs to get an estimate for the number of processors and wall time.
Familiarity with the qsub command, specifically its expected syntax, input, and behavior, is crucial for getting work done on Hopper.
How to Submit a Job
Job submission is accomplished using the Torque ‘qsub’ command. This command includes numerous directives which are used to specify resource requirements and other attributes for jobs. Torque directives can be in a batch script as header lines (#PBS) or as command-line options to the qsub command.
The general form of the qsub command:
qsub –l <resource-list> -m abe –M <email addr> <job script>
Job Submission Options
qsub Option | Description |
---|---|
-q queue | Specifies a queue |
-l resource_list | Defines resources required by the job |
-m abe | Sends email if the job is (a) aborted, when it (b) begins, and when it (e) ends |
-W x=FLAGS:ADVRES: | Specifies a reservation |
resource_list
Max length of time during which job can be in the running state
Resource | Descripton |
---|---|
mem | Max amount of physical memory used by the job |
nodes | Number of nodes for exclusive use by the job |
ppn | Number of processors per node allocated for the job |
walltime | Time from job start to job completion. ( Default: 2 days/Maximum: 90 days ) |
reservation | Dedicated resources ( nodes ) based on your group. |
Examples
Example 1:
This job submission requests 40 processors on two nodes for the job ‘test.sh’ and 20 hr of walltime. It will also email ‘nouser’ when the job begins and ends or if the job is aborted. Since no queue is specified, the general queue is used as it is the default.
qsub -l nodes=2:ppn=20,walltime=20:00:00 -m abe –M nouser@auburn.edu test.sh
Example 2:
This job requests a node with 200MB of available memory in the gen28 queue. Since no walltime is indicated, the job will get the two day default walltime.
qsub -q gen28 -l mem=200mb /home/user/script.sh
Example 3:
This job specifies a reservation.
qsub -l nodes=2:ppn=20,walltime=20:00:00 -W x=FLAGS:ADVRES:hpctest_lab.56273
In this example, the reservation id is hpcadmin_lab.56273. Your reservation id will be different.
Please run the showres command to determine your reservation id.
Example 4:
This job specifies a reservation using rsub.
rsub -l nodes=2:ppn=20,walltime=20:00:00 -m abe –M nouser@auburn.edu test.sh
This example does the same thing as the previous example: it specifies a reservation in the job submission. However, it does so by means of a wrapper script ‘rsub’ that finds your correct reservation and includes it without you having to do so yourself.
Testing
Compile and run a test job on login node first before submitting a job to the queue.
To see how your job will run on a compute node, exactly as it will using the scheduler, you can use an interactive job in the debug queue:
$ qsub -q debug -l nodes=1:ppn=8 -I -V
Common Resource Commands
Moab Command | Description |
---|---|
pbsnodes | Detailed node information |
pbsnodes -l free | Available nodes |
showres -n | Nodes assigned to you |
showbf | Shows what resources are available for immediate use |
Examples
What is the estimated start time for a job requiring 20 processors for 30 minutes?
What resources are available for immediate use?
Note: Tasks are equivalent to processors in this context.
Monitor Jobs
To monitor the status of a queued or running job, use the qstat command:
Example: Display running jobs
qstat –r
qstat option
qstat Option | Description |
---|---|
-u user_list | Displays jobs for users listed in user_list |
-a | Displays all jobs |
-r | Displays running jobs |
-f jobid | Displays the full listing of jobs |
-n | Displays nodes allocated to jobs |
To display information about active, eligible and blocked jobs, use the showq command:
showq
To display detailed job state information and diagnostic output for a specified job, use the checkjob command:
checkjob -v <jobid>
To cancel a job:
canceljob <jobid>
In some cases, a job can get stuck and does not respond to the canceljob command, if you have a job that refuses to die, you can try:
mjobctl -F <jobid>
Monitor Resources
Use the checkquota command to check your disk space usage.
checkquota
To see if you have files that are scheduled to expire soon
expiredfiles
Quick Start
For users sufficiently experienced with HPC clusters, please refer to the Hopper Quick Start Guide to get you up and running on the Hopper Cluster as fast as possible.
Best Practices
How to run a program for the first time:
- First, run on login node to make sure that your code will run.
- Then run using qsub in interactive mode to make sure that it will run on a compute node.
- Finally, run in batch mode using qsub.
Do not run jobs on the login node except as a test.
- This means short jobs using small amounts of memory to ensure that your code will run.
- Processes that violate this will be killed.
Do not submit a job and walk away or leave for weekend.
- Make sure the job is running or, if not, know why it’s not running.
Specify walltimes in your job submission.
- Allows Scheduler to maximize utilization which means your jobs run sooner.
- Users should receive an email after a job completes that contains the actual walltime.
Submit short-running jobs with fewer resources in order to reduce likelihood of preemption when not using your group’s reservation.
Clean up when your jobs are finished.
- Hopper does not provide archival or long-term storage.
- If files no longer need to be available for work on the system, copy them off and delete them so that the space can be used for active projects.
Pay attention to your disk usage.
- Once the hard limit is reached in disk space or # of files, your program will stop executing.
Do not share passwords or accounts.
- If you want others to access your files, then set them to read only.
Reference
Grace Hopper biography
Hopper Account Request
Hopper Software Request
Modules
- Modules homepage: http://modules.sourceforge.net/
- Modules manual: http://modules.sourceforge.net/man/module.html