Maintenance Schedule and Notifications

What does the maintenance period mean to you?

System maintenance is required to ensure the clusters continue to work efficiently with minimum downtime. It may include hardware and software upgrades, configuration changes, filesystem maintenance, … The maintenance period is the designated time when this is done.

Not only are jobs prevented from running during the maintenance period, but also jobs submitted before the maintenance period are affected as well. If the walltime of a job extends beyond the maintenance period, the job will not run, but remain idle. As a result, as the maintenance period approaches, jobs must be submitted with shorter and shorter walltimes in order to run.

A script has been created to calculate the maximum wall-time until the maintenance period begins: max-walltime.sh. The output can be used to specify the walltime in your job submissions.


It is important to note that in all maintenance/upgrades there is an element of risk that data could be lost. Although home and tools directories are backed up, the scratch directory is not.

It is the individual researcher’s responsibility to ensure that their critical data is copied to a safe location.


2019

Semi-Annual Maintenance Period: Sept. 3-9: Completed

  • Sept. 3
    • Upgraded the quorum node in the GPFS cluster including firmware and OS.
  • Sept. 4
    • Upgraded one of the two GSS servers in the GPFS cluster.
  • Sept. 5
    • Upgraded the remaining GSS server.
    • Upgraded Hopper login and management nodes.
  • Sept. 6
    • Ran post-maintenance health checks and confirmed job submission.
    • End of system maintenance period.

Semi-Annual Maintenance Period: March 4-8: Completed

  • March 4
    • Installed new Bright Cluster Manager licenses for entire cluster.
    • Restarted nodes with hardware events including the login node.
    • Assigned static ipmi ip addresses for two nodes.
  • March 5
    • Created new vdisks for metadata in GPFS.
    • Increased inode setting for the home fileset in GPFS.
    • Resolved inconsistencies in compute node configuration ( hosts file and hyperthreading ) and restarted.
    • Replaced failed memory in one of the nodes.
    • Resolved Infiniband networking issues.
    • Clean up of remnant job files.
  • March 6
    • Continued work on Infiniband issues.
    • Continued cleanup of remnant job files.
  • March 7
    • Completed cleanup of remnant job files.
    • Applied patch to address software bug that caused Torque server crashes in Dec. 2018.
    • Performed health check on CASIC.
    • Restarted NSDs on CASIC.
    • Restarted problem nodes on CASIC.
    • Completed work on Infiniband network issues.
  • March 8
    • Tested job submission.
    • End of system maintenance period.

2018

Annual Maintenance Period: August 6-10: Completed

  • August 6
    • Increased the max client connections on Hopper and restarted the scheduler.
    • Began implementing new database backup strategy on Hopper.
  • August 7
    • Implemented configuration changes for gpfs snapshots on Hopper and restarted gpfs.
  • August 8
    • Committed gpfs upgrade that will give us access to new features on Hopper.
    • Implemented new login node on Hopper.
    • Implemented configuration changes to greatly enhance performance monitoring on Hopper.
    • Resolved technical issue on CASIC with one of the compute nodes.
  • August 9
    • Completed system maintenance on the CASIC cluster and opened for job submission.
    • Finalized new gpfs functionality related to upgrade on Hopper.
  • August 10
    • Completed system maintenance on the Hopper cluster and opened for job submission.

2017

Hopper Cluster

GPFS Upgrade: September 19 -21: Completed

  • Sep. 21 Update
    • The GPFS upgrade is complete and the cluster is now available. Please let us know if you encounter any problems.
  • This upgrade was originally scheduled for Sep. 5-8, but was postponed due to complications related to Hurricane Irma.

Moab/Torque Upgrade: August 22-24: Completed

  • Upgrade Complete
  • Some post-upgrade tasks still in-progress

CASIC Cluster

GPFS Upgrade: September 5 – 8: Completed

  • Upgrade Complete
  • Some post-upgrade tasks remain
  • What does this mean to me?
    • You can now submit jobs normally.