Ownership and Preemption
As a “condo-model” cluster, Hopper is funded by a group of financial stakeholders known as Principle Investigators. Each PI maintains ownership of a subset of nodes within the cluster, and is guaranteed purchased processing power on demand. In mild contrast to this, a secondary goal of the machine is to promote efficiency through allowing all cluster researchers to leverage any unused resources in the system. Properly implemented, this method achieves both goals implicitly, without any added complexity to job submission or research workflow.
To strike this balance between node ownership, overall efficiency, and automation, we have chosen Moab’s condo-model approach for Auburn University’s Hopper Cluster scheduler configuration.
For example, consider a small research community consisting of three disciplines: biology, physics, and chemistry. Each group has a Principle Investigator who has purchased some percentage of a hypothetical 100 node cluster. To account for this, cluster administrators define three separate groups within the system:
$ getent group
biology_lab: user1, user4, user8
physics_lab: user2, user6, user7, user9
chemist_lab: user3, user5
Each of these groups is assigned full ownership of a subset of nodes, guaranteeing them for use when needed. Let us define our hypothetical division of resources as:
biology_lab: 30 Nodes
physics_lab: 50 Nodes
chemist_lab: 20 Nodes
In preparation for assigning global access, we also define a global “research” group of which all research users will be members…
research: user1, user2, user3, user4, user5, user6, user7, user8, user9
These groups are then used to maintain division of resources and assign appropriate access in a variety of ways within the system, including job submission.
Efficiency and Research Impact
Imagine that in Fall semester, most of the physics researchers are very busy with instruction, and consequently their nodes are underutilized. Rather than let hundreds of powerful cores sit idly by, these nodes could ideally be used in the short term by other cluster researchers (anyone in the “research” group) who are in current need of processing power. Once the Fall semester has concluded and physics researchers are once again active in the system, they reclaim ownership of their nodes and continue their work as requested.
This is the goal of a condo-model cluster: to guarantee purchased processing power while maximizing efficiency in times of underutilization. If this can be achieved in practical fashion, we have maximized the processing potential of the Hopper cluster, and consequently improved the impact of Auburn University as a research institution.
Finding the balance between these two seemingly conflictive goals is a complex problem, and requires an imaginative solution. Once such solution is offered through the use of Moab “reservations”.
Moab reservations are similar to traditional cluster queues, in that they can serve as the gateway to our predefined groups of nodes and place basic constraints on their use. However, Moab reservations are much more capable and intelligent than a traditional queue, as is reflected in the large number of potential configuration options.
To maintain a guaranteed level of ownership and provide global access, we can implement a special type of reservation (a “standing reservation“) which represents a persistent block of processing time for a given group of nodes, infinite in length. As with a traditional reservation, a name (or in our case a group of users) is associate with each, granting priority access to specified resources.
In our hypothetical research cluster, we would create a standing reservation for each of our PI research groups (biolsci_lab, physics_lab, and chemist_lab) assigning the corresponding number of nodes to each reservation, and assigning ownership to a specific research group.
As seen in reality, reservations are sometimes cancelled or simply go unused. In this case, other participants can take advantage of available seating. In our case, we grant secondary access for all standing reservations to the global “research” group, using special attributes to indicate the constraints within which these jobs must run, and how preemption is handled.
Through job preemption, reservations provide guaranteed node access to PIs and their sponsored users.
As seen in our hypothetical cluster, the chemistry_lab group is assigned ownership of 20 nodes. However, in November, the chemistry group is preparing research paper for presentation at an important regional conference. Through their secondary access to the physics_lab standing reservation, user3 and user5 are taking advantage of underutilized processing power from the pool of physics nodes in hopes of expediting their work, and are currently consuming a combined total of 1400 cores. They have exceeded the number of processors that have been assigned to their group, and have also consumed all idle resources owned by the physics group as well.
As user2 from the physics lab finishes up her Fall instruction work, she decides to run some jobs on the cluster. Upon submitting these jobs, the Moab scheduler attempts to calculate the most efficient allocation of resources and finds that all nodes are currently busy. But, Moab recognizes that there are nodes owned by physics_lab currently in use by secondary users in the “research” group. Moab initiates job preemption. After a short time, Moab gracefully re-queues some of the running chemist_lab jobs, and moves user2’s jobs into the running state.
Through preemption, owned resources are made available through the suspension or restart of jobs originating from users or groups who are not assigned primary ownership.
Considering the potential benefits, pursuit of a practical implementation of this condo-model approach is a worthwhile effort that offers guaranteed levels of service and optimized utilization of resources, without minimal added complexity to job submission.
While successful implementation of this scheme will maximize the processing power at your disposal, cluster researchers are still encouraged to be mindful of their primary allocation of cores, the system load, and the current demand from fellow researchers when requesting resources from the Workload Manager using secondary access.