Doing HW/SW node updates in the era of the reservation system
We have been struggling with how to take specific nodes out of the free pool for performing hardware of firmware updates without throwing the reservation system into overbook. The traditional way was to use sched_reserve
to ensure that the desired nodes would go into hwdown
either immediately if free or whenever they next became free.
Right now, if you do such a sched_reserve when the node type in question is at or near full use, things will go to hell when the reserved node is moved into hwdown. Really, this is the case when any node goes into hwdown unexpectedly when resources are tight.
We discussed keeping some number of each node type out of the general allocation pool so that we would have some slop to address this situation. However, just having N nodes out of the pool doesn't fully address the problem, as we need a way to free a node from that slop pool whenever a general node is taken out of service.
And note that I am conflating a reservation with an experiment here, essentially saying that we have some new experiment, say hwfixmysorryass
, that can have at most N nodes of a particular type in it. Whenever the N+1th node gets reserved to it, we free up the oldest of the other N nodes. The "N" and "of a particular type" are reservation-y concepts while we are reserving specific nodes to an actual experiment. Not sure how to reconcile this.