On the dangers of creating future reservations for all nodes in the testbed
We hit a situation in Apt recently where we created a reservation for the SC conference to use all 60 functional c6220 nodes for a couple of weeks in November. (I am over simplifying here, it is actually two partially overlapping reservations, but the point is that on November 6th, all nodes are committed.)
However, this reservation has consequences. Every time one of the c6220 nodes goes into hwdown between now and then, it immediately causes all allocations of c6220 nodes to start failing, regardless of how many are free at the time. The reason, as I understand it, is that the reservation system can only make accurate reservation decisions if the current schedule is sound. In the above scenario, whenever a node is forced into hwdown it is considered allocated to that experiment forever and we have now violated the future reservation for all nodes in November. Hence, we can no longer allocate (reserve) nodes because the schedule is pooched and we don't want to make it worse.
In practice, this just isn't going to work. While it does encourage us to get nodes out of hwdown quicker, everything grinds to a halt in the meantime. Either we need to introduce that "reserve pool" of each node type or we have to be a little more flexible when dealing with a flawed future schedule.