Periodic retries for nodes in `hwdown`
We spend a considerable amount of time on an ongoing basis dealing with nodes in the hwdown
experiment. By the time we diagnose such a node, quite often the problem has disappeared or we discover the problem is easily fixable. Worse, if we don't notice a node in there quickly enough it can get pushed down the list by more recent failures and can fall off our radar, winding up in hwdown
for months.
So... @eeide asks, "Can we do better?" Maybe periodically releasing nodes from hwdown
to give them a second chance (which is often times what we do manually when we don't have time to diagnose).
Note that this is mostly an issue for nodes of the less popular types (e.g. pc3000
s). hwdown
ing of nodes of frequently used types (e.g., those with GPUs) will cause overbook problems in the reservation system or users will complain and we will act quickly.