Periodic retries for nodes in `hwdown`

We spend a considerable amount of time on an ongoing basis dealing with nodes in the hwdown experiment. By the time we diagnose such a node, quite often the problem has disappeared or we discover the problem is easily fixable. Worse, if we don't notice a node in there quickly enough it can get pushed down the list by more recent failures and can fall off our radar, winding up in hwdown for months.

So... @eeide asks, "Can we do better?" Maybe periodically releasing nodes from hwdown to give them a second chance (which is often times what we do manually when we don't have time to diagnose).

Note that this is mostly an issue for nodes of the less popular types (e.g. pc3000s). hwdowning of nodes of frequently used types (e.g., those with GPUs) will cause overbook problems in the reservation system or users will complain and we will act quickly.

Edited Jun 29, 2022 by Mike Hibler