More comprehensive node cleaning
We are running in to more and more instances of node-hosted devices that have their own persistent state that should really be reset between experiments. We have long talked about doing this for BIOS settings (#234, #652), but new "threats" are emerging:
- BlueField NICs or other processing units that have an SoC. We have run into problems on the Clemson nodes with BF2 NICs where users can load images onto the card that prevent it from presenting properly to the BIOS or naive OS. This leads to long waits, ending with timeouts, at initialization time and ultimately causes timeout in
stated
which can result in nodes winding up inhwdown
. - NVMe devices. The standard allows for partitioning of the physical devices into logical units and for changing the format of the drives from 512 byte logical blocks to 4K blocks. The former is mostly just confusing to the user, the latter can actually prevent nodes from being imaged by
frisbee
as theimagezip
format allows for regions smaller than 4k that need to be written.
The latter could be handled from the MFS pretty easily, there are small utilities for both Linux (nvme
) and FreeBSD (nvmecontrol
) that can remove logical devices and put the drives in the correct format all in short order. The former involves a much more complex set of NVIDIA tools and a complete Linux image for the card, and the reset process can take 20+ minutes.
Since these are (at the moment) pretty rare events, our thinking is that we will have a "recovery" path, similar to the "hardware checkup" path that will run nodes through a custom Linux image with all the necessary tools to fix these and other problems. Nodes could be run through this path if they fail to load a known standard image, or we can run them through on demand. @stoller has such a custom image that he constructed specifically for the BF2 case.
Note that this is related to #540, which covers proper cleaning of the contents of storage devices.