-
Mike Hibler authored
Previously tb-set-node-failure-mode of "nonfatal" only applied to failures when rebooting a node. If there was an error during the disk reload phase, the experiment would still fail. This makes sense, as it is pretty dicey to let a node boot with an unloaded or partially-loaded disk. But there are situations, such as 500+ node experiments on PRObE, where it makes sense to not fail the experiment. What we do if a node fails reload, is to clear the OSIDs and partition info for the node and then force it to reboot (by setting the state to TBFAILED, for which there is a REBOOT trigger in stated). This causes the node to come up and park in pxeboot in the PXEWAIT state. It should remain in this state across reboots. The user can manually os_load the machine, or do a swap modify which will force the node to try to reload the original OS. Since this may not be for everyone, this new allow non-fatal osload failures requires that the "OsloadFailNonfatal" feature be enabled. This allows the new behavior to be global, per-group, per-experiment or per-user. The default is disabled.
783d3caf