Skip to content
  • Mike Hibler's avatar
    Logic for making osload failures non-fatal when nonfatal failure mode is set. · 783d3caf
    Mike Hibler authored
    Previously tb-set-node-failure-mode of "nonfatal" only applied to failures
    when rebooting a node. If there was an error during the disk reload phase,
    the experiment would still fail.
    
    This makes sense, as it is pretty dicey to let a node boot with an unloaded
    or partially-loaded disk. But there are situations, such as 500+ node
    experiments on PRObE, where it makes sense to not fail the experiment.
    
    What we do if a node fails reload, is to clear the OSIDs and partition info
    for the node and then force it to reboot (by setting the state to TBFAILED,
    for which there is a REBOOT trigger in stated). This causes the node to come
    up and park in pxeboot in the PXEWAIT state. It should remain in this state
    across reboots. The user can manually os_load the machine, or do a swap
    modify which will force the node to try to reload the original OS.
    
    Since this may not be for everyone, this new allow non-fatal osload failures
    requires that the "OsloadFailNonfatal" feature be enabled. This allows the
    new behavior to be global, per-group, per-experiment or per-user. The default
    is disabled.
    783d3caf