• Mike Hibler's avatar
    Handle a common failure on the node reload path. · 4dc57d48
    Mike Hibler authored
    Under load, nodes that have just entered reloading and have just rebooted
    might fail to get bootinfo.  The default behavior in this case is for the
    node to boot from disk (dubious, but that is the topic for another day).
    This causes the node to fall off the RELOAD path, winding up in either
    TBFAILED or ISUP.  Worse, if the node makes it to ISUP, its reload state
    is cleared and even if the reload_daemon reboots the node, it will still
    not go through the reloading process.
    
    The result is a bunch of nodes left in reloading.  Now if a node makes an
    invalid transition to TBFAILED or ISUP while in the RELOAD state machine,
    it fires the new REBOOT trigger which does...well, you figure it out.
    Note that in the ISUP case, this trigger overrides the default that would
    otherwise clear the reload state--so reboot is sufficient to get the machine
    back on the RELOAD track.
    4dc57d48
218 1.98 KB