Skip to content
  • Leigh B. Stoller's avatar
    Reloading daemon. Looks for free nodes that have not been reloaded · 74d21844
    Leigh B. Stoller authored
    since the last reservation (as determined by last_reservation table).
    Picks one (randomly) from that set of nodes, and calls sched_reload on
    it. Then waits until the node has finished reloading, as determined by
    the reserved table, which gets cleared by the tmcd when the node first
    reboots after a scheduled reload. Sleeps 30 seconds, and then goes
    around again. So at most one node is tied up in a reload at a time,
    which seems like a good balance between trying to keep the machines in
    a pristine state, and having nodes available for use.
    
    The advantage of this approach is that instead of calling sched_reload
    on 40 nodes (after generating a new image) and watching the network
    meltdown, we can let the nodes reload at a slower pace. We could call
    sched_reload on allocated nodes so that they will load when freed, but
    we run into the problem of big experiments ending and causing meltdown.
    
    The downside is that this approach is a little too aggressive. Nodes
    will end up reloading after just a single experiment. Need finer grain
    control over when to reload, but I will leave that as an exercise for
    later.
    74d21844