Reloading daemon. Looks for free nodes that have not been reloaded
since the last reservation (as determined by last_reservation table). Picks one (randomly) from that set of nodes, and calls sched_reload on it. Then waits until the node has finished reloading, as determined by the reserved table, which gets cleared by the tmcd when the node first reboots after a scheduled reload. Sleeps 30 seconds, and then goes around again. So at most one node is tied up in a reload at a time, which seems like a good balance between trying to keep the machines in a pristine state, and having nodes available for use. The advantage of this approach is that instead of calling sched_reload on 40 nodes (after generating a new image) and watching the network meltdown, we can let the nodes reload at a slower pace. We could call sched_reload on allocated nodes so that they will load when freed, but we run into the problem of big experiments ending and causing meltdown. The downside is that this approach is a little too aggressive. Nodes will end up reloading after just a single experiment. Need finer grain control over when to reload, but I will leave that as an exercise for later.
Showing with 232 additions and 1 deletion