Try and be smarter for nodes to die by looping in short pings waiting
for no more replies. Still not great, and this causes the loop to reboot all the machines to get kinda long. More important is that we have to wait until all the nodes reboot and come back so that the next part tbrun does not fail. That adds a bunch of time to this. Needs to parallelize the reboot and wait, but thats too hard too deal with right now.