First crack at surviving down planetlab nodes. If the master barrier sync
node sits in the stub or monitor barrier sync for more than the SYNCTIMO timeout value in common-env.sh, it will send a HUP to syncd which will knock all the other nodes out of their barrier sync. If that happens, all nodes will print a warning message and continue. All nodes wait for both a stub sync and a monitor sync, so if one plab node is down, they will timeout on both barrier syncs. Race conditions? Sure. If for example everyone times out on the stub barrier due to a slow node, and then that node reaches the barrier, it will hang there while everyone else waits on the monitor barrier. When the latter times out, it will kick the slow node out of the stub sync and it will then proceed to hang in the monitor sync until the experiment is stopped. Got that? As an aside, it would be nice if the initializer of a barrier could specify a timeout value, and return a special error code to everyone if it timed out, but that would require an incompatible change to the sync protocol.
Showing with 63 additions and 7 deletions