pelab/stub/auto-stub.sh · 5f413b470396f3afddfac3f59e2fdf268c0b1de3 · emulab / emulab-devel

First crack at surviving down planetlab nodes. If the master barrier sync · 5f413b47

Mike Hibler authored Aug 10, 2006

node sits in the stub or monitor barrier sync for more than the SYNCTIMO
timeout value in common-env.sh, it will send a HUP to syncd which will
knock all the other nodes out of their barrier sync. If that happens,
all nodes will print a warning message and continue.

All nodes wait for both a stub sync and a monitor sync, so if one plab node
is down, they will timeout on both barrier syncs. Race conditions? Sure.
If for example everyone times out on the stub barrier due to a slow node,
and then that node reaches the barrier, it will hang there while everyone
else waits on the monitor barrier. When the latter times out, it will
kick the slow node out of the stub sync and it will then proceed to hang
in the monitor sync until the experiment is stopped. Got that?

As an aside, it would be nice if the initializer of a barrier could specify
a timeout value, and return a special error code to everyone if it timed out,
but that would require an incompatible change to the sync protocol.

5f413b47