Skip to content
  • Mike Hibler's avatar
    First crack at surviving down planetlab nodes. If the master barrier sync · 5f413b47
    Mike Hibler authored
    node sits in the stub or monitor barrier sync for more than the SYNCTIMO
    timeout value in common-env.sh, it will send a HUP to syncd which will
    knock all the other nodes out of their barrier sync.  If that happens,
    all nodes will print a warning message and continue.
    
    All nodes wait for both a stub sync and a monitor sync, so if one plab node
    is down, they will timeout on both barrier syncs.  Race conditions?  Sure.
    If for example everyone times out on the stub barrier due to a slow node,
    and then that node reaches the barrier, it will hang there while everyone
    else waits on the monitor barrier.  When the latter times out, it will
    kick the slow node out of the stub sync and it will then proceed to hang
    in the monitor sync until the experiment is stopped.  Got that?
    
    As an aside, it would be nice if the initializer of a barrier could specify
    a timeout value, and return a special error code to everyone if it timed out,
    but that would require an incompatible change to the sync protocol.
    5f413b47