Skip to content
  • Mike Hibler's avatar
    Implement heartbeat/status reports in Frisbee. · 2be46ba4
    Mike Hibler authored
    There are three pieces here, a change to the frisbee protocol itself, an
    Emulab event component to get status back to the portal, and the surrounding
    infrastructure to make it all work.
    
    Frisbee heartbeat messages:
    
    Added a new message type to the frisbee protocol, "Progress". In theory it
    operates by having the server send a multicast progress request to its clients
    which includes an interval at which to report (or "just once") and an
    indication of what to report (nothing, progress summary, or full stats). The
    client then sends unicast "fire and forget" UDP replies according to that
    schedule. However, I took a shortcut for the moment and just added a command
    line option to the client to tell it to report a summary at the indicated
    interval (-H <interval>).  So the server never sends requests.
    
    This is implemented in the client by a fourth thread since I wanted it to
    operate independent of packet reception (which would cause clients to report
    in a highly synchronized fashion due to multicast). The server instance just
    logs progress reports into its log.
    
    This protocol addition should be fully backward compatible as both client and
    server ignore (but log) unknown messages.
    
    Emulab progress report events:
    
    When this is compiled in (-DEMULAB_EVENTS) and turned on (-E <server>), the
    frisbee server instances will send a FRISBEEPROGRESS event to the indicated
    event server for every progress report it receives (in addition to logging the
    events to its own log). Right now it will create an event with key/value pairs
    for the information in a client summary reply:
    
    TSTAMP is the client's time at which it sends the event. Could be used by the
    received to determine latency of the report if it cared (and if it assumed
    that the clocks are in sync). We don't care about this.
    
    SEQUENCE is the report number. Again, could be used by the receiver, in this
    case to detect loss, if it cared. We don't.
    
    CHUNKS_RECV is complete chunks that the client has received from the network.
    CHUNKS_DECOMP is chunks decompressed by the client.  BYTES_WRITTEN is bytes
    written to disk by the client.
    
    Any of the three can be used by the event receiver as an indication of life
    and/or progress. However, only the last would be a reasonable indicator of
    time remaining since it is the last (and slowest) phase of imaging. To
    estimate time remaining we could compare that value to the amount of
    uncompressed data that is in the image. This makes the sketchy assumptions
    that time for writes to the disk are uniform and that the number and distance
    of seeks is uniform, but it is better than a sharp stick in the eye.
    
    Emulab infrastructure:
    
    There is a new sitevar "images/frisbee/heartbeat" which can be set to a
    non-zero value to tell the frisbee MFS to fire off frisbee with -H <value>
    and thus make reports. The default value of zero means to not make reports.
    The tmcd "loadinfo" command sends this through via the HEARTBEAT=<value>
    param.
    
    REQUIRED A TMCD VERSION BUMP TO 41.
    2be46ba4