1. 19 Jan, 2017 1 commit
    • Mike Hibler's avatar
      Add subboss_attributes table. · 1537a3ae
      Mike Hibler authored
      We will populate this with the info we currently hardwire in the
      rc.d/3.mfrisbeed.sh startup script. Will pass to the subboss via
      a new tmcd call or using the mothballed subboss XMLRPC interface.
      1537a3ae
  2. 17 Jan, 2017 1 commit
    • Mike Hibler's avatar
      Implement heartbeat/status reports in Frisbee. · 2be46ba4
      Mike Hibler authored
      There are three pieces here, a change to the frisbee protocol itself, an
      Emulab event component to get status back to the portal, and the surrounding
      infrastructure to make it all work.
      
      Frisbee heartbeat messages:
      
      Added a new message type to the frisbee protocol, "Progress". In theory it
      operates by having the server send a multicast progress request to its clients
      which includes an interval at which to report (or "just once") and an
      indication of what to report (nothing, progress summary, or full stats). The
      client then sends unicast "fire and forget" UDP replies according to that
      schedule. However, I took a shortcut for the moment and just added a command
      line option to the client to tell it to report a summary at the indicated
      interval (-H <interval>).  So the server never sends requests.
      
      This is implemented in the client by a fourth thread since I wanted it to
      operate independent of packet reception (which would cause clients to report
      in a highly synchronized fashion due to multicast). The server instance just
      logs progress reports into its log.
      
      This protocol addition should be fully backward compatible as both client and
      server ignore (but log) unknown messages.
      
      Emulab progress report events:
      
      When this is compiled in (-DEMULAB_EVENTS) and turned on (-E <server>), the
      frisbee server instances will send a FRISBEEPROGRESS event to the indicated
      event server for every progress report it receives (in addition to logging the
      events to its own log). Right now it will create an event with key/value pairs
      for the information in a client summary reply:
      
      TSTAMP is the client's time at which it sends the event. Could be used by the
      received to determine latency of the report if it cared (and if it assumed
      that the clocks are in sync). We don't care about this.
      
      SEQUENCE is the report number. Again, could be used by the receiver, in this
      case to detect loss, if it cared. We don't.
      
      CHUNKS_RECV is complete chunks that the client has received from the network.
      CHUNKS_DECOMP is chunks decompressed by the client.  BYTES_WRITTEN is bytes
      written to disk by the client.
      
      Any of the three can be used by the event receiver as an indication of life
      and/or progress. However, only the last would be a reasonable indicator of
      time remaining since it is the last (and slowest) phase of imaging. To
      estimate time remaining we could compare that value to the amount of
      uncompressed data that is in the image. This makes the sketchy assumptions
      that time for writes to the disk are uniform and that the number and distance
      of seeks is uniform, but it is better than a sharp stick in the eye.
      
      Emulab infrastructure:
      
      There is a new sitevar "images/frisbee/heartbeat" which can be set to a
      non-zero value to tell the frisbee MFS to fire off frisbee with -H <value>
      and thus make reports. The default value of zero means to not make reports.
      The tmcd "loadinfo" command sends this through via the HEARTBEAT=<value>
      param.
      
      REQUIRED A TMCD VERSION BUMP TO 41.
      2be46ba4
  3. 10 Jan, 2017 1 commit
  4. 09 Jan, 2017 2 commits
  5. 06 Jan, 2017 3 commits
  6. 04 Jan, 2017 3 commits
  7. 27 Dec, 2016 1 commit
  8. 19 Dec, 2016 3 commits
  9. 15 Dec, 2016 1 commit
  10. 13 Dec, 2016 1 commit
  11. 08 Dec, 2016 1 commit
  12. 07 Dec, 2016 1 commit
  13. 17 Nov, 2016 1 commit
  14. 12 Nov, 2016 2 commits
    • Leigh B Stoller's avatar
      Minor tweak to make schemacheck happy. · 459fce68
      Leigh B Stoller authored
      459fce68
    • Leigh B Stoller's avatar
      Bring the cluster monitor "inhouse", rather then depending on the jfed · d7c4230e
      Leigh B Stoller authored
      monitoring system.
      
      New portal_monitor daemon does a GetVersion/ListResources call at each
      of the clusters every five minutes, and updates the new table in the
      DB called apt_aggregate_status. We calculate free/inuse counts for
      physical nodes and a free count for VMs. Failure to contact the
      aggregate for more then 10 minutes sets the aggregate as down, since
      from our perspective if we cannot get to it, the cluster is down.
      
      Unlike the jfed monitoring system, we are not going to try to
      instantiate a new experiment or ssh into it. Wait and see if that is
      necessary in our context.
      
      On the instantiate page, generate a json structure for each cluster,
      similar the one described in issue #172 by Keith. This way we can easily
      switch the existing code over to this new system, but fail back to the
      old mechanism if this turn out to be a bust.
      
      Some other related changes to how we hand cluster into the several web
      pages.
      d7c4230e
  15. 03 Nov, 2016 2 commits
  16. 02 Nov, 2016 1 commit
  17. 31 Oct, 2016 1 commit
  18. 26 Oct, 2016 1 commit
  19. 20 Oct, 2016 5 commits
  20. 18 Oct, 2016 1 commit
  21. 12 Oct, 2016 2 commits
  22. 06 Oct, 2016 1 commit
  23. 03 Oct, 2016 2 commits
  24. 28 Sep, 2016 1 commit
  25. 26 Sep, 2016 1 commit