1. 02 Nov, 2017 1 commit
  2. 12 Sep, 2017 1 commit
    • Mike Hibler's avatar
      Introduce sitevars to control the sensitivity of alerts. · 2962b32f
      Mike Hibler authored
      The sitevars are a bit obscure:
      
        # cnetwatch/check_interval
        #   Interval at which to collect info.
        #   Zero means don't run cnetwatch (exit immediately).
        #
        # cnetwatch/alert_interval
        #   Interval over which to calculate packet/bit rates and to log alerts.
        #   Should be an integer multiple of the check_interval.
        #
        # cnetwatch/pps_threshold
        #   Packet rate (packets/sec) in excess of which to log an alert.
        #   Zero means don't generate packet rate alerts.
        #
        # cnetwatch/bps_threshold
        #   Data rate (bits/sec) in excess of which to log an alert.
        #   Zero means don't generate data rate alerts.
        #
        # cnetwatch/mail_interval
        #   Interval at which to send email for all alerts logged during the interval.
        #   Zero means don't ever send email.
        #
        # cnetwatch/mail_max
        #   Maximum number of alert emails to send; after this alerts are only logged.
        #   Zero means no limit to the emails.
      
      Basically you can tweak pps_threshold and bps_threshold to define what you
      think an unusual "burst" of cnet traffic is and then alert_interval to
      determine how long a burst has to last before you will send an alert.
      
      Why would you have check_interval less than alert_interval? You probably
      wouldn't unless you want to record finer-grained port stats using the -l
      option to write stats to a logfile. We do it on the mothership as a data
      source for some student machine learning projects. Note that in an environment
      with lots of control net switches, a single instance of gathering port
      counters from the switches could take 30 seconds or longer (on the mothership
      it can take minutes). So don't set check_interval too low.
      
      The mail_* variables are paranoia about sending too much email due to runaway
      nodes. The mail_interval just coalesces alerts to reduce messages, and
      mail_max is the maximum number of emails that one instance of cnetwatch will
      send. The latter is a pretty silly mechanism as a long running cnetwatch will
      probably hit the limit legitiamtely after 6 months or so and you will have to
      restart it.
      2962b32f
  3. 01 Sep, 2017 1 commit
  4. 29 Mar, 2017 1 commit
  5. 24 Mar, 2017 2 commits
  6. 03 Mar, 2017 1 commit
  7. 26 Jan, 2017 1 commit
  8. 17 Jan, 2017 1 commit
    • Mike Hibler's avatar
      Implement heartbeat/status reports in Frisbee. · 2be46ba4
      Mike Hibler authored
      There are three pieces here, a change to the frisbee protocol itself, an
      Emulab event component to get status back to the portal, and the surrounding
      infrastructure to make it all work.
      
      Frisbee heartbeat messages:
      
      Added a new message type to the frisbee protocol, "Progress". In theory it
      operates by having the server send a multicast progress request to its clients
      which includes an interval at which to report (or "just once") and an
      indication of what to report (nothing, progress summary, or full stats). The
      client then sends unicast "fire and forget" UDP replies according to that
      schedule. However, I took a shortcut for the moment and just added a command
      line option to the client to tell it to report a summary at the indicated
      interval (-H <interval>).  So the server never sends requests.
      
      This is implemented in the client by a fourth thread since I wanted it to
      operate independent of packet reception (which would cause clients to report
      in a highly synchronized fashion due to multicast). The server instance just
      logs progress reports into its log.
      
      This protocol addition should be fully backward compatible as both client and
      server ignore (but log) unknown messages.
      
      Emulab progress report events:
      
      When this is compiled in (-DEMULAB_EVENTS) and turned on (-E <server>), the
      frisbee server instances will send a FRISBEEPROGRESS event to the indicated
      event server for every progress report it receives (in addition to logging the
      events to its own log). Right now it will create an event with key/value pairs
      for the information in a client summary reply:
      
      TSTAMP is the client's time at which it sends the event. Could be used by the
      received to determine latency of the report if it cared (and if it assumed
      that the clocks are in sync). We don't care about this.
      
      SEQUENCE is the report number. Again, could be used by the receiver, in this
      case to detect loss, if it cared. We don't.
      
      CHUNKS_RECV is complete chunks that the client has received from the network.
      CHUNKS_DECOMP is chunks decompressed by the client.  BYTES_WRITTEN is bytes
      written to disk by the client.
      
      Any of the three can be used by the event receiver as an indication of life
      and/or progress. However, only the last would be a reasonable indicator of
      time remaining since it is the last (and slowest) phase of imaging. To
      estimate time remaining we could compare that value to the amount of
      uncompressed data that is in the image. This makes the sketchy assumptions
      that time for writes to the disk are uniform and that the number and distance
      of seeks is uniform, but it is better than a sharp stick in the eye.
      
      Emulab infrastructure:
      
      There is a new sitevar "images/frisbee/heartbeat" which can be set to a
      non-zero value to tell the frisbee MFS to fire off frisbee with -H <value>
      and thus make reports. The default value of zero means to not make reports.
      The tmcd "loadinfo" command sends this through via the HEARTBEAT=<value>
      param.
      
      REQUIRED A TMCD VERSION BUMP TO 41.
      2be46ba4
  9. 10 Jan, 2017 1 commit
  10. 13 Dec, 2016 1 commit
  11. 17 Nov, 2016 1 commit
  12. 12 Oct, 2016 1 commit
  13. 09 Aug, 2016 1 commit
  14. 16 Jul, 2016 1 commit
  15. 01 Apr, 2016 1 commit
  16. 04 Dec, 2015 1 commit
  17. 03 Dec, 2015 1 commit
  18. 21 Oct, 2015 1 commit
  19. 16 Oct, 2015 1 commit
    • Mike Hibler's avatar
      New sitevar to set a default per-project dataset quota. · e6e123f2
      Mike Hibler authored
      In createdataset, if the "usequotas" sitevar is set for the dataset type in
      question but a quota does not exist for the dataset's project, we create
      a quota object using the value from the new "default_quota" sitevar for that
      dataset type. If that sitevar does not exist or has a value of zero, we do
      NOT create a quota object and hence createdataset will fail.
      e6e123f2
  20. 10 Aug, 2015 1 commit
  21. 18 May, 2015 1 commit
  22. 31 Mar, 2015 2 commits
  23. 25 Mar, 2015 1 commit
  24. 04 Feb, 2015 1 commit
  25. 16 Jan, 2015 1 commit
  26. 12 Jan, 2015 1 commit
  27. 02 Dec, 2014 2 commits
  28. 12 May, 2014 1 commit
  29. 12 Feb, 2014 1 commit
    • Mike Hibler's avatar
      Add frisbee master server mechanisms for turning on dynamic rate tuning. · d9ee4a67
      Mike Hibler authored
      For the Emulab configuration, we add the new site variable
      "images/frisbee/maxrate_dyn" which should be set non-zero to enable
      dynamic adjustment. If maxrate_dyn is enabled, then the maxrate_{std,usr}
      values are used as both the initial and maximum values for the BW of any
      instance. Really, if maxrate_dyn is on, then both of those should be set
      to the same value so that all servers are operating the same and the value
      should be just above the link BW.
      
      For the "null" configuration (aka, the subboss configuration),
      this is set by adding command line options:
          -O dynamicbw=1,bandwidth=1100000000
      which would enable it and start/cap the BW at 1.1Gb/sec.
      d9ee4a67
  30. 08 Jan, 2014 1 commit
  31. 06 Jan, 2014 1 commit
  32. 17 Oct, 2013 1 commit
  33. 09 Aug, 2013 1 commit
  34. 17 Jun, 2013 1 commit
  35. 18 Jan, 2013 1 commit
  36. 06 Dec, 2012 1 commit
  37. 30 Oct, 2012 1 commit