1. 23 Oct, 2018 1 commit
    • Leigh Stoller's avatar
      New version of the portal monitor that is specific to the Mothership. · 2a5cbb2a
      Leigh Stoller authored
      This version is intended to replace the old autostatus monitor on bas,
      except for monitoring the Mothership itself. We also notify the Slack
      channel like the autostatus version. Driven from the apt_aggregates
      table in the DB, we do the following.
      
      1. fping all the boss nodes.
      
      2. fping all the ops nodes and dboxen. Aside; there are two special
         cases for now, that will eventually come from the database. 1)
         powder wireless aggregates do not have a public ops node, and 2) the
         dboxen are hardwired into a table at the top of the file.
      
      3. Check all the DNS servers. Different from autostatus (which just
         checks that port 53 is listening), we do an actual lookup at the
         server. This is done with dig @ the boss node with recursion turned
         off. At the moment this is serialized test of all the DNS servers,
         might need to change that latter. I've lowered the timeout, and if
         things are operational 99% of the time (which I expect), then this
         will be okay until we get a couple of dozen aggregates to test.
      
         Note that this test is skipped if the boss is not pingable in the
         first step, so in general this test will not be a bottleneck.
      
      4. Check all the CMs with a GetVersion() call. As with the DNS check, we
         skip this if the boss does not ping. This test *is* done in parallel
         using ParRun() since its slower and the most likely to time out when
         the CM is busy. The time out is 20 seconds. This seems to be the best
         balance between too much email and not hanging for too long on any
         one aggregate.
      
      5. Send email and slack notifications. The current loop is every 60
         seconds, and each test has to fail twice in a row before marking a
         test as a failure and sending notification. Also send a 24 hour
         update for anything that is still down.
      
      At the moment, the full set of tests takes 15 seconds on our seven
      aggregates when they are all up. Will need more tuning later, as the
      number of aggregates goes up.
      2a5cbb2a
  2. 01 Oct, 2018 1 commit
    • Leigh Stoller's avatar
      More work on the aggregate monitoring. · 9f3205c9
      Leigh Stoller authored
      1. Split the resource stuff (where we ask for an advertisement and
         process it) into a separate script, since that takes a long time to
         cycle through cause of the size of the ads from the big clusters.
      
      2. On the monitor, distinguish offline (nologins) from actually being
         down.
      
      3. Add a table to store changes in status so we can see over time how
         much time the aggregates are usable.
      9f3205c9
  3. 28 Sep, 2018 2 commits
  4. 24 Feb, 2017 1 commit
  5. 15 Dec, 2016 1 commit
  6. 29 Nov, 2016 2 commits
  7. 12 Nov, 2016 1 commit
    • Leigh Stoller's avatar
      Bring the cluster monitor "inhouse", rather then depending on the jfed · d7c4230e
      Leigh Stoller authored
      monitoring system.
      
      New portal_monitor daemon does a GetVersion/ListResources call at each
      of the clusters every five minutes, and updates the new table in the
      DB called apt_aggregate_status. We calculate free/inuse counts for
      physical nodes and a free count for VMs. Failure to contact the
      aggregate for more then 10 minutes sets the aggregate as down, since
      from our perspective if we cannot get to it, the cluster is down.
      
      Unlike the jfed monitoring system, we are not going to try to
      instantiate a new experiment or ssh into it. Wait and see if that is
      necessary in our context.
      
      On the instantiate page, generate a json structure for each cluster,
      similar the one described in issue #172 by Keith. This way we can easily
      switch the existing code over to this new system, but fail back to the
      old mechanism if this turn out to be a bust.
      
      Some other related changes to how we hand cluster into the several web
      pages.
      d7c4230e