1. 13 Mar, 2019 1 commit
  2. 23 Oct, 2018 1 commit
    • Leigh Stoller's avatar
      New version of the portal monitor that is specific to the Mothership. · 2a5cbb2a
      Leigh Stoller authored
      This version is intended to replace the old autostatus monitor on bas,
      except for monitoring the Mothership itself. We also notify the Slack
      channel like the autostatus version. Driven from the apt_aggregates
      table in the DB, we do the following.
      
      1. fping all the boss nodes.
      
      2. fping all the ops nodes and dboxen. Aside; there are two special
         cases for now, that will eventually come from the database. 1)
         powder wireless aggregates do not have a public ops node, and 2) the
         dboxen are hardwired into a table at the top of the file.
      
      3. Check all the DNS servers. Different from autostatus (which just
         checks that port 53 is listening), we do an actual lookup at the
         server. This is done with dig @ the boss node with recursion turned
         off. At the moment this is serialized test of all the DNS servers,
         might need to change that latter. I've lowered the timeout, and if
         things are operational 99% of the time (which I expect), then this
         will be okay until we get a couple of dozen aggregates to test.
      
         Note that this test is skipped if the boss is not pingable in the
         first step, so in general this test will not be a bottleneck.
      
      4. Check all the CMs with a GetVersion() call. As with the DNS check, we
         skip this if the boss does not ping. This test *is* done in parallel
         using ParRun() since its slower and the most likely to time out when
         the CM is busy. The time out is 20 seconds. This seems to be the best
         balance between too much email and not hanging for too long on any
         one aggregate.
      
      5. Send email and slack notifications. The current loop is every 60
         seconds, and each test has to fail twice in a row before marking a
         test as a failure and sending notification. Also send a 24 hour
         update for anything that is still down.
      
      At the moment, the full set of tests takes 15 seconds on our seven
      aggregates when they are all up. Will need more tuning later, as the
      number of aggregates goes up.
      2a5cbb2a
  3. 01 Oct, 2018 1 commit
    • Leigh Stoller's avatar
      More work on the aggregate monitoring. · 9f3205c9
      Leigh Stoller authored
      1. Split the resource stuff (where we ask for an advertisement and
         process it) into a separate script, since that takes a long time to
         cycle through cause of the size of the ads from the big clusters.
      
      2. On the monitor, distinguish offline (nologins) from actually being
         down.
      
      3. Add a table to store changes in status so we can see over time how
         much time the aggregates are usable.
      9f3205c9
  4. 29 Aug, 2018 1 commit
  5. 08 Aug, 2018 1 commit
    • Leigh Stoller's avatar
      Big set of changes for deferred/scheduled/offline aggregates: · 6f17de73
      Leigh Stoller authored
      * I started out to add just deferred aggregates; those that are offline
        when starting an experiment (and marked in the apt_aggregates table as
        being deferable). When an aggregate is offline, we add an entry to the
        new apt_deferred_aggregates table, and periodically retry to start the
        missing slivers. In order to accomplish this, I split create_instance
        into two scripts, first part to create the instance in the DB, and the
        second (create_slivers) to create slivers for the instance. The daemon
        calls create_slivers for any instances in the deferred table, until
        all deferred aggregates are resolved.
      
        On the UI side, there are various changes to deal with allowing
        experiments to be partially create. For example used to wait till we
        have all the manifests until showing the topology. Now we show the
        topo on the first manifest, and then add them as they come in. Various
        parts of the UI had to change to deal with missing aggregates, I am
        sure I did not get them all.
      
      * And then once I had that, I realized that "scheduled" experiments was
        an "easy" addition, its just a degenerate case of deferred. For this I
        added some new slots to the tables to hold the scheduled start time,
        and added a started stamp so we can distinguish between the time it
        was created and the time it was actually started. Lots of data.
      
        On the UI side, there is a new fourth step on the instantiate page to
        give the user a choice of immediate or scheduled start. I moved the
        experiment duration to this step. I was originally going to add a
        calendar choice for termination, but I did not want to change the
        existing 16 hour max duration policy, yet.
      6f17de73
  6. 30 Jul, 2018 1 commit
  7. 09 Jul, 2018 1 commit
  8. 30 Mar, 2018 2 commits
    • Mike Hibler's avatar
      Initialize port range from defs- vars. · e593d62b
      Mike Hibler authored
      e593d62b
    • Mike Hibler's avatar
      Support for frisbee direct image upload to fs node. · 99943a19
      Mike Hibler authored
      We have had issues with uploading images to boss where they are then written
      across NFS to ops. That seems to be a network hop too far on CloudLab Utah
      where we have a 10Gb control network. We get occasional transcient timeouts
      from somewhere in the TCP code. With the convoluted path through real and
      virtual NICs, some with offloading, some without, packets wind up getting
      out of order and someone gets far enough behind to cause problems.
      
      So we work around it.
      
      If IMAGEUPLOADTOFS is defined in the defs-* file, we will run a frisbee
      master server on the fs (ops) node and the image creation path directs the
      nodes to use that server. There is a new hack configuration for the master
      server "upload-only" which is extremely specific to ops: it validates the
      upload with the boss master server and, if allowed, fires up an upload
      server for the client to talk to. The image will thus be directly uploaded
      to the local (ZFS) /proj or /groups filesystems on ops. This seems to be
      enough to get around the problem.
      
      Note that we could allow this master server to serve downloads as well to
      avoid the analogous problem in that direction, but this to date has not
      been a problem.
      
      NOTE: the ops node must be in the nodes table in the DB or else boss will
      not validate proxied requests from it. The standard install procedure is
      supposed to add ops, but we have a couple of clusters where it is not in
      the table!
      99943a19
  9. 18 Jan, 2018 1 commit
  10. 16 Jan, 2018 1 commit
  11. 04 Jan, 2018 1 commit
  12. 03 Jan, 2018 1 commit
  13. 30 Oct, 2017 1 commit
  14. 25 Oct, 2017 2 commits
  15. 06 Jul, 2017 1 commit
  16. 02 Jun, 2017 1 commit
  17. 31 May, 2017 2 commits
  18. 19 May, 2017 1 commit
  19. 30 Mar, 2017 1 commit
  20. 04 Feb, 2017 1 commit
  21. 31 Jan, 2017 1 commit
  22. 20 Jan, 2017 1 commit
  23. 19 Jan, 2017 1 commit
  24. 16 Dec, 2016 1 commit
    • Mike Hibler's avatar
      Disable --secure-file-priv. · 87214f75
      Mike Hibler authored
      As of 5.5.53 it defaults to not allowing 'INTO OUTFILE' which we use in
      a number of places (e.g., to save experiment state in /usr/testbed/expwork).
      87214f75
  25. 15 Nov, 2016 2 commits
  26. 12 Nov, 2016 1 commit
  27. 27 Oct, 2016 1 commit
  28. 17 Oct, 2016 1 commit
  29. 31 Aug, 2016 1 commit
  30. 14 Jul, 2016 3 commits
  31. 21 Jun, 2016 2 commits
  32. 17 Jun, 2016 1 commit
  33. 19 May, 2016 1 commit