1. 08 Feb, 2019 1 commit
  2. 08 Aug, 2018 1 commit
    • Leigh Stoller's avatar
      Big set of changes for deferred/scheduled/offline aggregates: · 6f17de73
      Leigh Stoller authored
      * I started out to add just deferred aggregates; those that are offline
        when starting an experiment (and marked in the apt_aggregates table as
        being deferable). When an aggregate is offline, we add an entry to the
        new apt_deferred_aggregates table, and periodically retry to start the
        missing slivers. In order to accomplish this, I split create_instance
        into two scripts, first part to create the instance in the DB, and the
        second (create_slivers) to create slivers for the instance. The daemon
        calls create_slivers for any instances in the deferred table, until
        all deferred aggregates are resolved.
      
        On the UI side, there are various changes to deal with allowing
        experiments to be partially create. For example used to wait till we
        have all the manifests until showing the topology. Now we show the
        topo on the first manifest, and then add them as they come in. Various
        parts of the UI had to change to deal with missing aggregates, I am
        sure I did not get them all.
      
      * And then once I had that, I realized that "scheduled" experiments was
        an "easy" addition, its just a degenerate case of deferred. For this I
        added some new slots to the tables to hold the scheduled start time,
        and added a started stamp so we can distinguish between the time it
        was created and the time it was actually started. Lots of data.
      
        On the UI side, there is a new fourth step on the instantiate page to
        give the user a choice of immediate or scheduled start. I moved the
        experiment duration to this step. I was originally going to add a
        calendar choice for termination, but I did not want to change the
        existing 16 hour max duration policy, yet.
      6f17de73
  3. 28 Jun, 2018 2 commits
  4. 01 Mar, 2018 1 commit
  5. 16 Feb, 2018 1 commit
    • Leigh Stoller's avatar
      A lot of work on the RPC code, among other things. · 56f6d601
      Leigh Stoller authored
      I spent a fair amount of improving error handling along the RPC path,
      as well making the code more consistent across the various files. Also
      be more consistent in how the web interface invokes the backend and gets
      errors back, specifically for errors that are generated when taking to a
      remote cluster.
      
      Add checks before every RPC to make sure the cluster is not disabled in
      the database. Also check that we can actually reach the cluster, and
      that the cluster is not offline (NoLogins()) before we try to do
      anything. I might have to relax this a bit, but in general it takes a
      couple of seconds to check, which is a small fraction of what most RPCs
      take. Return precise errors for clusters that are not available, to the
      web interface and show them to user.
      
      Use webtasks more consistently between the web interface and backend
      scripts. Watch specifically for scripts that exit abnormally (exit
      before setting the exitcode in the webtask) which always means an
      internal failure, do not show those to users.
      
      Show just those RPC errors that would make sense users, stop spewing
      script output to the user, send it just to tbops via the email that is
      already generated when a backend script fails fatally.
      
      But do not spew email for clusters that are not reachable or are
      offline. Ditto for several other cases that were generating mail to
      tbops instead of just showing the user a meaningful error message.
      
      Stop using ParRun for single site experiments; 99% of experiments.
      
      For create_instance, a new "async" mode that tells CreateSliver() to
      return before the first mapper run, which is typically very quickly.
      Then watch for errors or for the manifest with Resolve or for the slice
      to disappear. I expect this to be bounded and so we do not need to worry
      so much about timing this wait out (which is a problem on very big
      topologies). When we see the manifest, the RedeemTicket() part of the
      CreateSliver is done and now we are into the StartSliver() phase.
      
      For the StartSliver phase, watch for errors and show them to users,
      previously we mostly lost those errors and just sent the experiment into
      the failed state. I am still working on this.
      56f6d601
  6. 30 Nov, 2017 1 commit
  7. 08 Nov, 2017 1 commit
  8. 04 Oct, 2017 1 commit
  9. 06 Sep, 2017 1 commit
  10. 31 Aug, 2017 1 commit
  11. 05 May, 2017 1 commit
  12. 03 Mar, 2017 1 commit
  13. 09 Feb, 2017 3 commits
  14. 02 Feb, 2017 1 commit
  15. 28 Dec, 2016 1 commit
  16. 29 Nov, 2016 1 commit
  17. 12 Nov, 2016 1 commit
    • Leigh Stoller's avatar
      Bring the cluster monitor "inhouse", rather then depending on the jfed · d7c4230e
      Leigh Stoller authored
      monitoring system.
      
      New portal_monitor daemon does a GetVersion/ListResources call at each
      of the clusters every five minutes, and updates the new table in the
      DB called apt_aggregate_status. We calculate free/inuse counts for
      physical nodes and a free count for VMs. Failure to contact the
      aggregate for more then 10 minutes sets the aggregate as down, since
      from our perspective if we cannot get to it, the cluster is down.
      
      Unlike the jfed monitoring system, we are not going to try to
      instantiate a new experiment or ssh into it. Wait and see if that is
      necessary in our context.
      
      On the instantiate page, generate a json structure for each cluster,
      similar the one described in issue #172 by Keith. This way we can easily
      switch the existing code over to this new system, but fail back to the
      old mechanism if this turn out to be a bust.
      
      Some other related changes to how we hand cluster into the several web
      pages.
      d7c4230e
  18. 03 Nov, 2016 1 commit
  19. 18 Oct, 2016 1 commit
    • Leigh Stoller's avatar
      Image alias support for the constraint system. As per discussion at the · a5ee67dc
      Leigh Stoller authored
      meeting, we use the image aliases to create an equivalence class for the
      hardware types the image alias refers to. For example, UBUNTU16-64-STD
      is an x86 image in the image server. It is also an image alias (at the
      Moonshot cluster) that points to the x86 image and the ARM image. So
      that means UBUNTU16-64-STD runs on x86 types and moonshot types.
      a5ee67dc
  20. 22 Jul, 2016 2 commits
  21. 20 Jul, 2016 1 commit
  22. 23 Jun, 2016 1 commit
  23. 10 Jun, 2016 1 commit
  24. 01 Jun, 2016 1 commit
  25. 12 Apr, 2016 1 commit
  26. 09 Mar, 2016 1 commit
  27. 04 Jan, 2016 1 commit
  28. 21 Dec, 2015 1 commit
  29. 08 Dec, 2015 1 commit
    • Kirk Webb's avatar
      Batch of changes that creates a PhantomNet portal branding. · ba49a457
      Kirk Webb authored
      Also includes some PhantomNet-specific restrictions (e.g. only
      allows use of the main Utah Emulab testbed  aggregate).
      
      This excercise stretched the limits of what we can reasonably do
      before introducing real per-testbed branding/policy mechanisms to
      the php/web front-end.  My changes ain't exactly pretty...
      
      Please take care when adding any testbed-specific changes to the
      code.  There are three flavors now to consider in the logic.
      ba49a457
  30. 04 Nov, 2015 1 commit
    • Leigh Stoller's avatar
      Changes for Keith to develop the new profile picker: · eafff053
      Leigh Stoller authored
      1. Instead of a plain list of profiles, generate a more detailed list that
         includes last used and usage counts and project name and favorite flag,
         so that the new picker can be sorted/grouped.
      
         This list is *ordered* by most recent usage (if a real user), or most
         popular (if a guest). 
      
      2. Move the modal from quickvm_sup to the template, and generate the
         current list from the new json info.
      
      3. Add new table apt_profile_favorites to record favorite profiles for
         users.
      
      4. Add new ajax calls for above, MarkFavorite and ClearFavorite that take a
         single argument, the uuid of the profile. There is no UI for this, Keith
         is going to add that.
      eafff053
  31. 19 Oct, 2015 3 commits
  32. 15 Oct, 2015 2 commits
  33. 06 Oct, 2015 1 commit