1. 19 Apr, 2018 1 commit
  2. 16 Feb, 2018 1 commit
    • Leigh B Stoller's avatar
      A lot of work on the RPC code, among other things. · 56f6d601
      Leigh B Stoller authored
      I spent a fair amount of improving error handling along the RPC path,
      as well making the code more consistent across the various files. Also
      be more consistent in how the web interface invokes the backend and gets
      errors back, specifically for errors that are generated when taking to a
      remote cluster.
      
      Add checks before every RPC to make sure the cluster is not disabled in
      the database. Also check that we can actually reach the cluster, and
      that the cluster is not offline (NoLogins()) before we try to do
      anything. I might have to relax this a bit, but in general it takes a
      couple of seconds to check, which is a small fraction of what most RPCs
      take. Return precise errors for clusters that are not available, to the
      web interface and show them to user.
      
      Use webtasks more consistently between the web interface and backend
      scripts. Watch specifically for scripts that exit abnormally (exit
      before setting the exitcode in the webtask) which always means an
      internal failure, do not show those to users.
      
      Show just those RPC errors that would make sense users, stop spewing
      script output to the user, send it just to tbops via the email that is
      already generated when a backend script fails fatally.
      
      But do not spew email for clusters that are not reachable or are
      offline. Ditto for several other cases that were generating mail to
      tbops instead of just showing the user a meaningful error message.
      
      Stop using ParRun for single site experiments; 99% of experiments.
      
      For create_instance, a new "async" mode that tells CreateSliver() to
      return before the first mapper run, which is typically very quickly.
      Then watch for errors or for the manifest with Resolve or for the slice
      to disappear. I expect this to be bounded and so we do not need to worry
      so much about timing this wait out (which is a problem on very big
      topologies). When we see the manifest, the RedeemTicket() part of the
      CreateSliver is done and now we are into the StartSliver() phase.
      
      For the StartSliver phase, watch for errors and show them to users,
      previously we mostly lost those errors and just sent the experiment into
      the failed state. I am still working on this.
      56f6d601
  3. 04 Dec, 2017 1 commit
    • Leigh B Stoller's avatar
      Changes related to extensions: · e1b6076f
      Leigh B Stoller authored
      * Change the units of extension from days to hours along the extension
        path. The user does not see this directly, but it allows us to extend
        experiments to the hour before they are needed by a different
        reservation, both on the user extend modal and the admin extend modal.
      
        On the admin extend page, the input box still defaults to days, but
        you can also use xDyH to specify days and hours. Or just yH for just
        hours.
      
        But to make things easier, there is also a new "max" checkbox to
        extend an experiment out to the maximum allowed by the reservation
        system.
      
      * Changes to "lockout" (disabling extensions). Add a reason field to the
        database, clicking the lockout checkbox will prompt for an optional
        reason.
      
        The user no longer sees the extension modal when extensions are
        disabled, we show an alert instead telling them extensions are
        disabled, and the reason.
      
        On the admin extend page there is a new checkbox to disable extensions
        when denying an extension or scheduling termination.
      
        Log extension disable/enable to the audit log.
      
      * Clear out a bunch of old extension code that is no longer used (since
        the extension code was moved from php to perl).
      e1b6076f
  4. 03 Nov, 2017 1 commit
    • Leigh B Stoller's avatar
      Fixes/Changes for reservations: · 79d99fa8
      Leigh B Stoller authored
      1. Fix the user extend modal to show the proper number of days they can
         extend.
      
      2. Fix the admin extend modal warning when the extension would violate
         max extension, it was not showing. Add new alerts when we cannot get
         max extension from the cluster or no extension at all allowed.
      
      3. Reduce number of days in the box to max allowed. Warn loudly if you
         type a different number and its greater then max extension.
      
      4. Add "force" box to override. Use with caution. Added the plumbing
         through to the back end as new force option to RenewSliver().
      
      5. Add check in RenewSliver() to ask the reservation system if extension
         allowed before doing it. This was missing, should solve some of the
         over book problems.
      79d99fa8
  5. 13 Oct, 2017 1 commit
    • Leigh B Stoller's avatar
      Changes for automatic lockdown of experiments: · 8f4e3191
      Leigh B Stoller authored
      1. First off, we no longer do automatic lockdown of experiments when
         granting an extension longer then 10 days.
      
      2. Instead, we will lockdown experiments on case by case basis.
      
      3. Changes to the lockdown path that ask the reservation system at the
         target cluster if locking down would throw the reservation system
         into chaos. If so, return a refused error and give admin the choice
         to override. When we do override, send email to local tbops informing
         that the reservation system is in chaos state.
      8f4e3191
  6. 06 Sep, 2017 1 commit
  7. 26 Jun, 2017 1 commit
  8. 07 Jun, 2017 1 commit
    • Leigh B Stoller's avatar
      Implement issue #284 (admin notes). · 8ab4519f
      Leigh B Stoller authored
      You can find and change the admin notes on the admin extension page,
      since is the page where we show lots of admin only stuff. Might need
      to rename this page at some point.
      8ab4519f
  9. 17 Apr, 2017 1 commit
    • Leigh B Stoller's avatar
      Separate user vs admin lockdown, previously they were intertwined. · 8e88917e
      Leigh B Stoller authored
      User lockdown is as before, user can override that on the terminate
      page. Admin lockdown is like Classic lockdown; the flag must be cleared
      before the experiment can be terminated, there is no override on the
      termination page.
      
      UI changes on the status and admin extend page for the additional
      flag (instead of a single lockdown, there are now two).
      8e88917e
  10. 20 Oct, 2016 1 commit
  11. 20 Sep, 2016 1 commit
  12. 12 Sep, 2016 1 commit
  13. 16 Jul, 2016 1 commit
  14. 11 Jul, 2016 1 commit
  15. 10 Jun, 2016 1 commit
  16. 06 Jun, 2016 2 commits
  17. 01 Jun, 2016 1 commit
    • Leigh B Stoller's avatar
      Several sets of changes scattered across all these files. · 0f4a4dfb
      Leigh B Stoller authored
      * More on issue #54; watch for openstack experiments and try to download
        the new openstack stats file via the fast XMLRPC path. Show this as a
        text blob in a new tab on the status page, still need to graph the data.
        The apt_daemon handles the periodic request for the data (every 10
        minutes), which we store in the apt_instances table.
      
      * Addition for Rob on the admin extend page; Add a "more info" button that
        sends the contents of the text box as an email message requesting more
        info and stores that in the ongoing interaction log. Responses from the
        user are not stored though, might look at that someday.
      
      * Another addition for Rob; on the extensions list page, also show expired,
        locked down experiments. Note the sorting; at the top of the list are
        actual extension request (status='ready') while the bottom of the list
        are status='expired'.
      
      * Add a "graphs" tab to the status page, which shows the same idle stats
        graphs that were added to the admin extend page. Most of this change is
        refactoring the code and sharing it between the two pages.
      0f4a4dfb
  18. 25 May, 2016 1 commit
  19. 03 May, 2016 1 commit
  20. 28 Apr, 2016 2 commits
  21. 25 Apr, 2016 2 commits
    • Leigh B Stoller's avatar
      Minor tweak. · 73513547
      Leigh B Stoller authored
      73513547
    • Leigh B Stoller's avatar
      More changes as discussed in #62; the main change in this commit is the · ec52cc89
      Leigh B Stoller authored
      switch to new admin extend page that includes more summary and utilization
      and cluster info, to make it easier to determine the merits of a particular
      extension. As part of this change, extension are now first class objects
      associated with a instance (mostly a convenience, better then the ongoing
      text field, which was annoying to do anything interesting with).
      ec52cc89