1. 02 Mar, 2018 1 commit
  2. 01 Mar, 2018 5 commits
  3. 21 Feb, 2018 1 commit
  4. 16 Feb, 2018 8 commits
    • Leigh Stoller's avatar
    • Leigh Stoller's avatar
    • Leigh Stoller's avatar
      A lot of work on the RPC code, among other things. · 56f6d601
      Leigh Stoller authored
      I spent a fair amount of improving error handling along the RPC path,
      as well making the code more consistent across the various files. Also
      be more consistent in how the web interface invokes the backend and gets
      errors back, specifically for errors that are generated when taking to a
      remote cluster.
      
      Add checks before every RPC to make sure the cluster is not disabled in
      the database. Also check that we can actually reach the cluster, and
      that the cluster is not offline (NoLogins()) before we try to do
      anything. I might have to relax this a bit, but in general it takes a
      couple of seconds to check, which is a small fraction of what most RPCs
      take. Return precise errors for clusters that are not available, to the
      web interface and show them to user.
      
      Use webtasks more consistently between the web interface and backend
      scripts. Watch specifically for scripts that exit abnormally (exit
      before setting the exitcode in the webtask) which always means an
      internal failure, do not show those to users.
      
      Show just those RPC errors that would make sense users, stop spewing
      script output to the user, send it just to tbops via the email that is
      already generated when a backend script fails fatally.
      
      But do not spew email for clusters that are not reachable or are
      offline. Ditto for several other cases that were generating mail to
      tbops instead of just showing the user a meaningful error message.
      
      Stop using ParRun for single site experiments; 99% of experiments.
      
      For create_instance, a new "async" mode that tells CreateSliver() to
      return before the first mapper run, which is typically very quickly.
      Then watch for errors or for the manifest with Resolve or for the slice
      to disappear. I expect this to be bounded and so we do not need to worry
      so much about timing this wait out (which is a problem on very big
      topologies). When we see the manifest, the RedeemTicket() part of the
      CreateSliver is done and now we are into the StartSliver() phase.
      
      For the StartSliver phase, watch for errors and show them to users,
      previously we mostly lost those errors and just sent the experiment into
      the failed state. I am still working on this.
      56f6d601
    • Leigh Stoller's avatar
    • Leigh Stoller's avatar
      When clicking the max extension checkbox, also make the input box · 7701cb31
      Leigh Stoller authored
      readonly to make it more clear that the input is ignored (the box
      was already cleared).
      7701cb31
    • Leigh Stoller's avatar
      21f08b70
    • Leigh Stoller's avatar
      Build commands with the WebTask arguments. · f8116bef
      Leigh Stoller authored
      f8116bef
    • Leigh Stoller's avatar
      UI changes; add an approve button, most reservation can be approved from · f01c86be
      Leigh Stoller authored
      the listing without going to the edit page. Also hide the approve/deny
      icons on approved reservations.
      f01c86be
  5. 05 Feb, 2018 1 commit
  6. 25 Jan, 2018 1 commit
  7. 23 Jan, 2018 1 commit
  8. 22 Jan, 2018 3 commits
  9. 17 Jan, 2018 1 commit
  10. 16 Jan, 2018 1 commit
    • Leigh Stoller's avatar
      Slightly better handing for GPO users who are not a member of · 7a559d31
      Leigh Stoller authored
      any projects at the GPO portal; redirect them after login to a page that
      tells them they have no project membership. We already not giving them
      any menus, but add back in the right side dropdown that has their login
      name, but only a logout button.
      
      Eventually we want to make it easier for them to promote to real user,
      but that needs a bit more UI work. I just made a ticket for it, #377.
      7a559d31
  11. 11 Jan, 2018 2 commits
  12. 09 Jan, 2018 1 commit
  13. 02 Jan, 2018 3 commits
  14. 01 Jan, 2018 1 commit
    • Leigh Stoller's avatar
      Changes to reservation system wrt classic interface: · dc90a087
      Leigh Stoller authored
      1. Reservation system now groks experiment lockdown and swappable. When
         swapping in, lockdown and swappable mean the expected end of the
         experiment is never.
      
      2. Reservation library now handles changes to lockdowm, swappable, and
         autoswap (timeout). editexp now hands these changes off to a new
         script called manage_expsettings, which can be called by hand since
         we might need to force a change (I am not changing the classic UI, if
         a change is not allowed by the res system, we have to do it by hand).
      
      3. Minor fixes to reservation library.
      dc90a087
  15. 26 Dec, 2017 1 commit
  16. 23 Dec, 2017 1 commit
    • Leigh Stoller's avatar
      Temporary fix for issue #370; disable Jacks (viewer and the constraint · a5a61f6f
      Leigh Stoller authored
      checker) when the number of nodes is greater then 100. There appears to
      be a pathological problem in Jacks that causes it to consume 100% of
      your CPU for 5 minutes or so before finishing. The was making it
      impossible for David to do his 200+ node tests.
      
      Addendum: while testing this just now, I noticed that the problem is
      related to the cross product of nodes and links. 100 nodes and no links
      is actually not that bad, but add a lan and watch out!
      a5a61f6f
  17. 20 Dec, 2017 1 commit
  18. 14 Dec, 2017 2 commits
  19. 13 Dec, 2017 5 commits