1. 16 Feb, 2018 1 commit
    • Leigh B Stoller's avatar
      A lot of work on the RPC code, among other things. · 56f6d601
      Leigh B Stoller authored
      I spent a fair amount of improving error handling along the RPC path,
      as well making the code more consistent across the various files. Also
      be more consistent in how the web interface invokes the backend and gets
      errors back, specifically for errors that are generated when taking to a
      remote cluster.
      
      Add checks before every RPC to make sure the cluster is not disabled in
      the database. Also check that we can actually reach the cluster, and
      that the cluster is not offline (NoLogins()) before we try to do
      anything. I might have to relax this a bit, but in general it takes a
      couple of seconds to check, which is a small fraction of what most RPCs
      take. Return precise errors for clusters that are not available, to the
      web interface and show them to user.
      
      Use webtasks more consistently between the web interface and backend
      scripts. Watch specifically for scripts that exit abnormally (exit
      before setting the exitcode in the webtask) which always means an
      internal failure, do not show those to users.
      
      Show just those RPC errors that would make sense users, stop spewing
      script output to the user, send it just to tbops via the email that is
      already generated when a backend script fails fatally.
      
      But do not spew email for clusters that are not reachable or are
      offline. Ditto for several other cases that were generating mail to
      tbops instead of just showing the user a meaningful error message.
      
      Stop using ParRun for single site experiments; 99% of experiments.
      
      For create_instance, a new "async" mode that tells CreateSliver() to
      return before the first mapper run, which is typically very quickly.
      Then watch for errors or for the manifest with Resolve or for the slice
      to disappear. I expect this to be bounded and so we do not need to worry
      so much about timing this wait out (which is a problem on very big
      topologies). When we see the manifest, the RedeemTicket() part of the
      CreateSliver is done and now we are into the StartSliver() phase.
      
      For the StartSliver phase, watch for errors and show them to users,
      previously we mostly lost those errors and just sent the experiment into
      the failed state. I am still working on this.
      56f6d601
  2. 22 Jan, 2018 3 commits
  3. 01 Jan, 2018 1 commit
  4. 11 Dec, 2017 2 commits
    • Leigh B Stoller's avatar
      Minor wording change. · 0e063f87
      Leigh B Stoller authored
      0e063f87
    • Leigh B Stoller's avatar
      Add extension limiting to manage_extensions and request extension paths. · f71d7d95
      Leigh B Stoller authored
      The limit is the number of hours since the experiment is created, so a
      limit of 10 days really just means that experiments can not live past 10
      days. I think this makes more sense then anything else. There is an
      associated flag with extension limiting that controls whether the user
      can even request another extension after the limit. The normal case is
      that the user cannot request any more extensions, but when set, the user
      is granted no free time and goes through need admin approval path.
      
      Some changes to the email, so that both the user and admin email days
      how many days/hours were both requested and granted.
      
      Also UI change; explicitly tell the user when extensions are disabled,
      and also when no time is granted (so that the users is more clearly
      aware).
      f71d7d95
  5. 04 Dec, 2017 2 commits
    • Leigh B Stoller's avatar
      Extension policy changes: · bd7d9d05
      Leigh B Stoller authored
      * New tables to store policies for users and projects/groups. At the
        moment, there is only one policy (with associated reason); disabled.
        This allows us to mark projects/groups/users with enable/disable
        flags. Note that policies are applied consecutively, so you can
        disable extensions for a project, but enable them for a user in that
        project.
      
      * Apply extensions when experiments are created, send mail to the audit
        log when policies cause extensions to be disabled.
      
      * New driver script (manage_extensions) to change the policy tables.
      bd7d9d05
    • Leigh B Stoller's avatar
      Changes related to extensions: · e1b6076f
      Leigh B Stoller authored
      * Change the units of extension from days to hours along the extension
        path. The user does not see this directly, but it allows us to extend
        experiments to the hour before they are needed by a different
        reservation, both on the user extend modal and the admin extend modal.
      
        On the admin extend page, the input box still defaults to days, but
        you can also use xDyH to specify days and hours. Or just yH for just
        hours.
      
        But to make things easier, there is also a new "max" checkbox to
        extend an experiment out to the maximum allowed by the reservation
        system.
      
      * Changes to "lockout" (disabling extensions). Add a reason field to the
        database, clicking the lockout checkbox will prompt for an optional
        reason.
      
        The user no longer sees the extension modal when extensions are
        disabled, we show an alert instead telling them extensions are
        disabled, and the reason.
      
        On the admin extend page there is a new checkbox to disable extensions
        when denying an extension or scheduling termination.
      
        Log extension disable/enable to the audit log.
      
      * Clear out a bunch of old extension code that is no longer used (since
        the extension code was moved from php to perl).
      e1b6076f
  6. 20 Nov, 2017 1 commit
  7. 03 Nov, 2017 1 commit
    • Leigh B Stoller's avatar
      Fixes/Changes for reservations: · 79d99fa8
      Leigh B Stoller authored
      1. Fix the user extend modal to show the proper number of days they can
         extend.
      
      2. Fix the admin extend modal warning when the extension would violate
         max extension, it was not showing. Add new alerts when we cannot get
         max extension from the cluster or no extension at all allowed.
      
      3. Reduce number of days in the box to max allowed. Warn loudly if you
         type a different number and its greater then max extension.
      
      4. Add "force" box to override. Use with caution. Added the plumbing
         through to the back end as new force option to RenewSliver().
      
      5. Add check in RenewSliver() to ask the reservation system if extension
         allowed before doing it. This was missing, should solve some of the
         over book problems.
      79d99fa8
  8. 13 Oct, 2017 1 commit
    • Leigh B Stoller's avatar
      Changes for automatic lockdown of experiments: · 8f4e3191
      Leigh B Stoller authored
      1. First off, we no longer do automatic lockdown of experiments when
         granting an extension longer then 10 days.
      
      2. Instead, we will lockdown experiments on case by case basis.
      
      3. Changes to the lockdown path that ask the reservation system at the
         target cluster if locking down would throw the reservation system
         into chaos. If so, return a refused error and give admin the choice
         to override. When we do override, send email to local tbops informing
         that the reservation system is in chaos state.
      8f4e3191
  9. 04 Oct, 2017 1 commit
  10. 08 Aug, 2017 2 commits
  11. 07 Jul, 2017 1 commit
    • Leigh B Stoller's avatar
      Deal with user privs (issue #309): · d1516912
      Leigh B Stoller authored
      * Make user privs work across remote clusters (including stitching). I
        took a severe shortcut on this; I do not expect the Cloudlab portal
        will ever talk to anything but an Emulab based aggregate, so I just
        added the priv indicator to the user keys array we send over. If I am
        ever proved wrong on this, I will come out of retirement and fix
        it (for a nominal fee of course).
      
      * Do not show the root password for the console to users with user
        privs.
      
      * Make sure users with user privs cannot start experiments.
      
      * Do show the user trust values on the user dashboard membership tab.
      
      * Update tmcd to use the new privs slot in the nonlocal_user_accounts
        table.
      
      This closes issue #309.
      d1516912
  12. 26 Jun, 2017 1 commit
  13. 30 May, 2017 1 commit
    • Leigh B Stoller's avatar
      Rework how we store the sliver/slice status from the clusters: · e5d36e0d
      Leigh B Stoller authored
      In the beginning, the number and size of experiments was small, and so
      storing the entire slice/sliver status blob as json in the web task was
      fine, even though we had to lock tables to prevent races between the
      event updates and the local polling.
      
      But lately the size of those json blobs is getting huge and the lock is
      bogging things down, including not being able to keep up with the number
      of events coming from all the clusters, we get really far behind.
      
      So I have moved the status blobs out of the per-instance web task and
      into new tables, once per slice and one per node (sliver). This keeps
      the blobs very small and thus the lock time very small. So now we can
      keep up with the event stream.
      
      If we grow big enough that this problem comes big enough, we can switch
      to innodb for the per-sliver table and do row locking instead of table
      locking, but I do not think that will happen
      e5d36e0d
  14. 04 May, 2017 1 commit
  15. 17 Apr, 2017 1 commit
  16. 03 Mar, 2017 1 commit
  17. 01 Mar, 2017 1 commit
  18. 28 Feb, 2017 1 commit
  19. 01 Feb, 2017 2 commits
    • Leigh B Stoller's avatar
      Checkpoint the portal side of frisbee events. · 2faf5fd1
      Leigh B Stoller authored
      The igevent_daemon now also forwards frisbee events for slices to the
      Portal pubsubd over the SSL channel.
      
      The aptevent_daemon gets those and adds them to sliverstatus stored in
      the webtask for the instance.
      
      The timeout code in create_instance watches for frisbee events and uses
      that as another indicator of progress (or lack of). The hope is that we
      fail sooner or avoid failing too soon (say cause of a giant image backed
      dataset).
      
      As an added bonus, the status page will display frisbee progress (image
      name and MB written) in the node status hover popver. I mention this
      cause otherwise I would go to my grave without anyone ever noticing and
      giving me pat on the back or a smiley face in Slack.
      2faf5fd1
    • Leigh B Stoller's avatar
      Another tweak to frisbee event code. · ff395072
      Leigh B Stoller authored
      ff395072
  20. 31 Jan, 2017 1 commit
  21. 25 Jan, 2017 1 commit
  22. 27 Dec, 2016 1 commit
  23. 19 Dec, 2016 2 commits
  24. 01 Dec, 2016 1 commit
  25. 29 Nov, 2016 1 commit
  26. 13 Nov, 2016 1 commit
  27. 12 Nov, 2016 2 commits
  28. 07 Nov, 2016 2 commits
    • Leigh B Stoller's avatar
      Some work on restarting (rebooting) nodes. Presently, there is a bit of · 18cdfa8b
      Leigh B Stoller authored
      an inconsistency in SliverAction(); when operating on the entire slice
      we do the whole thing in the background, returning (almost) immediately.
      Which makes sense, we expect the caller to poll for status after.
      
      But when operating on a subset of slivers (nodes), we do it
      synchronously, which means the caller is left waiting until we get
      through rebooting all the nodes. As David pointed out, when rebooting
      nodes in the openstack profile, this can take a long time as the VMs are
      torn down. This leaves the user looking at a spinner modal for a long
      time, which is not a nice UI feature.
      
      So I added a local option to do slivers in the background, and return
      immediately. I am doing the for restart and reload at the moment since
      that is primarily what we use from the Portal.
      
      Note that this has to push out to all clusters.
      18cdfa8b
    • Leigh B Stoller's avatar
  29. 06 Nov, 2016 1 commit
  30. 20 Sep, 2016 1 commit
  31. 11 Jul, 2016 1 commit