1. 16 Feb, 2018 2 commits
    • Leigh Stoller's avatar
    • Leigh Stoller's avatar
      A lot of work on the RPC code, among other things. · 56f6d601
      Leigh Stoller authored
      I spent a fair amount of improving error handling along the RPC path,
      as well making the code more consistent across the various files. Also
      be more consistent in how the web interface invokes the backend and gets
      errors back, specifically for errors that are generated when taking to a
      remote cluster.
      
      Add checks before every RPC to make sure the cluster is not disabled in
      the database. Also check that we can actually reach the cluster, and
      that the cluster is not offline (NoLogins()) before we try to do
      anything. I might have to relax this a bit, but in general it takes a
      couple of seconds to check, which is a small fraction of what most RPCs
      take. Return precise errors for clusters that are not available, to the
      web interface and show them to user.
      
      Use webtasks more consistently between the web interface and backend
      scripts. Watch specifically for scripts that exit abnormally (exit
      before setting the exitcode in the webtask) which always means an
      internal failure, do not show those to users.
      
      Show just those RPC errors that would make sense users, stop spewing
      script output to the user, send it just to tbops via the email that is
      already generated when a backend script fails fatally.
      
      But do not spew email for clusters that are not reachable or are
      offline. Ditto for several other cases that were generating mail to
      tbops instead of just showing the user a meaningful error message.
      
      Stop using ParRun for single site experiments; 99% of experiments.
      
      For create_instance, a new "async" mode that tells CreateSliver() to
      return before the first mapper run, which is typically very quickly.
      Then watch for errors or for the manifest with Resolve or for the slice
      to disappear. I expect this to be bounded and so we do not need to worry
      so much about timing this wait out (which is a problem on very big
      topologies). When we see the manifest, the RedeemTicket() part of the
      CreateSliver is done and now we are into the StartSliver() phase.
      
      For the StartSliver phase, watch for errors and show them to users,
      previously we mostly lost those errors and just sent the experiment into
      the failed state. I am still working on this.
      56f6d601
  2. 04 Dec, 2017 1 commit
    • Leigh Stoller's avatar
      Changes related to extensions: · e1b6076f
      Leigh Stoller authored
      * Change the units of extension from days to hours along the extension
        path. The user does not see this directly, but it allows us to extend
        experiments to the hour before they are needed by a different
        reservation, both on the user extend modal and the admin extend modal.
      
        On the admin extend page, the input box still defaults to days, but
        you can also use xDyH to specify days and hours. Or just yH for just
        hours.
      
        But to make things easier, there is also a new "max" checkbox to
        extend an experiment out to the maximum allowed by the reservation
        system.
      
      * Changes to "lockout" (disabling extensions). Add a reason field to the
        database, clicking the lockout checkbox will prompt for an optional
        reason.
      
        The user no longer sees the extension modal when extensions are
        disabled, we show an alert instead telling them extensions are
        disabled, and the reason.
      
        On the admin extend page there is a new checkbox to disable extensions
        when denying an extension or scheduling termination.
      
        Log extension disable/enable to the audit log.
      
      * Clear out a bunch of old extension code that is no longer used (since
        the extension code was moved from php to perl).
      e1b6076f
  3. 26 Oct, 2017 1 commit
  4. 24 Oct, 2017 1 commit
  5. 08 Aug, 2017 1 commit
  6. 07 Jul, 2017 1 commit
    • Leigh Stoller's avatar
      Deal with user privs (issue #309): · d1516912
      Leigh Stoller authored
      * Make user privs work across remote clusters (including stitching). I
        took a severe shortcut on this; I do not expect the Cloudlab portal
        will ever talk to anything but an Emulab based aggregate, so I just
        added the priv indicator to the user keys array we send over. If I am
        ever proved wrong on this, I will come out of retirement and fix
        it (for a nominal fee of course).
      
      * Do not show the root password for the console to users with user
        privs.
      
      * Make sure users with user privs cannot start experiments.
      
      * Do show the user trust values on the user dashboard membership tab.
      
      * Update tmcd to use the new privs slot in the nonlocal_user_accounts
        table.
      
      This closes issue #309.
      d1516912
  7. 26 Jun, 2017 1 commit
  8. 07 Jun, 2017 2 commits
  9. 30 May, 2017 1 commit
    • Leigh Stoller's avatar
      Rework how we store the sliver/slice status from the clusters: · e5d36e0d
      Leigh Stoller authored
      In the beginning, the number and size of experiments was small, and so
      storing the entire slice/sliver status blob as json in the web task was
      fine, even though we had to lock tables to prevent races between the
      event updates and the local polling.
      
      But lately the size of those json blobs is getting huge and the lock is
      bogging things down, including not being able to keep up with the number
      of events coming from all the clusters, we get really far behind.
      
      So I have moved the status blobs out of the per-instance web task and
      into new tables, once per slice and one per node (sliver). This keeps
      the blobs very small and thus the lock time very small. So now we can
      keep up with the event stream.
      
      If we grow big enough that this problem comes big enough, we can switch
      to innodb for the per-sliver table and do row locking instead of table
      locking, but I do not think that will happen
      e5d36e0d
  10. 10 May, 2017 1 commit
  11. 02 May, 2017 1 commit
    • Leigh Stoller's avatar
      Speed up the instantiate page response time, it was taking forever! · af8cc34f
      Leigh Stoller authored
      1. Okay, 10-15 seconds for me, which is the same as forever.
      
      2. Do not sort in PHP, sort in javascript, let the client burn those
         cycles instead of poor overworked boss.
      
      3. Store global lastused/usecount in the apt_profiles table so that we
         do not have to compute it every time for profile.
      
      4. Compute the user's lastused/usecount for each profile in a single
         query and create local array. Cuts out 100s of queries.
      af8cc34f
  12. 25 Jan, 2017 1 commit
  13. 05 Jan, 2017 1 commit
  14. 28 Dec, 2016 1 commit
  15. 15 Dec, 2016 1 commit
  16. 29 Nov, 2016 1 commit
  17. 20 Jul, 2016 1 commit
  18. 01 Jun, 2016 1 commit
    • Leigh Stoller's avatar
      Several sets of changes scattered across all these files. · 0f4a4dfb
      Leigh Stoller authored
      * More on issue #54; watch for openstack experiments and try to download
        the new openstack stats file via the fast XMLRPC path. Show this as a
        text blob in a new tab on the status page, still need to graph the data.
        The apt_daemon handles the periodic request for the data (every 10
        minutes), which we store in the apt_instances table.
      
      * Addition for Rob on the admin extend page; Add a "more info" button that
        sends the contents of the text box as an email message requesting more
        info and stores that in the ongoing interaction log. Responses from the
        user are not stored though, might look at that someday.
      
      * Another addition for Rob; on the extensions list page, also show expired,
        locked down experiments. Note the sorting; at the top of the list are
        actual extension request (status='ready') while the bottom of the list
        are status='expired'.
      
      * Add a "graphs" tab to the status page, which shows the same idle stats
        graphs that were added to the admin extend page. Most of this change is
        refactoring the code and sharing it between the two pages.
      0f4a4dfb
  19. 28 Apr, 2016 1 commit
  20. 25 Apr, 2016 1 commit
  21. 12 Apr, 2016 1 commit
  22. 30 Mar, 2016 1 commit
  23. 26 Mar, 2016 1 commit
  24. 25 Feb, 2016 1 commit
  25. 03 Feb, 2016 1 commit
    • Leigh Stoller's avatar
      Add support for multiple pre-reservations per project: · 103e0385
      Leigh Stoller authored
      When creating a pre-reserve, new -n option to specify a name for the
      reservation, defaults to "default". All other operations require an
      -n option to avoid messing with the wrong reservation. You are not allowed
      to reuse a reservation name in a project, of course. Priorities are
      probably more important now, we might want to change the default from 0 to
      some thing higher, and change all the current priorities.
      
      For bookkeeping, the nodes table now has a reservation_name slot that is
      set with the reserved_pid. This allows us to revoke the nodes associated
      with a specific reservation. Bonus feature is that when setting the
      reserved_pid via the web interface, we leave the reservation_name null, so
      those won't ever be revoked by the prereserve command line tool.
      
      New feature; when revoking a pre-reserve, we now look to see if nodes being
      revoked are free and can be assigned to other pre-reserves. We used to not
      do anything, and so had to wait until that node was allocated and released
      later, to see if it could move into a pre-reserve.
      
      Also a change required by node specific reservations; when we free a node,
      need to make sure we actually use that node, so have to cycle through all
      reservations in priority order until it can used. We did not need to do
      this before.
      103e0385
  26. 28 Jan, 2016 1 commit
  27. 06 Jan, 2016 1 commit
  28. 04 Jan, 2016 2 commits
  29. 21 Dec, 2015 1 commit
  30. 08 Dec, 2015 2 commits
    • Leigh Stoller's avatar
      Export geni API for our panic mode (level 1, since not all clusters can do · 97347528
      Leigh Stoller authored
      control network port modification), and add front end support to the Portal
      status page (admin mode only of course)
      97347528
    • Kirk Webb's avatar
      Batch of changes that creates a PhantomNet portal branding. · ba49a457
      Kirk Webb authored
      Also includes some PhantomNet-specific restrictions (e.g. only
      allows use of the main Utah Emulab testbed  aggregate).
      
      This excercise stretched the limits of what we can reasonably do
      before introducing real per-testbed branding/policy mechanisms to
      the php/web front-end.  My changes ain't exactly pretty...
      
      Please take care when adding any testbed-specific changes to the
      code.  There are three flavors now to consider in the logic.
      ba49a457
  31. 01 Dec, 2015 1 commit
    • Leigh Stoller's avatar
      Add support for cancelation; stopping an experiment setup early, instead of · 32c3d934
      Leigh Stoller authored
      waiting till it finished setting up (or fails). This is really nice when a
      1000 node experiment has gone awry and it is pointless to wait for it to
      finish. When we do this, we mark the instance as canceled in the DB, and
      then wait for create_instance() to notice it. When it does, it stops
      waiting and invokes terminate with a new cancel option at the backend.
      32c3d934
  32. 30 Nov, 2015 1 commit
  33. 02 Nov, 2015 1 commit
    • Leigh Stoller's avatar
      Add password block decryption and expansion in the instructions panel. · d7d3800a
      Leigh Stoller authored
      Given a password element in the rspec:
      
      	<emulab:password name='foo'></password>
      
      which the portal has converted to an encrypted secret, when that experiment
      is later shown (the status page), ask the server to decrypt the block, and
      then replace the string "{password-foo}" in the instructions with the
      actual password.
      
      Need to generalize this a bit more, for arbitrary encryption blocks, when
      we have those.
      d7d3800a
  34. 28 Oct, 2015 1 commit
  35. 22 Oct, 2015 1 commit
  36. 21 Oct, 2015 1 commit