1. 09 Jul, 2018 1 commit
    • Leigh Stoller's avatar
      Various bits of support for issue #408: · b7fb16a8
      Leigh Stoller authored
      * Add portal url to the existing emulab extension that tells the CM the
        CreateSliver() is coming from the Portal. Always send this info, not
        just for the Emulab Portal.
      
      * Stash that info in the geni slice data structure so we can add links
        back to the portal status page for current slices.
      
      * Add routines to generate a portal URL for the history entries, since
        we will not have those links for historical slices. Add links back to
        the portal on the showslice and slice history pages.
      b7fb16a8
  2. 12 Jun, 2018 1 commit
  3. 14 May, 2018 1 commit
  4. 12 May, 2018 1 commit
  5. 30 Apr, 2018 1 commit
  6. 26 Apr, 2018 1 commit
  7. 14 Mar, 2018 1 commit
  8. 06 Mar, 2018 1 commit
  9. 20 Feb, 2018 1 commit
  10. 16 Feb, 2018 2 commits
    • Leigh Stoller's avatar
    • Leigh Stoller's avatar
      A lot of work on the RPC code, among other things. · 56f6d601
      Leigh Stoller authored
      I spent a fair amount of improving error handling along the RPC path,
      as well making the code more consistent across the various files. Also
      be more consistent in how the web interface invokes the backend and gets
      errors back, specifically for errors that are generated when taking to a
      remote cluster.
      
      Add checks before every RPC to make sure the cluster is not disabled in
      the database. Also check that we can actually reach the cluster, and
      that the cluster is not offline (NoLogins()) before we try to do
      anything. I might have to relax this a bit, but in general it takes a
      couple of seconds to check, which is a small fraction of what most RPCs
      take. Return precise errors for clusters that are not available, to the
      web interface and show them to user.
      
      Use webtasks more consistently between the web interface and backend
      scripts. Watch specifically for scripts that exit abnormally (exit
      before setting the exitcode in the webtask) which always means an
      internal failure, do not show those to users.
      
      Show just those RPC errors that would make sense users, stop spewing
      script output to the user, send it just to tbops via the email that is
      already generated when a backend script fails fatally.
      
      But do not spew email for clusters that are not reachable or are
      offline. Ditto for several other cases that were generating mail to
      tbops instead of just showing the user a meaningful error message.
      
      Stop using ParRun for single site experiments; 99% of experiments.
      
      For create_instance, a new "async" mode that tells CreateSliver() to
      return before the first mapper run, which is typically very quickly.
      Then watch for errors or for the manifest with Resolve or for the slice
      to disappear. I expect this to be bounded and so we do not need to worry
      so much about timing this wait out (which is a problem on very big
      topologies). When we see the manifest, the RedeemTicket() part of the
      CreateSliver is done and now we are into the StartSliver() phase.
      
      For the StartSliver phase, watch for errors and show them to users,
      previously we mostly lost those errors and just sent the experiment into
      the failed state. I am still working on this.
      56f6d601
  11. 22 Jan, 2018 2 commits
  12. 04 Dec, 2017 1 commit
    • Leigh Stoller's avatar
      Extension policy changes: · bd7d9d05
      Leigh Stoller authored
      * New tables to store policies for users and projects/groups. At the
        moment, there is only one policy (with associated reason); disabled.
        This allows us to mark projects/groups/users with enable/disable
        flags. Note that policies are applied consecutively, so you can
        disable extensions for a project, but enable them for a user in that
        project.
      
      * Apply extensions when experiments are created, send mail to the audit
        log when policies cause extensions to be disabled.
      
      * New driver script (manage_extensions) to change the policy tables.
      bd7d9d05
  13. 27 Nov, 2017 1 commit
  14. 23 Oct, 2017 1 commit
  15. 14 Aug, 2017 1 commit
  16. 10 Jul, 2017 1 commit
  17. 30 May, 2017 1 commit
    • Leigh Stoller's avatar
      Rework how we store the sliver/slice status from the clusters: · e5d36e0d
      Leigh Stoller authored
      In the beginning, the number and size of experiments was small, and so
      storing the entire slice/sliver status blob as json in the web task was
      fine, even though we had to lock tables to prevent races between the
      event updates and the local polling.
      
      But lately the size of those json blobs is getting huge and the lock is
      bogging things down, including not being able to keep up with the number
      of events coming from all the clusters, we get really far behind.
      
      So I have moved the status blobs out of the per-instance web task and
      into new tables, once per slice and one per node (sliver). This keeps
      the blobs very small and thus the lock time very small. So now we can
      keep up with the event stream.
      
      If we grow big enough that this problem comes big enough, we can switch
      to innodb for the per-sliver table and do row locking instead of table
      locking, but I do not think that will happen
      e5d36e0d
  18. 10 May, 2017 1 commit
  19. 02 May, 2017 1 commit
    • Leigh Stoller's avatar
      Speed up the instantiate page response time, it was taking forever! · af8cc34f
      Leigh Stoller authored
      1. Okay, 10-15 seconds for me, which is the same as forever.
      
      2. Do not sort in PHP, sort in javascript, let the client burn those
         cycles instead of poor overworked boss.
      
      3. Store global lastused/usecount in the apt_profiles table so that we
         do not have to compute it every time for profile.
      
      4. Compute the user's lastused/usecount for each profile in a single
         query and create local array. Cuts out 100s of queries.
      af8cc34f
  20. 24 Mar, 2017 1 commit
  21. 10 Mar, 2017 1 commit
  22. 06 Mar, 2017 1 commit
  23. 15 Feb, 2017 1 commit
  24. 09 Feb, 2017 2 commits
  25. 25 Jan, 2017 2 commits
  26. 20 Jan, 2017 2 commits
  27. 29 Dec, 2016 1 commit
  28. 28 Dec, 2016 1 commit
  29. 29 Nov, 2016 1 commit
    • Leigh Stoller's avatar
      A couple of changes to the apt daemon that I did a while back: · 6db95f98
      Leigh Stoller authored
      1. Kill canceled instances; we allow users to "terminate" an instance
         while it is booting up, but we have to pend that till the lock is
         released. We do this with a canceled flag, similar to the Classic
         interface. But I never committed the apt_daemon changes that look for
         canceled instances and kills them!
      
      2. Look for stale st/lt datasets and delete them. A stale dataset is one
         that no longer exists at the remote cluster (cause its expiration was
         reached and it was reaped). We do not get notification at the Portal,
         and so those dangling datasets descriptors sit around confusing
         people (okay, confusing me and others of a similar vintage).
      6db95f98
  30. 06 Oct, 2016 1 commit
  31. 28 Sep, 2016 1 commit
  32. 31 Aug, 2016 1 commit
  33. 29 Aug, 2016 3 commits