1. 27 Feb, 2017 1 commit
  2. 22 Feb, 2017 1 commit
  3. 20 Feb, 2017 1 commit
  4. 17 Feb, 2017 3 commits
  5. 16 Feb, 2017 2 commits
  6. 15 Feb, 2017 1 commit
  7. 01 Feb, 2017 4 commits
    • Leigh B Stoller's avatar
      Remove debugging code. · 4fb328bd
      Leigh B Stoller authored
    • Leigh B Stoller's avatar
      Checkpoint the portal side of frisbee events. · 2faf5fd1
      Leigh B Stoller authored
      The igevent_daemon now also forwards frisbee events for slices to the
      Portal pubsubd over the SSL channel.
      The aptevent_daemon gets those and adds them to sliverstatus stored in
      the webtask for the instance.
      The timeout code in create_instance watches for frisbee events and uses
      that as another indicator of progress (or lack of). The hope is that we
      fail sooner or avoid failing too soon (say cause of a giant image backed
      As an added bonus, the status page will display frisbee progress (image
      name and MB written) in the node status hover popver. I mention this
      cause otherwise I would go to my grave without anyone ever noticing and
      giving me pat on the back or a smiley face in Slack.
    • Leigh B Stoller's avatar
      Checkpoint two changes: · bd9613cc
      Leigh B Stoller authored
      1. Using frisbee events in libosload_new as a replacement for
         statically (hand waved) maxwait times for image loading. When frisbee
         is generating events, we use those to determined if progress is being
      2. Convert the CM to using the libosload_new library directly (like
         os_setup does). This is conditional on the NewOsload feature being
         attached to the geniuser. Otherwise, we go through the old path.
    • Leigh B Stoller's avatar
  8. 25 Jan, 2017 1 commit
  9. 23 Jan, 2017 1 commit
  10. 20 Jan, 2017 5 commits
  11. 18 Jan, 2017 1 commit
  12. 17 Jan, 2017 1 commit
    • Leigh B Stoller's avatar
      Various tweaks to reservation UI: · 29258b2c
      Leigh B Stoller authored
      * Allow start to be optional; means "now".
      * When selecting the current day, disable hours in the past.
      * Catch a few more form errors.
      * When editing, the start time might be in the past. Do not consider
        that an error, just pass it through since the backend is okay with
  13. 09 Jan, 2017 1 commit
  14. 04 Jan, 2017 1 commit
  15. 03 Jan, 2017 2 commits
  16. 02 Jan, 2017 1 commit
  17. 28 Dec, 2016 1 commit
  18. 08 Dec, 2016 1 commit
  19. 07 Dec, 2016 1 commit
  20. 01 Dec, 2016 1 commit
  21. 29 Nov, 2016 1 commit
    • Leigh B Stoller's avatar
      Fix two small problems with Addnode/Deletenode. · fd9bd976
      Leigh B Stoller authored
      1. Do not start a second copy of the event scheduler. This is the cause
         of all the slurm error messages on the APT cluster. Clearly this was
         wrong for DeleteNode(). AddNode is still open for debate, but at
         least now the error mail will stop.
      2. Do not reset the startstatus either, this was causing web interface
         to think startup services were running, when in fact they are not
         since the other nodes are not rebooted. In the classic interface,
         node reboot does not change the startstatus either, so lets mirror
         that in the Geni interface.
  22. 28 Nov, 2016 1 commit
  23. 12 Nov, 2016 2 commits
    • Leigh B Stoller's avatar
      Bring the cluster monitor "inhouse", rather then depending on the jfed · d7c4230e
      Leigh B Stoller authored
      monitoring system.
      New portal_monitor daemon does a GetVersion/ListResources call at each
      of the clusters every five minutes, and updates the new table in the
      DB called apt_aggregate_status. We calculate free/inuse counts for
      physical nodes and a free count for VMs. Failure to contact the
      aggregate for more then 10 minutes sets the aggregate as down, since
      from our perspective if we cannot get to it, the cluster is down.
      Unlike the jfed monitoring system, we are not going to try to
      instantiate a new experiment or ssh into it. Wait and see if that is
      necessary in our context.
      On the instantiate page, generate a json structure for each cluster,
      similar the one described in issue #172 by Keith. This way we can easily
      switch the existing code over to this new system, but fail back to the
      old mechanism if this turn out to be a bust.
      Some other related changes to how we hand cluster into the several web
    • Leigh B Stoller's avatar
      Fix a couple of memory leaks. · 1fd592b5
      Leigh B Stoller authored
  24. 11 Nov, 2016 1 commit
  25. 09 Nov, 2016 1 commit
  26. 08 Nov, 2016 1 commit
  27. 07 Nov, 2016 2 commits
    • Leigh B Stoller's avatar
      Minor fix to previous revision. · b0bb1017
      Leigh B Stoller authored
    • Leigh B Stoller's avatar
      Some work on restarting (rebooting) nodes. Presently, there is a bit of · 18cdfa8b
      Leigh B Stoller authored
      an inconsistency in SliverAction(); when operating on the entire slice
      we do the whole thing in the background, returning (almost) immediately.
      Which makes sense, we expect the caller to poll for status after.
      But when operating on a subset of slivers (nodes), we do it
      synchronously, which means the caller is left waiting until we get
      through rebooting all the nodes. As David pointed out, when rebooting
      nodes in the openstack profile, this can take a long time as the VMs are
      torn down. This leaves the user looking at a spinner modal for a long
      time, which is not a nice UI feature.
      So I added a local option to do slivers in the background, and return
      immediately. I am doing the for restart and reload at the moment since
      that is primarily what we use from the Portal.
      Note that this has to push out to all clusters.