1. 17 Jan, 2018 1 commit
  2. 16 Jan, 2018 3 commits
  3. 09 Jan, 2018 1 commit
  4. 30 May, 2017 1 commit
    • Leigh Stoller's avatar
      Rework how we store the sliver/slice status from the clusters: · e5d36e0d
      Leigh Stoller authored
      In the beginning, the number and size of experiments was small, and so
      storing the entire slice/sliver status blob as json in the web task was
      fine, even though we had to lock tables to prevent races between the
      event updates and the local polling.
      
      But lately the size of those json blobs is getting huge and the lock is
      bogging things down, including not being able to keep up with the number
      of events coming from all the clusters, we get really far behind.
      
      So I have moved the status blobs out of the per-instance web task and
      into new tables, once per slice and one per node (sliver). This keeps
      the blobs very small and thus the lock time very small. So now we can
      keep up with the event stream.
      
      If we grow big enough that this problem comes big enough, we can switch
      to innodb for the per-sliver table and do row locking instead of table
      locking, but I do not think that will happen
      e5d36e0d
  5. 21 May, 2017 1 commit
    • Leigh Stoller's avatar
      Another change for using the Portal on other clusters (not the · 85ec0376
      Leigh Stoller authored
      MotherShip); need to run the aptevent daemon on all clusters, with the
      difference that on the MotherShip it listens for events from the SSL
      pubsubd, while on other clusters it listens to the local pubsubd for
      those same events (slice/slivee status, image, frisbee status events).
      85ec0376
  6. 24 Apr, 2017 1 commit
  7. 03 Mar, 2017 1 commit
  8. 01 Feb, 2017 1 commit
    • Leigh Stoller's avatar
      Checkpoint the portal side of frisbee events. · 2faf5fd1
      Leigh Stoller authored
      The igevent_daemon now also forwards frisbee events for slices to the
      Portal pubsubd over the SSL channel.
      
      The aptevent_daemon gets those and adds them to sliverstatus stored in
      the webtask for the instance.
      
      The timeout code in create_instance watches for frisbee events and uses
      that as another indicator of progress (or lack of). The hope is that we
      fail sooner or avoid failing too soon (say cause of a giant image backed
      dataset).
      
      As an added bonus, the status page will display frisbee progress (image
      name and MB written) in the node status hover popver. I mention this
      cause otherwise I would go to my grave without anyone ever noticing and
      giving me pat on the back or a smiley face in Slack.
      2faf5fd1
  9. 29 Nov, 2016 1 commit
  10. 07 Sep, 2016 1 commit
  11. 28 Jan, 2016 1 commit
  12. 27 Jan, 2016 1 commit