1. 17 Jan, 2018 2 commits
  2. 16 Jan, 2018 1 commit
    • Leigh Stoller's avatar
      Lots of changes: · 98ca9432
      Leigh Stoller authored
      * Big change to how events are forwarded to the Portal; Originally we
        subscribed to events from the local pubsubd, would transform them to
        Geni events, then send them back to the local pubsubd, pubsub_forward
        would pick them up, and then foward to the Portal SSL pubsubd. Now,
        send them directly to the Portal SSL pubsubd, which reduces the load
        on the main pubsubd which was throwing errors because of too much
        load (to be specific, the subscribers were not keeping up, which
        causes pubsubd to throw errors back to the sender). We can do this
        cause pubsub and the event system now include the SSL entrypoints.
      
        Aside, pubsub_forward is multi-threaded while igevent_daemon is not,
        we might have to play some tricks similar to stated.
      
      * Clean up configure definitions as described in commit 621253f2.
      
      * Various debugging changes to make possible to run an alternate igevent
        daemon out of my devel tree for debuging. Basically, the main igevent
        daemon ignores all events for a specific slice, while my igevenyt
        daemon ignores all the other events and process just the ones for my
        specific slice.
      98ca9432
  3. 09 Jan, 2018 1 commit
  4. 30 May, 2017 1 commit
    • Leigh Stoller's avatar
      Rework how we store the sliver/slice status from the clusters: · e5d36e0d
      Leigh Stoller authored
      In the beginning, the number and size of experiments was small, and so
      storing the entire slice/sliver status blob as json in the web task was
      fine, even though we had to lock tables to prevent races between the
      event updates and the local polling.
      
      But lately the size of those json blobs is getting huge and the lock is
      bogging things down, including not being able to keep up with the number
      of events coming from all the clusters, we get really far behind.
      
      So I have moved the status blobs out of the per-instance web task and
      into new tables, once per slice and one per node (sliver). This keeps
      the blobs very small and thus the lock time very small. So now we can
      keep up with the event stream.
      
      If we grow big enough that this problem comes big enough, we can switch
      to innodb for the per-sliver table and do row locking instead of table
      locking, but I do not think that will happen
      e5d36e0d
  5. 01 Feb, 2017 1 commit
    • Leigh Stoller's avatar
      Checkpoint the portal side of frisbee events. · 2faf5fd1
      Leigh Stoller authored
      The igevent_daemon now also forwards frisbee events for slices to the
      Portal pubsubd over the SSL channel.
      
      The aptevent_daemon gets those and adds them to sliverstatus stored in
      the webtask for the instance.
      
      The timeout code in create_instance watches for frisbee events and uses
      that as another indicator of progress (or lack of). The hope is that we
      fail sooner or avoid failing too soon (say cause of a giant image backed
      dataset).
      
      As an added bonus, the status page will display frisbee progress (image
      name and MB written) in the node status hover popver. I mention this
      cause otherwise I would go to my grave without anyone ever noticing and
      giving me pat on the back or a smiley face in Slack.
      2faf5fd1
  6. 05 Feb, 2016 1 commit
  7. 27 Jan, 2016 2 commits