1. 14 Mar, 2018 2 commits
  2. 13 Mar, 2018 1 commit
  3. 09 Mar, 2018 1 commit
  4. 08 Mar, 2018 2 commits
  5. 20 Feb, 2018 1 commit
  6. 16 Feb, 2018 4 commits
    • Leigh Stoller's avatar
    • Leigh Stoller's avatar
      Backend part of "async" mode. · ae4af900
      Leigh Stoller authored
      ae4af900
    • Leigh Stoller's avatar
      A lot of work on the RPC code, among other things. · 56f6d601
      Leigh Stoller authored
      I spent a fair amount of improving error handling along the RPC path,
      as well making the code more consistent across the various files. Also
      be more consistent in how the web interface invokes the backend and gets
      errors back, specifically for errors that are generated when taking to a
      remote cluster.
      
      Add checks before every RPC to make sure the cluster is not disabled in
      the database. Also check that we can actually reach the cluster, and
      that the cluster is not offline (NoLogins()) before we try to do
      anything. I might have to relax this a bit, but in general it takes a
      couple of seconds to check, which is a small fraction of what most RPCs
      take. Return precise errors for clusters that are not available, to the
      web interface and show them to user.
      
      Use webtasks more consistently between the web interface and backend
      scripts. Watch specifically for scripts that exit abnormally (exit
      before setting the exitcode in the webtask) which always means an
      internal failure, do not show those to users.
      
      Show just those RPC errors that would make sense users, stop spewing
      script output to the user, send it just to tbops via the email that is
      already generated when a backend script fails fatally.
      
      But do not spew email for clusters that are not reachable or are
      offline. Ditto for several other cases that were generating mail to
      tbops instead of just showing the user a meaningful error message.
      
      Stop using ParRun for single site experiments; 99% of experiments.
      
      For create_instance, a new "async" mode that tells CreateSliver() to
      return before the first mapper run, which is typically very quickly.
      Then watch for errors or for the manifest with Resolve or for the slice
      to disappear. I expect this to be bounded and so we do not need to worry
      so much about timing this wait out (which is a problem on very big
      topologies). When we see the manifest, the RedeemTicket() part of the
      CreateSliver is done and now we are into the StartSliver() phase.
      
      For the StartSliver phase, watch for errors and show them to users,
      previously we mostly lost those errors and just sent the experiment into
      the failed state. I am still working on this.
      56f6d601
    • Leigh Stoller's avatar
      Kill debugging print. · efe0efb1
      Leigh Stoller authored
      efe0efb1
  7. 09 Feb, 2018 3 commits
  8. 06 Feb, 2018 1 commit
  9. 25 Jan, 2018 1 commit
  10. 22 Jan, 2018 14 commits
  11. 17 Jan, 2018 2 commits
  12. 16 Jan, 2018 3 commits
    • Leigh Stoller's avatar
      More mysql 5.7 fixes. · 283fc466
      Leigh Stoller authored
      283fc466
    • Leigh Stoller's avatar
      Left this out of previous commit 98ca9432 · 56b43e80
      Leigh Stoller authored
      56b43e80
    • Leigh Stoller's avatar
      Lots of changes: · 98ca9432
      Leigh Stoller authored
      * Big change to how events are forwarded to the Portal; Originally we
        subscribed to events from the local pubsubd, would transform them to
        Geni events, then send them back to the local pubsubd, pubsub_forward
        would pick them up, and then foward to the Portal SSL pubsubd. Now,
        send them directly to the Portal SSL pubsubd, which reduces the load
        on the main pubsubd which was throwing errors because of too much
        load (to be specific, the subscribers were not keeping up, which
        causes pubsubd to throw errors back to the sender). We can do this
        cause pubsub and the event system now include the SSL entrypoints.
      
        Aside, pubsub_forward is multi-threaded while igevent_daemon is not,
        we might have to play some tricks similar to stated.
      
      * Clean up configure definitions as described in commit 621253f2.
      
      * Various debugging changes to make possible to run an alternate igevent
        daemon out of my devel tree for debuging. Basically, the main igevent
        daemon ignores all events for a specific slice, while my igevenyt
        daemon ignores all the other events and process just the ones for my
        specific slice.
      98ca9432
  13. 09 Jan, 2018 1 commit
  14. 26 Dec, 2017 1 commit
  15. 11 Dec, 2017 1 commit
  16. 06 Dec, 2017 2 commits