• Leigh B. Stoller's avatar
    Two unrelated bug fixes (with some related cleanups and tweaks) · 9f4edbba
    Leigh B. Stoller authored
    * The first involves swapmod. When a swapmod on an active experiment fails,
      tbswap will reswap the experiment back to the original configuration. The
      problem is that it is reswapping it with the *new* virtual state of the
      experiment in the DB. It is not until later when control returns to
      swapexp that the virtual state is restored. This is plainly wrong, and in
      fact was causing the event scheduler grief cause it was starting up,
      reading the the virtual topo, which was different, wrong, and about to be
      blown away.
      I reorganized the modify section of swapexp so that virtual state is
      restored only when its a swapmod on a swapped experiment. On an active
      experiment, I moved that code down into tbswap, which will now does all
      of the virtual and physical state retore before it does the reswap back
      to the original experiment. Just for kicks, its also done if tbswap
      decides to swap the experiment cause of a fatal error.
      Cleanups: I changed $NoRecover to $CanRecover. My feeble brain cannot
      deal with !$NoRecover. I know, two knots make a wright for most people.
      Renderer: I was annoyed by the fact that we rerun the renderer on a
      failed swapmod. The original reason is that the renderer runs in the
      background and so vis_nodes cannot be saved with the rest of the virtual
      state tables cause the renderer might still be running when the user
      fires off the swapmod. Well, the hell with that. We lock the vis_nodes
      table anyway in the renderer during update, so we are certain to get a
      consistent snapshot. We store the renderer pid in the experiments table,
      so if the renderer was running, just fire off another one; mostly this is
      not going to happen. In addition, tbprerun no longer starts a new
      renderer when doing the swapmod; I start the new renderer later after
      swapmod succeeds. I might end up tweaking this a bit depending on what
      people notice as being different.
    * Termination changes to batchexp and swapexp: I've rearranged the
      termination code using an END block so that any uncontrolled exit from
      either batchexp or swapexp will go through the cleanup code, and
      hopefully insert a stats record, as well as not leave the experiment in
      some inbetween state. I've set the max DB retry count to zero in both
      cases, which means infinite retry. I've also added SIGTERM handlers to
      both so that again, we can kill a hung batch/swap and have it clean up
      things more or less. Note that END blocks are not caught when a signal
      causes the program to die; you have to catch it and then die() so that
      the END block is executed.
      Eventually, we need to clean up the various libraries so that we do not
      use DBQueryFatal(), but rather use DBQueryWarn(), and look for failure.
      Ditto for event system interface.
swapexp.in 31.8 KB