tbsetup/swapexp.in · 9f4edbba8fb50f54e5e86a58b30dd500e173a611 · emulab / emulab-devel

Leigh B. Stoller authored Jul 29, 2004

* The first involves swapmod. When a swapmod on an active experiment fails,
tbswap will reswap the experiment back to the original configuration. The
problem is that it is reswapping it with the *new* virtual state of the
experiment in the DB. It is not until later when control returns to
swapexp that the virtual state is restored. This is plainly wrong, and in
fact was causing the event scheduler grief cause it was starting up,
reading the the virtual topo, which was different, wrong, and about to be
blown away.

I reorganized the modify section of swapexp so that virtual state is
restored only when its a swapmod on a swapped experiment. On an active
experiment, I moved that code down into tbswap, which will now does all
of the virtual and physical state retore before it does the reswap back
to the original experiment. Just for kicks, its also done if tbswap
decides to swap the experiment cause of a fatal error.

Cleanups: I changed $NoRecover to $CanRecover. My feeble brain cannot
deal with !$NoRecover. I know, two knots make a wright for most people.

Renderer: I was annoyed by the fact that we rerun the renderer on a
failed swapmod. The original reason is that the renderer runs in the
background and so vis_nodes cannot be saved with the rest of the virtual
state tables cause the renderer might still be running when the user
fires off the swapmod. Well, the hell with that. We lock the vis_nodes
table anyway in the renderer during update, so we are certain to get a
consistent snapshot. We store the renderer pid in the experiments table,
so if the renderer was running, just fire off another one; mostly this is
not going to happen. In addition, tbprerun no longer starts a new
renderer when doing the swapmod; I start the new renderer later after
swapmod succeeds. I might end up tweaking this a bit depending on what
people notice as being different.

* Termination changes to batchexp and swapexp: I've rearranged the
termination code using an END block so that any uncontrolled exit from
either batchexp or swapexp will go through the cleanup code, and
hopefully insert a stats record, as well as not leave the experiment in
some inbetween state. I've set the max DB retry count to zero in both
cases, which means infinite retry. I've also added SIGTERM handlers to
both so that again, we can kill a hung batch/swap and have it clean up
things more or less. Note that END blocks are not caught when a signal
causes the program to die; you have to catch it and then die() so that
the END block is executed.

Eventually, we need to clean up the various libraries so that we do not
use DBQueryFatal(), but rather use DBQueryWarn(), and look for failure.
Ditto for event system interface.

9f4edbba