Commit 9f4edbba authored by Leigh Stoller's avatar Leigh Stoller

Two unrelated bug fixes (with some related cleanups and tweaks)

* The first involves swapmod. When a swapmod on an active experiment fails,
  tbswap will reswap the experiment back to the original configuration. The
  problem is that it is reswapping it with the *new* virtual state of the
  experiment in the DB. It is not until later when control returns to
  swapexp that the virtual state is restored. This is plainly wrong, and in
  fact was causing the event scheduler grief cause it was starting up,
  reading the the virtual topo, which was different, wrong, and about to be
  blown away.

  I reorganized the modify section of swapexp so that virtual state is
  restored only when its a swapmod on a swapped experiment. On an active
  experiment, I moved that code down into tbswap, which will now does all
  of the virtual and physical state retore before it does the reswap back
  to the original experiment. Just for kicks, its also done if tbswap
  decides to swap the experiment cause of a fatal error.

  Cleanups: I changed $NoRecover to $CanRecover. My feeble brain cannot
  deal with !$NoRecover. I know, two knots make a wright for most people.

  Renderer: I was annoyed by the fact that we rerun the renderer on a
  failed swapmod. The original reason is that the renderer runs in the
  background and so vis_nodes cannot be saved with the rest of the virtual
  state tables cause the renderer might still be running when the user
  fires off the swapmod. Well, the hell with that. We lock the vis_nodes
  table anyway in the renderer during update, so we are certain to get a
  consistent snapshot. We store the renderer pid in the experiments table,
  so if the renderer was running, just fire off another one; mostly this is
  not going to happen. In addition, tbprerun no longer starts a new
  renderer when doing the swapmod; I start the new renderer later after
  swapmod succeeds. I might end up tweaking this a bit depending on what
  people notice as being different.

* Termination changes to batchexp and swapexp: I've rearranged the
  termination code using an END block so that any uncontrolled exit from
  either batchexp or swapexp will go through the cleanup code, and
  hopefully insert a stats record, as well as not leave the experiment in
  some inbetween state. I've set the max DB retry count to zero in both
  cases, which means infinite retry. I've also added SIGTERM handlers to
  both so that again, we can kill a hung batch/swap and have it clean up
  things more or less. Note that END blocks are not caught when a signal
  causes the program to die; you have to catch it and then die() so that
  the END block is executed.

  Eventually, we need to clean up the various libraries so that we do not
  use DBQueryFatal(), but rather use DBQueryWarn(), and look for failure.
  Ditto for event system interface.
parent 719a65c4
......@@ -2745,6 +2745,9 @@ sub TBGetSiteVar($)
"virt_programs",
"virt_node_desires",
"virt_simnode_attributes",
# vis_nodes is locked during update in prerender, so we
# will get a consistent dataset when we backup.
"vis_nodes",
"nseconfigs",
"eventlist",
"event_groups",
......
......@@ -83,8 +83,13 @@ my $user_name;
my $user_email;
my $dbuid;
# Be careful not to exit on transient error
$libdb::DBQUERY_MAXTRIES = 30;
# Be careful not to exit on transient error; 0 means infinite retry.
$libdb::DBQUERY_MAXTRIES = 0;
# For the END block below.
my $cleaning = 0;
my $justexit = 1;
my $signaled = 0;
#
# Turn off line buffering on output
......@@ -334,6 +339,12 @@ if (! DBQueryWarn("unlock tables")) {
" DB error unlocking tables!");
}
#
# At this point, we need to force a cleanup no matter how we exit.
# See the END block below.
#
$justexit = 0;
#
# Create a directory structure for the experiment.
#
......@@ -419,8 +430,11 @@ TBExptOpenLogFile($pid, $eid);
if (my $childpid = TBBackGround($logname)) {
#
# Parent exits normally, unless in waitmode.
# Parent exits normally, unless in waitmode. We have to set
# justexit to make sure the END block below does not run.
#
$justexit = 1;
if (!$waitmode) {
print("Experiment $pid/$eid is now configuring\n".
"You will be notified via email when the experiment is ".
......@@ -470,6 +484,26 @@ if ($waitmode) {
POSIX::setsid();
}
#
# We need to catch TERM cause sometimes shit happens and we have to kill
# an experiment setup that is hung or otherwise scrogged. Rather then
# trying to kill off the children one by one, lets arrange to catch it
# here and send a killpg to the children. This is not to be done lightly,
# cause it can leave things worse then they were before!
#
sub handler ($) {
my ($signame) = @_;
$SIG{TERM} = 'IGNORE';
my $pgrp = getpgrp(0);
kill('TERM', -$pgrp);
sleep(1);
$signaled = 1;
fatal("Caught SIG${signame}! Killing experiment setup ...");
}
$SIG{TERM} = \&handler;
$SIG{QUIT} = 'DEFAULT';
#
# The guts of starting an experiment!
#
......@@ -478,29 +512,33 @@ if ($waitmode) {
#
#
# Run the various scripts. We want to propogate the error from tbprerun
# Run the various scripts. We want to propagate the error from tbprerun
# and tbrun back out, hence the bogus looking errorstat variable.
#
SetExpState($pid, $eid, EXPTSTATE_PRERUN);
SetExpState($pid, $eid, EXPTSTATE_PRERUN)
or fatal("Failed to set experiment state to " . EXPTSTATE_PRERUN());
print "Running 'tbprerun $pid $eid $nsfile'\n";
if (system("$tbbindir/tbprerun $pid $eid $nsfile") != 0) {
$errorstat = $? >> 8;
fatal("tbprerun failed!");
}
SetExpState($pid, $eid, EXPTSTATE_SWAPPED);
SetExpState($pid, $eid, EXPTSTATE_SWAPPED)
or fatal("Failed to set experiment state to " . EXPTSTATE_SWAPPED());
#
# If not in frontend mode (preload only) continue to swapping exp in.
#
if (! ($frontend || $batchmode)) {
SetExpState($pid, $eid, EXPTSTATE_ACTIVATING);
SetExpState($pid, $eid, EXPTSTATE_ACTIVATING)
or fatal("Failed to set experiment state to ". EXPTSTATE_ACTIVATING());
print "Running 'tbswap in $pid $eid'\n";
if (system("$tbbindir/tbswap in $pid $eid") != 0) {
$errorstat = $? >> 8;
fatal("tbswap in failed!");
}
SetExpState($pid, $eid, EXPTSTATE_ACTIVE);
SetExpState($pid, $eid, EXPTSTATE_ACTIVE)
or fatal("Failed to set experiment state to " . EXPTSTATE_ACTIVE());
#
# Look for the unsual case of more than 2 nodes and no vlans. Send a
......@@ -672,14 +710,8 @@ exit(0);
#
#
#
sub fatal($)
sub cleanup()
{
my($mesg) = $_[0];
print "*** $0:\n";
print " $mesg\n";
print "Cleaning up and exiting with status $errorstat ...\n";
#
# Failed early (say, in parsing). No point in keeping any of the
# stats or resource records. Just a waste of space since the
......@@ -705,7 +737,7 @@ sub fatal($)
DBQueryWarn("DELETE from experiment_stats ".
"WHERE eid='$eid' and pid='$pid' and exptidx=$exptidx");
exit($errorstat);
return;
}
#
......@@ -723,17 +755,26 @@ sub fatal($)
#
my $estate = ExpState($pid, $eid);
if ($estate ne EXPTSTATE_NEW) {
if ($estate eq EXPTSTATE_ACTIVE) {
#
# We do not know exactly where things stopped, so if the
# experiment was activating when the signal was delivered,
# run tbswap on it.
#
if ($estate eq EXPTSTATE_ACTIVE ||
($estate eq EXPTSTATE_ACTIVATING && $signaled)) {
print "Running 'tbswap out -force $pid $eid'\n";
if (system("$tbbindir/tbswap out -force $pid $eid") != 0) {
print "tbswap out failed!\n";
}
SetExpState($pid, $eid, EXPTSTATE_SWAPPED);
}
print "Running 'tbend -force $pid $eid'\n";
if (system("$tbbindir/tbend -force $pid $eid") != 0) {
print "tbend failed!\n";
}
}
SetExpState($pid, $eid, EXPTSTATE_TERMINATED);
#
# Okay, we *are* going to terminate the experiment.
......@@ -748,7 +789,6 @@ sub fatal($)
#
SENDMAIL("$user_name <$user_email>",
"Experiment Configure Failure: $pid/$eid",
$mesg . "\n\n" .
"Please look at the log below to see what happened. If the error\n".
"resulted from a lack of free nodes, you can use this web page to\n".
"get a summary of free nodes:\n\n".
......@@ -773,7 +813,6 @@ sub fatal($)
# Clear the record and cleanup.
#
TBExptDestroy($pid, $eid);
exit($errorstat);
}
#
......@@ -932,3 +971,46 @@ sub ParseArgs()
$waitmode = 1;
}
}
#
# We need this END block to make sure that we clean up after a fatal
# exit in the library. This is problematic, cause we could be exiting
# cause the mysql server has gone whacky again.
#
sub fatal($)
{
my($mesg) = $_[0];
print "*** $0:\n";
print " $mesg\n";
print "Cleaning up and exiting with status $errorstat ...\n";
#
# This exit will drop into the END block below.
#
exit($errorstat);
}
END {
# Normal exit, nothing to do.
if (!$? || $justexit) {
return;
}
my $saved_exitcode = $?;
if ($cleaning) {
#
# We are screwed; a recursive error. Someone will have to clean
# up by hand.
#
SENDMAIL(TBOPS,
"Experiment Configure Failure: $pid/$eid",
"Recursive error in cleanup! This is very bad.");
$? = $saved_exitcode;
return;
}
$cleaning = 1;
cleanup();
$? = $saved_exitcode;
}
This diff is collapsed.
......@@ -94,14 +94,13 @@ if (!$force &&
#
sub cleanup {
print STDERR "Cleaning up after errors.\n";
if ($state eq EXPTSTATE_PRERUN) {
# Must kill the prerender process before we remove virt state.
print "Killing the renderer.\n";
system("prerender -r $pid $eid");
# When doing a modify, this is handled elsewhere.
if ($state eq EXPTSTATE_PRERUN) {
print "Removing experiment state ... " . TBTimeStamp() . "\n";
TBExptRemoveVirtualState($pid, $eid );
}
print "Removing experiment state.\n";
TBExptRemoveVirtualState($pid, $eid );
}
# Must kill any prerender process first!
......@@ -144,9 +143,15 @@ if ($nsfile_string) {
}
}
TBDebugTimeStamp("prerender started in background");
print "Precomputing visualization ...\n";
system("prerender -t $pid $eid");
#
# In update mode, do not start the renderer until later. If update fails we
# want to try to restore old render info rather then rerunning.
#
if ($state eq EXPTSTATE_PRERUN) {
TBDebugTimeStamp("prerender started in background");
print "Precomputing visualization ...\n";
system("prerender -t $pid $eid");
}
#
# See if using the new ipassign.
......
......@@ -202,26 +202,38 @@ elsif ($swapop eq "update") {
#
# There were errors; see if we can recover.
#
my $NoRecover = 0;
my $CanRecover = 1;
if ($errors != 7) {
print STDERR "Update failure occurred _after_ assign phase; ";
$NoRecover = 1;
$CanRecover = 0;
}
if (! $NoRecover) {
print STDERR "Recovering physical state.\n";
if (($NoRecover = TBExptRestorePhysicalState($pid,$eid))) {
print STDERR "Could not restore backed-up physical state; ";
}
if ($CanRecover) {
print STDERR "Recovering virtual and physical state.\n";
if (TBExptRemoveVirtualState($pid, $eid) ||
TBExptRestoreVirtualState($pid, $eid) ||
TBExptRestorePhysicalState($pid,$eid)) {
print STDERR "Could not restore backed-up state; ";
$CanRecover = 0;
}
if (! $NoRecover) {
else {
print STDERR "Doing a recovery swap-in of old state.\n";
if (($NoRecover = doSwapin(UPDATE_RECOVER))) {
if (doSwapin(UPDATE_RECOVER)) {
print STDERR "Could not swap in old physical state; ";
$CanRecover = 0;
}
}
if ($NoRecover) {
}
#
# Some part of the recovery failed; must swap it out. swapexp
# (caller) will then have to do more clean up, hence the special
# exit status indicated by $updatehosed.
#
if (! $CanRecover) {
print STDERR "Recovery aborted! Swapping experiment out.\n";
doSwapout(CLEANUP);
$updatehosed = 1;
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment