Commit 4269dad1 authored by Leigh Stoller's avatar Leigh Stoller

Up to now we have had two state variables associated with an experiment,

plus a lock field. The lock field was a simple "experiment locked, go away"
slot that is easy to use when you do not care about the actual state that
an experiment is in, just that it is in "transition" and should not be
messed with.

The other two state variables are "state" and "batchstate". The former
(state) is the original variable that Chris added, and was used by the tb*
scripts to make sure that the experiment was in the state each particular
script wanted them to be in. But over time (and with the addition of so
much wrapper goo around them), "state" has leaked out all over the place to
determine what operations on an experiment are allowed, and if/when it
should be displayed in various web pages. There are a set of transition
states in addition to the usual "active", "swapped", etc like "swapping"
that make testing state a pain in the butt.

I added the other state variable ("batchstate") when I did the batch
system, obviously! It was intended as a wrapper state to control access to
the batch queue, and to prevent batch experiments from being messed with
except when it was really okay (for example, its okay to terminate a
swapped out batch experiment, but not a swapped in batch experiment since
that would confuse the batch daemon). There are fewer of these states, plus
one additional state for "modifying" experiments.

So what I have done is change the system to use "batchstate" for all
experiments to control entry into the swap system, from the web interface,
from the command line, and from the batch daemon. The other state variable
still exists, and will be brutally pushed back under the surface until its
just a vague memory, used only by the original tb* scripts. This will
happen over time, and the "batchstate" variable will be renamed once I am
convinced that this was the right thing to do and that my changes actually
work as intended.

Only people who have bothered to read this far will know that I also added
the ability to cancel experiment swapin in progress. For that I am using
the "canceled" flag (ah, this one was named properly from the start!), and
I test that at various times in assign_wrapper and tbswap. A minor downside
right now is that a canceled swapin looks too much like a failed swapin,
and so tbops gets email about it. I'll fix that at some point (sometime
after the boss complains).

I also cleaned up various bits of code, replacing direct calls to exec
with calls to the recently improved SUEXEC interface. This removes
some cruft from each script that calls an external script.

Cleaned up modifyexp.ph3 quite a bit, reformatting and indenting.
Also fixed to not run the parser directly! This was very wrong; should
call nscheck instead. Changed to use "nobody" group instead of group
flux (made the same change in nscheck).

There is a script in the sql directory called newstates.pl. It needs
to be run to initialize the batchstate slot of the experiments table
for all existing experiments.
parent ec1ec6cf
......@@ -68,8 +68,9 @@ use Exporter;
BATCHSTATE_POSTED BATCHSTATE_RUNNING BATCHSTATE_TERMINATING
BATCHSTATE_ACTIVATING BATCHSTATE_PAUSED
BATCHSTATE_RUNNING_LOCKED BATCHSTATE_MODIFYING
BATCHMODE_CANCELTERM BATCHMODE_CANCELSWAP BATCHMODE_CANCELCLEAR
TBBatchState TBSetBatchState TBSetBatchCancelFlag
TBBatchState TBSetBatchState TBSetBatchCancelFlag TBGetBatchCancelFlag
TB_NODELOGTYPE_MISC TB_NODELOGTYPES TB_DEFAULT_NODELOGTYPE
......@@ -285,6 +286,7 @@ sub NODEFAILMODE_FATAL() { "fatal"; }
sub NODEFAILMODE_NONFATAL() { "nonfatal"; }
sub NODEFAILMODE_IGNORE() { "ignore"; }
# These are really "sub" states.
sub EXPTSTATE_NEW() { "new"; }
sub EXPTSTATE_PRERUN() { "prerunning"; }
sub EXPTSTATE_SWAPPED() { "swapped"; }
......@@ -294,11 +296,13 @@ sub EXPTSTATE_ACTIVE() { "active"; }
sub EXPTSTATE_TESTING() { "testing"; }
sub EXPTSTATE_TERMINATING() { "terminating"; }
sub EXPTSTATE_TERMINATED() { "ended"; }
sub EXPTSTATE_UPDATING() { "updating"; }
# These are really experiment states (both batch *and* plain).
sub BATCHSTATE_POSTED() { "posted"; }
sub BATCHSTATE_ACTIVATING() { "activating"; }
sub BATCHSTATE_RUNNING() { "active"; }
sub BATCHSTATE_RUNNING_LOCKED() { "active_locked"; }
sub BATCHSTATE_MODIFYING() { "modifying"; }
sub BATCHSTATE_PAUSED() { "paused"; }
sub BATCHSTATE_TERMINATING() { "terminating"; }
# Cancel flags
......@@ -1276,18 +1280,19 @@ sub TBSetExpSwapTime($$)
}
#
# Lock Experiment.
# Lock Experiment. Can also provide an optional new state.
#
# usage: TBLockExp(char *pid, char *eid)
# usage: TBLockExp(char *pid, char *eid, char *newstate)
# returns 1 if okay.
# returns 0 if an invalid pid/eid or if an error.
#
sub TBLockExp($$)
sub TBLockExp($$;$)
{
my($pid, $eid) = @_;
my($pid, $eid, $newstate) = @_;
my $query_result =
DBQueryWarn("update experiments set expt_locked=now() ".
(defined($newstate) ? ",batchstate='$newstate' " : "") .
"where eid='$eid' and pid='$pid'");
if (! $query_result ||
......@@ -1298,18 +1303,18 @@ sub TBLockExp($$)
}
#
# Test if Experiment is locked
# Test if Experiment is locked. Can provide optional pointer to return state.
#
# usage: TBExpLocked(char *pid, char *eid)
# usage: TBExpLocked(char *pid, char *eid, char **state)
# returns 1 if locked.
# returns 0 if an invalid pid/eid or if an error.
#
sub TBExpLocked($$)
sub TBExpLocked($$;$)
{
my($pid, $eid) = @_;
my($pid, $eid, $curstate) = @_;
my $query_result =
DBQueryWarn("select expt_locked from experiments ".
DBQueryWarn("select expt_locked,batchstate from experiments ".
"where eid='$eid' and pid='$pid'");
if (! $query_result ||
......@@ -1317,25 +1322,28 @@ sub TBExpLocked($$)
return 0;
}
my @row = $query_result->fetchrow_array();
if (! defined($row[0])) {
return 0;
}
$$curstate = $row[1]
if (defined($curstate));
return 0
if (! defined($row[0]));
return 1;
}
#
# UnLock Experiment.
# UnLock Experiment. Can also provide an optional new state.
#
# usage: TBUnLockExp(char *pid, char *eid)
# usage: TBUnLockExp(char *pid, char *eid, char *newstate)
# returns 1 if okay.
# returns 0 if an invalid pid/eid or if an error.
#
sub TBUnLockExp($$)
sub TBUnLockExp($$;$)
{
my($pid, $eid) = @_;
my($pid, $eid, $newstate) = @_;
my $query_result =
DBQueryWarn("update experiments set expt_locked=NULL ".
(defined($newstate) ? ",batchstate='$newstate' " : "") .
"where eid='$eid' and pid='$pid'");
if (! $query_result ||
......@@ -1413,6 +1421,29 @@ sub TBSetBatchCancelFlag($$$)
return 1;
}
#
# Get BatchMode cancel flag,
#
# usage: GetBatchCancel(char *pid, char *eid, char **flag)
# returns 1 if okay.
# returns 0 if an invalid pid/eid or if an error.
#
sub TBGetBatchCancelFlag($$$)
{
my($pid, $eid, $flag) = @_;
my $query_result =
DBQueryWarn("select canceled from experiments ".
"where eid='$eid' and pid='$pid'");
if (! $query_result ||
$query_result->numrows == 0) {
return 0;
}
($$flag) = $query_result->fetchrow_array();
return 1;
}
#
# Return a list of all the nodes in an experiment.
#
......
......@@ -422,6 +422,17 @@ if ($plabcount && (keys(%virt_nodes) == $plabcount)) {
$maxrun = 1;
}
while (1) {
my $canceled;
print "Sleeping for a bit ...\n";
sleep(60);
# Check cancel flag before continuing.
TBGetBatchCancelFlag($pid, $eid, \$canceled);
fatal(1, "*** $0:\n".
" Cancel flag set; aborting assign run!")
if ($canceled);
print "Assign Run $currentrun\n";
# Violation counts
......@@ -492,6 +503,12 @@ while (1) {
fatal(65, "*** $0:\n".
" Could not open assign logfile!");
# Check cancel flag before continuing.
TBGetBatchCancelFlag($pid, $eid, \$canceled);
fatal(1, "*** $0:\n".
" Cancel flag set; aborting assign run!")
if ($canceled);
if ($assignexitcode == 0)
{
# read output
......@@ -868,8 +885,8 @@ if ($needwanassign) {
# Recoverability ends.
# All fatal() calls from this point do not have the recoverable '64' bit set.
#
#
# VIRTNODES HACK: Local virtnodes have to be mapped now. This is a little
# hokey in that the virtnodes just need to be allocated from the pool that
# is on the real node. We know they are free, but we should go through
......@@ -882,6 +899,12 @@ foreach my $pnode (keys(%virtnodes)) {
my @oplist = ();
my @ovlist = ();
# Check cancel flag before continuing.
TBGetBatchCancelFlag($pid, $eid, \$canceled);
fatal(1, "*** $0:\n".
" Cancel flag set; aborting assign run!")
if ($canceled);
#
# If updating, need to watch for nodes that are already reserved.
# We save that info in oplist/ovlist, and build a new vlist for
......@@ -1007,6 +1030,12 @@ foreach my $pnode (keys(%virtnodes)) {
}
}
# Check cancel flag before continuing.
TBGetBatchCancelFlag($pid, $eid, \$canceled);
fatal(1, "*** $0:\n".
" Cancel flag set; aborting assign run!")
if ($canceled);
# Set port range (see below for how we deal with update).
TBExptSetPortRange();
......
......@@ -119,7 +119,7 @@ if (! $debug) {
#
while (1) {
my($count, $i, $query_result, $pending_result, $running_result);
my(%row, %pending_row);
my(%pending_row);
my $retry_wait = TBGetSiteVar("batch/retry_wait");
# Do not allow zero!
......@@ -159,8 +159,8 @@ while (1) {
" e2.batchmode=1 and e2.batchstate='$BSTATE_RUNNING' and ".
" e1.pid=e2.pid and e1.eid!=e2.eid ".
"WHERE e2.eid is null and ".
" e1.expt_head_uid!='stoller' and ".
" e1.batchmode=1 and e1.canceled=0 and ".
" e1.expt_locked is null and ".
" e1.batchstate='$BSTATE_POSTED' and ".
" (e1.attempts=0 or ".
" ((UNIX_TIMESTAMP() - ".
......@@ -170,8 +170,7 @@ while (1) {
$running_result =
DBQuery("select * from experiments ".
"where batchmode=1 and batchstate='$BSTATE_RUNNING' ".
" and expt_head_uid!='stoller' ".
"ORDER BY expt_start");
"ORDER BY expt_start LIMIT 1");
if (!$pending_result || !$running_result) {
print "DB Error getting batch info. Waiting a bit ...\n";
......@@ -207,7 +206,6 @@ while (1) {
goto pause;
}
}
DBQueryWarn("unlock tables");
#
# Okay, first we check the status of running batch mode experiments
......@@ -220,23 +218,65 @@ while (1) {
# loop instead of in the child that started the experiment, its so that
# we fire up again and look for them in the event that paper goes down.
#
while (%row = $running_result->fetchhash()) {
my $canceled = $row{'canceled'};
if ($running_result->numrows) {
my %running_row = $running_result->fetchhash();
my $canceled = $running_row{'canceled'};
if ($canceled) {
# Local vars!
my $eid = $running_row{'eid'};
my $pid = $running_row{'pid'};
#
# Have to set the state to busy so that no one will be able
# to mess with the experiment while we deal with termination.
#
TBSetBatchState($pid, $eid, BATCHSTATE_RUNNING_LOCKED());
DBQueryWarn("unlock tables");
# Look at the cancel flag.
if ($canceled == BATCHMODE_CANCELTERM) {
dosomething("cancel", %row);
dosomething("cancel", %running_row);
}
elsif ($canceled == BATCHMODE_CANCELSWAP) {
dosomething("swap", %row);
dosomething("swap", %running_row);
}
else {
print "Improper cancel flag: $canceled\n";
}
next;
}
if (isexpdone(%row)) {
dosomething("swap", %row);
next;
else {
#
# Have to set the state to busy so that no one will be able
# to mess with the experiment while trying to determine if
# the batch is done.
#
TBSetBatchState($pid, $eid, BATCHSTATE_RUNNING_LOCKED());
DBQueryWarn("unlock tables");
if (isexpdone(%running_row)) {
#
# Terminate the experiment. Set the state appropriately
# so that swapexp will accept it. It is okay to do this
# with the table unlocked since no one is allowed to mess
# with a batch experiment in the RUNNING_LOCKED state.
#
dosomething("swap", %running_row);
}
else {
#
# Reset the state to RUNNING. It is okay to do this with
# the table unlocked since no one is allowed to mess
# with a batch experiment in the RUNNING_LOCKED state.
#
TBSetBatchState($pid, $eid, $BSTATE_RUNNING);
}
}
}
else {
# no one above unlocked the tables ...
DBQueryWarn("unlock tables");
}
#
# Finally start an actual experiment!
......@@ -245,7 +285,7 @@ while (1) {
dosomething("start", %pending_row);
}
pause:
sleep(30);
sleep(15);
}
#
......@@ -344,7 +384,7 @@ sub dosomething($$)
swapexp(%exphash);
}
elsif ($dowhat eq "cancel") {
cancelexp(1, %exphash);
cancelexp(%exphash);
}
exit(0);
}
......@@ -378,14 +418,28 @@ sub startexp($)
"where eid='$eid' and pid='$pid'");
if ($query_result) {
@row = $query_result->fetchrow_array();
my ($canceled) = $query_result->fetchrow_array();
$exphash{'canceled'} = $canceled;
if ($row[0]) {
cancelexp($running);
# Yuck: This is strictly for the benefit of swapexp() below.
$exphash{'batchstate'} = BATCHSTATE_RUNNING
if ($running);
if ($canceled) {
# Look at the cancel flag.
if ($canceled == BATCHMODE_CANCELTERM) {
cancelexp(%exphash);
}
elsif ($canceled == BATCHMODE_CANCELSWAP) {
swapexp(%exphash);
}
else {
print "Improper cancel flag: $canceled\n";
}
#
# Never returns, but just to be safe ...
#
exit(0);
exit(-1);
}
}
......@@ -406,15 +460,18 @@ sub startexp($)
"where eid='$eid' and pid='$pid'");
#
# The exit value is important. If its -1 or 1, thats bad. Kill the
# batch off. Anything else implies an assign violation that is
# (hopefully) transient. We leave it up the user to kill cancel the
# batch if it looks like its never going to work.
# The exit value is important. If its -1 or 1, thats bad. Anything
# else implies an assign violation that is (hopefully) transient.
# We leave it up the user to kill the batch if it looks like its
# never going to work.
#
if ($exit_status == 1 || $exit_status == -1) {
TBSetBatchState($pid, $eid, $BSTATE_PAUSED);
TBUnLockExp($pid, $eid);
email_status("Experiment startup has failed with a fatal error!\n".
"Batch has been removed from the system.");
TBExptDestroy($pid, $eid);
"Batch has been dequeued so that you may check it.");
exit($exit_status);
}
......@@ -440,8 +497,8 @@ sub startexp($)
# There is some state that needs to be reset so that another
# attempt can be made.
#
SetExpState($pid, $eid, EXPTSTATE_SWAPPED);
TBSetBatchState($pid, $eid, $BSTATE_POSTED);
TBUnLockExp($pid, $eid);
exit($exit_status);
}
......@@ -450,6 +507,7 @@ sub startexp($)
# Well, it configured! Lets set it state to running.
#
TBSetBatchState($pid, $eid, $BSTATE_RUNNING);
TBUnLockExp($pid, $eid);
email_status("Batch Mode experiment $pid/$eid is now running!\n".
"Please consult the Web interface to see how it is doing.");
......@@ -463,22 +521,26 @@ sub startexp($)
#
# A batch has completed. Swap it out.
#
sub swapexp($)
sub swapexp($;$)
{
my(%exphash) = @_;
my $canceled = $exphash{'canceled'};
my $running = ($exphash{'batchstate'} eq BATCHSTATE_RUNNING);
#
# Have to set the state to terminating or else swapexp will not accept it.
#
TBSetBatchState($pid, $eid, $BSTATE_TERMINATING);
system("$swapexp -b -s out $pid $eid");
if ($?) {
if ($running) {
#
# TB admin is going to have to clean up.
# Have to set the state to terminating so that swap/end exp
# will accept it.
#
fatal("Swapping out Batch Mode experiment $pid/$eid");
TBSetBatchState($pid, $eid, $BSTATE_TERMINATING);
system("$swapexp -b -s out $pid $eid");
if ($?) {
#
# TB admin is going to have to clean up.
#
fatal("Swapping out Batch Mode experiment $pid/$eid");
}
}
#
# Set the state to paused to ensure that it is not run again until
......@@ -486,6 +548,7 @@ sub swapexp($)
#
TBSetBatchCancelFlag($pid, $eid, BATCHMODE_CANCELCLEAR);
TBSetBatchState($pid, $eid, $BSTATE_PAUSED);
TBUnLockExp($pid, $eid);
if ($canceled) {
email_status("Batch Mode experiment $pid/$eid has been stopped!");
......@@ -503,13 +566,19 @@ sub swapexp($)
#
# Cancel an experiment. Never returns.
#
sub cancelexp($$)
sub cancelexp($)
{
my($running) = shift;
my(%exphash) = @_;
#
# Have to set the state to terminating so that swap/end exp will accept it.
#
TBSetBatchState($pid, $eid, $BSTATE_TERMINATING);
#
# It does not matter if the experiment is running; endexp does the
# right thing.
#
system("$endexp -b $pid $eid");
if ($?) {
#
......
......@@ -79,7 +79,9 @@ my $idleswaptime = 60 * TBGetSiteVar("idle/threshold");
my $autoswap = 0;
my $autoswaptime = 10 * 60;
my $idleignore = 0;
my $priority = TB_EXPTPRIORITY_LOW;
my $priority = TB_EXPTPRIORITY_LOW;
my $exptstate = EXPTSTATE_NEW();
my $swapstate = BATCHSTATE_ACTIVATING();
#
# Verify user and get his DB uid.
......@@ -170,17 +172,19 @@ if ($query_result->numrows) {
#
# Insert the record. This reserves the pid/eid for us. If its a batchmode
# experiment, we will update the record later so that the batch daemon
# will recognize it.
# will recognize it. We insert the record as locked and ACTIVATING so that
# no one can mess with the experiment until later.
#
if (! DBQueryWarn("INSERT INTO experiments ".
"(eid, pid, gid, expt_created, expt_expires, expt_name,".
" expt_head_uid,expt_swap_uid, state, priority, swappable,".
" idleswap, idleswap_timeout, autoswap, autoswap_timeout,".
" idle_ignore, keyhash) ".
" idle_ignore, keyhash, batchstate, expt_locked) ".
"VALUES ('$eid', '$pid', '$gid', now(), '$expires', ".
"$description,'$dbuid', '$dbuid', 'new', $priority, ".
"$description,'$dbuid', '$dbuid', '$exptstate', $priority, ".
"$swappable, $idleswap, '$swaptime', $autoswap, ".
"'$autoswaptime', $idleignore, '$secretkey')")) {
"'$autoswaptime', $idleignore, '$secretkey', ".
"'$swapstate', now())")) {
DBQueryWarn("unlock tables");
die("*** $0:\n".
" Database error inserting record for $pid/$eid!\n");
......@@ -222,10 +226,14 @@ if (system("$mkexpdir $pid $gid $eid") != 0) {
}
#
# If no NS file, we are done.
# If no NS file, we are done. We must unlock it and reset its state
# appropriately. We leave the experiment in the "new" state so that
# the user is forced to do a modify first (to give it a topology).
#
exit(0)
if (!defined($tempnsfile));
if (!defined($tempnsfile)) {
TBUnLockExp($pid, $eid, BATCHSTATE_PAUSED);
exit(0);
}
#
# Grab the working directory path, and thats where we work.
......@@ -277,6 +285,9 @@ if ($nsfile_string) {
# email later. If its a batch experiment, update the experiment record
# so that the batch daemon will see it and act.
#
# Note that we hand off to startexp with the experiment locked and ACTIVATING.
# This is "okay" since no one else calls startexp.
#
if ($immediate) {
my $optargs = "";
$optargs .= " -f"
......
......@@ -12,19 +12,14 @@ use Getopt::Std;
#
# This gets invoked from the Web interface. Terminate an experiment.
# Most of the STDOUT prints are never seen since the web interface
# repeats only errors. My plan is make this script the front end to
# experiment termination and make tbend a backend program that no one
# uses.
# reports only errors, but this script is also intended to be run by the
# user someday. Perhaps.
#
# The -b (batch) argument is so that this script can be part of a batchmode
# that starts/ends experiments offline. In that case, we don't want to put
# it into the background and send email, but just want an exit status
# returned to the batch system.
#
# Note about exit value. -1 means error. 0 means backgrounded. 1 means
# termination happened immediately. The web page uses this to decide
# what kind of message to give the user.
#
sub usage()
{
print STDOUT "Usage: endexp [-b] <pid> <eid>\n";
......@@ -32,6 +27,30 @@ sub usage()
}
my $optlist = "b";
#
# Exit codes are important; they tell the web page what has happened so
# it can say something useful to the user. Fatal errors are mostly done
# with die(), but expected errors use this routine. At some point we will
# use the DB to communicate the actual error.
#
# $status < 0 - Fatal error. Something went wrong we did not expect.
# $status = 0 - Termination is proceeding in the background. Notified later.
# $status > 0 - Expected error. User not allowed for some reason.
#
sub ExitWithStatus($$)
{
my ($status, $message) = @_;
if ($status < 0) {
die("*** $0:\n".
" $message\n");
}
else {
print STDERR "$message\n";
}
exit($status);
}
#
# Configure variables
#
......@@ -129,6 +148,7 @@ if ($UID && !TBAdmin($UID) &&
" You do not have permission to end this experiment!\n");
}
#
# We have to protect against trying to end an experiment that is currently
# in the process of being terminated. We use a "wrapper" state (actually
......@@ -151,15 +171,7 @@ my $estate = $hashrow{'state'};
my $expt_path = $hashrow{'path'};
my $isbatchexpt = $hashrow{'batchmode'};
my $ebatchstate = $hashrow{'batchstate'};
if (defined($hashrow{'expt_locked'})) {
$val = $hashrow{'expt_locked'};
die("*** $0:\n".
" It appears that $pid/$eid went into transition at $val.\n".
" You will be notified via email when the experiment is no\n".
" longer in transition.\n");
}
my $cancelflag = $hashrow{'canceled'};
#
# Batch experiments get a different protocol to avoid races with the
......@@ -182,6 +194,9 @@ if ($isbatchexpt) {
if ($ebatchstate ne BATCHSTATE_TERMINATING);
}
else {
ExitWithStatus(1, "Batch experiment $pid/$eid is still canceling!")
if ($cancelflag);
#
# Set the canceled flag. This will prevent the batch_daemon
# from trying to run it (once the table is unlocked). It might
......@@ -194,52 +209,54 @@ if ($isbatchexpt) {
# If the state is POSTED or PAUSED, we can do it right away.
# Otherwise, have to let the batch daemon deal with it.
#
if ($ebatchstate ne BATCHSTATE_POSTED &&
$ebatchstate ne BATCHSTATE_PAUSED) {
#
# Exit with non zero status so that caller knows (web
# server) that the batch experiment cannot be ended at
# this time.
#
print "Batch experiment $pid/$eid is currently running.\n".
"You will receive email notification when the experiment is\n".
"torn down and you can reuse the experiment name\n";
exit(0);
}
ExitWithStatus(0,
"Batch experiment $pid/$eid has been canceled.\n".
"You will receive email when the experiment is\n".
"torn down and you can reuse the experiment name.")
if ($ebatchstate ne BATCHSTATE_POSTED &
$ebatchstate ne BATCHSTATE_PAUSED);
}
#
# Let termination proceed normally.
#
}
else {
#
# If the cancel flag is set, then user must wait for that to clear before
# we can do anything else.
#
ExitWithStatus(1,
"Experiment $pid/$eid has its cancel flag set!.\n".
"You must wait for that to clear before you can terminate ".
"the experiment.\n")
if ($cancelflag);
#
# Okay, check state. We do not allow termination to start when the
# experiment is in transition. A future task would be to allow this,
# but for now the experiment must be in one of a few states to proceed
#
# Seems like too many states!
#
if ($estate eq EXPTSTATE_PRERUN ||
$estate eq EXPTSTATE_ACTIVATING ||
$estate eq EXPTSTATE_SWAPPING ||
$estate eq EXPTSTATE_TERMINATING) {
die("*** $0:\n".
" It appears that experiment $pid/$eid is in transition.\n".
" The user that created the experiment will be notified via\n".
" email when the experiment is no longer in transition.\n");
#
# Okay, check state. We do not allow termination to start when the
# experiment is in transition. A future task would be to allow this,
# but for now the experiment must be in one of a few states to proceed.
#
if ($ebatchstate ne BATCHSTATE_PAUSED() &&
$ebatchstate ne BATCHSTATE_RUNNING()) {
ExitWithStatus(1,
"Experiment $pid/$eid is currently in transition.\n".
"You must wait until it is no longer $ebatchstate!");
}
}
#
# Set the timestamp now, and unlock the experiments table.
# Lock the experiment and change state so no one can mess with it.
#
DBQueryFatal("UPDATE experiments SET expt_locked=now() ".
"WHERE eid='$eid' and pid='$pid'");
TBLockExp($pid, $eid, BATCHSTATE_TERMINATING());
DBQueryFatal("unlock tables");
#
# XXX - At this point a failure is going to leave things in an
# inconsistent state.
# inconsistent state. Be sure to call fatal() only since we are
# going into the background, and we have to send email since no
# one is going to see printed error messages (output goes into the
# log file, which will be sent along in the email).
#
#
......@@ -251,8 +268,8 @@ my $expt_head_name;
my $expt_head_email;
if (! UserDBInfo($expt_head_login, \$expt_head_name, \$expt_head_email)) {
print STDERR "*** WARNING: ".
"Could not determine name/email for $expt_head_login.\n";
print "*** WARNING: ".
"Could not determine name/email for $expt_head_login.\n";
$expt_head_name = "TBOPS";
$expt_head_email = $TBOPS;
}
......@@ -269,11 +286,10 @@ if (! $batch) {
#
# Parent exits normally
#
print STDOUT
"Experiment $pid/$eid is now terminating\n".
"You will be notified via email when the experiment has been\n".
"torn down, and you can reuse the experiment name.\n";
exit(0);
ExitWithStatus(0,
"Experiment $pid/$eid is now terminating.\n".
"You will be notified via email when the experiment has been\n".
"torn down, and you can reuse the experiment name.\n");
}
}
......@@ -337,7 +353,7 @@ TBExptClearLogFile($pid, $eid);
TBExptDestroy($pid, $eid);
#
# In batch mode, exit now.
# In batch mode, exit now.
#
if ($batch) {
exit(0);
......@@ -369,9 +385,8 @@ sub fatal($)
#
# Kill this for convenience later.
#
DBQueryWarn("update experiments set expt_locked=NULL ".
"WHERE eid='$eid' and pid='$pid'");
#
TBUnLockExp($pid, $eid);
# Copy over the log files so the user can see them.
system("/bin/cp -Rfp $workdir/ $userdir/tbdata");
......
......@@ -14,6 +14,11 @@ use Getopt::Std;
#