Commit 0318cc22 authored by Leigh B. Stoller's avatar Leigh B. Stoller

A wide ranging set of event system changes:

assign_wrapper.in: Hack in a change that ensures a delay node is
created for any link on which an event is posted (up,down,modify),
no matter what its initial parameters are. ie: If a link is created
with no delay, but there is an event that adds a delay later, then we
must drop in a delay node. Same for up/down on a link. We do this in
the delay node. I am reasonably confident that this change is fine for
duplex links, but I am less sure of the effect on lans!

eventsys_control.in: Checkpoint latest changes. Add "replay" option,
which right now just stops and starts the event scheduler so that it
reloads the entire event list. Add check for existing experiment, and
that the experiment is either active or swapping (do not want to start
a scheduler for a swapped out experiment!). Add check to see if there
are any events, and skip startup if there are not events in the DB.
Lastly, get very serious about preventing more than one scheduler from
being started, either by accident or intentionally. My protocol is to
lock the table, grab and set the pid to -pid, test the pid for a
positive value, and if positive, send the scheduler a kill(TERM) so
that it can cleanup, clear the pid to zero in the DB, and exit. This
approach ensures that we do not try to send a kill to a pid that is no
longer active or owned by the user (this last part is not really
necessary cause of how pids are reused, but it was easy to add so why
not).

exports_setup.in: Trivial change to make it easier to turn this on
temporarily in devel trees.
named_setup.in: Ditto.

node_reboot.in: Add call to TBdbfork() in child cause of apparent DB
connection problems across forks. In the child, set the eventstatus
for the node to REBOOT if successful (not this event status stuff is
temporary, will be recast in next set of revisions).

GNUmakefile:  Add new controlling program, eventsys_control.
power.in:     Ditto previous comment about REBOOT.
os_setup.in:  Non event system cleanups.
tbend.in:     Add DB cleanup of the new virt_trafgens and eventlist tables.
tbprerun.in:  Ditto.
tbreport.in:  Print out the event list in a pretty print format.
tbswapin.in:  Add call to start the event system. Also a big fix; move
              the named script up above the os_setup so that the named
              tables have been updated by the time the first node
              reboots. I noticed that nodes were failing on gethostbyname().
tbswapout.in: Add call to stop the event system.
parent d1d3a034
......@@ -12,7 +12,7 @@ SUBDIRS = checkpass ns2ir
BIN_STUFF = power snmpit tbend tbswapin tbswapout tbprerun tbreport \
os_load startexp endexp batchexp swapexp \
node_reboot nscheck node_update savelogs node_control
node_reboot nscheck node_update savelogs node_control \
# Stuff that mere users get on plastic.
USERBINS = os_load node_reboot nscheck node_update savelogs node_control
......@@ -20,7 +20,7 @@ USERBINS = os_load node_reboot nscheck node_update savelogs node_control
SBIN_STUFF = resetvlans console_setup.proxy sched_reload named_setup \
batch_daemon exports_setup reload_daemon sched_reserve \
console_reset db2ns bwconfig frisbeelauncher \
rmgroup mkgroup mkacct setgroups mkproj
rmgroup mkgroup mkacct setgroups mkproj eventsys_control
LIBEXEC_STUFF = rmproj rmacct-ctrl \
os_setup mkexpdir console_setup webnscheck webreport \
......@@ -94,6 +94,8 @@ post-install:
chmod u+s $(INSTALL_BINDIR)/node_reboot
chown root $(INSTALL_BINDIR)/node_update
chmod u+s $(INSTALL_BINDIR)/node_update
chown root $(INSTALL_SBINDIR)/eventsys_control
chmod u+s $(INSTALL_SBINDIR)/eventsys_control
#
# Control node installation (okay, plastic)
......
......@@ -53,6 +53,7 @@ $ENV{'PATH'} = "/usr/bin:$TBROOT/libexec:$TBROOT/sbin:$TBROOT/bin";
use lib '@prefix@/lib';
use libdb;
use libtestbed;
require exitonwarn;
#
......@@ -235,6 +236,21 @@ while (($vname,$member,$delay,$bandwidth,$lossrate,
}
$result->finish;
#
# Check event list. Anytime we find an event to control a link, we need
# to drop a delay node in. start/stop especially, since thats the easiest
# way to do that, even if the link has no other traffic shaping in it.
#
printdb "Checking events for LINK commands.\n";
$result =
DBQueryFatal("select distinct vname from eventlist as ex ".
"left join event_eventtypes as et on ex.eventtype=et.idx ".
"left join event_objecttypes as ot on ex.objecttype=ot.idx ".
"where ot.type='LINK' and ex.pid='$pid' and ex.eid='$eid'");
while (($vname) = $result->fetchrow_array) {
$mustdelay{$vname} = 1;
}
# Shark hack
foreach $lan (keys(%lans)) {
$realmembers = [];
......@@ -330,10 +346,11 @@ foreach $lan (keys(%lans)) {
$bandwidth = &getbandwidth(&min($bw0,$rbw1));
$rbandwidth = &getbandwidth(&min($rbw0,$bw1));
if ((($delay >= $delaythresh) ||
(($bw != $S100Kbs) && ($bw != $S10Kbs)) ||
(($delaywithswitch == 0) &&
(($bw != $S100Kbs) && (($sharks == 0) || ($nonsharks > 1)))) ||
($loss != 0)) ||
(($bw != $S100Kbs) && ($bw != $S10Kbs)) ||
(($delaywithswitch == 0) &&
(($bw != $S100Kbs) && (($sharks == 0) || ($nonsharks > 1)))) ||
($loss != 0)) ||
(defined($mustdelay{$lan})) ||
(($rdelay >= $delaythresh) ||
(($rbw != $S100Kbs) && ($rbw != $S10Kbs)) ||
(($delaywithswitch == 0) &&
......@@ -374,11 +391,12 @@ foreach $lan (keys(%lans)) {
# XXX The expression below should be modified for better bandwidth support.
# Probably needs to happen post assign somehow.
if ((($delay >= $delaythresh) ||
(($bw != $S100Kbs) && ($bw != $S10Kbs)) ||
(($delaywithswitch == 0) &&
(($bw != $S100Kbs) && (($sharks == 0) ||
($nonsharks > 1)))) ||
($loss != 0)) ||
(($bw != $S100Kbs) && ($bw != $S10Kbs)) ||
(($delaywithswitch == 0) &&
(($bw != $S100Kbs) && (($sharks == 0) ||
($nonsharks > 1)))) ||
($loss != 0)) ||
(defined($mustdelay{$lan})) ||
(($rdelay >= $delaythresh) ||
(($rbw != $S100Kbs) && ($rbw != $S10Kbs)) ||
(($delaywithswitch == 0) &&
......@@ -930,6 +948,24 @@ foreach $vnodeport (keys(%portbw)) {
}
}
#
# Post pass the event list. At present, all LINK operations apply to
# the delay node that is in the middle of it. Rewrite the vnode in
# the event list.
#
$eventlist_result =
DBQueryFatal("select ex.idx,ex.vname,r.vname ".
" from eventlist as ex ".
"left join delays as d on ex.vname=d.vname ".
"left join reserved as r on r.node_id=d.node_id ".
"left join event_objecttypes as ob on ob.idx=ex.objecttype ".
"where ob.type='LINK' and ex.pid='$pid' and ex.eid='$eid'");
while (my ($idx,$vname,$vnode) = $eventlist_result->fetchrow_array) {
DBQueryFatal("update eventlist set vnode='$vnode' ".
"where idx=$idx and pid='$pid' and eid='$eid'");
}
######################################################################
# Subroutines
######################################################################
......
......@@ -12,7 +12,7 @@ use POSIX ":sys_wait_h";
#
sub usage()
{
print STDOUT "Usage: eventsys_control <start|stop> <pid> <eid>\n";
print STDOUT "Usage: eventsys_control <start|stop|replay> <pid> <eid>\n";
exit(-1);
}
my $optlist = "d";
......@@ -50,6 +50,7 @@ delete @ENV{'IFS', 'CDPATH', 'ENV', 'BASH_ENV'};
my $evsched = "$TB/sbin/event-sched";
my $debug = 0;
my $expstate;
#
# Parse command arguments. Once we return from getopts, all that should be
......@@ -73,6 +74,7 @@ my $eid = $ARGV[2];
# Untaint args.
#
if ($action ne "start" &&
$action ne "replay" &&
$action ne "stop") {
usage();
}
......@@ -89,6 +91,23 @@ else {
die("Bad data in eid: $eid.");
}
if (! ($expstate = ExpState($pid, $eid))) {
die("*** $0:\n".
" No such experiment $pid/$eid!\n");
}
#
# Do not allow an event system to be controlled if the experiment is not
# active (or swapping). We will eventually give the user the ability
# to control the event system directly.
#
if ($expstate ne EXPTSTATE_ACTIVE &&
$expstate ne EXPTSTATE_ACTIVATING &&
$expstate ne EXPTSTATE_SWAPPING) {
die("*** $0:\n".
" Experiment $pid/$eid must active (or swapping)!\n");
}
#
# Check permission. Only people with permission to destroy the experiment
# can do this.
......@@ -96,34 +115,56 @@ else {
if ($UID &&
! TBExptAccessCheck($UID, $pid, $eid, TB_EXPT_DESTROY)) {
die("*** $0:\n".
" You do not have permission to control the event system!");
" You do not have permission to control the event system!\n");
}
#
# If stopping, find the pid from the DB and send it a kill.
#
if ($action eq "stop") {
if ($action eq "stop" ||
$action eq "replay") {
#
# Simple protocol to prevent concurrent manipulation; If there is a pid
# set it to -pid to prevent another start/stop operation. The event
# scheduler itself will catch the signal and clear the pid to 0 when it
# exits, thus releasing the scheduler.
#
DBQueryFatal("lock tables experiments write");
$query_result =
DBQueryFatal("select event_sched_pid from experiments ".
"where pid='$pid' and eid='$eid'");
DBQueryFatal("update experiments set event_sched_pid=0 ".
"where pid='$pid' and eid='$eid'");
DBQueryFatal("update experiments set ".
"event_sched_pid=-event_sched_pid ".
"where pid='$pid' and eid='$eid'");
DBQueryWarn("unlock tables");
my @row = $query_result->fetchrow_array();
my $procid = $row[0];
if ($procid &&
if ($procid > 0 &&
! kill('TERM', $procid)) {
DBQueryFatal("update experiments set ".
"event_sched_pid=-event_sched_pid ".
"where pid='$pid' and eid='$eid'");
SENDMAIL($TBOPS,
"Failed to stop event system for $pid/$eid",
"Could not kill(TERM) process $procid: $? $!");
die("*** $0:\n".
"Failed to stop event system for $pid/$eid!\n");
}
exit(0);
if ($action eq "stop") {
if ($procid <= 0) {
print "There is no event scheduler running for $pid/$eid!\n";
}
exit(0);
}
# replay continues below
}
#
......@@ -133,15 +174,42 @@ if ($action eq "stop") {
$EUID=$UID;
#
# start the event scheduler, redirecting output to the experiment
# directory.
# Check for a running scheduler, This is a loose check since its the users
# responsibility to make sure that they don't try and do two things
# at the same time. All that *really* matters is that we do not start two
# at a time, and that we make sure we kill off existing ones (making sure
# the pid is not cleared from the DB unless it really does die). This is
# handled by the stop code above. For start, just check for zero pid.
# The scheduler itself will lock the table to prevent concurrent startup.
#
$query_result =
DBQueryFatal("select event_sched_pid,path from experiments ".
"where pid='$pid' and eid='$eid'");
my @row = $query_result->fetchrow_array();
my $procid = $row[0];
my $path = $row[1];
if ($procid != 0) {
die("*** $0:\n".
"There is already an event scheduler running for $pid/$eid!\n");
}
#
# For now, lets not start an event system if there are no events.
#
$query_result =
DBQueryFatal("select path from experiments ".
$query_result =
DBQueryFatal("select distinct pid,eid from eventlist ".
"where pid='$pid' and eid='$eid'");
my @row = $query_result->fetchrow_array();
my $path = $row[0];
if (! $query_result->numrows) {
print "*** There are no events for $pid/$eid. Not starting a scheduler.\n";
exit(0);
}
#
# start the event scheduler, redirecting output to the experiment
# directory.
#
if (my $childpid = TBBackGround("$path/logs/event-sched.log")) {
#
# Delay just in case. The event scheduler needs to be turned into
......@@ -151,7 +219,7 @@ if (my $childpid = TBBackGround("$path/logs/event-sched.log")) {
sleep(1);
my $foo = waitpid($childpid, &WNOHANG);
if ($foo) {
print STDERR "Failed to start event system $foo $?!\n";
print STDERR "Failed to start event system for $pid/$eid: $foo $?!\n";
SENDMAIL($TBOPS,
"Event System Failure: $pid/$eid!\n",
"Failed to start event system for $pid/$eid",
......
......@@ -39,7 +39,7 @@ if ($EUID != 0) {
" Must be root! Maybe its a development version?\n");
}
# XXX Hacky!
if ($TB ne "/usr/testbed") {
if (1 && $TB ne "/usr/testbed") {
print STDERR "*** $0:\n".
" Wrong version. Maybe its a development version?\n";
#
......
......@@ -31,7 +31,7 @@ if ($EUID != 0) {
" Must be root! Maybe its a development version?\n");
}
# XXX Hacky!
if ($TB ne "/usr/testbed") {
if (1 && $TB ne "/usr/testbed") {
die("*** $0:\n".
" Wrong version. Maybe its a development version?\n");
}
......
......@@ -257,6 +257,7 @@ sub RebootNode {
if ($mypid) {
return $mypid;
}
TBdbfork();
#
# See if the machine is pingable. If its not pingable, then we just
......@@ -338,6 +339,7 @@ sub RebootNode {
# punch the power button.
#
if (WaitTillDead($pc) == 0) {
TBSetNodeEventState($pc, TBDB_EVENTTYPE_REBOOT);
exit(0);
}
......
......@@ -481,18 +481,19 @@ sub WaitTillAlive {
# other problems. Any non-zero return indicates "not pingable" to us.
#
if (! $status) {
print STDERR "$pc alive and well\n" if $dbg;
print "$pc is alive and well\n" if $dbg;
return 0;
}
$waittime = time - $waitstart;
if ($waittime > $maxwait) {
print "$pc appears dead; its been ",
(int ($waittime / 60))," minutes since rebooted.\n";
$minutes = int($waittime / 60);
print "*** $pc appears dead - its been $minutes minute(s).\n";
return 1;
}
if (int($waittime / 60) > $minutes) {
$minutes = int($waittime / 60);
print "Still waiting for $pc - its been $minutes minute(s)\n";
print "Still waiting for $pc - its been $minutes minute(s).\n";
}
}
}
......
......@@ -202,6 +202,7 @@ foreach my $power_id (keys %outlets) {
if (!$errors) {
foreach my $node (@nodes) {
print "$node now ",($op eq "cycle" ? "rebooting" : $op),"\n";
TBSetNodeEventState($node, TBDB_EVENTTYPE_REBOOT);
}
} else {
$exitval += $errors;
......
......@@ -81,6 +81,10 @@ DBQueryWarn("DELETE from virt_nodes where pid='$pid' and eid='$eid'") or
$errors++;
DBQueryWarn("DELETE from virt_lans where pid='$pid' and eid='$eid'") or
$errors++;
DBQueryWarn("DELETE from virt_trafgens where pid='$pid' and eid='$eid'") or
$errors++;
DBQueryWarn("DELETE from eventlist where pid='$pid' and eid='$eid'") or
$errors++;
if ($errors == 0) {
SetExpState($pid, $eid, EXPTSTATE_TERMINATED) or
......
......@@ -63,8 +63,10 @@ if (! SetExpState($pid, $eid, EXPTSTATE_PRERUN)) {
#
sub cleanup {
print STDERR "Cleaning up after errors.\n";
DBQueryWarn("DELETE from virt_nodes where pid='$pid' and eid='$eid'");
DBQueryWarn("DELETE from virt_lans where pid='$pid' and eid='$eid'");
DBQueryWarn("DELETE from virt_nodes where pid='$pid' and eid='$eid'");
DBQueryWarn("DELETE from virt_lans where pid='$pid' and eid='$eid'");
DBQueryWarn("DELETE from virt_trafgens where pid='$pid' and eid='$eid'");
DBQueryWarn("DELETE from eventlist where pid='$pid' and eid='$eid'");
SetExpState($pid, $eid, EXPTSTATE_NEW);
}
......
......@@ -213,5 +213,41 @@ if (($state eq EXPTSTATE_ACTIVE) || ($state eq EXPTSTATE_TESTING)) {
}
}
#
# Print time sorted avent list.
#
$result =
DBQueryFatal("select time,vnode,vname,ot.type,et.type,arguments ".
" from eventlist as ex ".
"left join event_eventtypes as et on ex.eventtype=et.idx ".
"left join event_objecttypes as ot on ex.objecttype=ot.idx ".
"order by time");
if ($result->numrows) {
print "\n";
print "Event List:\n";
printf "%-12s %-12s %-10s %-10s %-10s %s\n",
"Time", "Node", "Agent", "Type", "Event", "Arguments";
print "------------ ------------ ---------- ---------- ---------- ".
"------------ \n";
while (($time,$vnode,$vname,$obj,$type,$args) = $result->fetchrow_array()){
printf("%-12s %-12s %-10s %-10s %-10s ",
$time, $vnode, $vname, $obj, $type);
my @arglist = split(" ", $args);
my $arg = shift(@arglist);
if (defined($arg)) {
printf("$arg");
}
printf("\n");
foreach my $arg (@arglist) {
printf("%-58s %s\n", "", $arg);
}
}
}
$result->finish();
0;
......@@ -48,6 +48,11 @@ my $state;
sub cleanup {
print STDERR "Cleaning up after errors.\n";
print "Stopping event system\n";
if (system("eventsys_control stop $pid $eid")) {
print STDERR "*** Failed to stop the event system.\n";
}
if ($cleanvlans) {
print STDERR "Removing VLANs.\n";
if (system("snmpit -r $pid $eid")) {
......@@ -129,6 +134,14 @@ if (system("exports_setup")) {
exit(1);
}
print "Setting up named maps.\n";
if (system("named_setup")) {
print STDERR "*** WARNING: Failed to add node names to named map.\n";
#
# This is a non-fatal error.
#
}
print "Resetting OS and rebooting.\n";
if (system("os_setup $pid $eid")) {
print STDERR "*** Failed to reset OS and reboot nodes.\n";
......@@ -136,12 +149,11 @@ if (system("os_setup $pid $eid")) {
exit(1);
}
print "Setting up named maps.\n";
if (system("named_setup")) {
print STDERR "*** WARNING: Failed to add node names to named map.\n";
#
# This is a non-fatal error.
#
print "Starting the event system.\n";
if (system("eventsys_control start $pid $eid")) {
print STDERR "*** Failed to start the event system.\n";
cleanup;
exit(1);
}
print "Setting up email lists.\n";
......
......@@ -89,6 +89,12 @@ if (! SetExpState($pid, $eid, EXPTSTATE_SWAPPING)) {
}
if (! $TESTMODE) {
print "Stopping the event system.\n";
if (system("eventsys_control stop $pid $eid")) {
print STDERR "*** Failed to stop the event system.\n";
$errors = 1;
}
print "Clearing VLANs.\n";
if (system("snmpit -r $pid $eid")) {
print STDERR "*** Failed to reset VLANs.\n";
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment