Commit 9a23fe83 authored by David Anderson's avatar David Anderson

This commit has various changes to Linktest to make it more reliable.

1. The Linktest daemon, linktest.c, now listens for a KILL event. If received,
   the daemon will kill the linktest.pl child process and all of its subchildren.
2. The daemon also listens for SIGSTP events from the linktest.pl child, and
   will kill the linktest.pl process and its children if linktest.pl dies
   unexpectedly.
3. Locking has been implemented in linktest.c to ignore requests to start linktest
   while a current run is executing.
4. The controller script run_linktest.pl now includes the following new options:
   -t   allows the user to specify a timeout in seconds for Linktest.
   -v   prints out better feedback from the Linktest run as it proceeds.

Major remaining items are:
1. Avoid NFS mount hups
2. More testing, especially on vnodes
parent c2fe8581
> Linktest needs its own tb_compat.tcl because the linktest version of
> tb_compat.tcl overrides function definitions in order to parse out the
> topology. Maybe a workaround would be to have tb_compat.tcl only include
> definitions it actually overrides, so that the NS script first includes
> the default, up-to-date script, then the linktest version second. I think
> I could do that by preprocessing the NS script to add a directive to
> include the linktest definitions file.
Or perhaps include nstb_compat.tcl since that has stubs of the functions
that I think you want to be stubs.
> OK. I'll need to set up linktest to calculate its runtime, based on the
> topology and the options selected, then use that (plus some buffer) as the
> timeout. Sound reasonable?
Yep. Exit with non-zero if timed out.
===========================================================
> Can you please put your doc into proper html so it can be displayed, since
> I want to link to that from the beginexp page. You can take a look at
> www/tutorial/tutorial.html to see what proper format is.
>
> Right now the output goes to stdout? I think an option (-o filename) to put
> the output into the exp log directory (/proj/pid/eid/exp/logs) is needed
> since when it is done in the swapin path, we want to save that output off
> someplace.
>
> If the "levels" are really ordered, then maybe a "-l level" option to the
> run_linktest.pl.in script?
===========================================================
Aside: This is the wrong way to do this:
#define LINKTEST_SCRIPT "@CLIENT_BINDIR@" "/linktest.pl"
Pass this in from the makefile instead. So, in your makefile do:
-DCLIENT_BINDIR ='"$(CLIENT_BINDIR)"'
===========================================================
> {3} nodeg$ gmake
> mkdir fbsd
> mkdir: fbsd: File exists
> gmake: [linktest] Error 1 (ignored)
Please make these errors silent. In fact, the proper approach is to not
have subdirs here, but to have an obj-linux tree. See my testbed directory
on ops. However, just getting rid of the error statements above is okay for
now.
> gcc -static -L../lib -L../../lib/libtb linktest.o ../../lib/libtb/log.o
> ../../lib/libtb/tbdefs.o -levent -lcrypto -lssl `elvin-config --libs
> vin4c` -o fbsd/linktest
> gcc -static -L../lib -L../../lib/libtb ltevent.o ../../lib/libtb/log.o
> ../../lib/libtb/tbdefs.o -levent -lcrypto -lssl `elvin-config --libs
> vin4c` -o fbsd/ltevent
Please fix up the dependencies in the makefile so that these are not
relinked unless something actually changed.
===========================================================
> So, the front end if I understand correctly would have a drop-down for
> Linktest validation. Contents of which would be:
> 0 - Do not run Linktest
> 1 - Connectivity and Latency
> 2 - Static Routing
> 3 - Loss
> 4 - Bandwidth
Sure, sounds good. They should of course correspond to section headers in
the document (which I link to from the menu).
===========================================================
Add copyright headers to all files.
Update linktest.c program style to standard emulab style
tmcd.c is an example
Update linktest.pl program style to standard emulab style
run_linktest.pl is an example
# TODO to protect the script from NFS mount hups,
# always copy the file to /tmp before attempting to read it.
# repeatedly call cp and check the exit code.
####
Note, ignoring project groups. See the line...
$gid = $proj_id;
......@@ -21,13 +21,22 @@
#include "event.h"
#include "linktest.h"
static int debug;
static void callback(event_handle_t handle,
event_notification_t notification, void *data);
#define TRUE 1;
#define FALSE 0;
static void
start_linktest(char *args, int);
static int debug;
static volatile int locked;
static pid_t linktest_pid;
static char *pideid;
static event_handle_t handle;
static void callback(event_handle_t handle,
event_notification_t notification, void *data);
static void exec_linktest(char *args, int);
static void sigchld_handler(int sig);
static void send_group_kill();
static void send_kill_event();
void
usage(char *progname)
{
......@@ -40,17 +49,17 @@ usage(char *progname)
int
main(int argc, char **argv) {
event_handle_t handle;
address_tuple_t tuple;
char *server = NULL;
char *port = NULL;
char *keyfile = NULL;
char *pideid = NULL;
char *pidfile = NULL;
char *logfile = NULL;
char *progname;
char c;
char buf[BUFSIZ];
pideid = NULL;
progname = argv[0];
......@@ -128,7 +137,7 @@ main(int argc, char **argv) {
*/
handle = event_register_withkeyfile(server, 0, keyfile);
if (handle == NULL) {
fatal("could not register with event system");
fatal("could not register with event system");
}
/*
......@@ -162,6 +171,14 @@ main(int argc, char **argv) {
}
}
/*
* Initialize variables used to control child execution
*/
locked = FALSE;
if(signal(SIGCHLD,sigchld_handler) == SIG_ERR) {
fatal("could not install child handler");
}
/*
* Begin the event loop, waiting to receive event notifications:
*/
......@@ -204,41 +221,217 @@ callback(event_handle_t handle, event_notification_t notification, void *data)
event_notification_get_arguments(handle,
notification, args, sizeof(args));
/* info("Event: %lu:%d %s %s %s\n", now.tv_sec, now.tv_usec,
objname, event, args);*/
/*
* Dispatch the event.
*/
if (strcmp(event, TBDB_EVENTTYPE_START) == 0) {
start_linktest(args, sizeof(args));
}
if(!strcmp(event, TBDB_EVENTTYPE_START)) {
if(!locked) {
/*
* Set locked bit. The bit is not set to false
* until a SIGCHLD signal is received
*/
locked = TRUE;
/*
* The Linktest script is not running, so
* fork a process and execute it.
*/
linktest_pid = fork();
if(linktest_pid < 0) {
error("Could not fork a process to run linktest script!\n");
return;
}
/*
* Changes the process group of the child to itself so
* a sigkill to the child process group will not kill
* the Linktest daemon.
*/
if(!linktest_pid) {
pid_t mypid = getpid();
setpgid(mypid,mypid);
/* Finally, execute the linktest script. */
exec_linktest(args, sizeof(args));
}
}
} else if(!strcmp(event, TBDB_EVENTTYPE_STOP)) {
/*
* STOP is informational and may be ignored.
*/
} else if (!strcmp(event, TBDB_EVENTTYPE_KILL)) {
/*
* Ignore unless we are running.
*/
if(locked) {
/*
* If KILL is received, there is a problem on this
* or some other node. So, kill off linktest
* and its children.
*/
send_group_kill();
}
}
}
/* convert arguments from the event into a form for Linktest.
*/
/*
* Executes Linktest with arguments received from the Linktest
* start event. Does not return.
*/
static void
start_linktest(char *args, int buflen) {
pid_t lt_pid;
int status;
char *word;
char *argv[MAX_ARGS];
int i=1;
word = strtok(args," \t");
do {
argv[i++] = word;
} while ((word = strtok(NULL," \t"))
&& (i<MAX_ARGS));
argv[i] = NULL;
argv[0] = LINKTEST_SCRIPT;
/* info("starting linktest.\n");*/
lt_pid = fork();
if(!lt_pid) {
execv( LINKTEST_SCRIPT,argv);
}
waitpid(lt_pid, &status, 0);
/* info("linktest completed.\n");*/
exec_linktest(char *args, int buflen) {
char *word, *argv[MAX_ARGS];
int i,res;
/*
* Set up arguments for execv call by parsing contents
* of the event string.
*/
word = strtok(args," \t");
i=1;
do {
argv[i++] = word;
} while ((word = strtok(NULL," \t"))
&& (i<MAX_ARGS));
argv[i] = NULL;
argv[0] = LINKTEST_SCRIPT;
/*
* Execute the script with the arguments from the event
*/
res = execv( LINKTEST_SCRIPT,argv);
if(res < 0) {
error("Could not execute the Linktest script.");
return;
}
}
static
void sigchld_handler(int sig) {
pid_t pid;
int status;
int exit_code;
/*
* If the exit_code is nonzero after a normal exit,
* the daemon sends a KILL to let other nodes know
* that something has failed.
*
* However, ignore the case of a non-normal exit,
* since that is likely the result of a KILL signal.
*/
exit_code = 0;
while((pid = waitpid(-1, &status, 0)) > 0) {
/*
* If Linktest died due to an error, Perl will exit
* the script normally with an error code.
*/
if(WIFEXITED(status)) {
exit_code = WEXITSTATUS(status);
info("Linktest exit code: %d\n",exit_code);
/*
* If this was an abnormal exit (exit status != 0)
* then we must send KILL to the process group of
* Linktest to kill its subchildren. However,
* it doesn't seem to hurt (cause kill errors) to
* send it anyway.
*/
send_group_kill();
} else if (WIFSIGNALED(status)) {
/*
* Linktest exited due to a signal, likely from
* this daemon. If that's the case, group_kill
* has already been sent.
*/
info("Linktest killed by signal %d.\n", WTERMSIG(status));
} else {
/*
* Linktest is stopped unexpectedly.
*/
error("unexpected SIGCHLD received\n");
}
}
if(errno != ECHILD) {
error("waitpid error\n");
}
/*
* Go ahead and unlock since Linktest and its children
* should all be killed and reaped now.
*/
locked = FALSE;
/*
* Now let other nodes know about the problem, if any.
*/
if(exit_code) {
info("Posting KILL event\n");
send_kill_event();
}
return;
}
static
void send_group_kill() {
int res;
/*
* Kill off all processes in the process group of the
* Linktest run. This may include the linktest script
* itself, and any children it forked.
*/
res = kill(-linktest_pid,SIGKILL);
if(res < 0) {
/*
* Not a serious error, likely the process group
* has already exited.
*/
return;
}
}
static
void send_kill_event() {
event_notification_t notification;
address_tuple_t tuple;
/*
* Construct an address tuple for generating the event.
*/
tuple = address_tuple_alloc();
if (tuple == NULL) {
error("could not allocate an address tuple");
}
/*
* Set up event to send a kill...
*/
tuple->objtype = TBDB_OBJECTTYPE_LINKTEST;
tuple->eventtype= TBDB_EVENTTYPE_KILL;
tuple->host = ADDRESSTUPLE_ALL;
tuple->expt = pideid;
/* Generate the event */
notification = event_notification_alloc(handle, tuple);
if (notification == NULL) {
error("could not allocate notification");
}
/* Send the event */
if (event_notify(handle, notification) == 0) {
error("could not send event notification");
}
/* Clean up */
event_notification_free(handle, notification);
}
This diff is collapsed.
/*
* Event helper for Linktest events.
* EMULAB-COPYRIGHT
* Copyright (c) 2000-2004 University of Utah and the Flux Group.
* All rights reserved.
*/
#include <stdio.h>
......@@ -13,21 +15,22 @@
#include <time.h>
#include "log.h"
#include "tbdefs.h"
#include "event.h"
#define TRUE 1
#define FALSE 0
static char *progname;
static void
callback(event_handle_t handle,
event_notification_t notification, void *data);
static void callback(event_handle_t handle,
event_notification_t notification, void *data);
void
usage()
{
fprintf(stderr,
"Usage:\t%s -s server [-p port] [-k keyfile] -e pid/eid [-w event | -x event] [ARGS...]\n",
"Usage:\t%s -s server [-p port] [-k keyfile] -e pid/eid [-w | -x event] [ARGS...]\n",
progname);
fprintf(stderr, " -w event\twait for Linktest event\n");
fprintf(stderr, " -w \twait for any Linktest event\n");
fprintf(stderr, " -x event\ttransmit Linktest event\n");
exit(-1);
}
......@@ -53,13 +56,13 @@ main(int argc, char **argv)
char buf[BUFSIZ];
char *pideid = NULL;
char *send_event = NULL;
char *wait_event = NULL;
char event_args[BUFSIZ];
int c;
int c,wait_event;
progname = argv[0];
wait_event = FALSE;
while ((c = getopt(argc, argv, "s:p:w:x:e:k:")) != -1) {
while ((c = getopt(argc, argv, "s:p:wx:e:k:")) != -1) {
switch (c) {
case 's':
server = optarg;
......@@ -68,7 +71,7 @@ main(int argc, char **argv)
port = optarg;
break;
case 'w':
wait_event = optarg;;
wait_event = TRUE;
break;
case 'x':
send_event = optarg;
......@@ -106,7 +109,6 @@ main(int argc, char **argv)
* Uppercase event tags for now. Should be wired in list instead.
*/
up(send_event);
up(wait_event);
/*
* Convert server/port to elvin thing.
......@@ -154,13 +156,12 @@ main(int argc, char **argv)
}
}
/* Send Event */
if (event_notify(handle, notification) == 0) {
fatal("could not send test event notification");
}
/*gettimeofday(&now, NULL);
info("Sent %s at time: %lu:%d\n",send_event ,now.tv_sec, now.tv_usec);*/
/* Cleanup */
event_notification_free(handle, notification);
}
......@@ -182,15 +183,7 @@ main(int argc, char **argv)
tuple->expt = pideid;
tuple->objtype = TBDB_OBJECTTYPE_LINKTEST;
tuple->objname = ADDRESSTUPLE_ANY;
tuple->eventtype = wait_event;
/*
* Register with the event system.
*/
handle = event_register_withkeyfile(server, 0, keyfile);
if (handle == NULL) {
fatal("could not register with event system");
}
/*
* Subscribe to the event we specified above.
......@@ -219,10 +212,38 @@ main(int argc, char **argv)
static void
callback(event_handle_t handle, event_notification_t notification, void *data)
{
char objname[TBDB_FLEN_EVOBJTYPE];
char event[TBDB_FLEN_EVEVENTTYPE];
char args[BUFSIZ];
struct timeval now;
exit(0);
gettimeofday(&now, NULL);
if (! event_notification_get_objname(handle, notification,
objname, sizeof(objname))) {
error("Could not get objname from notification!\n");
return;
}
up(objname);
if (! event_notification_get_eventtype(handle, notification,
event, sizeof(event))) {
error("Could not get event from notification!\n");
return;
}
event_notification_get_arguments(handle,
notification, args, sizeof(args));
/*
* Print out events and args.
*/
printf("%s %s\n",event,args);
fflush(stdout);
}
......@@ -7,6 +7,8 @@
use strict;
use Getopt::Std;
use English;
use POSIX;
#
# Wrapper for running the linktest daemon. This script is currently
......@@ -17,12 +19,17 @@ use English;
sub usage()
{
print "Usage: run_linktest.pl ".
"[-q] [-d] [-s server] [-p port] [-k keyfile] [-l level] [-o logfile] -e pid/eid\n".
"Use -q for quick termination mode, same as -l 3\n";
"[-q] [-d] [-t] [-v] [-s server] [-p port] [-k keyfile] [-l level] [-o logfile] -e pid/eid\n".
"Use -q for quick termination mode, which skips the Bandwidth test\n".
"Use -v for verbose feedback messages\n" .
"Use -t <time> to set a timeout in seconds\n";
exit(1);
}
my $optlist = "qd:s:p:k:e:l:o:";
my $optlist = "vqd:s:p:k:e:l:o:t:";
my $debug = 0;
my $verbose = 0;
my $timeout = 0;
my $server;
my $keyfile;
my $port;
......@@ -38,6 +45,9 @@ my $TMCC = "@CLIENT_BINDIR@/tmcc";
my $LTEVENT = "@CLIENT_BINDIR@/ltevent";
my $LTEVENTBOSS = "$TB/libexec/ltevent";
my $BOSSNODE = "@BOSSNODE@";
my $STOPEVENT = "STOP";
my $KILLEVENT = "KILL";
my $REPORTEVENT = "REPORT";
#
# This script should be run as a real person!
......@@ -74,6 +84,19 @@ if (defined($options{"d"})) {
" Bad data in debug: $debug\n");
}
}
if (defined($options{"v"})) {
$verbose = 1;
}
if (defined($options{"t"})) {
$timeout = $options{"t"};
if ($timeout =~ /^([\w]+)$/) {
$timeout = $1;
}
else {
die("*** $0:\n".
" Bad data in timeout: $timeout\n");
}
}
if (defined($options{"l"})) {
$stopAt = $options{"l"};
if ($stopAt =~ /^(\d)$/) {
......@@ -156,11 +179,6 @@ else {
" Bad data in eid: $eid\n");
}
# signal handler in case the process is killed.
$SIG{INT} = sub {
print "Aborted. Linktest continues on nodes.\nErrors so far:\n";
exit &analyze;
};
#
# Need to figure out the elvind server. Since this script runs on boss
......@@ -184,7 +202,7 @@ if (!defined($server)) {
}
#
# These days, must use a keyfile! Hmm, linktest.c is not using a keyfile.
# These days, must use a keyfile!
#
if (!defined($keyfile)) {
$keyfile = "/proj/$pid/exp/$eid/tbdata/eventkey";
......@@ -213,48 +231,120 @@ print "Quick termination requested.\n"
print "Debug mode requested.\n"
if ($debug);
# wait for the shutdown event.
#
# Now that linktest has started, wait for events to be reported
# by ltevent. It will print out the event followed by args,
# which are informational. The events sent are KILL, STOP and REPORT.
#
$args = starter();
$args .= " -w STOP";
$args .= " -w";
if(my $pid =fork) {
system($args);
if ($?) {
die("*** $0:\n".
" Error running '$args'\n");
#
# Install signal handlers to wait for a kill or a timeout.
# If the process is killed, kill Linktest!
#
$SIG{INT} = sub {
&kill_linktest_run;
print "Linktest KILL event has been sent, aborting the run.\n"
if ($verbose);
exit(&analyze);
};
#
# Set timeout behavior if requested.
#
if($timeout) {
$SIG{ALRM} = sub {
&kill_linktest_run;
print "Timeout expired.\n"
if ($verbose);
exit(&analyze);
};
alarm($timeout);
}
waitpid($pid,0);
alarm 0;
exit(&analyze());
} else {
#
# Open child process to read in the output from ltevent,
# and just print out the return values for feedback.
#
open ARGS,"$args |" || die("*** $0:\n"." Error running '$args'\n");
while(<ARGS>) {
chomp;
if(/(\w+)\s(.*)/) {
my $eventtype = $1;
my $eventargs = $2;
if($eventtype eq $STOPEVENT) {
print "Linktest completed normally.\n"
if($verbose);
exit;
} elsif ($eventtype eq $KILLEVENT) {
print "Linktest has been cancelled due to a timeout or unrecoverable error.\n"
if ($verbose);
exit;
} else {
#
# Print out report messages if in verbose mode.
#
print lc($eventargs) . "\n"
if ($verbose);
}
} else {
# parse error, exit.
print "error parsing: " . $_ . "\n";
exit;
}
}
exit;
}
#
# Spit out the results?
# Spits out the results from the Linktest path,
# with a return code that indicates whether errors were found
# by Linktest on the nodes.
#
my @dir_contents;
opendir(DIR, $linktest_path) ||
die("*** $0:\n".
" Cannot open $linktest_path\n");
@dir_contents = grep(/\.fatal$|\.error$/, readdir(DIR));
closedir(DIR);
foreach my $file (@dir_contents) {
# Hmm, need to taint check the filenames. Ick.
if ($file =~ /^([-\w\.\/]+)$/) {
$file = $1;
}
else {
sub analyze {
my @dir_contents;
opendir(DIR, $linktest_path) ||
die("*** $0:\n".
" Bad data in filename: $file\n");
}
if(defined($options{"o"})) {
open LOG_FILE, ">>$logfile" || die "Could not open $logfile for append: $!";
open NODE_TRACE, "$linktest_path/$file" || die "Could not open $file for read: $!";