Commit 9a23fe83 authored by David Anderson's avatar David Anderson
Browse files

This commit has various changes to Linktest to make it more reliable.

1. The Linktest daemon, linktest.c, now listens for a KILL event. If received,
   the daemon will kill the linktest.pl child process and all of its subchildren.
2. The daemon also listens for SIGSTP events from the linktest.pl child, and
   will kill the linktest.pl process and its children if linktest.pl dies
   unexpectedly.
3. Locking has been implemented in linktest.c to ignore requests to start linktest
   while a current run is executing.
4. The controller script run_linktest.pl now includes the following new options:
   -t   allows the user to specify a timeout in seconds for Linktest.
   -v   prints out better feedback from the Linktest run as it proceeds.

Major remaining items are:
1. Avoid NFS mount hups
2. More testing, especially on vnodes
parent c2fe8581
> Linktest needs its own tb_compat.tcl because the linktest version of
> tb_compat.tcl overrides function definitions in order to parse out the
> topology. Maybe a workaround would be to have tb_compat.tcl only include
> definitions it actually overrides, so that the NS script first includes
> the default, up-to-date script, then the linktest version second. I think
> I could do that by preprocessing the NS script to add a directive to
> include the linktest definitions file.
Or perhaps include nstb_compat.tcl since that has stubs of the functions
that I think you want to be stubs.
> OK. I'll need to set up linktest to calculate its runtime, based on the
> topology and the options selected, then use that (plus some buffer) as the
> timeout. Sound reasonable?
Yep. Exit with non-zero if timed out.
===========================================================
> Can you please put your doc into proper html so it can be displayed, since
> I want to link to that from the beginexp page. You can take a look at
> www/tutorial/tutorial.html to see what proper format is.
>
> Right now the output goes to stdout? I think an option (-o filename) to put
> the output into the exp log directory (/proj/pid/eid/exp/logs) is needed
> since when it is done in the swapin path, we want to save that output off
> someplace.
>
> If the "levels" are really ordered, then maybe a "-l level" option to the
> run_linktest.pl.in script?
===========================================================
Aside: This is the wrong way to do this:
#define LINKTEST_SCRIPT "@CLIENT_BINDIR@" "/linktest.pl"
Pass this in from the makefile instead. So, in your makefile do:
-DCLIENT_BINDIR ='"$(CLIENT_BINDIR)"'
===========================================================
> {3} nodeg$ gmake
> mkdir fbsd
> mkdir: fbsd: File exists
> gmake: [linktest] Error 1 (ignored)
Please make these errors silent. In fact, the proper approach is to not
have subdirs here, but to have an obj-linux tree. See my testbed directory
on ops. However, just getting rid of the error statements above is okay for
now.
> gcc -static -L../lib -L../../lib/libtb linktest.o ../../lib/libtb/log.o
> ../../lib/libtb/tbdefs.o -levent -lcrypto -lssl `elvin-config --libs
> vin4c` -o fbsd/linktest
> gcc -static -L../lib -L../../lib/libtb ltevent.o ../../lib/libtb/log.o
> ../../lib/libtb/tbdefs.o -levent -lcrypto -lssl `elvin-config --libs
> vin4c` -o fbsd/ltevent
Please fix up the dependencies in the makefile so that these are not
relinked unless something actually changed.
===========================================================
> So, the front end if I understand correctly would have a drop-down for
> Linktest validation. Contents of which would be:
> 0 - Do not run Linktest
> 1 - Connectivity and Latency
> 2 - Static Routing
> 3 - Loss
> 4 - Bandwidth
Sure, sounds good. They should of course correspond to section headers in
the document (which I link to from the menu).
===========================================================
Add copyright headers to all files.
Update linktest.c program style to standard emulab style
tmcd.c is an example
Update linktest.pl program style to standard emulab style
run_linktest.pl is an example
# TODO to protect the script from NFS mount hups,
# always copy the file to /tmp before attempting to read it.
# repeatedly call cp and check the exit code.
####
Note, ignoring project groups. See the line...
$gid = $proj_id;
......@@ -21,13 +21,22 @@
#include "event.h"
#include "linktest.h"
static int debug;
static void callback(event_handle_t handle,
event_notification_t notification, void *data);
#define TRUE 1;
#define FALSE 0;
static void
start_linktest(char *args, int);
static int debug;
static volatile int locked;
static pid_t linktest_pid;
static char *pideid;
static event_handle_t handle;
static void callback(event_handle_t handle,
event_notification_t notification, void *data);
static void exec_linktest(char *args, int);
static void sigchld_handler(int sig);
static void send_group_kill();
static void send_kill_event();
void
usage(char *progname)
{
......@@ -40,17 +49,17 @@ usage(char *progname)
int
main(int argc, char **argv) {
event_handle_t handle;
address_tuple_t tuple;
char *server = NULL;
char *port = NULL;
char *keyfile = NULL;
char *pideid = NULL;
char *pidfile = NULL;
char *logfile = NULL;
char *progname;
char c;
char buf[BUFSIZ];
pideid = NULL;
progname = argv[0];
......@@ -128,7 +137,7 @@ main(int argc, char **argv) {
*/
handle = event_register_withkeyfile(server, 0, keyfile);
if (handle == NULL) {
fatal("could not register with event system");
fatal("could not register with event system");
}
/*
......@@ -162,6 +171,14 @@ main(int argc, char **argv) {
}
}
/*
* Initialize variables used to control child execution
*/
locked = FALSE;
if(signal(SIGCHLD,sigchld_handler) == SIG_ERR) {
fatal("could not install child handler");
}
/*
* Begin the event loop, waiting to receive event notifications:
*/
......@@ -204,41 +221,217 @@ callback(event_handle_t handle, event_notification_t notification, void *data)
event_notification_get_arguments(handle,
notification, args, sizeof(args));
/* info("Event: %lu:%d %s %s %s\n", now.tv_sec, now.tv_usec,
objname, event, args);*/
/*
* Dispatch the event.
*/
if (strcmp(event, TBDB_EVENTTYPE_START) == 0) {
start_linktest(args, sizeof(args));
}
if(!strcmp(event, TBDB_EVENTTYPE_START)) {
if(!locked) {
/*
* Set locked bit. The bit is not set to false
* until a SIGCHLD signal is received
*/
locked = TRUE;
/*
* The Linktest script is not running, so
* fork a process and execute it.
*/
linktest_pid = fork();
if(linktest_pid < 0) {
error("Could not fork a process to run linktest script!\n");
return;
}
/*
* Changes the process group of the child to itself so
* a sigkill to the child process group will not kill
* the Linktest daemon.
*/
if(!linktest_pid) {
pid_t mypid = getpid();
setpgid(mypid,mypid);
/* Finally, execute the linktest script. */
exec_linktest(args, sizeof(args));
}
}
} else if(!strcmp(event, TBDB_EVENTTYPE_STOP)) {
/*
* STOP is informational and may be ignored.
*/
} else if (!strcmp(event, TBDB_EVENTTYPE_KILL)) {
/*
* Ignore unless we are running.
*/
if(locked) {
/*
* If KILL is received, there is a problem on this
* or some other node. So, kill off linktest
* and its children.
*/
send_group_kill();
}
}
}
/* convert arguments from the event into a form for Linktest.
*/
/*
* Executes Linktest with arguments received from the Linktest
* start event. Does not return.
*/
static void
start_linktest(char *args, int buflen) {
pid_t lt_pid;
int status;
char *word;
char *argv[MAX_ARGS];
int i=1;
word = strtok(args," \t");
do {
argv[i++] = word;
} while ((word = strtok(NULL," \t"))
&& (i<MAX_ARGS));
argv[i] = NULL;
argv[0] = LINKTEST_SCRIPT;
/* info("starting linktest.\n");*/
lt_pid = fork();
if(!lt_pid) {
execv( LINKTEST_SCRIPT,argv);
}
waitpid(lt_pid, &status, 0);
/* info("linktest completed.\n");*/
exec_linktest(char *args, int buflen) {
char *word, *argv[MAX_ARGS];
int i,res;
/*
* Set up arguments for execv call by parsing contents
* of the event string.
*/
word = strtok(args," \t");
i=1;
do {
argv[i++] = word;
} while ((word = strtok(NULL," \t"))
&& (i<MAX_ARGS));
argv[i] = NULL;
argv[0] = LINKTEST_SCRIPT;
/*
* Execute the script with the arguments from the event
*/
res = execv( LINKTEST_SCRIPT,argv);
if(res < 0) {
error("Could not execute the Linktest script.");
return;
}
}
static
void sigchld_handler(int sig) {
pid_t pid;
int status;
int exit_code;
/*
* If the exit_code is nonzero after a normal exit,
* the daemon sends a KILL to let other nodes know
* that something has failed.
*
* However, ignore the case of a non-normal exit,
* since that is likely the result of a KILL signal.
*/
exit_code = 0;
while((pid = waitpid(-1, &status, 0)) > 0) {
/*
* If Linktest died due to an error, Perl will exit
* the script normally with an error code.
*/
if(WIFEXITED(status)) {
exit_code = WEXITSTATUS(status);
info("Linktest exit code: %d\n",exit_code);
/*
* If this was an abnormal exit (exit status != 0)
* then we must send KILL to the process group of
* Linktest to kill its subchildren. However,
* it doesn't seem to hurt (cause kill errors) to
* send it anyway.
*/
send_group_kill();
} else if (WIFSIGNALED(status)) {
/*
* Linktest exited due to a signal, likely from
* this daemon. If that's the case, group_kill
* has already been sent.
*/
info("Linktest killed by signal %d.\n", WTERMSIG(status));
} else {
/*
* Linktest is stopped unexpectedly.
*/
error("unexpected SIGCHLD received\n");
}
}
if(errno != ECHILD) {
error("waitpid error\n");
}
/*
* Go ahead and unlock since Linktest and its children
* should all be killed and reaped now.
*/
locked = FALSE;
/*
* Now let other nodes know about the problem, if any.
*/
if(exit_code) {
info("Posting KILL event\n");
send_kill_event();
}
return;
}
static
void send_group_kill() {
int res;
/*
* Kill off all processes in the process group of the
* Linktest run. This may include the linktest script
* itself, and any children it forked.
*/
res = kill(-linktest_pid,SIGKILL);
if(res < 0) {
/*
* Not a serious error, likely the process group
* has already exited.
*/
return;
}
}
static
void send_kill_event() {
event_notification_t notification;
address_tuple_t tuple;
/*
* Construct an address tuple for generating the event.
*/
tuple = address_tuple_alloc();
if (tuple == NULL) {
error("could not allocate an address tuple");
}
/*
* Set up event to send a kill...
*/
tuple->objtype = TBDB_OBJECTTYPE_LINKTEST;
tuple->eventtype= TBDB_EVENTTYPE_KILL;
tuple->host = ADDRESSTUPLE_ALL;
tuple->expt = pideid;
/* Generate the event */
notification = event_notification_alloc(handle, tuple);
if (notification == NULL) {
error("could not allocate notification");
}
/* Send the event */
if (event_notify(handle, notification) == 0) {
error("could not send event notification");
}
/* Clean up */
event_notification_free(handle, notification);
}
......@@ -24,13 +24,17 @@ use English;
sub usage()
{
print "Usage: linktest.pl\n".
" [STARTAT=<test step, 1-4>]\n".
" [STOPAT=<test step, 1-4>]\n".
" [DEBUG=<debugging level, ie 1 for on>]\n".
" [SCRIPT=<alternate ns script to parse>]\n";
exit(1);
" [STARTAT=<test step, 1-4>]\n".
" [STOPAT=<test step, 1-4>]\n".
" [DEBUG=<debugging level. 1=on, 0=off>]\n";
exit(0);
}
##############################################################################
# Constants
##############################################################################
......@@ -45,6 +49,7 @@ use constant PATH_PATHRATE_SND => "/usr/local/bin/pathrate_snd";
use constant PATH_PATHRATE_RCV => "/usr/local/bin/pathrate_rcv";
use constant PATH_EMULAB_SYNC => "@CLIENT_BINDIR@/emulab-sync";
use constant PATH_LTEVENT => "@CLIENT_BINDIR@/ltevent";
use constant RUN_PATH => "@CLIENT_BINDIR@"; # where the linktest-ns runs.
# log files used by tests.
use constant CRUDE_DAT => "/tmp/crude.dat"; # binary data
......@@ -74,6 +79,7 @@ use constant BSD => "FreeBSD";
use constant LINUX => "Linux";
use constant RTPROTO_STATIC => "Static";
use constant EVENT_STOP => "STOP";
use constant EVENT_REPORT => "REPORT";
use constant PING_SEND_COUNT => 10;
use constant SYNC_NAMESPACE => "linktest";
......@@ -89,14 +95,15 @@ use constant NAME_LATENCY => "Latency (Round Trip)";
use constant NAME_LOSS => "Loss";
use constant NAME_BW => "Bandwidth";
# error suffix for logging linktest and development (fatal) errors
# error suffix for logs
use constant SUFFIX_ERROR => ".error";
use constant SUFFIX_FATAL => ".fatal";
use constant SUFFIX_TOPO => ".topology";
use constant DEBUG_ALL => 2; # debug level for all debug info, not just msgs.
# more paths
use constant RUN_PATH => "@CLIENT_BINDIR@"; # where the linktest-ns runs.
# exit codes
use constant EXIT_ABORTED => -1;
use constant EXIT_OK => 0;
# struct for representing a link.
struct ( edge => {
......@@ -109,10 +116,6 @@ struct ( edge => {
# fixes emacs colorization woes introduced by above struct definition.
struct ( unused => { foo => '$'});
# security: constrain the path
$ENV{'PATH'} = '/bin:/usr/bin:/usr/local/bin';
# application goo
use constant TRUE => 1;
use constant FALSE => 0;
......@@ -120,8 +123,8 @@ use constant FALSE => 0;
# Globals
##############################################################################
# see init() for initialization of globals
my $ns_file; # ns file full path
my $ns_file; # location of the customized ns for Linktest
my $topology_file; # location of the topology input file.
my $synserv; # synch server node
my $rtproto; # routing protocol
my $hostname; # this hosts name
......@@ -151,57 +154,237 @@ my $linktest_path; # log path (ie tbdata/linktest) set by init.
# full path to custom NS build.
my $ns_cmd;
# signal handler in case the process is killed.
$SIG{INT} = sub {
&debug("Linktest killed by SIGINT.\n");
&cleanup;
exit(1);
};
# security
$ENV{'PATH'} = '/bin:/usr/bin';
##############################################################################
# Main control
##############################################################################
&proc_cmd;
&init;
# un-taint path
$ENV{'PATH'} = '/bin:/usr/bin:/usr/local/bin';
delete @ENV{'IFS', 'CDPATH', 'ENV', 'BASH_ENV'};
$| = 1; #Turn off line buffering on output
#
# Parse command arguments. Since Linktest is run via the event system,
# parse out pairs of <symbol>=<value>.
#
foreach my $arg (@ARGV) {
if($arg =~ /STOPAT=(\d)/) {
$stopat=$1;
}
if($arg =~ /STARTAT=(\d)/) {
$startat=$1;
}
if($arg =~ /DEBUG=(\d)/) {
$debug_level=$1;
}
}
#
# Parse the nickname file to obtain the host name,
# experiment ID and the project ID.
#
my $fname = PATH_NICKNAME;
die("Could not locate $fname\n") unless -e $fname;
my @results = &read_file($fname);
($hostname, $exp_id, $proj_id) = split /\./, $results[0];
chomp $hostname;
chomp $exp_id;
chomp $proj_id;
$gid = $proj_id;
#
# Set path variables storing the experiment logging path,
# the current ns file and the output file for topology info.
#
$expt_path = "/proj/$proj_id/exp/$exp_id/tbdata";
$linktest_path = "$expt_path/linktest";
$topology_file = "$linktest_path/$exp_id" . SUFFIX_TOPO;
if(-e "$expt_path/$exp_id-modify.ns") {
$ns_file = "$expt_path/$exp_id-modify.ns";
} elsif (-e "$expt_path/$exp_id.ns") {
$ns_file = "$expt_path/$exp_id.ns";
} else {
die("Could not locate an ns file.\n");
}
#
# Determine location of the customized ns binary for Linktest.
#
($platform) = POSIX::uname();
if($platform eq BSD) {
$ns_cmd = LINKTEST_NSPATH ."/fbsd/ns";
} elsif ($platform eq LINUX) {
$ns_cmd = LINKTEST_NSPATH ."/linux/ns";
} else {
die ("Platform $platform is not currently supported.\n");
}
#
# Parse the syncserver file to find out which node is the sync server.
#
my $ssname = "@CLIENT_VARDIR@/boot/syncserver";
die("Could not locate an emulab-sync server\n") unless -e $ssname;
@results = &read_file($ssname);
($synserv) = split/\./, $results[0];
chomp $synserv;
&debug_top;
#
# Execute tmcc to find out the name of the event server.
#
@results = &my_tick("@CLIENT_BINDIR@/tmcc","bossinfo");
if(@results && $results[0] =~ /^([\w\.]*)\s/) {
$event_server = $1;
} else {
die("Could not determine event server name\n");
}
#
# If the current node is the special node (arbitrarily the sync
# server node), do some housekeeping and run ns to generate
# the topology input file, which is read by all nodes to obtain
# the experiment topology.
#
if(&is_special_node()) {
#
# If the shared path used by Linktest for logging and temporary
# files already exists, clear its contents for this run.
#
if( -e $linktest_path ) {
die("Path $linktest_path is not a directory\n")
unless -d $linktest_path;
opendir (DIR,$linktest_path)
|| die("Could not open $linktest_path: $!");
my @dirfiles = grep (/error$/,readdir(DIR));
foreach (@dirfiles) {
&do_unlink("$linktest_path/$_");
}
closedir(DIR);
} else {
#
# The shared path does not exist, create it.
#
mkdir (&check_filename($linktest_path),0777)
|| die("Could not create directory $linktest_path: $!");
}
#