Commit 9bfe3d61 authored by Leigh B. Stoller's avatar Leigh B. Stoller
Browse files

Converted os_load and node_reboot into libraries. Basically that meant

splitting the existing code between a frontend script that parses arguments
and does taint checking, and a backend library where all the work is done
(including permission checks). The interface to the libraries is simple
right now (didn't want to spend a lot of time on designing interface
without knowing if the approach would work long term).

	use libreboot;
	use libosload;

        nodereboot(\%reboot_args, \%reboot_results);
        osload(\%reload_args, \%reload_results);

Arguments are passed to the libraries in the form of a hash. For example,
in os_setup:

	$reload_args{'debug'}     = $dbg;
	$reload_args{'asyncmode'} = 1;
	$reload_args{'imageid'}   = $imageid;
	$reload_args{'nodelist'}  = [ @nodelist ];

Results are passed back both as a return code (-1 means total failure right
away, while a positive argument indicates the number of nodes that failed),
and in the results hash which gives the status for each individual node. At
the moment it is just success or failure (0 or 1), but in the future might
be something more meaningful.

os_setup can now find out about individual failures, both in reboot and
reload, and alter how it operates afterwards. The main thing is to not wait
for nodes that fail to reboot/reload, and to terminate with no retry when
this happens, since at the moment it indicates an unusual failure, and it
is better to terminate early. In the past an os_load failure would result
in a tbswap retry, and another failure (multiple times). I have already
tested this by trying to load images that have no file on disk; it is nice
to see those failures caught early and the experiment failure to happen
much quicker!

A note about "asyncmode" above. In order to promote parallelism in
os_setup, asyncmode tells the library to fork off a child and return
immediately. Later, os_setup can block and wait for status by calling
back into the library:

	my $foo = nodereboot(\%reboot_args, \%reboot_results);
	nodereboot_wait($foo);

If you are wondering how the child reports individual node status back to
the parent (so it can fill in the results hash), Perl really is a kitchen
sink. I create a pipe with Perl's pipe function and then fork a child to so
the work; the child writes the results to the pipe (status for each node),
and the parent reads that back later when nodereboot_wait() is called,
moving the results into the %reboot_results array. The parent meanwhile can
go on and in the case of os_setup, make more calls to reboot/reload other
nodes, later calling the wait() routines once all have been initiated.
Also worth noting that in order to make the libraries "reentrant" I had to
do some cleaning up and reorganizing of the code. Nothing too major though,
just removal of lots of global variables. I also did some mild unrelated
cleanup of code that had been run over once too many times with a tank.

So how did this work out. Well, for os_setup/os_load it works rather
nicely!

node_reboot is another story. I probably should have left it alone, but
since I had already climbed the curve on osload, I decided to go ahead and
do reboot. The problem is that node_reboot needs to run as root (its a
setuid script), which means it can only be used as a library from something
that is already setuid. os_setup and os_load runs as the user. However,
having a consistent library interface and the ability to cleanly figure out
which individual nodes failed, is a very nice thing.

So I came up with a suitable approach that is hidden in the library. When the
library is entered without proper privs, it silently execs an instance of
node_reboot (the setuid script), and then uses the same trick mentioned
above to read back individual node status. I create the pipe in the parent
before the exec, and set the no-close-on-exec flag. I pass the fileno along
in an environment variable, and the library uses that to the write the
results to, just like above. The result is that os_setup sees the same
interface for both os_load and node_reboot, without having to worry that
one or the other needs to be run setuid.
parent 5b23201c
......@@ -48,7 +48,7 @@ LIBEXEC_STUFF = rmproj wanlinksolve wanlinkinfo \
LIB_STUFF = libtbsetup.pm exitonwarn.pm libtestbed.pm snmpit_intel.pm \
snmpit_cisco.pm snmpit_lib.pm snmpit_apc.pm power_rpc27.pm \
snmpit_cisco_stack.pm snmpit_intel_stack.pm \
libaudit.pm
libaudit.pm libreboot.pm libosload.pm
#
# Force dependencies on the scripts so that they will be rerun through
......
#!/usr/bin/perl -wT
#
# EMULAB-COPYRIGHT
# Copyright (c) 2000-2004 University of Utah and the Flux Group.
# All rights reserved.
#
# Osload library. Basically the backend to the osload script, but also used
# where we need finer control of loading of nodes.
#
package libosload;
use strict;
use Exporter;
use vars qw(@ISA @EXPORT);
@ISA = "Exporter";
@EXPORT = qw ( osload osload_wait );
# Must come after package declaration!
use lib '@prefix@/lib';
use libdb;
use libreboot;
use English;
use File::stat;
use IO::Handle;
# Configure variables
my $TB = "@prefix@";
my $TESTMODE = @TESTMODE@;
my $TBOPS = "@TBOPSEMAIL@";
# Max number of retries (per node) before its deemed fatal. This allows
# for the occasional pxeboot failure.
my $MAXRETRIES = 1;
my $FRISBEELAUNCHER = "$TB/sbin/frisbeelauncher";
my $osselect = "$TB/bin/os_select";
my $FRISBEEOSID = TB_OSID_FRISBEE_MFS();
# Locals
my %imageinfo = (); # Per imageid DB info.
my %maxwaits = (); # Per imageid max wait time.
my $debug = 0;
my %children = (); # Child pids in when asyncmode=1
sub osload ($$) {
my ($args, $result) = @_;
# These come in from the caller.
my $imageid;
my $waitmode = 1;
my @nodes = ();
my $noreboot = 0;
my $asyncmode = 0;
# Locals
my %imageids = ();
my %retries = ();
my $failures = 0;
my $usedefault = 1;
my $mereuser = 0;
my $rowref;
if (!defined($args->{'nodelist'})) {
print STDERR "*** osload: Must supply a node list!\n";
return -1;
}
@nodes = sort(@{ $args->{'nodelist'} });
if (defined($args->{'waitmode'})) {
$waitmode = $args->{'waitmode'};
}
if (defined($args->{'noreboot'})) {
$noreboot = $args->{'noreboot'};
}
if (defined($args->{'debug'})) {
$debug = $args->{'debug'};
}
if (defined($args->{'imageid'})) {
$imageid = $args->{'imageid'};
$usedefault = 0;
}
if (defined($args->{'asyncmode'})) {
$asyncmode = $args->{'asyncmode'};
}
#
# Figure out who called us. Root and admin types can do whatever they
# want. Normal users can only change nodes in experiments in their
# own projects.
#
if ($UID && !TBAdmin($UID)) {
$mereuser = 1;
if (! TBNodeAccessCheck($UID, TB_NODEACCESS_LOADIMAGE, @nodes)) {
print STDERR
"*** osload: Not enough permission to load images on one or ".
"more nodes!\n";
return -1;
}
}
#
# Check permission to use the imageid.
#
if (defined($imageid) && $mereuser &&
! TBImageIDAccessCheck($UID, $imageid, TB_IMAGEID_READINFO)) {
print STDERR
"*** osload: You do not have permission to load '$imageid'!\n";
return -1;
}
#
# This is somewhat hackish. To promote parallelism during os_setup, we
# want to fork off the osload from the parent so it can do other things.
# The problem is how to return status via the results vector. Well,
# lets do it with some simple IPC. Since the results vector is simply
# a hash of node name to an integer value, its easy to pass that back.
#
# We return the pid to the caller, which it can wait on either directly
# or by calling back into this library if it wants to actually get the
# results from the child!
#
if ($asyncmode) {
#
# Create a pipe to read back results from the child we will create.
#
if (! pipe(PARENT_READER, CHILD_WRITER)) {
print STDERR "*** osload: creating pipe: $!\n";
return -1;
}
CHILD_WRITER->autoflush(1);
if (my $childpid = fork()) {
close(CHILD_WRITER);
$children{$childpid} = [ *PARENT_READER, $result ];
return $childpid;
}
#
# Child keeps going.
#
close(PARENT_READER);
TBdbfork();
}
# Loop for each node.
foreach my $node (@nodes) {
# All nodes start out as being successful; altered later as needed.
$result->{$node} = 0;
# Get default imageid for this node.
my $default_imageid;
if (! ($default_imageid = DefaultImageID($node))) {
print STDERR "*** osload ($node): No default imageid defined!\n";
goto failednode;
}
if ($usedefault) {
$imageid = $default_imageid;
}
print STDERR "osload: Using $imageid for $node\n"
if $debug;
$imageids{$node} = $imageid;
#
# Try to avoid repeated queries to DB for info that does not change!
#
if (exists($imageinfo{$imageid})) {
$rowref = $imageinfo{$imageid};
}
else {
my $query_result =
DBQueryWarn("select * from images where imageid='$imageid'");
if (! $query_result || $query_result->numrows < 1) {
print STDERR
"*** osload ($node): Imageid $imageid is not defined!\n";
goto failednode;
}
$rowref = $query_result->fetchrow_hashref();
}
my $loadpart = $rowref->{'loadpart'};
my $loadlen = $rowref->{'loadlength'};
my $imagepath = $rowref->{'path'};
my $defosid = $rowref->{'default_osid'};
# Check for a few errors early!
if (!defined($imagepath)) {
print STDERR
"*** osload ($node): No filename associated with $imageid!\n";
goto failednode;
}
if (! -R $imagepath) {
print STDERR
"*** osload ($node): ".
"$imagepath does not exists or cannot be read!\n";
goto failednode;
}
#
# If there's a maxiumum number of concurrent loads listed, check to
# see if we'll go over the limit, by checking to see how many other
# nodes are currently booting thie image's default_osid. This is NOT
# intended to be strong enforcement of license restrictions, just a way
# to catch mistakes.
# XXX This could go outside the @nodes loop, but so could most of this
# stuff
#
if (!TBImageLoadMaxOkay($imageid, scalar(@nodes), @nodes)) {
print STDERR
"*** osload ($node): Exceeded maxiumum concurrent instances\n";
goto failednode;
}
#
# Compute a maxwait time based on the image size plus a constant
# factor for the reboot cycle. We store this globally for later in
# WaitTillReloadDone(), and so we do not recompute each time
# through the loop!
#
if (!exists($maxwaits{$imageid})) {
my $sb = stat($imagepath);
my $chunks = $sb->size / (1024 * 1024);
$maxwaits{$imageid} = int((($chunks / 100.0) * 30) + (5 * 60));
}
# 0 means load the entire disk.
my $diskpart = "";
if ($loadpart) {
$diskpart = "wd0:s${loadpart}";
}
else {
$diskpart = "wd0";
}
print "osload ($node): Changing default OS to $defosid\n";
if (!$TESTMODE) {
system("$osselect $defosid $node");
if ($?) {
print STDERR "*** osload ($node): os_select $defosid failed!\n";
goto failednode;
}
}
#
# If loading an image (which is not the default) then schedule
# a reload for it so that when the experiment is terminated it
# will get a fresh default image before getting reallocated to
# another experiment.
#
if ($imageid ne $default_imageid &&
!TBSetSchedReload($node, $default_imageid)) {
print STDERR
"*** osload ($node): Could not schedule default reload\n";
goto failednode;
}
#
# Assign partition table entries for each partition in the image.
# This is complicated by the fact that an image that covers only
# part of the slices, should only change the partition table entries
# for the subset of slices that are written to disk.
#
my $startpart = $loadpart == 0 ? 1 : $loadpart;
my $endpart = $startpart + $loadlen;
for (my $i = $startpart; $i < $endpart; $i++) {
my $partname = "part${i}_osid";
my $dbresult;
if (defined($rowref->{$partname})) {
my $osid = $rowref->{$partname};
$dbresult =
DBQueryWarn("replace into partitions ".
"(partition, osid, node_id) ".
"values('$i', '$osid', '$node')");
}
else {
$dbresult =
DBQueryWarn("delete from partitions ".
"where node_id='$node' and partition='$i'");
}
if (!$dbresult) {
print STDERR
"*** osload ($node): Could not update partition table\n";
goto failednode;
}
}
print "Setting up reload for $node\n";
if (!$TESTMODE) {
if (SetupReload($node, $imageid) < 0) {
print STDERR
"*** osload ($node): Could not set up reload. Skipping.\n";
goto failednode;
}
}
next;
failednode:
$result->{$node} = -1;
$failures++;
}
#
# Remove any failed nodes from the list we are going to operate on.
#
my @temp = ();
foreach my $node (@nodes) {
push(@temp, $node)
if (! $result->{$node});
}
@nodes = @temp;
# Exit if not doing an actual reload.
if ($TESTMODE) {
print "osload: Stopping in Testmode!\n";
goto done;
}
if (! @nodes) {
print STDERR "*** osload: Stopping because of previous failures!\n";
goto done;
}
# Fire off a mass reboot and quit if not in waitmode.
if (! $waitmode) {
if (! $noreboot) {
print "osload: Rebooting nodes.\n";
my %reboot_args = ();
my %reboot_failures = ();
$reboot_args{'debug'} = $debug;
$reboot_args{'waitmode'} = 0;
$reboot_args{'nodelist'} = [ @nodes ];
if (nodereboot(\%reboot_args, \%reboot_failures)) {
foreach my $node (@nodes) {
if ($reboot_failures{$node}) {
$result->{$node} = $reboot_failures{$node};
$failures++;
}
}
}
}
goto done;
}
#
# The retry vector is initialized to the number of retries we allow per
# node, afterwhich its a fatal error.
#
foreach my $node (@nodes) {
$retries{$node} = $MAXRETRIES;
}
while (@nodes) {
if (! $noreboot) {
# Reboot them all.
print "osload: Issuing reboot for @nodes and then waiting ...\n";
my %reboot_args = ();
my %reboot_failures = ();
$reboot_args{'debug'} = $debug;
$reboot_args{'waitmode'} = 0;
$reboot_args{'nodelist'} = [ @nodes ];
if (nodereboot(\%reboot_args, \%reboot_failures)) {
#
# If we get any failures in the reboot, we want to
# alter the list of nodes accordingly for the next phase.
#
my @temp = ();
foreach my $node (@nodes) {
if ($reboot_failures{$node}) {
$result->{$node} = $reboot_failures{$node};
$failures++;
}
else {
push(@temp, $node);
}
}
@nodes = @temp;
}
}
# Now wait for them.
my $startwait = time;
my @failednodes = WaitTillReloadDone($startwait, \%imageids, @nodes);
@nodes=();
while (@failednodes) {
my $node = shift(@failednodes);
if ($retries{$node}) {
print "*** osload ($node): Trying again ...\n";
# Possible race with reboot?
if (SetupReload($node, $imageids{$node}) < 0) {
print(STDERR
"*** osload ($node): ".
"Could not set up reload. Skipping.\n");
$result->{$node} = -1;
$failures++;
next;
}
push(@nodes, $node);
# Retry until count hits zero.
$retries{$node} -= 1;
}
else {
print "*** osload ($node): failed too many times. Skipping!\n";
$result->{$node} = -1;
$failures++;
}
}
}
done:
print "osload: Done! There were $failures failures.\n";
if ($asyncmode) {
#
# We are a child. Send back the results to the parent side
# and *exit* with status instead of returning it.
#
foreach my $node (keys(%{ $result })) {
my $status = $result->{$node};
print CHILD_WRITER "$node,$status\n";
}
close(CHILD_WRITER);
exit($failures);
}
return $failures;
}
# Wait for a reload to finish by watching its state
sub WaitTillReloadDone($$@)
{
my ($startwait, $imageids, @nodes) = @_;
my %done = ();
my $count = @nodes;
my @failed = ();
foreach my $node ( @nodes ) { $done{$node} = 0; }
print STDERR "Waiting for @nodes to finish reloading\n".`date` if $debug;
# Start a counter going, relative to the time we rebooted the first
# node.
my $waittime = 0;
my $minutes = 0;
while ($count) {
# Wait first to make sure reboot is done, and so that we don't
# wait one more time after everyone is up.
sleep(5);
foreach my $node (@nodes) {
if (! $done{$node}) {
my $maxwait = $maxwaits{$imageids->{$node}};
my $query_result =
DBQueryWarn("select * from current_reloads ".
"where node_id='$node'");
#
# There is no point in quitting if this query fails. Just
# try again in a little bit.
#
if (!$query_result) {
print STDERR
"*** osload ($node): Query failed; waiting a bit.\n";
next;
}
#
# We simply wait for stated to clear the current_reloads entry.
#
if (!$query_result->numrows) {
print STDERR "osload ($node): left reloading mode at ".`date`
if ($debug);
$count--;
$done{$node} = 1;
next;
}
# Soon we will have stated's timeouts take care of
# rebooting once or twice if we get stuck during
# reloading.
$waittime = time - $startwait;
if ($waittime > $maxwait) {
my $t = (int ($waittime / 60));
print STDERR "*** osload ($node): appears wedged; ".
"it has been $t minutes since it was rebooted.\n";
$count--;
$done{$node} = 1;
push(@failed, $node);
next;
}
if (int($waittime / 60) > $minutes) {
$minutes = int($waittime / 60);
print STDERR "osload ($node): still waiting; ".
"it has been $minutes minute(s)\n";
}
}
}
}
return @failed;
}
# Setup a reload.
sub SetupReload($$)
{
my ($node, $imageid) = @_;
#
# Put it in the current_reloads table so that nodes can find out which
# OS to load. See tmcd.
#
my $query_result =
DBQueryWarn("replace into current_reloads ".
"(node_id, image_id) values ('$node', '$imageid')");
return -1
if (!$query_result);
system("$osselect -1 $FRISBEEOSID $node");
if ($?) {
print STDERR "*** osload ($node): os_select $FRISBEEOSID failed!\n";
return -1;
}
system("$FRISBEELAUNCHER " . ($debug ? "-d ": "") . "$imageid");
if ($?) {
print STDERR "*** osload ($node): Frisbee Launcher ($imageid) failed!\n";
return -1;
}
return 0;
}
#
# This gets called in the parent, to wait for an async osload that was
# launched earlier (asyncmode). The child will print the results back
# on the the pipe that was opened between the parent and child. They
# are stuffed into the original results array.
#
sub osload_wait($)
{
my ($childpid) = @_;
if (!exists($children{$childpid})) {
print STDERR "*** osload: No such child pid $childpid!\n";
return -1;
}
my ($PARENT_READER, $result) = @{ $children{$childpid}};
#
# Read back the results.
#
while (<$PARENT_READER>) {
chomp($_);
if ($_ =~ /^([-\w]+),([-\d])+$/) {
$result->{$1} = $2;
print STDERR "reload ($1): child returned $2 status.\n";
}
else {
print STDERR "*** osload: Improper response from child: $_\n";
}
}
#
# And get the actual exit status.
#
waitpid($childpid, 0);
return $? >> 8;
}
# _Always_ make sure that this 1 is at the end of the file...
1;
This diff is collapsed.
This diff is collapsed.
#!/usr/bin/perl -wT
#
# EMULAB-COPYRIGHT
# Copyright (c) 2000-2004 University of Utah and the Flux Group.
# All rights reserved.
#
use English;
use Getopt::Std;
use File::stat;
# Be careful not to exit on transient error - this script is in the 'critical
# path' for reload_daemon, so we want to give it the same retries as the daemon
$libdb::DBQUERY_MAXTRIES = 30;
# XXX boss.emulab.net and users.emulab.net wired in.
# wd0 wired in. Should come from node_types table in DB
#
# Load an image onto a disk. The image must be in the DB images table,
# which defines how/where to load, and what partitions are affected.
# The nodes and partitions tables are updated appropriately.
#
sub usage()
{
print STDOUT
......@@ -27,91 +19,127 @@ sub usage()
" os_load [-s] [[-p <pid>] -i <imagename>] -e pid,eid\n".
" os_load -l\n".
"Use -i to specify an image name. Use node default otherwise.\n".
"Use -m to specify an image ID (internal name, TB admins only!).\n".
"Use -m to specify an image ID (internal name).\n".