Commit 9bfe3d61 authored by Leigh B. Stoller's avatar Leigh B. Stoller

Converted os_load and node_reboot into libraries. Basically that meant

splitting the existing code between a frontend script that parses arguments
and does taint checking, and a backend library where all the work is done
(including permission checks). The interface to the libraries is simple
right now (didn't want to spend a lot of time on designing interface
without knowing if the approach would work long term).

	use libreboot;
	use libosload;

        nodereboot(\%reboot_args, \%reboot_results);
        osload(\%reload_args, \%reload_results);

Arguments are passed to the libraries in the form of a hash. For example,
in os_setup:

	$reload_args{'debug'}     = $dbg;
	$reload_args{'asyncmode'} = 1;
	$reload_args{'imageid'}   = $imageid;
	$reload_args{'nodelist'}  = [ @nodelist ];

Results are passed back both as a return code (-1 means total failure right
away, while a positive argument indicates the number of nodes that failed),
and in the results hash which gives the status for each individual node. At
the moment it is just success or failure (0 or 1), but in the future might
be something more meaningful.

os_setup can now find out about individual failures, both in reboot and
reload, and alter how it operates afterwards. The main thing is to not wait
for nodes that fail to reboot/reload, and to terminate with no retry when
this happens, since at the moment it indicates an unusual failure, and it
is better to terminate early. In the past an os_load failure would result
in a tbswap retry, and another failure (multiple times). I have already
tested this by trying to load images that have no file on disk; it is nice
to see those failures caught early and the experiment failure to happen
much quicker!

A note about "asyncmode" above. In order to promote parallelism in
os_setup, asyncmode tells the library to fork off a child and return
immediately. Later, os_setup can block and wait for status by calling
back into the library:

	my $foo = nodereboot(\%reboot_args, \%reboot_results);

If you are wondering how the child reports individual node status back to
the parent (so it can fill in the results hash), Perl really is a kitchen
sink. I create a pipe with Perl's pipe function and then fork a child to so
the work; the child writes the results to the pipe (status for each node),
and the parent reads that back later when nodereboot_wait() is called,
moving the results into the %reboot_results array. The parent meanwhile can
go on and in the case of os_setup, make more calls to reboot/reload other
nodes, later calling the wait() routines once all have been initiated.
Also worth noting that in order to make the libraries "reentrant" I had to
do some cleaning up and reorganizing of the code. Nothing too major though,
just removal of lots of global variables. I also did some mild unrelated
cleanup of code that had been run over once too many times with a tank.

So how did this work out. Well, for os_setup/os_load it works rather

node_reboot is another story. I probably should have left it alone, but
since I had already climbed the curve on osload, I decided to go ahead and
do reboot. The problem is that node_reboot needs to run as root (its a
setuid script), which means it can only be used as a library from something
that is already setuid. os_setup and os_load runs as the user. However,
having a consistent library interface and the ability to cleanly figure out
which individual nodes failed, is a very nice thing.

So I came up with a suitable approach that is hidden in the library. When the
library is entered without proper privs, it silently execs an instance of
node_reboot (the setuid script), and then uses the same trick mentioned
above to read back individual node status. I create the pipe in the parent
before the exec, and set the no-close-on-exec flag. I pass the fileno along
in an environment variable, and the library uses that to the write the
results to, just like above. The result is that os_setup sees the same
interface for both os_load and node_reboot, without having to worry that
one or the other needs to be run setuid.
parent 5b23201c
......@@ -48,7 +48,7 @@ LIBEXEC_STUFF = rmproj wanlinksolve wanlinkinfo \
LIB_STUFF = \ \ \
# Force dependencies on the scripts so that they will be rerun through
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -5,7 +5,6 @@
# Copyright (c) 2000-2004 University of Utah and the Flux Group.
# All rights reserved.
use English;
use Getopt::Std;
require '';
......@@ -57,14 +56,15 @@ my $TFTP = "/tftpboot";
use lib "@prefix@/lib";
use libdb;
use libreboot;
use libosload;
use libtestbed;
my $nodereboot = "$TB/bin/node_reboot";
my $os_load = "$TB/bin/os_load";
my $vnode_setup = "$TB/sbin/vnode_setup";
my $osselect = "$TB/bin/os_select";
my $dbg = 0;
my $failed = 0;
my $noretry = 0;
my $failedvnodes= 0;
my $failedplab = 0;
my $canceled = 0;
......@@ -76,8 +76,6 @@ my %pnodevcount = ();
my %plabvnodes = ();
my %osids = ();
my %canfail = ();
my $db_result;
my @row;
# Ah, Frisbee works so lets do auto reloading for nodes that do not have
......@@ -166,7 +164,7 @@ TBDebugTimeStamp("os_setup started");
# Get the set of nodes, as well as the nodes table information for them.
$db_result =
my $db_result =
DBQueryFatal("select n.*,,nt.* from reserved as r ".
"left join nodes as n on n.node_id=r.node_id ".
"left join last_reservation as l on n.node_id=l.node_id ".
......@@ -464,14 +462,12 @@ foreach my $vnode (keys(%vnodes)) {
TBDebugTimeStamp("rebooting/reloading started");
if (!$TESTMODE) {
my %pids = ();
my $count = 0;
my $cmd;
my @children = ();
foreach my $imageid ( keys(%reloads) ) {
my @list = @{ $reloads{$imageid} };
my @nodelist = @{ $reloads{$imageid} };
foreach my $node (@list) {
foreach my $node (@nodelist) {
TBSetNodeAllocState( $node, TBDB_ALLOCSTATE_RES_RELOAD() );
$nodeAllocStates{$node} = TBDB_ALLOCSTATE_RES_RELOAD();
# No point in reboot/reconfig obviously, since node will reboot!
......@@ -480,9 +476,18 @@ if (!$TESTMODE) {
$rebooted{$node} = 1;
my %reload_args = ();
my $reload_failures = {};
$reload_args{'debug'} = $dbg;
$reload_args{'asyncmode'} = 1;
$reload_args{'imageid'} = $imageid;
$reload_args{'nodelist'} = [ @nodelist ];
my $pid = osload(\%reload_args, $reload_failures);
push(@children, [ $pid, \&osload_wait,
[ @nodelist ], $reload_failures ]);
$pids{"$os_load -m $imageid @list"} =
ForkCmd("$os_load -m $imageid @list");
......@@ -501,61 +506,101 @@ if (!$TESTMODE) {
$rebooted{$node} = 1;
$cmd = "$nodereboot " . join(" ", keys(%reboots));
$pids{$cmd} = ForkCmd($cmd);
my @nodelist = keys(%reboots);
my %reboot_args = ();
my $reboot_failures = {};
$reboot_args{'debug'} = $dbg;
$reboot_args{'waitmode'} = 0;
$reboot_args{'asyncmode'} = 1;
$reboot_args{'nodelist'} = [ @nodelist ];
my $pid = nodereboot(\%reboot_args, $reboot_failures);
push(@children, [ $pid, \&nodereboot_wait,
[ @nodelist ], $reboot_failures ]);
# Fire off the reconfigs.
if (keys(%reconfigs)) {
$cmd = "$nodereboot -c " . join(" ", keys(%reconfigs));
$pids{$cmd} = ForkCmd($cmd);
my @nodelist = keys(%reconfigs);
my %reboot_args = ();
my $reboot_failures = {};
$reboot_args{'debug'} = $dbg;
$reboot_args{'waitmode'} = 0;
$reboot_args{'asyncmode'} = 1;
$reboot_args{'reconfig'} = 1;
$reboot_args{'nodelist'} = [ @nodelist ];
my $pid = nodereboot(\%reboot_args, $reboot_failures);
push(@children, [ $pid, \&nodereboot_wait,
[ @nodelist ], $reboot_failures ]);
foreach $cmd ( keys(%pids) ) {
my $pid = $pids{$cmd};
# Wait for all of the children to exit. We look at the $pid to know if
# command failed/ended immediately; otherwise we need to wait on it.
# For any failures, record the node failures for later so that we do
# not wait for them needlessly.
while (@children) {
my ($pid, $waitfunc, $listref, $hashref) = @{ pop(@children) };
waitpid($pid, 0);
if ($?) {
print "*** Failed: $cmd\n";
TBDebugTimeStamp("rebooting/reloading finished");
# This is not likely to happen.
if ($pid == 0);
# XXX What happens if something above fails? We could exit, but some nodes
# that *are* rebooting would be caught in the middle. For the nodes that
# were reloaded, we can check the state right away (and avoid the wait
# below as well); they should be in the ISUP state when os_load is
# finished. If not, thats a failure and we can save some time below. For
# plain reboot failures, nothing to do but find out below after the wait.
# I do not want to exit right away cause we might end up with a lot more
# power cycles since the nodes are very likely to be in a non responsive
# state if just rebooted!
foreach my $imageid ( keys(%reloads) ) {
my @list = @{ $reloads{$imageid} };
if ($pid > 0) {
if (! &$waitfunc($pid));
# Failure. Record the failures for later. If the $pid<0 then the
# entire list failed. Otherwise, have to scan the return hash to
# find the failures.
my @nodelist = ();
if ($pid < 0) {
@nodelist = @{ $listref };
else {
foreach my $node (keys(%{ $hashref })) {
push(@nodelist, $node)
if ($hashref->{$node});
foreach my $node ( @list ) {
my $mode;
# These errors are unusal enough that we do not want to retry
# or keep going even if canfail is set. Better to stop and let
# someone look at what happened.
$noretry = 1;
if (!TBGetNodeOpMode($node, \$mode)) {
print "*** Error getting operational mode for $node!\n";
print "*** Not waiting for $node since its reload failed!\n";
foreach my $node (@nodelist) {
print "*** Not waiting for $node since its reload/reboot failed!\n";
TBSetNodeAllocState($node, TBDB_ALLOCSTATE_DOWN());
$nodeAllocStates{$node} = TBDB_ALLOCSTATE_DOWN();
# Remaining nodes we need to wait for.
TBDebugTimeStamp("rebooting/reloading finished");
# Remaining nodes we need to wait for. Why do we wait in the face of errors
# above? So that they enter a reasonably known state before we try to tear
# things down. Otherwise we could end up power cycling nodes a lot more often.
# This should probably be handled in other ways, say via stated or the alloc
# state machine.
my @nodelist = keys(%nodes);
......@@ -599,7 +644,7 @@ while ( @nodelist ) {
if ($retries{$node} && !$canceled) {
if ($retries{$node} && !($canceled || $noretry)) {
$retries{$node} -= 1;
print "*** Rebooting $node and waiting again ...\n";
......@@ -617,7 +662,7 @@ while ( @nodelist ) {
print "*** WARNING: $node may be down.\n".
" This has been reported to testbed-ops.\n";
if ($canfail{$node}) {
if ($canfail{$node} && !($canceled || $noretry)) {
# Send mail to testbed-ops and to the user about it.
my ($user) = getpwuid($UID);
......@@ -803,7 +848,7 @@ TBDebugTimeStamp("os_setup finished");
# No retry if vnodes failed. Indicates a fatal problem.
if ($failedvnodes || $canceled);
if ($failedvnodes || $canceled || $noretry);
if ($failed || $failedplab);
exit 0;
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment