Commit 859a9986 authored by Leigh B. Stoller's avatar Leigh B. Stoller

Address comments on start command and batch system changes, and sync

server additions, made by Eric, Tim and Jay.
parent c747d4eb
......@@ -5,7 +5,7 @@
*/
/*
* This ia program agent to manage programs from the event system.
* This is a program agent to manage programs from the event system.
*
* You can start, stop, and kill (signal) programs.
*/
......
......@@ -28,6 +28,7 @@ ETCDIR = $(DESTDIR)$(CLIENT_ETCDIR)
BINDIR = $(DESTDIR)$(CLIENT_BINDIR)
VARDIR = $(DESTDIR)$(CLIENT_VARDIR)
RCDIR = $(DESTDIR)/usr/local/etc/rc.d
TBBINDIR = $(DESTDIR)/usr/testbed/bin
INSTALL = /usr/bin/install -c
install:
......@@ -36,7 +37,7 @@ install:
@echo "directory afterwards."
local-install: path-install local-script-install
local-install: path-install local-script-install symlinks
remote-install: path-install remote-script-install
control-install: path-install control-script-install
......@@ -55,6 +56,7 @@ dir-install:
$(INSTALL) -m 755 -o root -g wheel -d $(VARDIR)/logs
$(INSTALL) -m 755 -o root -g wheel -d $(VARDIR)/boot
$(INSTALL) -m 755 -o root -g wheel -d $(VARDIR)/lock
$(INSTALL) -m 755 -o root -g wheel -d $(TBBINDIR)
path-install: dir-install
$(INSTALL) -m 755 $(SRCDIR)/paths.pm $(ETCDIR)/paths.pm
......@@ -69,7 +71,13 @@ common-script-install: dir-install vnodesetup
$(INSTALL) -m 755 $(SRCDIR)/update $(BINDIR)/update
$(INSTALL) -m 755 vnodesetup $(BINDIR)/vnodesetup
$(INSTALL) -m 755 $(SRCDIR)/bootvnodes $(BINDIR)/bootvnodes
$(INSTALL) -m 755 $(SRCDIR)/batchcmddone $(BINDIR)/batchcmddone
$(INSTALL) -m 755 $(SRCDIR)/startcmddone $(BINDIR)/startcmddone
symlinks: dir-install
rm -f $(TBBINDIR)/tevc
ln -s $(BINDIR)/tevc $(TBBINDIR)/tevc
rm -f $(TBBINDIR)/emulab-sync
ln -s $(BINDIR)/emulab-sync $(TBBINDIR)/emulab-sync
local-script-install: common-script-install
$(INSTALL) -m 755 $(SRCDIR)/bootsetup $(BINDIR)/bootsetup
......
......@@ -7,11 +7,11 @@
use English;
#
# Report that batch command for this node is done. Report status.
# Report that start command for this node is done. Report status.
#
sub usage()
{
print "Usage: batchcmddone <status>\n";
print "Usage: startcmddone <status>\n";
exit(1);
}
my $stat;
......
......@@ -730,10 +730,8 @@
rerun your experiment from scratch, but without the added expense
of a swapin and swapout. In other words, the nodes that are
currently allocated to your experiment are all rebooted, and the
experiment startup state is cleared. This includes the
<a href="#SWS-6">ready bits</a>, the boot status in the web page,
and the <a href="#SWS-4">startup command status</a>. In addition,
the event scheduler for the experiment is restarted, and your event
experiment startup state is cleared.
The event scheduler for the experiment is restarted, and your event
sequence is replayed again. Note that your rpms and tarfiles are
<b>not</b> installed again. Replay is obviously faster than
swapout/swapin, and has the added benefit that you will not run
......@@ -1062,7 +1060,7 @@
RPM installation) has completed. The exit status of the script (or
program) is reported back and is made available for you to view in
Experiment Information link in the menu at your left. The Emulab
NS extension <tt>tb-set-node-startup</tt> is used in the NS file
NS extension <tt>tb-set-node-startcmd</tt> is used in the NS file
to specify the path of the script (or program) to run. You may
specify a different program for each node in the experiment.
</p>
......@@ -1090,22 +1088,11 @@
<font size='+1'><b>How does my software determine when other nodes in my
experiment are ready?</b></font>
<p>
If your application requires synchronization to determine when all
of the nodes in your experiment have started up and are ready to
proceed, then you can use the Testbed's <i>ready bits</i>
mechanism. The ready bits are really just a way of determining how
many nodes have issued the <b>ready</b> command, and is returned
to the application as a simple N of M string, where N is the
number that have reported in, and M is the total number of nodes
in the experiment. Applications can use this as a very simplistic
form of barrier synchronization, albeit one that can be used just
once and one that does not actually block!
</p>
<p>
Use of the ready bits is described in more detail in the <a href =
"tutorial/tutorial.php3">Emulab Tutorial</a> and in the <a href =
"doc/docwrapper.php3?docname=tmcd.html"> Testbed Master Control
Daemon</a> documentation.
If your application requires synchronization amongst your nodes,
you may use the Emulab provided synchronization server, which
provides a very simple form of barrier synchronization. Use of the
synchronization server is described in more detail in the <a href =
"tutorial/tutorial.php3#SyncServer">Emulab Tutorial</a>.
</p>
<li><a NAME="SWS-7"></a>
......
......@@ -1014,7 +1014,6 @@ function SHOWNODES($pid, $eid) {
<th>Node<br>Status</th>
<th>Hours<br>Idle[<b>1</b>]</th>
<th>Startup<br>Status[<b>2</b>]</th>
<th>Ready<br>Status[<b>3</b>]</th>
</tr>\n";
$sort = "type,priority";
......@@ -1043,7 +1042,6 @@ function SHOWNODES($pid, $eid) {
$type = $row[type];
$def_boot_osid = $row[def_boot_osid];
$startstatus = $row[startstatus];
$readystatus = $row[ready];
$status = $row[nodestatus];
$bootstate = $row[eventstate];
$idlehours = TBGetNodeIdleTime($node_id);
......@@ -1057,10 +1055,6 @@ function SHOWNODES($pid, $eid) {
if (!$vname)
$vname = "--";
if ($readystatus)
$readylabel = "Yes";
else
$readylabel = "No";
echo "<tr>
<td><A href='shownode.php3?node_id=$node_id'>$node_id</a>
......@@ -1083,7 +1077,6 @@ function SHOWNODES($pid, $eid) {
echo " <td>$idlestr</td>
<td align=center>$startstatus</td>
<td align=center>$readylabel</td>
</tr>\n";
}
echo "</table>\n";
......@@ -1093,7 +1086,6 @@ function SHOWNODES($pid, $eid) {
the node has not reported on its proper schedule.
<li>Exit value of the node startup command. A value of
666 indicates a testbed internal error.
<li>User application ready status, reported via TMCC.
</ol>
</blockquote></blockquote></blockquote></h4>\n";
}
......
......@@ -400,6 +400,7 @@ example:
<p>
<h3>
<a NAME="ProgramObjects"></a>
Program Objects
</h3>
......
set ns [new Simulator]
source tb_compat.tcl
# Two nodes
set nodeA [$ns node]
set nodeB [$ns node]
# A link
$ns duplex-link $nodeA $nodeB 100Mb 0ms DropTail
# Set the OS.
tb-set-node-os $nodeA FBSD-STD
tb-set-node-os $nodeB FBSD-STD
tb-set-node-os $nodeB RHL-STD
tb-set-node-rpms $nodeA /proj/testbed/rpms/silly-1.0-1.i386-freebsd.rpm
tb-set-node-rpms $nodeB /proj/testbed/rpms/silly-1.0-1.i386-freebsd.rpm
# Load our software.
tb-set-node-tarfiles $nodeA /usr/site /proj/testbed/tarfiles/silly.tar.gz
tb-set-node-tarfiles $nodeB /usr/site /proj/testbed/tarfiles/silly.tar.gz
tb-set-node-startup $nodeA /usr/site/bin/run-silly
tb-set-node-startup $nodeB /usr/site/bin/run-silly
# Set the commands to run
tb-set-node-startcmd $nodeA "/usr/site/bin/run-silly >& /tmp/foo.log"
tb-set-node-startcmd $nodeB "/usr/site/bin/run-silly >& /tmp/foo.log"
$ns run
......@@ -323,12 +323,12 @@ tb-set-node-rpms $node0 rpm1 rpm2 rpm3
</ul>
<h4>tb-set-node-startup</h4>
<h4>tb-set-node-startcmd</h4>
<pre>
tb-set-node-startup <i>node</i> <i>startupcmd</i>
tb-set-node-startcmd <i>node</i> <i>startupcmd</i>
tb-set-node-startup $node0 {mystart.sh -a}
tb-set-node-startcmd $node0 {mystart.sh -a >& /tmp/node0.log}
</pre>
<p>Notes:
......
<!--
EMULAB-COPYRIGHT
Copyright (c) 2000-2002 University of Utah and the Flux Group.
Copyright (c) 2000-2003 University of Utah and the Flux Group.
All rights reserved.
-->
<center>
......@@ -8,7 +8,7 @@
</center>
If you <em>really</em> want to setup and manage routing entirely by
yourself, you can use the <tt>tb-set-node-startup</tt> command in your
yourself, you can use the <tt>tb-set-node-startcmd</tt> command in your
NS file to specify a per-node script in your home directory to set up
routing at boot time. You can use this startup script to setup either
static routes or to fire up a routing daemon. We do not recommend this
......@@ -23,7 +23,7 @@ to <i>router</i>. In order to get router to correctly forward packets
between client and server, you would add this line to your NS file:
<code><pre>
tb-set-node-startup $router /proj/pid/router-startup
tb-set-node-startcmd $router /proj/pid/router-startup
</pre></code>
This will make router run the <code>router-startup</code> script
......@@ -49,8 +49,8 @@ Now to get client and server to talk to each other through this router,
you would add these lines to your NS file:
<code><pre>
tb-set-node-startup $client /proj/pid/clientroutecmd
tb-set-node-startup $server /proj/pid/serverroutecmd
tb-set-node-startcmd $client /proj/pid/clientroutecmd
tb-set-node-startcmd $server /proj/pid/serverroutecmd
</pre></code>
This will have the client and the server each call a small script
......@@ -119,8 +119,8 @@ You can make a single script that will handle all end nodes, by replacing
and specifying the router in your NS file:
<code><pre>
tb-set-node-startup $clientA {/proj/pid/router-startup router0}
tb-set-node-startup $clientB {/proj/pid/router-startup router1}
tb-set-node-startcmd $clientA {/proj/pid/router-startup router0}
tb-set-node-startcmd $clientB {/proj/pid/router-startup router1}
</pre></code>
For multiple routers, the startup script for each router will need to
......
......@@ -29,7 +29,7 @@
<li> <a href="#TARBALLS">Installing Tar files automatically</a>
<li> <a href="#Startupcmd">
Starting your application automatically</a>
<li> <a href="#ReadyBits">
<li> <a href="#SyncServer">
How do I know when all my nodes are ready?</a>
<li> <a href="#Routing">
Setting up IP routing between nodes</a>
......@@ -553,7 +553,7 @@ Again, please feel free to contact us.
<li> <a href="#RPMS">Installing RPMS automatically</a>
<li> <a href="#TARBALLS">Installing Tar files automatically</a>
<li> <a href="#Startupcmd">Starting your application automatically</a>
<li> <a href="#ReadyBits">How do I know when all my nodes are ready?</a>
<li> <a href="#SyncServer">How do I know when all my nodes are ready?</a>
<li> <a href="#Routing">Setting up IP routing between nodes</a>
<li> <a href="#Simem">Hybrid Experiments with Simulation and Emulation</a>
<img src="../new.gif" alt="&lt;NEW&gt;">
......@@ -627,15 +627,15 @@ be unpacked in, all separated by spaces.
<p>
You can start your application automatically when your nodes boot by
using the <tt>tb-set-node-startup</tt> NS extension. The argument is
using the <tt>tb-set-node-startcmd</tt> NS extension. The argument is
the pathname of a script or program that is run as the <tt>UID</tt> of
the experiment creator, after the node has reached multiuser mode. You
can specify the same program for each node, or a different program.
For example:
<code><pre>
tb-set-node-startup $nodeA /proj/pid/runme.nodeA
tb-set-node-startup $nodeB /proj/pid/runme.nodeB </code></pre>
tb-set-node-startcmd $nodeA /proj/pid/runme.nodeA
tb-set-node-startcmd $nodeB /proj/pid/runme.nodeB </code></pre>
will run <tt>/proj/pid/runme.nodeA</tt> on nodeA and
<tt>/proj/pid/runme.nodeA</tt> on nodeB. The programs must reside on
......@@ -643,82 +643,87 @@ the node's local filesystem, or in a directory that can be reached via
NFS. This is either the project's <tt>/proj</tt> directory, in the
<tt>/group</tt> directory if the experiment has been created in a
subgroup, or a project member's home directory in <tt>/users</tt>.
If you need to see the output of your command, be sure to redirect the
output into a file. You can place the file on the local node, or in
one of the NFS mounted directories mentioned above. For example:
<p>
The exit value of the startup command is reported back to the Web
<code><pre>
tb-set-node-startcmd $nodeB /proj/pid/runme >& /tmp/foo.log </code></pre>
The exit value of the start command is reported back to the Web
Interface, and is made available to you via the "Experiment
Information" link. There is a listing for all of the nodes in the
experiment, and the exit value is recorded in this listing. The
special symbol <tt>none</tt> indicates that the node is still running
the startup command. A log file containing the output of the startup
command is created in the project's <tt>logs</tt> directory
(<tt>/proj/pid/logs</tt>).
the start command.
<p>
The startup command is especially useful when
The start command is especially useful when
combined with <a href="#BatchMode"><i>batch mode</i></a> experiments.
<p>
<li> <a NAME="ReadyBits"></a>
The start command is implemented using
<a href="docwrapper.php3?docname=advanced.html#ProgramObjects">
Program Objects</a>, which are described in more detail in the
<a href="docwrapper.php3?docname=advanced.html">
Advanced Tutorial.</a>
<p>
<li> <a NAME="SyncServer"></a>
<h3>How do I know when all my nodes are ready?</h3>
<p>
It is often necessary for your startup program to determine when all
It is often necessary for your start program to determine when all
of the other nodes in the experiment have started, and are ready to
proceed. Sometimes called a <i>barrier</i>, this allows programs to
wait at a specific point, and then all proceed at once. Emulab
provides a primitive form of this mechanism using experiment <i>ready
bits</i>, which are set and read using the
<a href="../doc/docwrapper.php3?docname=tmcd.html">
TMCD/TMCC</a>. When an experiment is first configured, the ready bit
for each node is cleared. As each node starts its application and
reaches the point where it must be sure that all other nodes have
started up, it issues a TMCC <tt>ready</tt> command:
provides a simple form of this mechanism using a synchronization server
that runs on a node of your choice. You specify the node in your NS
file:
<code><pre>
tmcc ready </code></pre>
tb-set-sync-server $nodeB </code></pre>
which tells Emulab's configuration system that the node is ready to
proceed. The node can then poll for the <i>ready count</i> to
determine how many nodes are ready (have issued a tmcc ready command):
When nodeB boots, the synchronization server will automatically start.
Your software can then synchronize using the <tt>emulab-sync</tt>
program that is installed on your nodes. For example, your node start
command might look like this:
<code><pre>
tmcc readycount </code></pre>
which will return the ready count as a string:
#!/bin/sh
if [ "$1" = "master" ]; then
/usr/testbed/bin/emulab-sync -i 4
else
/usr/testbed/bin/emulab-sync
fi
/usr/site/bin/dosilly </code></pre>
As you can see, one of the nodes must be configured to operate as the
master, initializing the barrier to the number of clients that are
expected to rendezvous at the barrier. The master will by default wait
for all of the clients to reach the barrier. Each client of the
barrier also waits until all of the clients have reached the barrier
(and of course, until the master initializes the barrier to the proper
count). Any number of clients may be specified (any subset of nodes in
your experiment can wait). If the master does not need to wait for the
clients, you may use the <em>async</em> option which releases the
master immediately:
<code><pre>
READY=N TOTAL=M </code></pre>
/usr/testbed/bin/emulab-sync -a -i 4 </code></pre>
where <tt>N</tt> is the number of nodes that are ready, and <tt>M</tt>
is the total number of nodes in the experiment. An application can
poll the ready count with a simple script, or it can encode the ready
bits check directly into its program. For example, here is a simple
Perl fragment that issues the ready command, and then polls for the
ready count, being sure to delay a small amount between each poll.
You may also specify the <em>name</em> of the barrier.
<code><pre>
system("tmcc ready");
while (1) {
my $bits = `tmcc readycount`;
if ($bits =~ /READY=(\d*) TOTAL=(\d*)/) {
if ($1 == $2) {
last;
}
}
#
# Please sleep to avoid swamping the TMCD!
#
sleep(5);
} </code></pre>
<i>Note that the ready count is essentially a use-once feature; The
ready count cannot be reinitialized to zero since there is no actual
synchronization happening. If in the future it appears that a
generalized barrier synchronization would be more useful, we will
investigate the implementation of such a feature.</i>
/usr/testbed/bin/emulab-sync -a -i 4 -n mybarrier
/usr/testbed/bin/emulab-sync -n mybarrier </code></pre>
</p>
This allows multiple barriers to be in use at the same time. Scripts
on NodeA and NodeB can be waiting on a barrier named "foo" while
(other) scripts on NodeA and NodeC can be waiting on a barrier named
"bar." You may reuse an existing barrier (including the default
barrier) once it has been released (all clients arrived and woken up).
<p>
<li> <a NAME="Routing"></a>
<h3>Setting up IP routing between nodes</h3>
......@@ -966,14 +971,13 @@ regular experiment and a batch mode experiment:
notifying you when the experiment has been scheduled and when it
has been terminated.
<p>
<li> Your NS file must define a <i>startup</i> command to run on each
node using the <a href="#Startupcmd"><tt>tb-set-node-startup</tt></a>
NS extension. It is the exit value(s) of the startup command(s) that
<li> Your NS file must define a <i>start</i> command to run on each
node using the <a href="#Startupcmd"><tt>tb-set-node-startcmd</tt></a>
NS extension. It is the exit value(s) of the start command(s) that
indicates that the experiment is completed; when all of the
nodes have run their respective startup commands and exited, the
batch system will then tear down the experiment. The output of
the startup command is stored in the project <tt>logs</tt> directory
(<tt>/proj/pid/logs</tt>) so you can follow what has happened.
nodes have run their respective start commands and reported
their exit values, the batch system will then tear down the
experiment.
</ul>
<p>
......@@ -983,63 +987,58 @@ regular experiment and a batch mode experiment:
Consider example NS file <a href="batch.ns" target=stuff>batch.ns</a>.
First off, we have to arrange for the experimental software to be
automatically installed when the nodes boot. This is done with the <a
href="#RPMS"><tt>tb-set-node-rpms</tt></a> NS extension:
href="#TARBALLS"><tt>tb-set-node-tarfiles</tt></a> NS extension:
<code><pre>
tb-set-node-rpms $nodeA /proj/testbed/rpms/silly-1.0-1.i386-freebsd.rpm
tb-set-node-rpms $nodeB /proj/testbed/rpms/silly-1.0-1.i386-freebsd.rpm
tb-set-node-tarfiles $nodeA /usr/site /proj/testbed/tarfiles/silly.tar.gz
tb-set-node-tarfiles $nodeB /usr/site /proj/testbed/tarfiles/silly.tar.gz
</code></pre>
The next two lines of the NS file specify what program should be run
on each of the nodes. Using the <a href="#Startupcmd">
<tt>tb-set-node-startup</tt></a> NS extension, we say that the program
<tt>run-silly</tt> (installed by the <tt>silly-1.0</tt> RPM) is to be
run on both nodes:
<tt>tb-set-node-startcmd</tt></a> NS extension, we specify the
name of the program to run once all the nodes have booted and are
ready to proceed:
<code><pre>
tb-set-node-startup $nodeA /usr/site/bin/run-silly
tb-set-node-startup $nodeB /usr/site/bin/run-silly
tb-set-node-startcmd $nodeA "/usr/site/bin/run-silly >& /tmp/foo.log"
tb-set-node-startcmd $nodeB "/usr/site/bin/run-silly >& /tmp/foo.log"
</code></pre>
After you have been notified via email that the batch experiment is
running, you can track the progress of your experiment by looking in
the "Experiment Information" page. As each node completes the startup
the "Experiment Information" page. As each node completes the start
command, the listing for that node will be updated to reflect the exit
status of the command (you may need to hit the Reload button to see
the changes). Once all of the nodes hare reported in an exit status,
the batch system will tear down the experiment and send you email. If
your experiment is such that one node is the controller, and runs
commands on all the other nodes, then simply run a dummy startup
commands on all the other nodes, then simply run a dummy start
command on the other nodes so that the batch system will receive an
exit value for that node. Since the batch is not terminated until
<em>all</em> nodes have reported in, be sure that the controlling node
does not exit from its startup command until all of the nodes have
finished. A dummy startup command can be setup like this:
does not exit from its start command until all of the nodes have
finished. A dummy start command can be setup like this:
<code><pre>
tb-set-node-startup $nodeC /bin/echo
</code></pre>
tb-set-node-startcmd $nodeC /bin/echo </code></pre>
<p>
The status of your batch experiment can be viewed via the "Experiment
Information" link in the Web Interface Options menu. You may also
cancel a batch after you have submitted it using the "Terminate"
option in the information display. As noted in the section on the <a
href="#Startupcmd">Startupcmd</a>, the output of the startup command
on each node is written to separate files in your project log
directory. You can use these log files to debug your batch experiment.
option in the information display. You may also stop a batch job,
causing it to swap out by using the "Stop" option. The batch may be
reposted at any time.
<p>
<i>
The batch system is still under development. It appears to be
functional, but there are bound to be kinks in the system. Please help
us debug and improve it by letting us know what you think and if you
have problems with it. Currently, the batch system tries every 10
minutes to run your batch. It will send you email every 5 or so
attempts to let you know that it is trying, but that resources are not
available. It is a good idea to glance at the message to make sure
that the problem is lack of resources and not an error in your NS
file.</i>
The batch system is still under development. Currently, the batch
system tries every 10 minutes to run your batch. It will send you
email every 5 or so attempts to let you know that it is trying, but
that resources are not available. It is a good idea to glance at the
message to make sure that the problem is lack of resources and not an
error in your NS file.</i>
<!-- This ends the Basic Tutorial Section -->
</ul>
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment