Commit ff13e555 authored by Robert Ricci's avatar Robert Ricci

The beginnings of a document giving an overview of how the bits and

pieces of the testbed software fit together. Aimed at sites (like
Kentucky) running their own Emulabs.
parent ef307bd6
#####
##### overview.txt - Overview of the Emulab software, and the way various pieces
##### fit together.
#####
##### Some Key Emulab Programs and Libraries
tmcd - daemon that runs on boss, and is essentially a proxy for the database.
Nodes are not, for security reasons, allowed to contact the database directly.
Also, nodes should not have to know details of the database to get, for example,
a list of accounts to create. Used by nodes to get information such as user
accounts, NFS mounts, hostnames, etc. Uses SSL to authenticate the server and
clients, and to encrypt transmissions. The client side, used on the nodes, is
called tmcc.
snmpit - The program we use to configure switches via SNMP. It gets used on the
experimental net to create VLANs and set port speed and duplex. It is generally
not used on the control net switch.
suexec - Invoked by the web interface to execute commands as other users. All
commands run on boss from the web interface (such as the ones to create and
terminate experiments) go through suexec, and are executed as the user logged
into the webserver, not as the webserver itself.
assign - Our simulated annealing algorithm that maps the user's requested
topology onto available hardware. Its main purposes are to minimize inter-switch
bandwidth in environments with multiple experimental switches, resolve hardware
types, and make sure that any special features of nodes are used efficiently
(ie. are not used by an experiment if not requested.) It is called by
assign_wrapper, which does the task of generating the list of available
resources, and reserving the resources picked by assign.
parse.tcl - Our NS parser, implemented as a TCL script that loads libraries that
mimic NS commands, and pulls a few other tricks (such as overriding variable
assignment.) Evaluates the user's input NS script, and places the results into
the database.
frisbeed - Server for our multicast disk loading system. More on this below in
the 'Images' section.
libdb - Big library that is used to interface with the database. It hides
details such as the name of the database from scripts, retries failed
connections, and can send mail and/or terminate the script when queries fail.
It also contains functions for doing permissions checks, getting information
about the state of an experiment, project, or node, and so forth. Almost all
perl scripts use this library.
libtestbed - A small library that contains handy functions for sending mail to
the user, going into the background, and so forth.
##### Some useful administrative programs
sched_reload - Schedules a reload of a node with the default image. If the node
is free, moves it to the reloading experiment, and starts the reload
immediately. If the node is reserved, puts an entry into the scheduled_reloads
table. When the node is freed from an experiment (by 'nfree'), it checks this
table to see if should be reloaded rather than being released into the free
pool. This is the preferred way to get nodes reloaded when you have a new
version of the default image.
sched_reserve - Works like sched_reload, but re-allocates a node to another
experiment when it gets freed (or, immediately, if the node is already free.)
Most often used to move a suspect node to emulab-ops/hwdown when an experimenter
reports something that may be a hardware problem.
##### Node Boot Process
We boot nodes via PXE, which is a feature that allows a network card to download
code to boot from. Thus, the control network card in each node needs to have
PXE, but it's best to have it disabled on the experimental interfaces (because
you'll just waste time at boot, waiting for DHCP to time out.) PXE contacts the
dchpd on boss, which gives it an IP address and so forth, and then hands it off
to 'proxydhcp' (also running on boss.) This daemon looks into the nodes table,
at the pxe_boot_path and next_pxe_boot path fields, to tell the card where to
load its boot program from. next_pxe_boot_path is intended to be used by the
emulab software to temporarily override the user's settings. PXE on the NIC
then downloads the boot program via TFTP from boss. Normally, we load something
called 'pxeboot', which is a little custom OSKit boot loader. But, we can also
boot some loaders that load FreeBSD into memory and run it from there - more on
this in the disk image section.
pxeboot contacts another daemon on boss called 'bootinfo' to find out what to
boot. bootinfo looks at the nodes table to figure this out. Usually, this is
done by looking at the 'def_boot_osid' field, then looking in the partitions table
to discover which partition that OS can be found in. However, pxeboot can also
boot from other sources, such as kernels loaded via TFTP. You can also use
pxeboot interactively, by pressing any key when prompted to do so during boot.
When the OS booted is our standard FreeBSD or Linux installation,
/etc/testbed/rc.testbed is called to perform Emulab-specific configuration.
First, the nodes contact cvsup on boss to look for incremental updates (we do
this so that we don't have to create a new image every time we update any single
file.) Next, they run scripts that set up things like routes, delay pipes,
accounts, NFS mounts, etc. Most of this information is obtained from tmcd on boss.
##### Images
We create images with a program called 'imagezip' that does filesystem-specific
compression.
To create an image, we boot into a special FreeBSD that is loaded over the
network and run solely out of memory, not touching the disk at all. This way, we
don't depend on specific disk contents, and aren't trying to zip up mounted
filesystems. This is a stripped-down FreeBSD kernel and root filesystem that are
loaded from boss via TFTP. The filesystem is decompressed into a memory file
system (MFS) by the boot loader. You can get a node into and out of the MFS
FreeBSD with the 'node_admin' command. It runs a (usually out-of-date) version
of the node setup software, so it looks a lot like a regular node, with user
accounts, NFS mounts, etc. Inside this MFS FreeBSD, we run imagezip and write
the image via NFS to the project directory on ops.
To load an image, we boot another FreeBSD MFS, but this one is _much_ more
stripped down, as it may get loaded by dozens of reloading nodes at once, and
TFTP is unicast. This MFS contacts tmcd to find out which address (multicast
address and port number) to get its disk image from. A program called 'frisbee'
(think: flying disks) is invoked with this address, and grabs the image from
frisbeed running on boss. The multicast protocol used by Frisbee is designed so
that no global synchronization is required, and nodes can join at any time.
Once frisbee is done, the node reboots into the new OS.
To initiate a disk reload, os_load gets run on boss. It sets the
next_pxe_boot_path so that the node will boot into the reloading MFS
(pxeboot.frisbee) then reboots the node. It also sets the node's default boot OS
to the default for the image (specified in the images table.)
os_load then runs frisbee_launcher. frisbee_launcher is a wrapper around the
real frisbee server, frisbeed. This way, frisbeed itself does not need to know
any Emulab specifics, and can be restarted by frisbee_launcher if it dies. One
frisbee_launcher/frisbeed process is run per image. frisbee_launcher looks at
the database to determine if an instance is already running for this image, and
exits if it is. If not, frisbee_launcher picks a multicast address to use,
registers it in the database (images table) starts frisbeed, and goes into the
background. When frisbeed exits (which it may do after being idle for a long
time,) frisbee_launcher updates the database to indicate that no frisbeed is
running for the image anymore. If you change the path to an image in the images
table, or replace its file, check for instances of frisbee_launcher running for
the image (you can tell from the command line shown by 'ps'). Kill the
frisbee_launcher process (NOT the frisbeed process.)
There is a reload daemon (called reload_daemon) that runs on boss, and does the
job of reloading nodes after they are freed from experiments. Nodes are placed
into the emulab-ops/reloadpending experiment, then moved over to the
emulab-ops/reloading experiment by the reload daemon (this is largely a relic of
when we had a unicast disk loader, and could only reload a few nodes at a time.)
This is a common place to notice hardware and software problems, so the reload
daemon sends mail if any nodes get stuck in the reloading experiment for too long.
##### Experiment Lifetime
Once the user hits the 'submit' button on the experiment creation form,
'batchexp' is fired off on the user's behalf to actually start the experiment.
If the experiment was marked as a batch experiment, it is submitted to the batch
queue, to be run by batch_daemon when there are enough free nodes. Otherwise,
'startexp' is run.
The first thing that startexp does is run tbprerun, which sets up the experiment
in the database - . At this point, the experiment has been created, but is not
swapped in (if you checked the 'preload' box on the webpage, things stop here.)
In general, this fills out the virt_* tables in the database.
Next, startexp calls tbswapin to realize the experiment in hardware. It calls
the programs to do resource assignment (assign_wrapper), set up VLANs (snmpit),
set up NFS exports (exports_setup), set up DNS records (named_setup), etc.
tbswapin waits for nodes to come up - in older versions of our software,
detection is done by pinging them. In newer versions, the nodes report back in
with the event system. If any nodes fail to come up, they are rebooted once, and
if they fail again, the are moved the the emulab-ops/hwdown experiment, and the
experiment swapin fails.
Finally, when the experiment is configured, startexp sends the user email.
If the user swaps the experiment in and out during its lifetime, 'tbswapin' and
'tbswapout' are called to do the job.
When the user terminates the experiment, endexp gets called. It calls tbswapout
to free hardware resources, then calls tbend, which cleans up the experiment's
data in the virt_* tables.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment