overview.txt 12.5 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
##
## overview.txt - Overview of the Emulab software, and the way various
## pieces fit together.
##

## Some Key Emulab Programs and Libraries

tmcd - daemon that runs on boss, and is essentially a proxy for the
database.  Nodes are not, for security reasons, allowed to contact the
database directly.  Also, nodes should not have to know details of the
database to get, for example, a list of accounts to create. Used by
nodes to get information such as user accounts, NFS mounts, hostnames,
etc. Uses SSL to authenticate the server and clients, and to encrypt
transmissions. The client side, used on the nodes, is called tmcc.

snmpit - The program we use to configure switches via SNMP. It gets
used on the experimental net to create VLANs and set port speed and
duplex. It is generally not used on the control net switch.

suexec - Invoked by the web interface to execute commands as other
users. All commands run on boss from the web interface (such as the
ones to create and terminate experiments) go through suexec, and are
executed as the user logged into the webserver, not as the webserver
itself.

assign - Our simulated annealing algorithm that maps the user's
requested topology onto available hardware. Its main purposes are to
minimize inter-switch bandwidth in environments with multiple
experimental switches, resolve hardware types, and make sure that any
special features of nodes are used efficiently (ie. are not used by an
experiment if not requested.)  It is called by assign_wrapper, which
does the task of generating the list of available resources, and
reserving the resources picked by assign.

parse.tcl - Our NS parser, implemented as a TCL script that loads
libraries that mimic NS commands, and pulls a few other tricks (such
as overriding variable assignment.) Evaluates the user's input NS
script, and places the results into the database.

frisbeed - Server for our multicast disk loading system. More on this
below in the 'Images' section.

libdb - Big library that is used to interface with the database. It
hides details such as the name of the database from scripts, retries
failed connections, and can send mail and/or terminate the script when
queries fail.  It also contains functions for doing permissions
checks, getting information about the state of an experiment, project,
or node, and so forth. Almost all perl scripts use this library.

libtestbed - A small library that contains handy functions for sending
mail to the user, going into the background, and so forth.

## Some useful administrative programs

sched_reload - Schedules a reload of a node with the default image. If
the node is free, moves it to the reloading experiment, and starts the
reload immediately. If the node is reserved, puts an entry into the
scheduled_reloads table. When the node is freed from an experiment (by
'nfree'), it checks this table to see if should be reloaded rather
than being released into the free pool. This is the preferred way to
get nodes reloaded when you have a new version of the default image.

sched_reserve - Works like sched_reload, but re-allocates a node to
another experiment when it gets freed (or, immediately, if the node is
already free.)  Most often used to move a suspect node to
emulab-ops/hwdown when an experimenter reports something that may be a
hardware problem.

## Node Boot Process

We boot nodes via PXE, which is a feature that allows a network card
to download code to boot from. Thus, the control network card in each
node needs to have PXE, but it's best to have it disabled on the
experimental interfaces (because you'll just waste time at boot,
waiting for DHCP to time out.) PXE contacts the dchpd on boss, which
gives it an IP address and so forth, and then hands it off to
'proxydhcp' (also running on boss.) This daemon looks into the nodes
table, at the pxe_boot_path and next_pxe_boot path fields, to tell the
card where to load its boot program from. next_pxe_boot_path is
intended to be used by the emulab software to temporarily override the
user's settings.  PXE on the NIC then downloads the boot program via
TFTP from boss.  Normally, we load something called 'pxeboot', which
is a little custom OSKit boot loader.  But, we can also boot some
loaders that load FreeBSD into memory and run it from there - more on
85 86
this in the disk image section.

87 88 89 90 91 92 93
pxeboot contacts another daemon on boss called 'bootinfo' to find out
what to boot. bootinfo looks at the nodes table to figure this
out. Usually, this is done by looking at the 'def_boot_osid' field,
then looking in the partitions table to discover which partition that
OS can be found in. However, pxeboot can also boot from other sources,
such as kernels loaded via TFTP. You can also use pxeboot
interactively, by pressing any key when prompted to do so during boot.
94 95

When the OS booted is our standard FreeBSD or Linux installation,
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
/etc/testbed/rc.testbed is called to perform Emulab-specific
configuration.  First, the nodes contact cvsup on boss to look for
incremental updates (we do this so that we don't have to create a new
image every time we update any single file.) Next, they run scripts
that set up things like routes, delay pipes, accounts, NFS mounts,
etc. Most of this information is obtained from tmcd on boss.

## Images

We create images with a program called 'imagezip' that does
filesystem-specific compression.

To create an image, we boot into a special FreeBSD that is loaded over
the network and run solely out of memory, not touching the disk at
all. This way, we don't depend on specific disk contents, and aren't
trying to zip up mounted filesystems. This is a stripped-down FreeBSD
kernel and root filesystem that are loaded from boss via TFTP. The
filesystem is decompressed into a memory file system (MFS) by the boot
loader. You can get a node into and out of the MFS FreeBSD with the
'node_admin' command. It runs a (usually out-of-date) version of the
node setup software, so it looks a lot like a regular node, with user
accounts, NFS mounts, etc. Inside this MFS FreeBSD, we run imagezip
and write the image via NFS to the project directory on ops.

To load an image, we boot another FreeBSD MFS, but this one is _much_
more stripped down, as it may get loaded by dozens of reloading nodes
at once, and TFTP is unicast.  This MFS contacts tmcd to find out
which address (multicast address and port number) to get its disk
image from. A program called 'frisbee' (think: flying disks) is
invoked with this address, and grabs the image from frisbeed running
on boss. The multicast protocol used by Frisbee is designed so that no
global synchronization is required, and nodes can join at any time.
128 129 130 131
Once frisbee is done, the node reboots into the new OS.

To initiate a disk reload, os_load gets run on boss. It sets the
next_pxe_boot_path so that the node will boot into the reloading MFS
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189
(pxeboot.frisbee) then reboots the node. It also sets the node's
default boot OS to the default for the image (specified in the images
table.)

os_load then runs frisbee_launcher.  frisbee_launcher is a wrapper
around the real frisbee server, frisbeed. This way, frisbeed itself
does not need to know any Emulab specifics, and can be restarted by
frisbee_launcher if it dies. One frisbee_launcher/frisbeed process is
run per image.  frisbee_launcher looks at the database to determine if
an instance is already running for this image, and exits if it is. If
not, frisbee_launcher picks a multicast address to use, registers it
in the database (images table) starts frisbeed, and goes into the
background. When frisbeed exits (which it may do after being idle for
a long time,) frisbee_launcher updates the database to indicate that
no frisbeed is running for the image anymore. If you change the path
to an image in the images table, or replace its file, check for
instances of frisbee_launcher running for the image (you can tell from
the command line shown by 'ps'). Kill the frisbee_launcher process
(NOT the frisbeed process.)

There is a reload daemon (called reload_daemon) that runs on boss, and
does the job of reloading nodes after they are freed from
experiments. Nodes are placed into the emulab-ops/reloadpending
experiment, then moved over to the emulab-ops/reloading experiment by
the reload daemon (this is largely a relic of when we had a unicast
disk loader, and could only reload a few nodes at a time.)  This is a
common place to notice hardware and software problems, so the reload
daemon sends mail if any nodes get stuck in the reloading experiment
for too long.

## Experiment Lifetime

Once the user hits the 'submit' button on the experiment creation
form, 'batchexp' is fired off on the user's behalf to actually start
the experiment.  If the experiment was marked as a batch experiment,
it is submitted to the batch queue, to be run by batch_daemon when
there are enough free nodes. Otherwise, 'startexp' is run.

The first thing that startexp does is run tbprerun, which sets up the
experiment in the database - . At this point, the experiment has been
created, but is not swapped in (if you checked the 'preload' box on
the webpage, things stop here.)  In general, this fills out the virt_*
tables in the database.

Next, startexp calls tbswapin to realize the experiment in
hardware. It calls the programs to do resource assignment
(assign_wrapper), set up VLANs (snmpit), set up NFS exports
(exports_setup), set up DNS records (named_setup), etc.  tbswapin
waits for nodes to come up - in older versions of our software,
detection is done by pinging them. In newer versions, the nodes report
back in with the event system. If any nodes fail to come up, they are
rebooted once, and if they fail again, the are moved the the
emulab-ops/hwdown experiment, and the experiment swapin fails.

Finally, when the experiment is configured, startexp sends the user
email.

If the user swaps the experiment in and out during its lifetime,
190 191 192
'tbswap' is called to do the job - since a failed swapin requires a
lot of the same cleanup as a swapout, one script handles both, called
with 'in' or 'out' as its first argument.
193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213

When the user terminates the experiment, endexp gets called. It calls
tbswapout to free hardware resources, then calls tbend, which cleans
up the experiment's data in the virt_* tables.

While an experiment is swapped in, its activity is tracked using
slothd, sdcollectd, etc, and inactive experiments get sent email
messages asking them to swap out. After a while, these experiments get
swapped out, either by administrators, or automatically by the system
itself (coming soon).

## The Event System

Our event system is implemented as a thin layer on top of the 'elvin'
system.  Elvin is a publish/subscribe system, meaning that programs
that want to recive events connect to the elvin server and subscribe
to all events for a particular node, for a particular traffic
generator, etc.

We run an event scheduler for each experiment. It's this sheduler's
job to keep track of future evnets for the experiment, and send them
214
at the appropriate time. The most basic example of this are events
215
specified through the NS file, but tevc can also be used to schedule
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242
events for some future time. eventys_control can be used to control
the scheduler, stooping it, starting it, or replaying all events
listed in the NS file.

We also use the event system to communicate information about the
state of nodes. At several times during the node boot process, they
contact tmcd informing it of their current state. State changes are
also inferred from some external event, such as a node DHCPing, or
running 'power' to power cycle a stuck node. All of these state
changes are received by a daemon on boss, 'stated'. In addition to
receiving state transition events, stated can generate them for nodes
than cannot do so themselves - for example, it can ping nodes that are
running OSes that do no report in, to detect when they come up. stated
reads state machines from the database, and can detect nodes that time
out in specific states (ie., they begin to boot, but then hang), or
that somehow make invalid state transitions. During experiment swapin,
when all nodes have reached the ISUP state, a 'time start' event is
send, so that all agents will have a similar idea of when the
experiment began.

On the nodes, the main things that interact with the event system are
delay nodes, traffic generators, and the program agent. Delay nodes
subscribe to events for their link, so that the delay can be changed,
the link brought down, etc. Traffic generators can be turned on and
off, and have their paramters changed through this method. And, if the
user specified any 'Program' objects in the NS file, agents are
started so that they can be dynamically started and stopped.