The beginnings of a document giving an overview of how the bits and

pieces of the testbed software fit together. Aimed at sites (like Kentucky) running their own Emulabs.

The beginnings of a document giving an overview of how the bits and
ff13e555 · Robert Ricci · ef307bd6 · ff13e555
Commit ff13e555 authored 22 years ago by Robert Ricci
--- a/doc/overview.txt
+++ b/doc/overview.txt
+#####
+##### overview.txt - Overview of the Emulab software, and the way various pieces
+##### fit together.
+#####
+##### Some Key Emulab Programs and Libraries
+tmcd - daemon that runs on boss, and is essentially a proxy for the database.
+Nodes are not, for security reasons, allowed to contact the database directly.
+Also, nodes should not have to know details of the database to get, for example,
+a list of accounts to create. Used by nodes to get information such as user
+accounts, NFS mounts, hostnames, etc. Uses SSL to authenticate the server and
+clients, and to encrypt transmissions. The client side, used on the nodes, is
+called tmcc.
+snmpit - The program we use to configure switches via SNMP. It gets used on the
+experimental net to create VLANs and set port speed and duplex. It is generally
+not used on the control net switch.
+suexec - Invoked by the web interface to execute commands as other users. All
+commands run on boss from the web interface (such as the ones to create and
+terminate experiments) go through suexec, and are executed as the user logged
+into the webserver, not as the webserver itself.
+assign - Our simulated annealing algorithm that maps the user's requested
+topology onto available hardware. Its main purposes are to minimize inter-switch
+bandwidth in environments with multiple experimental switches, resolve hardware
+types, and make sure that any special features of nodes are used efficiently
+(ie. are not used by an experiment if not requested.)  It is called by
+assign_wrapper, which does the task of generating the list of available
+resources, and reserving the resources picked by assign.
+parse.tcl - Our NS parser, implemented as a TCL script that loads libraries that
+mimic NS commands, and pulls a few other tricks (such as overriding variable
+assignment.) Evaluates the user's input NS script, and places the results into
+the database.
+frisbeed - Server for our multicast disk loading system. More on this below in
+the 'Images' section.
+libdb - Big library that is used to interface with the database. It hides
+details such as the name of the database from scripts, retries failed
+connections, and can send mail and/or terminate the script when queries fail.
+It also contains functions for doing permissions checks, getting information
+about the state of an experiment, project, or node, and so forth. Almost all
+perl scripts use this library.
+libtestbed - A small library that contains handy functions for sending mail to
+the user, going into the background, and so forth.
+##### Some useful administrative programs
+sched_reload - Schedules a reload of a node with the default image. If the node
+is free, moves it to the reloading experiment, and starts the reload
+immediately. If the node is reserved, puts an entry into the scheduled_reloads
+table. When the node is freed from an experiment (by 'nfree'), it checks this
+table to see if should be reloaded rather than being released into the free
+pool. This is the preferred way to get nodes reloaded when you have a new
+version of the default image. 
+sched_reserve - Works like sched_reload, but re-allocates a node to another
+experiment when it gets freed (or, immediately, if the node is already free.)
+Most often used to move a suspect node to emulab-ops/hwdown when an experimenter
+reports something that may be a hardware problem.
+##### Node Boot Process
+We boot nodes via PXE, which is a feature that allows a network card to download
+code to boot from. Thus, the control network card in each node needs to have
+PXE, but it's best to have it disabled on the experimental interfaces (because
+you'll just waste time at boot, waiting for DHCP to time out.) PXE contacts the
+dchpd on boss, which gives it an IP address and so forth, and then hands it off
+to 'proxydhcp' (also running on boss.) This daemon looks into the nodes table,
+at the pxe_boot_path and next_pxe_boot path fields, to tell the card where to
+load its boot program from. next_pxe_boot_path is intended to be used by the
+emulab software to temporarily override the user's settings.  PXE on the NIC
+then downloads the boot program via TFTP from boss.  Normally, we load something
+called 'pxeboot', which is a little custom OSKit boot loader.  But, we can also
+boot some loaders that load FreeBSD into memory and run it from there - more on
+this in the disk image section.
+pxeboot contacts another daemon on boss called 'bootinfo' to find out what to
+boot. bootinfo looks at the nodes table to figure this out. Usually, this is
+done by looking at the 'def_boot_osid' field, then looking in the partitions table
+to discover which partition that OS can be found in. However, pxeboot can also
+boot from other sources, such as kernels loaded via TFTP. You can also use
+pxeboot interactively, by pressing any key when prompted to do so during boot.
+When the OS booted is our standard FreeBSD or Linux installation,
+/etc/testbed/rc.testbed is called to perform Emulab-specific configuration.
+First, the nodes contact cvsup on boss to look for incremental updates (we do
+this so that we don't have to create a new image every time we update any single
+file.) Next, they run scripts that set up things like routes, delay pipes,
+accounts, NFS mounts, etc. Most of this information is obtained from tmcd on boss.
+##### Images
+We create images with a program called 'imagezip' that does filesystem-specific
+compression.
+To create an image, we boot into a special FreeBSD that is loaded over the
+network and run solely out of memory, not touching the disk at all. This way, we
+don't depend on specific disk contents, and aren't trying to zip up mounted
+filesystems. This is a stripped-down FreeBSD kernel and root filesystem that are
+loaded from boss via TFTP. The filesystem is decompressed into a memory file
+system (MFS) by the boot loader. You can get a node into and out of the MFS
+FreeBSD with the 'node_admin' command. It runs a (usually out-of-date) version
+of the node setup software, so it looks a lot like a regular node, with user
+accounts, NFS mounts, etc. Inside this MFS FreeBSD, we run imagezip and write
+the image via NFS to the project directory on ops.
+To load an image, we boot another FreeBSD MFS, but this one is _much_ more
+stripped down, as it may get loaded by dozens of reloading nodes at once, and
+TFTP is unicast.  This MFS contacts tmcd to find out which address (multicast
+address and port number) to get its disk image from. A program called 'frisbee'
+(think: flying disks) is invoked with this address, and grabs the image from
+frisbeed running on boss. The multicast protocol used by Frisbee is designed so
+that no global synchronization is required, and nodes can join at any time.
+Once frisbee is done, the node reboots into the new OS.
+To initiate a disk reload, os_load gets run on boss. It sets the
+next_pxe_boot_path so that the node will boot into the reloading MFS
+(pxeboot.frisbee) then reboots the node. It also sets the node's default boot OS
+to the default for the image (specified in the images table.)
+os_load then runs frisbee_launcher.  frisbee_launcher is a wrapper around the
+real frisbee server, frisbeed. This way, frisbeed itself does not need to know
+any Emulab specifics, and can be restarted by frisbee_launcher if it dies. One
+frisbee_launcher/frisbeed process is run per image.  frisbee_launcher looks at
+the database to determine if an instance is already running for this image, and
+exits if it is. If not, frisbee_launcher picks a multicast address to use,
+registers it in the database (images table) starts frisbeed, and goes into the
+background. When frisbeed exits (which it may do after being idle for a long
+time,) frisbee_launcher updates the database to indicate that no frisbeed is
+running for the image anymore. If you change the path to an image in the images
+table, or replace its file, check for instances of frisbee_launcher running for
+the image (you can tell from the command line shown by 'ps'). Kill the
+frisbee_launcher process (NOT the frisbeed process.)
+There is a reload daemon (called reload_daemon) that runs on boss, and does the
+job of reloading nodes after they are freed from experiments. Nodes are placed
+into the emulab-ops/reloadpending experiment, then moved over to the
+emulab-ops/reloading experiment by the reload daemon (this is largely a relic of
+when we had a unicast disk loader, and could only reload a few nodes at a time.)
+This is a common place to notice hardware and software problems, so the reload
+daemon sends mail if any nodes get stuck in the reloading experiment for too long.
+##### Experiment Lifetime
+Once the user hits the 'submit' button on the experiment creation form,
+'batchexp' is fired off on the user's behalf to actually start the experiment.
+If the experiment was marked as a batch experiment, it is submitted to the batch
+queue, to be run by batch_daemon when there are enough free nodes. Otherwise,
+'startexp' is run.
+The first thing that startexp does is run tbprerun, which sets up the experiment
+in the database - . At this point, the experiment has been created, but is not
+swapped in (if you checked the 'preload' box on the webpage, things stop here.)
+In general, this fills out the virt_* tables in the database.
+Next, startexp calls tbswapin to realize the experiment in hardware. It calls
+the programs to do resource assignment (assign_wrapper), set up VLANs (snmpit),
+set up NFS exports (exports_setup), set up DNS records (named_setup), etc.
+tbswapin waits for nodes to come up - in older versions of our software,
+detection is done by pinging them. In newer versions, the nodes report back in
+with the event system. If any nodes fail to come up, they are rebooted once, and
+if they fail again, the are moved the the emulab-ops/hwdown experiment, and the
+experiment swapin fails.
+Finally, when the experiment is configured, startexp sends the user email.
+If the user swaps the experiment in and out during its lifetime, 'tbswapin' and
+'tbswapout' are called to do the job.
+When the user terminates the experiment, endexp gets called. It calls tbswapout
+to free hardware resources, then calls tbend, which cleans up the experiment's
+data in the virt_* tables.