Commit 9d2aef7d authored by Mike Hibler's avatar Mike Hibler
Browse files

Some incomplete internal doc on vnode setup

parent a462e117
How FreeBSD Jail (and Linux Vserver) based vnodes are setup (and torn down).
1. bootvnodes [-b] [-h] [-k]
Actions are: -b to boot all vnodes, -h to halt them but save their
disk setup, -k to kill them, removing all the virtual disks. Halt
is typically used when the physical host is rebooting. In fact,
the kill option is only used for debugging. Normally when an
experiment is being torn down, we don't bother to kill the vnodes
as they will go away when the physical node does.
If no action is specified, it is a "cold" boot. In this case we query
boss to get the list of vnodes. Then we do any physnode actions
(e.g., make a big FS, create vn devices) and then call vnodesetup for
each node.
If an action is specified, we just look in /var/emulab/jails for all
pcvm* subdirectories and call vnodesetup for each of them.
bootvnodes exits after all calls to vnodesetup have returned.
Actually, bootvnodes performs all actions in the background
(i.e. returns immediately) unless -f is given to do it in the
2. vnodesetup -j [-b] [-h] [-k] <vnodeid>
The -j says this is a jailed vnode. -[bhk] are for boot, halt, kill
as in bootvnodes. Here an action must always be given.
Booting. In theory, the first thing we do is fork a child process
to continue the boot process in the background and the parent exits
immediately. However, this causes bootvnodes to fire off all the
vnodesetups concurrently, which proved to be a bad thing for
reliability. So now, the parent process waits for 1 minute or until
it sees that the vnode has gotten as far as firing up its watchdog
process (whichever comes first) before exiting. This throttles the
concurrency somewhat.
The child process daemonizes itself (creating a new process group for
it and its descendents). Then it sets up to catch signals, informs
the testbed via tmcc that the vnode is booting and populates the vnode
configuration directory using libsetup::vnodejailsetup. This config
directory (/var/emulab/jails/<vnodeid>) most importantly contains the
jailconfig file which in turn contains the key=value pairs returned
by the tmcc "jailconfig" command. Finally the child forks again. Now
the original child process (now called the "parent") just waits around
until it receives a signal or its child exits. The former case is how
jails are killed off (explained later). The parent's pid is recorded
in /var/run/tbvnode-<vnodid>.pid.
The new child process now just exec's /usr/local/etc/emulab/
(which on Linux is currently symlinked to
So at this point, there is a watchdog vnodesetup process waiting, and
a worker child mkjail process doing the rest of the work.
Halting and Killing. To halt or kill a vnode, vnodesetup reads the
watchdog pid file (/var/run/tbvnode-<vnodid>.pid) written when the
vnode was started and sends a TERM (halt) or USR1 (kill) signal to
that process. Then it waits around for up to 30 seconds for the pid
file to be removed, indicating that the vnode was stopped. Currently
if the pid file is not removed in that time, we exit(0) anyway.
When the watchdog process receives the signal, it calls the "cleanup"
function (via "fatal"). Cleanup informs the testbed via tmcc that we
are going down, and sends a signal on to the worker mkjail process
which (as we will see) is still around as well. The signals are
different here however; we send a USR1 if we are just halting the
vnode, a HUP if we are destroying it. Cleanup then waits for the mkjail
process to die and then sends a TERM signal to the whole process group
for good measure [what does this kill off?] Finally, if this is a
kill operation, it removes the whole /var/emulab/jails/<vnodeid>
hierarchy (carefully--it will fail if there are any loopback mounts
left over) and removes the pid file.
3a. BSD: -p <exppid> -h <hostname> <vnodeid> is only called to boot a jail-based vnode. Halting vnodes
is handled by catching signals in this creation invocation, not by
re-invoking it.
mkjail sucks in the contents of the jailconfig file created by
vnodesetup and uses that info to configure the local kernel to
allow the jail to be created as necessary (e.g., using a sysctl
to allow jails to use BPF) and to build up a command line for
actually creating the jail.
mkjail then either creates or "restores" the filesystem namespace
for the vnode. Creation involves setting up a "vnode disk" for
mutable local file systems and then loopback mounting /usr and
NFS mounted filesystems. Network interfaces are also setup at this
point, but these aspects of jail setup are covered in the online docs.
Now mkjail starts up a proxy instance of tmcc which listens on a
unix domain socket shared with the soon-to-be jail. The purpose of
this proxy is now lost in the mists of time. It is not a performance
proxy, as all requests from inside are just forwarded on to boss, there
is no caching or aggregation. It appears to be more of a security issue
for remote jails (i.e., the never deployed jails on RON nodes), but I
am not certain about that.
Finally, mkjail forks and once again, the parent sits around waiting
for children to die (tmcc proxy or this child) or for a signal, while
the child goes off and actually execs the "jail" command which starts
up the jail running /etc/jail/injail. Here the parent remembers the
pid of both the tmcc proxy and the forked jail. If the tmcc proxy
terminates, it restarts it. If the jail terminates, we cleanup and exit.
If a signal is received, we again differentiate halting (USR1) from
destroying (INT or HUP).
Cleanup consists of killing the tmcc proxy, sending a USR1 to the jail
init process (see below), waiting for it to terminate and then undoing
all the mounts and interface setup. Additionally, if the jail is being
torn down, the per-vnode disk is destoryed as well.
At this point in the creation process, there are now vnodesetup and
mkjail processes both just waiting for vnode termination.
3b. BSD: injail
Finally we are actually running in the jail environment. injail is
just a mini-version of /sbin/init whose main job is to fork and fire
off /etc/rc in the child. The parent process then--you guessed it--waits
around for the child to terminate or for a signal in order to shutdown
all jailed processes. A signal can be either sent from outside
( or inside (shutdown).
So at the end of the day there are *three* processes whose sole job
is to wait for termination or signals: vnodesetup and outside
the jail, and injail inside.
4a. Linux: -p <exppid> -h <hostname> <vnodeid> is derived from but is for vservers (duh!)
It is only called to boot a Emulab cluster vserver-based vnode,
it is not used for planetlab or other remote vserver setups.
As with, is only used to start vnodes.
There is no seperate invocation for halting them. Halting is done
by sending a signal to the creation invocation.
mkvserver sucks in the contents of the jailconfig file created by
vnodesetup and uses that info to configure the local kernel to
allow the vserver to be created as necessary. In the vserver case,
we need (first time only) to create a control net bridge to which all
per-vnode cnet devices (tunnels) are attached. We then create the
control net tunnel device for this vnode and attach it.
mkvserver then either creates or "restores" the filesystem namespace
for the vnode. Creation currently involves "building" a vserver and
then creating a copy of the mutable local file systems and loopback
mounting /usr and NFS mounted filesystems.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment