Commit be548710 authored by Mike Hibler's avatar Mike Hibler

Incorporate description of jail changes from Leigh's jail.html file.

Fix a few nits.
parent 71385a45
......@@ -187,10 +187,9 @@ interfaces and routes, and runs them. A virtual interface's parameters are
assigned as follows. We assign veth virtual MACs based on the interfaces IP
address, in the form: 00:00:IP#0:IP#1:IP#2:IP#3, ensuring uniqueness.
We assign veth tags using the subnet part of the interface's IP address.
Since we use the 192.168 space, we need only the third octet of the IP to
identify the subnet, but for future expansion, we use the second and third
octets. A veth's physical interface is determined by assign. We may use
multiple physical interfaces between nodes or we may use no physical
Since we use 10.n.n.h space, where n.n is the subnet, we use the second and
third octets. A veth's physical interface is determined by assign. We may
use multiple physical interfaces between nodes or we may use no physical
interfaces at all. The routing table ID is unique per virtual node on a
physical node, so we simply use a per-physical node counter to assign these
when the vnodes are booted. All veths for a virtual node get the same
......@@ -198,7 +197,7 @@ counter value. There is nothing magical in route setup, just an extra
argument to the route command to ensure the routes get added to the correct
table. Likewise for delay setup, ipfw rules are simply applied to veths
rather than physical interfaces. Setting up routes and dummynet outside
the jail is largely historical.
the jail is largely historical, both could be done from inside.
Finally, the jail startup is done. Our augmented jail implementation
takes some new parameters in addition to a "primary" IP address, the root
......@@ -208,35 +207,142 @@ implicitly define which interfaces are accessible to the jail: those to which
the IP addresses are assigned. This is analogous to the root directory
specification which determines which mounted filesystems are accessible:
those at or below the level of the root directory. The program run by the
jail is yet another perl script, /etc/injail.pl, that is effectively the
/sbin/init of the virtual node. Its primary jobs are to fire off /etc/rc
to bring up the virtual node and then sit around and wait for a signal to
shutdown the jail. The startup scripts run by /etc/rc in the jail are
scaled back versions of what would run on a real node. This scaling back
reflects the fact that the node has already been partially initialized and
also that it usually will not run as many services as a real node. A
typical jail syslogd, cron and sshd, as well as the Emulab watchdog and
optional agents like trafgen or delay_agent. From the perspective of the
physical node, each jail has at least 8 processes running:
vnodesetup, mkjail.pl and proxy-tmcc outside the jail as well as
injail.pl, syslogd, cron, sshd, and the Emulab watchdog inside the jail.
jail, /etc/injail, is effectively the /sbin/init of the virtual node.
Its primary jobs are to fire off /etc/rc to bring up the virtual node and
then sit around and wait for a signal to shutdown the jail. The startup
scripts run by /etc/rc in the jail are scaled back versions of what would
run on a real node. This scaling back reflects the fact that the node has
already been partially initialized and also that it usually will not run as
many services as a real node. A typical jail syslogd, cron and sshd, as
well as the Emulab watchdog and optional agents like trafgen or delay_agent.
From the perspective of the physical node, each jail has at least 8 processes
running: vnodesetup, mkjail.pl and proxy-tmcc outside the jail as well as
injail, syslogd, cron, sshd, and the Emulab watchdog inside the jail.
Empirically, it appears that each jail requires 12-16MB of physical memory
for its base processes.
C. Details
FreeBSD jails. Jails provide filesystem and network namespace isolation
and some degree of superuser privilege restriction. To the basic jail
mechanism we have added:
1. Raw socket access.
2. Read-only BPF access.
3. Multiple-IP and INADDR_ANY support
4. Associate jail with a routing table
5. Bug fixes: cannot umount fs not mounted from within jail
[ move this detail somewhere else, a man page maybe? ]
First a few words on veth devices. They are configured with a few parameters:
C. Assorted Details
C1. FreeBSD Jails [ lifted from Leigh's jail.html... ]
Jails provide filesystem and network namespace isolation and some degree of
superuser privilege restriction. Following is a list of the features we added,
and bugs we fixed in FreeBSD jails. All of the new features are optional,
controlled by sysctl MIBs and per-jail flags. This new jail implementation
is backward compatible with the original implementation, meaning all new
features are disabled by default.
1. Allow a jailed process to bind to multiple IP addresses.
The default implementation of jail allows processes inside of a jail
to bind to just one IP, the IP that was specified to the jail command.
In that implementation, if a process specifies INADDR_ANY,
the kernel silently changes it to the jail IP. If however there
are other interfaces on the node, or if tunnels are being used to
construct an overlay for the experiment, it is necessary to allow
processes inside the jail to bind to those interfaces. In our
modified implementation, when the
jail is created, a list of auxiliary IPs can be specified on the
command line, telling the kernel to allow processes inside
the jail to bind to any of those IPs (including the jail IP).
When the bind happens, the kernel checks the jails list of IPs;
this applies to sockets bound for outgoing traffic, as well as
incoming traffic. Further, the set of accessible IPs determine
the list of interfaces that a jail can see so that, for example,
ifconfig inside a jail will only list the interfaces and IPs
available to the jail.
2. Allow jails to bind to INADDR_ANY.
The default behavior (and original implementation) of jail maps
INADDR_ANY to the jail's main IP address. However, when a jail
is allowed to access other IPs, then INADDR_ANY actually means a
subset of all the interfaces on the node that the jail is allowed
to use (which might also be tunnels). There are two situations in
which this matters:
A process is connecting to another address, and has
specified its local address as INADDR_ANY (which is typical).
Instead of binding the local address of packets to the jail IP,
the local address is set to the actual address of the interface
that the packet is routed out of. If there are IP aliases on the
interface, the list of aliases is searched for a match against
one of the allowed prison IPs. If there is a match, the local
address is set to that IP. Otherwise the address is set to the
main address of the interface (this is not correct; it should be
an error). This is to support multiplexing links using IP
aliases. If we were to use IP tunnels or some other form of
virtual interface, there would be no need to search the list of
aliases.
A process is binding a local socket for an incoming
connection. In this case, any of the prison IPs can be the local
target of the connection, but it is not until the connection is
actually made that the address can be checked. This is done in
the pcb lookup routine. For each pcb, if the port matches and the
local address is INADDR_ANY, and the pcb was created within a
jail, then the list of the prison IPs is searched, looking for a
match. If no match is found, the pcb is skipped. This behavior
improves compatibility with existing server applications which
typically specify INADDR_ANY. If the kernel were to continue
binding INADDR_ANY sockets to the main IP address of the jail,
such applications would only be able to receive packets on the
primary jail interface.
3. Allow access to raw sockets.
The jail is allowed to both read and write, but is restricted from
accessing the firewall, dummynet, route, and RSVP interfaces. We also
ensure that the packet header reflects a source IP address appropriate
for the jail: INADDR_ANY is mapped to an appropriate address for
the outgoing interface and fixed addresses that are not part of
the jail set are rejected. This feature allows ping, traceroute
and gated to work in jails.
4. Allow read-only access to BPF devices.
The interface is not put into promiscuous mode, so the jail is not
able to see all of the packets on the wire, but only those addressed
to the node. However, if the interface is already in promiscuous mode
(say, because someone outside the jail is using tcpdump), then the jail
will also be able to see any packet that goes by. Even when not in
promiscuous mode, a jail will see all packets destined for the interface
whether targeted to a valid jail IP address or not. This could be fixed,
and the promiscuous-mode problem avoided, by augmenting the filter given
when the bpf device is setup. Allowing BPF access enables use
of tcpdump and other packet trace tools within jails.
5. Restrict the port range to which a jail can bind.
This allows multiple jails on the same node to safely share the port
space without stepping on each other in environments where jails cannot
be assigned their own IP addresses. Since the ultimate goal is to allow
different experiments to coexist in jails on the same node, the
port space has to be allocated globally, with the same port space
assigned to all jails across an experiment, so as not to conflict
with any other experiments. This assignment is done when the
experiment is swapped in so that swapped experiments are not
holding ranges (16 bits of port space does not go very far).
6. Disallow FS unmounts inside a jail unless the mount was done in the jail.
This is a bug fix that prevents a jail from unmounting a filesystem
and exposing the underlying mount point to which it likely shouldn't
have access.
7. Added per-jail flags to control various existing and new jail features.
These are in addition to sysctls which control the global availability
of a given feature. Existing features thus controlled are: access to
SYSV IPC facilities, access to routing sockets and ability to turn on
and off filesystem quotas. New features controlled are: access to raw
sockets, access to read-only BPF and the ability to use INADDR_ANY.
Additionally, there is a new global sysctl to allow jails to be
configured with multiple IP addresses.
C2. Virtual ethernet devices.
Virtual ethernet devices (veths) are configured with a few parameters:
a virtual MAC (VMAC) address, a broadcast domain tag, an associated physical
(parent) ethernet interface, and a routing table ID. The VMAC obviously
identifies the interface and needs to be unique within a broadcast domain,
......@@ -261,7 +367,7 @@ associated with a virtual node have the same ID, every virtual node has
its own unique ID. The route table ID is a local-node only value, different
physical nodes can use the same ID for different purposes.
More about the startup pieces:
C3. More about the startup pieces.
vnodesetup hangs around so that you can signal it and easily
reboot the vnode. I guess the idea is that it is also jail/vserver
......@@ -270,15 +376,5 @@ vnodesetup hangs around so that you can signal it and easily
mkjail.pl is jail specific and hangs around so that it can clean up
jail specific things when the jail exits.
injail.pl is the jail's init process. This is the single point of
injail is the jail's init process. This is the single point of
contact for killing the jail.
/bin/sleep is just an artifact.
D. Examples
Consider a topology of two nodes connected via a link. Assume that both
nodes have been mapped to virtual nodes on two different physical nodes.
Each virtual node.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment