Commit 077583fb authored by Mike Hibler's avatar Mike Hibler
Browse files

Alright, I am now officially sick of writing vnode documentation.

I have said all that I want to say.
parent a34582d9
......@@ -12,6 +12,12 @@
<li> <a href="#Overview">Overview</a>
<li> <a href="#Use">Use</a>
<li> <a href="#AdvancedIssues">Advanced Issues</a>
<ul>
<li> <a href="#AI1">Taking advantage of a virtual node host</a>
<li> <a href="#AI2">Controlling virtual node layout</a>
<li> <a href="#AI3">Determining how many nodes to colocate</a>
<li> <a href="#AI4">Mixing virtual and physical nodes</a>
</ul>
<li> <a href="#Limitations">Limitations</a>
<li> <a href="#KnownBugs">Known Bugs</a>
<li> <a href="#TechDetails">Technical Details</a>
......@@ -56,19 +62,27 @@ of the links you are emulating and the desired fidelity of the emulation.
See the <a href="#AdvancedIssues">Advanced Issues</a> section for more info.
</p>
<a NAME="Use"></a><h2>Use</h2>
Multiplexed virtual nodes are specified in an ns description by indicating
Multiplexed virtual nodes are specified in an NS description by indicating
that you want the <b>pcvm</b> node type:
<code><pre>
set nodeA [$ns node]
tb-set-hardware $nodeA pcvm
</code></pre>
or, if you want all virtual nodes to be mapped to the same machine type,
say a pc850:
<code><pre>
set nodeA [$ns node]
tb-set-hardware $nodeA pcvm850
</code></pre>
that is, instead of "pcvm" use "pcvmN" where N is the node type
(600, 850, 1500, 2000).
That's it! With few exceptions, every thing you use in an NS file for an
Emulab experiment running on physical nodes, will work with virtual nodes.
The most notable exception is that you cannot specify the operating system
for a virtual node, they are limited to running our custom version of
FreeBSD 4.7 (soon to be FreeBSD 4.9).
</p><p>
As a simple example, we could take the <a href="basic.ns">basic ns script</a>
As a simple example, we could take the <a href="basic.ns">basic NS script</a>
used in the
<a href="docwrapper.php3?docname=tutorial.html#Designing">tutorial</a>
add the following lines:
......@@ -84,7 +98,7 @@ and remove the explicit setting of the OS:
tb-set-node-os $nodeA FBSD-STD
tb-set-node-os $nodeC RHL-STD
</code></pre>
and the <a href="vnode-example.ns">resulting ns file</a>
and the <a href="vnode-example.ns">resulting NS file</a>
can be submitted to produce the very same topology.
Once the experiment has been instantiated, the experiment web page should
include a listing of the reserved nodes that looks something like:
......@@ -188,7 +202,7 @@ anyway so that all virtual links have the same MTU.
</p><p>
<a NAME="AdvancedIssues"></a><h2>Advanced Issues</h2>
<h3>Taking advantage of a virtual node host.</h3>
<a NAME="AI1"></a><h3>Taking advantage of a virtual node host.</h3>
A physical node hosting one or more virtual nodes is not itself part of
the topology, it exists only to host virtual nodes. However, the physical
node is still setup with user accounts and shared filesystems just as a
......@@ -207,10 +221,8 @@ variety of ways:
them by hand after the experiment has been created, and reboot the
virtual nodes. Thereafter, the packages will be available.
<li> The private root filesystem for each virtual node is also accessible
to the host node in
<code>/var/emulab/jails/</code><i>vnodename</i><code>/root</code>
where <i>vnodename</i> is the "pcvmNN-NN" Emulab name. Thus the host
can monitor log files and even change files on the fly.
to the host node (see below). Thus the host can monitor log files and
even change files on the fly.
<li> Other forms of monitoring can be done as well since all processes,
filesystems, network interfaces and routing tables are visible in the
host. For instance, you can run tcpdump on a virtual interface outside
......@@ -222,8 +234,56 @@ We should emphasize however, that virtual nodes are not "performance
isolated" from each other or from the host; i.e., a big CPU hogging
monitor application in the host might affect the performance and behavior
of the hosted virtual nodes.
<p>
Following is a list of the per virtual node resources and how they can be
accessed from the physical host:
<ul>
<li> <b>Processes.</b>
FreeBSD does not distinguish which processes belong to which jails,
you can see which processes belong to any jail as indicated by the
'J' in a ps listing. The "injail" process for each jail does identify
itself on the ps command line, so you can trace parent/child
relationships from there.
<li> <b>Filesystems.</b>
The private "file" disk for each virtual node is mounted as
<code>/var/emulab/jails/</code><i>vnodename</i><code>/root</code>
where <i>vnodename</i> is the "pcvmNN-NN" Emulab name. The regular
file that is the disk itself is in the per virtual node directory
as <code>root.vnode</code>. The /bin, /sbin, /usr directories
are read-only loopback mounted from the parent as are the normal
shared directories in /users and /proj.
<li> <b>Network Interfaces.</b>
All virtual network interfaces are visible using <code>ifconfig</code>.
Identifying which interfaces belong to a particular virtual node must
be done by hand, most easily by first logging into the virtual node in
question and doing <code>ifconfig</code>. You can also look at
<code>/var/emulab/jails/</code><i>vnodename</i><code>/rc.ifc</code>
which is the startup script used to configure the node's interfaces.
In addition to the usual information, <code>ifconfig</code> on a
virtual device also shows which route table (rtabid), broadcast
domain (vethtag) and parent device (parent) it is associated with it.
See <a href="#TechDetails">Technical Details</a> below for what
these mean.
<li> <b>Routing tables.</b>
Every virtual node has its own IP routing table. Each table is
identified by an ID, the "rtabid." Tables can be viewed in the
parent using <code>netstat</code> with the enhanced '-f inet' option:
<code><pre>
netstat -ran -f inet
netstat -ran -f inet:3
netstat -ran -f inet:-1
</code></pre>
The first form shows IP4 routes in the "main" (physical host's)
routing table. The second would show routing table 3, and the last
shows all active routing tables. Routing tables may be modified using
the <code>route</code> command with the new '-rtabid N' option, where
N is the rtabid:
<code><pre>
route add -rtabid 3 -net 192.168/16 -interface lo0
</code></pre>
</ul>
<h3>Controlling virtual node layout.</h3>
<a NAME="AI2"></a><h3>Controlling virtual node layout.</h3>
<p>
Normally, the Emulab resource mapper, <code>assign</code>
will map virtual nodes onto physical
......@@ -234,63 +294,224 @@ can without exceeding a node's internal or external network bandwidth
capabilities and without exceeding a node-type specific static packing
factor. Internal network bandwidth is an empirically derived value for
how much network data can be moved through internally connected virtual
ethernet interfaces. External network bandwidth is based on the number
ethernet interfaces. External network bandwidth is determined by the number
of physical interfaces available on the node. The static packing factor is
intended as a coarse metric of CPU and memory load that a physical node
can support, currently it is based strictly on the amount of physical memory.
The current values for these constraints are:
can support, currently it is based strictly on the amount of physical memory
in each node type. The current values for these constraints are:
<ul>
<li>Internal network bandwidth: 400Mb/sec for all node types
<li>External network bandwidth: 400Mb/sec for all node types
<li>External network bandwidth: 400Mb/sec (4 x 100Mb NICs) for all node types
<li>Packing factor: 10 for pc600s and pc1500s, 20 for pc850s and pc2000s
</ul>
</p><p>
The mapper generally produces an "unsurprising" mapping of virtual nodes
to physical nodes (e.g., mapping small LANs all on the same physical host)
and where it doesn't, it is usually because doing so would violate one
of the constraints. However, there are circumstances in which you might
want to modify or even override the way in which mapping is done.
Currently there are only limited ways in which to do this, and none of
these will allow you to violate the constrains above.
of the constraints. One exception involves LANs.
</p><p>
Using the NS-extension <code>tb-set-colocate-factor</code> command, you
can globally reduce (not increase!) the maximum number of virtual nodes
One might think that an entire 100Mb LAN, regardless of the number of
members, could be located on a single physical host since the internal
bandwidth of a host is 400Mb/sec. Alas, this is not the case. A LAN
is modeled in Emulab as a set of point-to-point links to a "LAN node."
The LAN node will then see 100Mb/sec from every LAN member. For the
purposes of bandwidth allocation, a LAN node must be mapped to a physical
host just as any other node. The difference is that a LAN node may be
mapped to a switch, which has "unlimited" internal bandwidth,
as well as to a node. Now consider the case of a 100Mb/sec LAN with 5 members.
If the LAN node is colocated with the other nodes on the same physical
host, it is a violation as 500Mb/sec of bandwidth is required for
the LAN node. If instead the LAN node is mapped to a switch, it is
still a violation because now we need 500Mb/sec from the physical node
to the switch, but there is only 400Mb/sec available there as well.
Thus you can only have 4 members of a 100Mb/sec LAN on any single physical
host. You can however have 4 members on each of many physical hosts to
form a large LAN, in this case the LAN node will be located on the switch.
Note that this discussion applies equally to 8 members on a 50Mb/sec LAN,
20 members of a 20Mb LAN, or any LAN where the aggregate bandwidth
exceeds 400Mb/sec. And of course, you must take into consideration
the bandwidth of all other links and LANs on a node.
Now you know why we have a complex program to do this!
</p><p>
Anyway, if you are still not deterred and feel you can do a better job
of virtual to physical node mapping yourself, there are a few ways to
do this. Note carefully though that none of
these will allow you to violate the bandwidth and packing constraints
listed above.
</p><p>
The NS-extension <code>tb-set-colocate-factor</code> command allows you
to globally decrease (not increase!) the maximum number of virtual nodes
per physical node. This command is useful if you know the application
load you are running in the vnodes is going to require more resources
per instance (e.g., a java DHT).
per instance (e.g., a java DHT), and that the Emulab picked values of
10-20 per physical node are just too high.
Note that currently, this is not really a "factor,"
it is an absolute value. Setting it to 5 will reduce the capacity of
all node types to 5, whether they were 10 or 20 by default.
</p><p>
Since <code>assign</code> uses a heuristic algorithm at its core,
sometime it just doesn't find the best solution that you might think
is obvious. If assign just won't colocate virtual nodes that you want
colocated, you can resort to trying to do the mapping by hand using
<code>tb-fix-node</code>.
<i>TODO:
using tb-set-jail-os,
using tb-set-noshaping,
understanding how bandwidth affects layout.
How do I know what the right colocate factor is?
ENDTODO</i>
<h3>Mixing virtual and physical nodes.</h3>
If the packing factor is ok, but <code>assign</code>
just won't colocate virtual nodes the way you want,
you can resort to trying to do the mapping by hand using
<code>tb-fix-node</code>. This technique is not for the faint of heart
(or weak of stomach) as it involves mapping virtual nodes to specific
physical nodes, which you must determine in advance are available.
For example, the following code snippet will allocate 8 nodes in a LAN
and force them all onto the same physical host (pc41):
<code><pre>
set phost pc41 # physical node to use
set phosttype 850 # type of physical node, e.g. pc850
# Force virtual nodes in a LAN to one physical host
set lanstr ""
for {set j 1} {$j <= 8} {incr j} {
set n($j) [$ns node]
append lanstr "$n($j) "
tb-set-hardware $n($j) pcvm${phosttype}
tb-fix-node $n($j) $phost
}
set lan [$ns make-lan "$lanstr" 10Mb 0ms]
</code></pre>
If the host is not available, this will fail. Note again, that "fixing"
nodes will still not allow you to violate any of the fundamental
mapping constraints.
</p><p>
There is one final technique that will allow you to circumvent
<code>assign</code> and the bandwidth constraints above.
The NS-extension <code>tb-set-noshaping</code> can be used to turn off
link shaping for a specific link or LAN, e.g.:
<code><pre>
tb-set-noshaping $lan 1
</code></pre>
added to the NS snippet above would allow you to specify "1Mb" for the
LAN bandwidth and map 20 virtual nodes to the same physical host,
but then not be bound by the bandwidth constraint later.
In this way <code>assign</code> would map your topology, but no enforcement
would be done at runtime. Specifically, this tells Emulab not to set
up ipfw rules and dummynet pipes on the specified interfaces.
One semi-legitimate use
of this command, is in the case where you know that your applications
will not exceed a certain bandwidth, and you don't want to incur the
ipfw/dummynet overhead associated with explicitly enforcing the limits.
Note, that as implied by the name, this turns off all shaping of a link,
not just the bandwidth constraint. So if you need delays or packet loss,
don't use this.
<a NAME="AI3"></a><h3>How do I know what the right colocate factor is?</h3>
The hardest issue when using virtual nodes is determining how many
virtual nodes you can colocate on a physical node, without affecting the
fidelity of the experiment. Ultimately, the experimenter must make
this decision, based on the nature of the applications run and what exactly
is being measured. We provide some simple limits (e.g., network bandwidth
caps) and coarse-grained agregate limits (e.g., the default colocation factor)
but these are hardly adequate.
<p>
One thing to try is to allocate a modest sized version of your experiment,
say 40-50 nodes, using just physical nodes and compare that to the same
experiment with 40-50 virtual nodes with various packing factors.
</p><p>
We are currently working on techniques that will allow you to specify
some performance constraints in some fashion, and have the experiment
run and self-adjust til it reaches a packing factor that doesn't violate
those constraints.
<a NAME="AI4"></a><h3>Mixing virtual and physical nodes.</h3>
It is possible to mix virtual nodes and physical nodes in the same
experiment. For example, we could setup a LAN, similar to the above example,
such that half the nodes were virtual (pcvm) and half physical (pc):
<code><pre>
set lanstr ""
for {set j 1} {$j <= 8} {incr j} {
set n($j) [$ns node]
append lanstr "$n($j) "
if {$j & 1} {
tb-set-hardware $n($j) pcvm
} else {
tb-set-hardware $n($j) pc
tb-set-node-os $n($j) FBSD-STD
}
}
set lan [$ns make-lan "$lanstr" 10Mb 0ms]
</code></pre>
The current limitation is that the physical nodes must run FreeBSD because
of the use of the custom encapsulation on virtual ethernet devices. Note
that this also implies that the physical nodes use virtual ethernet devices
and thus the MTU is likewise reduced.
<p>
We have implemented, but not yet deployed, a non-encapsulating version of the
virtual ethernet interface that will allow virtual nodes to talk directly to
physical ethernet interfaces and thus remove the FreeBSD-only and reduced-MTU
restrictions.
<a NAME="Limitations"></a><h2>Limitations</h2>
<i>TODO:
Must run FreeBSD and a particular version at that.
No resource guarantees for CPU and memory.
veth encapsulation reduces MTU.
Vnode control net not externally visible.
400Mb internal "network" bandwidth.
Only scale to low 1000s of nodes due to various bottlenecks
(assign, NFS, routing).
No consoles.
Always use linkdelays (more overhead, requires 1000Hz kernel).
Not a complete virtualization, many commands "see through".
ENDTODO</i>
Following are the primary limitations of the Emulab virtual node
implementation.
<ul>
<li> <b>Not a complete virtualization of a node.</b>
We make no claims about being a true x86 or even BSD/Linux
virtual machine. We build on an existing mechanism (jail) with
the primary goal of providing functional transparency to applications.
We are even more lax in that we assume that all virtual nodes on a
physical host belong to the same experiment. This reduces the security
concerns considerably. For example, if a virtual node is able to crash
the physical machine or is able to see data outside its scope, it only
affects the particular experiment. This is not to say that we are
egregious in our violation. A particular example is that virtual nodes
are allowed to read /dev/mem. This made it much easier as we did not
have to either virtualize /dev/mem or rewrite lots of system utilities
that use it. The consequence is, that virtual nodes can spy on each
other if they want. But then, if you cannot trust yourself, who can
you trust!
<li> <b>No resource guarantees for CPU and memory on nodes.</b>
We also don't provide complete performance isolation. We currently
have no virtual node aware CPU scheduling mechanisms. Processes in
virtual nodes are just processes on the real machine and are scheduled
according to the standard BSD scheduler. There are also no limits
on virtual or physical memory consumption by a virtual node.
<li> <b>Nodes must run a specific version of FreeBSD.</b>
We have hacked the FreeBSD 4.7 kernel mightily to support virtual nodes.
See
<a href="../doc/docwrapper.php3?docname=jail.html">this document</a>
for details, but suffice it to say, making these changes to Linux
or even other versions of FreeBSD would be a huge task. That said,
the changes have been made to FreeBSD 4.9, which will become the default
"jail kernel" at some point. We will also be providing a virtual node
environment for Linux, likely using Xen.
<li> <b>Will only scale to low 1000s of nodes.</b>
We currently have a number of scaling issues that make it impractical
to run experiments of more than 1000-2000 nodes. These range from
algorithmic issues in the resource mapper and route calculator, to
physical issues like too few and too feeble of physical nodes, to
user interface issues like how to present a listing or visualization
of thousands of nodes in a useful way.
<li> <b>Virtual nodes are not externally visible.</b>
Due to a lack of routable IP space, virtual nodes are given non-routable
control net addresses and thus cannot be accessed directly from outside
Emulab. You must use a suitable proxy or access them from the Emulab
user-login server.
<li> <b>Virtual ethernet encapsulation reduces the MTU.</b>
This is a detail, but of possible importance to people since they
are doing network experiments. The veth device reduces the MTU by
16 bytes to 1484. As mentioned, we have a version of the interface
which does not use encapsulation.
<li> <b>Only 400Mb of internal "network" bandwidth.</b>
This falls in the rinky-dink node catagory. As most of our nodes
are based on ancient 100Mhz FSB, sub-GHz technology, they cannot
host many virtual nodes or high capacity virtual links. The next
wave of cluster machines will be much better in this regard.
<li> <b>No node consoles.</b>
Virtual nodes do not have a virtual console. If we discover a need
for one, we will implement it.
<li> <b>Must use "linkdelays."</b>
To enable topology-on-a-single-node configurations and to conserve
physical resources in the face of large topologies, we use on-node
traffic shaping rather than dedicated traffic shaping nodes. This
increases the overhead on the host machine slightly. To improve
the fidelity of delays and bandwidth shaping, virtual node hosts
run their kernel at 1000Hz rather than 100Hz. This is unlikely to
affect anything, but is mentioned for completeness.
</ul>
<a NAME="KnownBugs"></a><h2>Known Bugs</h2>
<i>TODO:
Deadlocks in loopback mounts.
ENDTODO</i>
There is currently a problem with the "loopback" (nullfs) filesystem
mechanism we use to export filesystems to virtual nodes. It is prone to
deadlock under load. To be safe, you should do all your logging and
heavy file activity inside the "file" disk (e.g., in /var).
<a NAME="TechDetails"></a><h2>Technical Details</h2>
There is an
<a href="../doc/docwrapper.php3?docname=jail.html">online document</a>
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment