Multiplexed Virtual Nodes in Emulab

Contents


Overview

In order to allow experiments with a very large number of nodes, we provide a multiplexed virtual node implementation. If an experiment application's CPU, memory and network requirements are modest, multiplexed virtual nodes (hereafter known as just "virtual nodes"), allow an experiment to use 10-20 times as many nodes as there are available physical machines in Emulab. These virtual nodes can currently only run FreeBSD, but Linux support is coming.

Virtual nodes fall between simulated nodes (ala, ns) and real, dedicated machines in terms of accuracy of modeling the real world. A virtual node is just a lightweight virtual machine running on top of a regular operating system. In particular, our virtual nodes are based on the FreeBSD jail mechanism, that allows groups of processes to be isolated from each other while running on the same physical machine. Emulab virtual nodes provide isolation of the filesystem, process, network, and account namespaces. That is to say, each virtual node has its own private filesystem, process hierarchy, network interfaces and IP addresses, and set of users and groups. This level of virtualization allows unmodified applications to run as though they were on a real machine. Virtual network interfaces are used to form an arbitrary number of virtual network links. These links may be individually shaped and may be multiplexed over physical links or used to connect virtual nodes within a single physical node.

With some limitations, virtual nodes can act in any role that a normal Emulab node can: end node, router, or traffic generator. You can run startup commands, ssh into them, run as root, use tcpdump or traceroute, modify routing tables, and even reboot them. You can construct arbitrary topologies of links and LANs, even mixing virtual and real nodes.

The number of virtual nodes that can be multiplexed on a single physical node depends on a variety of factors including the resource requirements of the application, the type of the underlying node, the bandwidths of the links you are emulating and the desired fidelity of the emulation. See the Advanced Issues section for more info.

Use

Multiplexed virtual nodes are specified in an NS description by indicating that you want the pcvm node type:
	set nodeA [$ns node]
	tb-set-hardware $nodeA pcvm
	
or, if you want all virtual nodes to be mapped to the same machine type, say a pc850:
	set nodeA [$ns node]
	tb-set-hardware $nodeA pcvm850
	
that is, instead of "pcvm" use "pcvmN" where N is the node type (600, 850, 1500, 2000). That's it! With few exceptions, every thing you use in an NS file for an Emulab experiment running on physical nodes, will work with virtual nodes. The most notable exception is that you cannot specify the operating system for a virtual node, they are limited to running our custom version of FreeBSD 4.10.

As a simple example, we could take the basic NS script used in the tutorial add the following lines:

	tb-set-hardware $nodeA pcvm
	tb-set-hardware $nodeB pcvm
	tb-set-hardware $nodeC pcvm
	tb-set-hardware $nodeD pcvm
	
and remove the explicit setting of the OS:
	# Set the OS on a couple.
	tb-set-node-os $nodeA FBSD-STD
	tb-set-node-os $nodeC RHL-STD         
	
and the resulting NS file can be submitted to produce the very same topology. Once the experiment has been instantiated, the experiment web page should include a listing of the reserved nodes that looks something like:





By looking at the NodeIDs (pcvm36-NN), you can see that all four virtual nodes were assigned to the same physical node (pc36). (At the moment, control over virtual node to physical node mapping is limited. The Advanced Issues section discusses ways in which you can affect the mapping.) Clicking on the ssh icon will log you in to the virtual node. Virtual nodes do not have consoles, so there is no corresponding icon. Note that there is also an entry for the ''hosting'' physical node. You can login to it as well, either with ssh or via the console. See the Advanced Issues section for how you can use the physical host. Finally, note that there is no ''delay node'' associated with the shaped link. This is because virtual links always use end node shaping.

Logging into a virtual node you see only the processes associated with your jail:

        PID  TT  STAT      TIME COMMAND
        1846  ??  IJ     0:00.01 injail: pcvm36-5 (injail)
        1883  ??  SsJ    0:00.03 /usr/sbin/syslogd -ss
        1890  ??  SsJ    0:00.01 /usr/sbin/cron
        1892  ??  SsJ    0:00.28 /usr/sbin/sshd
        1903  ??  IJ     0:00.01 /usr/bin/perl -w /usr/local/etc/emulab/watchdog start
        5386  ??  SJ     0:00.04 sshd: mike@ttyp1 (sshd)
        5387  p1  SsJ    0:00.06 -tcsh (tcsh)
        5401  p1  R+J    0:00.00 ps ax
	
The injail process serves the same function as init on a regular node, it is the ''root'' of the process name space. Killing it will kill the entire virtual node. Other standard FreeBSD processes include syslog, cron, and sshd along with the Emulab watchdog process. Note that the process IDs are in fact not virtualized, they are in the physical machine's name space. However, a virtual node still cannot kill a process that is part of another jail.

Doing a df you see:

        Filesystem                      1K-blocks      Used   Avail Capacity  Mounted on
        /dev/vn5c                          507999      1484  496356     0%    /
        /var/emulab/jails/local/testbed   6903614     73544 6277782     1%    /local/testbed
        /users/mike                      14081094   7657502 5297105    59%    /users/mike
        ...
	
/dev/vn5c is your private root filesystem, which is a FreeBSD vnode disk (i.e., a regular file in the physical machine filesystem). /local/projname is ''loopback'' mounted from the physical host and provides some disk space that is shared between all virtual nodes on the same physical node. Also mounted are the usual Emulab-provided, shared filesystems. Thus you have considerable flexibility in sharing ranging from shared by all nodes (/users/yourname and /proj/projname), shared by all virtual nodes on a physical node (/local/projname) to private to a virtual node (/local).

Doing ifconfig reveals:

        fxp4: flags=8843 mtu 1500 rtabid 0
                inet 172.17.36.5 netmask 0xffffffff broadcast 172.17.36.5
                ether 00:d0:b7:14:0f:e2
                media: Ethernet autoselect (100baseTX )
                status: active
        lo0: flags=8049 mtu 16384 rtabid 0
                inet 127.0.0.1 netmask 0xff000000 
        veth3: flags=8843 mtu 1484 rtabid 5
                inet 10.1.2.3 netmask 0xffffff00 broadcast 10.1.2.255
                ether 00:00:0a:01:02:03
                vethtag: 513 parent interface: 
	
Here fxp4 is the control net interface. Due to limited routable IP address space, Emulab uses the 172.16/12 unroutable address range to assign control net addresses to virtual nodes. These addresses are routed within Emulab, but are not exposed externally. This means that you can access this node (including using the DNS name ''nodeC.vtest.testbed.emulab.net'') from ops.emulab.net or from other nodes in your experiment, but not from outside Emulab. If you need to access a virtual node from outside Emulab, you will have to proxy the access via ops or a physical node (that is what the ssh icon in the web page does). veth3 is a virtual ethernet device (not part of standard FreeBSD, we wrote it at Utah) and is the experimental interface for this node. There will be one veth device for every experimental interface. Note the reduced MTU (1484) on the veth interface. This is because the veth device uses encapsulation to identify packets which are multiplexed on physical links. Even though this particular virtual link does not cross a physical wire, the MTU is reduced anyway so that all virtual links have the same MTU.

Advanced Issues

Taking advantage of a virtual node host.

A physical node hosting one or more virtual nodes is not itself part of the topology, it exists only to host virtual nodes. However, the physical node is still setup with user accounts and shared filesystems just as a regular node is. Thus you can login to, and use the physical node in a variety of ways: We should emphasize however, that virtual nodes are not "performance isolated" from each other or from the host; i.e., a big CPU hogging monitor application in the host might affect the performance and behavior of the hosted virtual nodes.

Following is a list of the per virtual node resources and how they can be accessed from the physical host:

Controlling virtual node layout.

Normally, the Emulab resource mapper, assign will map virtual nodes onto physical nodes in such a way as to achieve the best overall use of physical resources without violating any of the constraints of the virtual nodes or links. In a nutshell, it packs as many virtual nodes onto a physical node as it can without exceeding a node's internal or external network bandwidth capabilities and without exceeding a node-type specific static packing factor. Internal network bandwidth is an empirically derived value for how much network data can be moved through internally connected virtual ethernet interfaces. External network bandwidth is determined by the number of physical interfaces available on the node. The static packing factor is intended as a coarse metric of CPU and memory load that a physical node can support, currently it is based strictly on the amount of physical memory in each node type. The current values for these constraints are:

The mapper generally produces an "unsurprising" mapping of virtual nodes to physical nodes (e.g., mapping small LANs all on the same physical host) and where it doesn't, it is usually because doing so would violate one of the constraints. One exception involves LANs.

One might think that an entire 100Mb LAN, regardless of the number of members, could be located on a single physical host since the internal bandwidth of a host is 400Mb/sec. Alas, this is not the case. A LAN is modeled in Emulab as a set of point-to-point links to a "LAN node." The LAN node will then see 100Mb/sec from every LAN member. For the purposes of bandwidth allocation, a LAN node must be mapped to a physical host just as any other node. The difference is that a LAN node may be mapped to a switch, which has "unlimited" internal bandwidth, as well as to a node. Now consider the case of a 100Mb/sec LAN with 5 members. If the LAN node is colocated with the other nodes on the same physical host, it is a violation as 500Mb/sec of bandwidth is required for the LAN node. If instead the LAN node is mapped to a switch, it is still a violation because now we need 500Mb/sec from the physical node to the switch, but there is only 400Mb/sec available there as well. Thus you can only have 4 members of a 100Mb/sec LAN on any single physical host. You can however have 4 members on each of many physical hosts to form a large LAN, in this case the LAN node will be located on the switch. Note that this discussion applies equally to 8 members on a 50Mb/sec LAN, 20 members of a 20Mb LAN, or any LAN where the aggregate bandwidth exceeds 400Mb/sec. And of course, you must take into consideration the bandwidth of all other links and LANs on a node. Now you know why we have a complex program to do this!

Anyway, if you are still not deterred and feel you can do a better job of virtual to physical node mapping yourself, there are a few ways to do this. Note carefully though that none of these will allow you to violate the bandwidth and packing constraints listed above.

The NS-extension tb-set-colocate-factor command allows you to globally decrease (not increase!) the maximum number of virtual nodes per physical node. This command is useful if you know the application load you are running in the vnodes is going to require more resources per instance (e.g., a java DHT), and that the Emulab picked values of 10-20 per physical node are just too high. Note that currently, this is not really a "factor," it is an absolute value. Setting it to 5 will reduce the capacity of all node types to 5, whether they were 10 or 20 by default.

If the packing factor is ok, but assign just won't colocate virtual nodes the way you want, you can resort to trying to do the mapping by hand using tb-fix-node. This technique is not for the faint of heart (or weak of stomach) as it involves mapping virtual nodes to specific physical nodes, which you must determine in advance are available. For example, the following code snippet will allocate 8 nodes in a LAN and force them all onto the same physical host (pc41):

        set phost       pc41    # physical node to use
        set phosttype   850     # type of physical node, e.g. pc850

        # Force virtual nodes in a LAN to one physical host
        set lanstr ""
        for {set j 1} {$j <= 8} {incr j} {
                set n($j) [$ns node]
                append lanstr "$n($j) "
                tb-set-hardware $n($j) pcvm${phosttype}
                tb-fix-node $n($j) $phost
        }
        set lan [$ns make-lan "$lanstr" 10Mb 0ms]
	
If the host is not available, this will fail. Note again, that "fixing" nodes will still not allow you to violate any of the fundamental mapping constraints.

There is one final technique that will allow you to circumvent assign and the bandwidth constraints above. The NS-extension tb-set-noshaping can be used to turn off link shaping for a specific link or LAN, e.g.:

	tb-set-noshaping $lan 1
	
added to the NS snippet above would allow you to specify "1Mb" for the LAN bandwidth and map 20 virtual nodes to the same physical host, but then not be bound by the bandwidth constraint later. In this way assign would map your topology, but no enforcement would be done at runtime. Specifically, this tells Emulab not to set up ipfw rules and dummynet pipes on the specified interfaces. One semi-legitimate use of this command, is in the case where you know that your applications will not exceed a certain bandwidth, and you don't want to incur the ipfw/dummynet overhead associated with explicitly enforcing the limits. Note, that as implied by the name, this turns off all shaping of a link, not just the bandwidth constraint. So if you need delays or packet loss, don't use this.

How do I know what the right colocate factor is?

The hardest issue when using virtual nodes is determining how many virtual nodes you can colocate on a physical node, without affecting the fidelity of the experiment. Ultimately, the experimenter must make this decision, based on the nature of the applications run and what exactly is being measured. We provide some simple limits (e.g., network bandwidth caps) and coarse-grained aggregate limits (e.g., the default colocation factor) but these are hardly adequate.

One thing to try is to allocate a modest sized version of your experiment, say 40-50 nodes, using just physical nodes and compare that to the same experiment with 40-50 virtual nodes with various packing factors.

We are currently working on techniques that will allow you to specify some performance constraints in some fashion, and have the experiment run and self-adjust til it reaches a packing factor that doesn't violate those constraints.

Mixing virtual and physical nodes.

It is possible to mix virtual nodes and physical nodes in the same experiment. For example, we could setup a LAN, similar to the above example, such that half the nodes were virtual (pcvm) and half physical (pc):
        set lanstr ""
	for {set j 1} {$j <= 8} {incr j} {
	        set n($j) [$ns node]
		append lanstr "$n($j) "
		if {$j & 1} {
		        tb-set-hardware $n($j) pcvm
		} else {
		        tb-set-hardware $n($j) pc
                        tb-set-node-os $n($j) FBSD-STD
		}
	}
	set lan [$ns make-lan "$lanstr" 10Mb 0ms]
	
The current limitation is that the physical nodes must run FreeBSD because of the use of the custom encapsulation on virtual ethernet devices. Note that this also implies that the physical nodes use virtual ethernet devices and thus the MTU is likewise reduced.

We have also implemented, a non-encapsulating version of the virtual ethernet interface that allows virtual nodes to talk directly to physical ethernet interfaces and thus remove the reduced-MTU restriction. To use the non-encapsulating version, put:

	tb-set-encapsulate 0
	
in your NS file.

Limitations

Following are the primary limitations of the Emulab virtual node implementation.

Known Bugs

There is currently a problem with the "loopback" (nullfs) filesystem mechanism we use to export filesystems to virtual nodes. It is prone to deadlock under load. To be safe, you should do all your logging and heavy file activity inside the "file" disk (e.g., in /var).

Technical Details

There is an online document covering some of the details of the FreeBSD implementation of virtual nodes. There is a more detailed document in the Emulab source code in the file doc/vnode-impl.txt.