From b8c0b937e1d06c842e2283fa0a81c0d7cf925b3c Mon Sep 17 00:00:00 2001
From: Mike Hibler <mike@flux.utah.edu>
Date: Thu, 13 Sep 2007 23:23:29 +0000
Subject: [PATCH] Last dump for now (almost 60K, sheesh!)

---
 doc/delay-implementation.txt | 364 +++++++++++++++++++++++------------
 1 file changed, 236 insertions(+), 128 deletions(-)

diff --git a/doc/delay-implementation.txt b/doc/delay-implementation.txt
index 9819b25386..dc85428e67 100644
--- a/doc/delay-implementation.txt
+++ b/doc/delay-implementation.txt
@@ -758,8 +758,6 @@ This translates into:
   |       |             +-----+       +-----+             |       |
   +-------+                   +-------+                   +-------+
 
-
-
 4b. delay-agent on end nodes.
 
 Fill me in...
@@ -880,17 +878,26 @@ soon.
 
 The CREATE event is sent to all nodes in the cloud (rather, to the shaping
 node responsible for each node's connection to the underlying LAN) and
-creates "node pair" pipes for each node to all other nodes on the LAN.
-Each node-to-LAN connection has two pipes associated with each possible
-destination on the LAN (destinations determined from /etc/hosts file).
-The first pipe is used for most situations and contains BW/delay values
-for the pair.  The second pipe is used when operating in Flexlab hybrid
-mode as described below.  Characteristics of these per-pair pipes cannot
-be set/modified unless a CREATE command has first been executed.
+creates, internal to the delay-agent, "node pair" pipes for each node to
+all other nodes on the LAN.  Actual IPFW rules and dummynet pipes are only
+created the first time a per-pair pipe's characteristics are set via the
+MODIFY event.  This behavior is in part an optimization, but is also
+essential for the hybrid model described later.
+
 There is a corresponding CLEAR event which will destroy all the per-pair
 pipes, leaving only the standard delayed LAN setup (node to LAN pipes).
 
-The cloud snippet above would translate into a setup of:
+Each node-to-LAN connection has two pipes associated with each possible
+destination on the LAN (destinations determined from /etc/hosts file).
+The first pipe is used for shaping bandwidth for the pair.  The second
+pipe is used for shaping delay (and eventually packet loss).  While it
+might seem that the single pipe from a node to the LAN might be sufficient
+for shaping both, the split is needed when operating in the hybrid mode
+as described below.  Characteristics of these per-pair pipes cannot be
+modified unless a CREATE command has first been executed.
+
+Assuming all IPFW/dummynet pipes have been modified, the cloud snippet
+above would translate into a physical setup of:
 
   +----+                        +-------+                        +-------+
   |    |--- to n2 pipe -->+-----+       +-----+<- from n2 pipe --|       |
@@ -914,7 +921,16 @@ The cloud snippet above would translate into a setup of:
 where the top two pipes in each set of three are the new, per-pair pipes
 and the final pipe is the standard shaping pipe which can be thought of
 as the "default" pipe through which any traffic flows for which there is
-not a specific per-pair setup.
+not a specific per-pair setup.  In IPFW, the rules associated with the
+per-pair pipes are numbered starting at 60000 and decreasing.  This gives
+them higher priority than the default pipes which are numbered above 60000.
+
+One important thing to note is that while bandwidth is shaped on the
+outgoing pipe, when a delay value is set on n1 for destination n2, it is
+imposed on the link *into* n1.  This is different than for regular LAN
+shaping (and for the ACIM model below), where bandwidth, delay and loss
+are all applied in one direction.  The reason for the split is explained
+in the hybrid-model discussion below.
 
 5a. Simple mode setup:
 
@@ -929,10 +945,15 @@ command).  If the DEST parameter is not given, then the modification is
 applied to the "default" pipe (i.e., the normal shaping behavior).  For
 example:
 
-    tevc -e pid/eid now cloud-n1 MODIFY DEST=10.0.0.2 BW=1000 DELAY=10 PLR=0
+    tevc -e pid/eid now cloud-n1 MODIFY DEST=10.0.0.2 BANDWIDTH=1000 DELAY=10
+
+Assuming 10.0.0.2 is "n2" in the diagram above, this would change n1's
+"to n2 pipe" to shape the bandwidth, and change n1's "from n2 pipe" to
+handle the delay.  If a more "balanced" shaping is desired, half of each
+characteristic could be applied to both sides via:
 
-Assuming 10.0.0.2 is "n2", this would change the "n1 to n2 pipe" and
-possibly the "n1 from n2 pipe."
+    tevc -e pid/eid now cloud-n1 MODIFY DEST=10.0.0.2 BANDWIDTH=1000 DELAY=5
+    tevc -e pid/eid now cloud-n2 MODIFY DEST=10.0.0.1 BANDWIDTH=1000 DELAY=5
 
 5b. ACIM mode setup:
 
@@ -941,16 +962,22 @@ were not enough, here we further add per-flow pipes!  For example, in the
 diagram above, the six pipes for n1 might also have a seventh pipe for
 "n1 TCP port 10345 to n2 TCP port 80" if a monitored web application running
 on n1 were to connect to the web server on n2.  That pipe could then have
-specific BW, delay and loss characteristics.  It should be noted that only
-one pipe is created here to serve BW/delay/loss, unlike the split of BW
-from the others on per-pair pipes.  The one pipe is in the node-to-lan
-outgoing direction (i.e., on the left hand side in the diagram above).
+specific BW, delay and loss characteristics.
 
-For an application being monitored with ACIM, these more specific pipes
-are created for each flow on the fly as connections are formed.  Flows
-from unmonitored applications will use the node pair pipes.  Note that
-this would include return traffic to the monitored application unless the
-other end were also monitored.
+Note that only one pipe is created here to serve bandwidth, delay and loss,
+unlike the split of BW from the others on per-pair pipes.  The one pipe is
+in the node-to-lan outgoing direction (i.e., on the left hand side in the
+diagram above).
+
+Higher priority is given to per-flow pipes by numbering the IPFW rules
+starting from 100 and working up.  Thus the priority is: per-flow pipe,
+per-pair pipe, default pipe.
+
+For an application being monitored with ACIM, the flow pipes are created
+for each flow on the fly as connections are formed.  Flows from unmonitored
+applications will use the node pair pipes.  Note that this would include
+return traffic to the monitored application unless the other end were also
+monitored.
 
 The tevc commands sports even more parameters to support per-flow pipes.
 In addition to the DEST parameter, there are three others needed:
@@ -964,11 +991,24 @@ SRCPORT:
 DSTPORT:
     The destination UDP or TCP port number.
 
-An example:
+An example follows.  First, a flow pipe must be explicitly created:
+
+    tevc -e pid/eid now cloud-n1 CREATE \
+	DEST=10.0.0.2 PROTOCOL=TCP SRCPORT=10345 DSTPORT=80
+
+Note that unlike per-pair pipes, the CREATE call here immediately creates
+the associated IPFW rule and dummynet pipe.  A flow pipe will inherit its
+initial characteristics from the "parent" per-pair pipe.  Those
+characteristics can be changed with:
 
     tevc -e pid/eid now cloud-n1 MODIFY \
 	DEST=10.0.0.2 PROTOCOL=TCP SRCPORT=10345 DSTPORT=80 \
-	BW=1000 DELAY=10 PLR=0
+	BANDWIDTH=1000 DELAY=10
+
+When finished, the flow pipe is destroyed with:
+
+    tevc -e pid/eid now cloud-n1 CLEAR \
+	DEST=10.0.0.2 PROTOCOL=TCP SRCPORT=10345 DSTPORT=80
 
 5c. Hybrid mode setup:
 
@@ -978,135 +1018,203 @@ form.  For a given node, it allows full per-destination delay settings and
 partial per-destination bandwidth settings.  All destinations that do not
 have individual bandwidth pipes, will share a single, default bandwidth pipe.
 
-This is where the seperate pipes for BW and delay/plr described above
-come into play.  In the current implementation, every node pair has
-individual delay and loss characteristics.  These are implemented on the
-"from node" pipes (i.e., the right-hand side of the diagram above).  Thus
-for a LAN of N nodes, each node will have N-1 "from node" pipes 
-Nodes
-may then also have per-node pair BW pipes to some, but possibly not all,
-of the other nodes.
+This is where the separate pipes for bandwidth and delay/plr described above
+come into play.  Recall that the CREATE call only creates a full NxN set of
+pipes internally, and that actual dummynet pipes are only created when the
+first MODIFY event for the pipe is received.  This allows for having only
+a subset of per-pair pipes active.  Hence, for a given node, by explicitly
+setting the characteristics for only some destination nodes, all other
+destinations will use the default pipe and its characteristics.  This is
+how hybrid mode achieves a shared destination bandwidth.
 
+Specifically, in the current Flexlab hybrid-model implementation, every
+node pair is set with individual delay and loss characteristics via MODIFY
+events.  These are the "from node" pipes (i.e., the right-hand side of
+the diagram above).  Thus for a LAN of N nodes, each node will have N-1
+such "from node" pipes active.  Nodes may then also have per-node pair
+bandwidth pipes to some, but possibly not all, of the other nodes.  These
+are the "to node" (left-hand side) pipes.  Where specific bandwidth per-pair
+pipes are not setup with MODIFY, the default pipe will then be used and
+thus its bandwidth shared by traffic to all unnamed destinations.
 
-To setup unique characteristics per pair, the event should specify a DEST
-parameter:
+This mechanism allows only a single set of shared destination bandwidth
+nodes.  The implementation will have to be modified to allow multiple
+shared destination bandwidth sets or shared source bandwidth sets.
 
-  tevc -e pid/eid now link-node DEST=10.0.0.2 DELAY=10 PLR=0
+The tevc commands to setup unique delay characteristics per pair use the
+DEST parameter:
 
-would say that the link "link-node" from us to 10.0.0.2 should have the
-indicated characteristics.  To setup a shared bandwidth, omit the DEST:
+    tevc -e pid/eid now cloud-n1 MODIFY DEST=10.0.0.2 DELAY=10
 
-  tevc -e pid/eid now link-node BANDWIDTH=1000
+would say that traffic from us to 10.0.0.2 should have a 10ms round-trip
+delay.  Likewise for setting up unique per-pair bandwidth:
 
-which says that all traffic to all hosts reachable on link-node should share
-a 1000Kb *outgoing* bandwidth.  To allow some hosts to have per-pair
-bandwidth while all others share, then use a command with DEST and BANDWIDTH:
-
-  tevc -e pid/eid now link-node DEST=10.0.0.2 BANDWIDTH=5000
-  tevc -e pid/eid now link-node BANDWIDTH=1000
+    tevc -e pid/eid now cloud-n1 MODIFY DEST=10.0.0.2 BANDWIDTH=5000
 
 which says that traffic between us and 10.0.0.2 has an outgoing "private"
-BW of 5000Kb while traffic from us to all other nodes in the cloud shares
-a 1000Kb outgoing bandwidth.
+BW of 5000Kb.  To establish the "default" shared bandwidth, we simply
+omit the DEST:
 
-5d. Flexlab shaping implementation.
+    tevc -e pid/eid now cloud-n1 MODIFY BANDWIDTH=1000
 
-At the current time, a Flexlab experiment must have all nodes in a "cloud"
-created via the "make-cloud" method instead of "make-lan."  Make-cloud is
-just syntactic sugar for creating an unshaped LAN with mustdelay set, e.g.:
+to say that traffic from us to all other nodes in the cloud shares a 1000Kb
+outgoing bandwidth.
 
-    set link [$ns duplex-link n1 n2 100Mbps 0ms DropTail]
-    $link mustdelay
+5d. Late additions to Flexlab shaping.
 
-This cloud must have at least three nodes as LANs of two nodes are optimized
-into a link and links do not give us all the pipes we need, as we will see
-soon.
+A later, quick hack added the ability to specify multiple sets of shared
+outgoing bandwidth nodes.  A specification like:
 
-This whole thing is implemented using the two shaping pipes that connect
-every node to a LAN.  Since delay and packet loss are per-node pair but
-bandwidth may be applied to sets of nodes
-The delay and PLR are set on the incoming (lan-to-node)
-pipe, while the BW is applied to the outgoing (node-to-lan) pipe.  Note that
-this is completely different than the normal shaping done on a LAN node.
-Normally, the delay/plr are divided up between the incoming and outgoing pipes.
+    tevc -e pid/eid now cloud-n1 MODIFY DEST=10.0.0.2,10.0.0.3 BANDWIDTH=5000
 
-So it looks like:
+creates a "per node pair" style pipe for which the destination is a list
+of nodes rather than a single node.  This directly translates into an IPFW
+command:
 
-  +-------+                   +-------+                   +-------+
-  |       |             +-----+       +-----+             |       |
-  | node0 |--- pipe0 -->| if0 |       | if1 |<-- pipe1 ---|       |
-  |       |    (BW)     +-----+       +-----+  (del/plr)  |       |
-  +-------+                   |       |                   |       |
-                              |       |                   |       |
-  +-------+                   |       |                   |       |
-  |       |             +-----+       +-----+             |       |
-  | node1 |--- pipe2 -->| if2 | delay | if3 |<-- pipe3 ---| "lan" |
-  |       |    (BW)     +-----+       +-----+  (del/plr)  |       |
-  +-------+                   |       |                   |       |
-                              |       |                   |       |
-  +-------+                   |       |                   |       |
-  |       |             +-----+       +-----+             |       |
-  | node2 |--- pipe4 -->| if4 |       | if5 |<-- pipe5 ---|       |
-  |       |    (BW)     +-----+       +-----+  (del/plr)  |       |
-  +-------+                   +-------+                   +-------+
+    ipfw add <pipe> pipe <pipe> ip from any to 10.0.0.2,10.0.0.3 in recv <if>
 
-This means that, for any pair of nodes n1 and n2, packets from n1 to n2
-have the BW shaped leaving n1 but the delay applied when arriving at n2
+so it was straightforward, though hacky in the current delay-agent, to
+implement.  This is clearly more general than the one "default rule"
+bandwidth, but would be less efficient in the case where they is only
+one set.
 
-    NOTE: In both the link and LAN case, we have only a single pipe
-    on each side of the shaping node.  While this is sufficient for
-    implementing basic delays, it causes some grief for the Flexlab
-    modifications (described later), where we want to potentially run
-    packets through multiple rules in each direction (e.g., once for
-    BW shaping, once for delay shaping).  With IPFW, you can only
-    apply a single rule to a packet passing through.  In order to
-    apply multiple rules, you would have to run through IPFW multiple
-    times.  However, when using IPFW in combination with bridging,
-    packets are only passed through once (as opposed to with IP
-    forwarding, where packets pass through once on input and once on
-    output. 
+A final variation is a mechanism for allowing the specification for an
+"incoming" delay from a particular node:
 
-There are additional event parameters for hybrid pipes.
+    tevc -e pid/eid now cloud-n1 MODIFY SRC=10.0.0.2 DELAY=10
 
-EVENTTYPE: CREATE, CLEAR
+This would appear to be equivalent to:
+
+    tevc -e pid/eid now cloud-n2 MODIFY DEST=10.0.0.1 DELAY=10
 
-# "flow" pipe events
-CREATE: create "flow" pipes.  Each link has two pipes associated with each
-        possible destination (destinations determined from /etc/hosts file).
-	The first pipe is used for most situations and contains BW/delay
-	values.  The second pipe is used when operating in Flexlab hybrid mode.
-	In that case the first pipe is used for delay, the second for BW.
+and for round-trip traffic they will produce the same result.  However,
+they will perform differently for one way traffic.  For the SRC= rule,
+traffic from n2 to n1 will see 10ms of delay, but for the DEST= rule
+traffic from n2 to n1 will see no delay since the shaping is on the
+return path.  This is really an implementation artifact though.
 
-CLEAR: destroy all "flow" pipes
+So why are there both forms?  I do not recall if there was supposed to
+be a functional difference, or whether it was just a convenience issue
+depending on which object handle you had readily available.
 
+5e. Future additions to Flexlab shaping.
 
-Additional MODIFY arguments:
-BWQUANTUM, BWQUANTABLE, BWMEAN, BWSTDDEV, BWDIST, BWTABLE,
-DELAYQUANTUM, DELAYQUANTABLE, DELAYMEAN, DELAYSTDDEV, DELAYDIST, DELAYTABLE,
-PLRQUANTUM, PLRQUANTABLE, PLRMEAN, PLRSTDDEV, PLRDIST, PLRTABLE,
-MAXINQ
+Thus far, the only additional feature that has been requested is the
+ability to specify a "shared source" bandwidth.  For example, with:
+
+    set cloud [$ns make-cloud "n1 n2 n3 n4" 100Mbps 0ms]
+
+we might want to say: "on n1 I want 1Mbs from {n2,n3}" which would
+presumably translate into tevc commands:
+
+    tevc -e pid/eid now cloud-n1 MODIFY SRC=10.0.0.2,10.0.0.3 BW=1000
+
+So why is this a problem?  Going back to the base diagram for a cloud
+(for simplicity assuming a shaping node that could handle shaping four links):
+
+  +-------+                   +-------+                     +-------+
+  |       |             +-----+       +-----+               |       |
+  |  n1   |- to pipes ->| if0 |       | if1 |<- from pipes -|       |
+  |       |    (BW)     +-----+       +-----+    (del)      |       |
+  +-------+                   |       |                     |       |
+                              |       |                     |       |
+  +-------+                   |       |                     |       |
+  |       |             +-----+       +-----+               |       |
+  |  n2   |- to pipes ->| if2 |       | if3 |<- from pipes -|       |
+  |       |    (BW)     +-----+       +-----+    (del)      |       |
+  +-------+                   |       |                     |       |
+                              | delay |                     | "lan" |
+  +-------+                   |       |                     |       |
+  |       |             +-----+       +-----+               |       |
+  |  n3   |- to pipes ->| if4 |       | if5 |<- from pipes -|       |
+  |       |    (BW)     +-----+       +-----+    (del)      |       |
+  +-------+                   |       |                     |       |
+                              |       |                     |       |
+  +-------+                   |       |                     |       |
+  |       |             +-----+       +-----+               |       |
+  |  n4   |- to pipes ->| if6 |       | if7 |<- from pipes -|       |
+  |       |    (BW)     +-----+       +-----+    (del)      |       |
+  +-------+                   +-------+                     +-------+
+
+So the shaping would need to be applied in the "from pipes" for "cloud-n1"
+(i.e., the upper right).  However, the from pipes already include one pipe
+for adding per-pair delay from all other nodes to n1:
+
+    <n2-del> pipe <pipe1a> ip from 10.0.0.2 to any in recv <if1>
+             pipe <pipe1a> config delay 10ms
+    <n3-del> pipe <pipe1b> ip from 10.0.0.3 to any in recv <if1>
+             pipe <pipe1b> config delay 20ms
+    <n4-del> pipe <pipe1c> ip from 10.0.0.4 to any in recv <if1>
+             pipe <pipe1c> config delay 30ms
 
-5b. Hybrid model mods
+to which we would need to add a rule for shared bandwidth:
 
-We want to be able to specify, at a destination, a source delay from a
-specific node.  For example with nodes H1-H5 we might issue commands:
+    <n1-bw> pipe <pipe1d> ip from 10.0.0.2,10.0.0.3 to any in recv <if1>
+            pipe <pipe1d> config bw 1000Kbit/sec
 
-to H1: "10ms from H2 to me, 20ms from H3 to me"
-	tevc ... elabc-h1 SRC=10.0.0.2 DELAY=10ms
-	tevc ... elabc-h1 SRC=10.0.0.3 DELAY=20ms
-delay from H4 to H1 and H5 to H1 will be the "default" (zero?)
+but only one of these rules can trigger for each packet coming in on <if1>.
+In this case, packets from 10.0.0.2 and .3 will go through the delay pipes
+(pipe1a or pipe1b) and not the bandwidth pipe (pipe1d).  Putting the
+bandwidth pipe first won't help, now packets will pass through it and
+not the delay pipes!
 
-We want to be able to specify, at a source, that some set of destinations
-will share outgoing BW.  Currently we support a single, implied set of
-destinations in the sense that you can specify individual host-host links
-with specific outgoing bandwidth, and then all remaining destinations can
-share the "default" BW.  We want to be able to support multiple, explicit
-sets.  For example, with hosts H1-H5 we might issue:
+We could apply the appropriate bandwidth and delay to each of the from
+pipes from .2 and .3 so that there is only one pipe from each node:
+
+    <n2-del> pipe <pipe1a> ip from 10.0.0.2 to any in recv <if1>
+             pipe <pipe1a> config delay 10ms bw 1000Kbit/sec
+    <n3-del> pipe <pipe1b> ip from 10.0.0.3 to any in recv <if1>
+             pipe <pipe1b> config delay 20ms bw 1000Kbit/sec
+
+but now the bandwidth of 1000Kbit/sec is no longer shared.
+
+We could instead augment the left-hand "to pipes" adding an "incoming"
+rule so that we had:
+
+    # to pipes    
+    <n2-bw> pipe <pipe0a> ip from any to 10.0.0.2 in recv <if0>
+    <n2-bw> pipe <pipe0b> ip from any to 10.0.0.3 in recv <if0>
+    <n2-bw> pipe <pipe0c> ip from any to 10.0.0.4 in recv <if0>
+    # new rule
+    <n1-bw> pipe <pipe0d> ip from 10.0.0.2,10.0.0.3 to any out xmit <if0>
+
+However, when combining bridging (recall, <if0> and <if1> are bridged)
+with IPFW, packets traveling in either direction will only pass through
+IPFW once in each direction.  This means that a packet coming from the
+lan to n1, will trigger the appropriate "in recv <if1>" rule (pipe1?)
+and then be immediately placed on the outgoing interface <if0> with no
+further filtering.  Hence, the "out xmit <if0>" rule (aka pipe0d) will
+never be triggered.
+
+So we cannot hang a "shared source bandwidth" pipe in either place nor
+modify any of the existing pipes.
+
+In the big picture, what we might want to be able to support in a shaping
+node are, for each of BW, delay and loss and for each node in an N node cloud:
+
+ * shaping from node to {node set}
+ * shaping to node from {node set}
+
+Here a {node set} might be "all N other nodes in the LAN" in which case
+we have two shaping pipes for a node to and from the LAN (aka, the current
+asymmetric shaped LAN), or a set might contain a single node in which case
+we have N-1 shaping pipes for other nodes (aka, the current Flexlab per
+node pair pipes), or it might be multiple pipes with subsets of 2 to N-1
+nodes (aka, shared-source and shared-destination bandwidth pipes, as well
+as possibly useless shared-source and shared-destination delay and PLR
+pipes).  The only requirement for a set would be that it be disjoint with
+any other set.
+
+
+6. Assorted dummynet mods
 
-to H3: "1Mbs to {H1,H2}, 2Mbs to H4"
-	tevc ... elabc-h3 DEST=10.0.0.1,10.0.0.2 BANDWIDTH=1000
-	tevc ... elabc-h3 DEST=10.0.0.4 BANDWIDTH=2000
+Additional MODIFY arguments:
+BWQUANTUM, BWQUANTABLE, BWMEAN, BWSTDDEV, BWDIST, BWTABLE,
+DELAYQUANTUM, DELAYQUANTABLE, DELAYMEAN, DELAYSTDDEV, DELAYDIST, DELAYTABLE,
+PLRQUANTUM, PLRQUANTABLE, PLRMEAN, PLRSTDDEV, PLRDIST, PLRTABLE,
+
+MAXINQ
+    Define a maximum time for packets to be in a queue before they
+    are dropped.  This is the way in which ACIM models the queue
+    length of the bottleneck router.
 
-The "default" in this case will be whatever was setup with an earlier
-	tevc ... elabc-h3 BANDWIDTH=2000
-or unlimited if there was no such command.
-- 
GitLab