*** Major:
* Fix the entire nalloc/nfree/reloading mechansism and the state
control stuff for it that is scattered around nfree, tmcd, stated,
and the reload daemon needs a complete overhaul. Many races, many
oppotunities to fail. Mac is thinking about this.
* DB optimization, as per Daves email of Sun, 19 May 2002 10:35:54.
Good one for Mac, who is good at this DB stuff! Also check cache
sizes in the mysql config file.
Related: Transient lost connections to DB. Requires find the last of
the queries that do not go through the common interface, and putting
in retry code.
* Event system startup cost. Abhijeet reported that after ISUP, it
could take a very long time for events to start. This is because it
takes a really long time to process the event stream in event-sched
using Ian's original binary tree stuff. I hacked in a fix, but need
to look at that algorithm and perhaps change. Need to decide if
insertion needs to be optimized, over deletion.
* Specify maximum running time of a batch experiment.
Ron Oldfield requested this, for the case when batches fail, they
get stuck in the system. An alternative is an NS "at" event to stop
the simulation (which we would have to convert into something, not
sure what).
* Move flest to another machine?
Related: Change flest to ignore certain tables, like idle stats to
reduce DB churning. Could do it as a table of table names.
* Jail setup on wide area nodes. LBS: send email to Dave about using
tunnels and our discussion the other day.
* Need to default the OS id version (4.3, 7.1) if we are going to
delay reloading, or else people can get old versions of the OS
when in the same project (last_reservation). This might be moot
depending on what we do wrt reloading when experiments are done.
* tmcd does not appear to be scaling with the advent of ssl. Rob
suggested a combined tmcd command to return the entire node
configuration in one message. We would still keep the individual
calls, but provide a way to get all the data at once and save on a
dozen connections per boot.
* Complete event system overhaul (per-exp elvind, secure elvind,
per-node elvind, distribution of event lists to nodes).
Cannot multicast events to multiple agents at a time.
* Get the program agent working on ron nodes. This is related (and
dependent) on securing the event system.
* Deal with two ends of a remote link being allocated to the same ron
node! Need to catch the situation for now (and error), but
eventually make sure it does not happen since setting up a tunnel
from a node to itself sound rather silly.
* Fix Dummynet crashes as reported Parveen.
* Dummynet validation. Error rates from Table 2 in the paper.
* Switch to ipsec AH tunnels for remote nodes. Faster than user mode
ip in UDP.
* Fix mountd-invalidating-current-mounts problem.
* Fix the DHCP/TFTP bootstrap path: we talked about how to work around
the BIOS DHCP failing, but I also had a couple of cases of corrupt
MFSes while trying to do disk loads enmasse.
* Swapping support.
*** Medium:
* Front end support for changing delay/bw/plr asymmetrically in
events. Currently, we can do queue params, but the basic delay, bw,
and plr and can be only be done symmetrically.
Related: Add NS event support for some of the tb- commands. For
example, tb-set-lan-simplex-params. This is actually harder, since
lans are not directly controllable at the link level anywhere in the
system. delay_config chokes on lans at the moment for this reason.
* Fix up staticroutes to use lan nodes. Actually, I did most of this,
but there was some question that Mike needed to answer about it.
* Change pipe specification on links so that its more clear. We
currently use pipe0,pipe1 in some places, and the actual pipe
numbers other places (the numbers reported by experiment setup).
This makes it confusing to control pipes via the event system.
Perhaps have the front end generate the pipe numbers so that
everyone agrees on them (frontend, backend, tevc, delayagent,
* Web page to control delay nodes (well, links). Would go nicely with
Chads new vis tool that shows you the link characteristics.
* Scripts to age idle data!
* Support images with more than once slice (but not the entire disk).
At present, people can make use of the 4th slice, but cannot save it
with an image, unless they create an entire disk image, and we do
not allow mere users to do that. The current disk image stuff is not
flexible enough to support arbitrary slice definitions (save slice
Related: Kill deltas!
* Add MBR initialization to all images, perhaps as a special Frisbee
operation, like Mike did for slice 4. This is to prevent problems
with people messing up the MBR.
* Web Interface Problem with unverified users who forgot their passwords.
The verification stuff needs some work, as does the password stuff,
which retains some of its structure from when we had to deal with
frames. This is mostly a cleanup operation that I can do in a day or
Dave also requested that we get rid of . from the verification key
when it falls at the end so as not to be confused with period.
Also includes suggestion from Dave to make verification a web link
instead of form. So, send them a URL that when clicked verifies
them. This would be a good user interface addition, and not too hard
after the above cleanup.
* Daily experiment stats report sent by email. To include such things
#expts-created success/fail #PCs Avg#PCs/expt?
#expts-terminated "" "" ""
#swapped-in "" "" ""
#swapped-out "" "" ""
See Jays message to tbops of Fri, 19 Apr 2002.
* Web interface to "preload" experiment. Sorta like a syntax check
that saves the virt state so it can be visualized, and later swapped
in. Or perhaps this is an experiment create option.
* Change hardwired degree 4 for vrons->rons to more flexible DB
management. Related would be dynamic creation of virtual nodes
instead of hardwired entries in the nodes table, but thats a lot of
work, and might not be worth it.
* Script to remove old log files from the mysql directory. Also remove
old backups from the mysql backup dir. These files are taking up a
huge amount of space, and /usr/testbed/log has been filling up a
* Change exp failed email to report the actual failure mode based on
result of assign_wrapper. Most users have no idea why an experiment
* Fix RPC power controller, which was failing on long strings, as
reported by Mike.
(Done by Rob on May 21).
* Add some kind of host table support to RON nodes so that programs
can figure out IPs. This is going to be a pain.
* Support for protocols other than IP. Mike reported some issues
related to this in email of Fri, 17 May 2002 10:05:41.
* Look at ganglia ( and other cluster
management tools to see of we can leverage something from them,
especially for widearea nodes.
* Add frontend syntax to control (widearea) solver weights. delay,
bandwidth, plr.
* Setup the other RON nodes at MIT.
* Find/Fix mysterious capture deaths.
* Web option to become another user (su). Might be possible after I
clean up the auth stuff in the web page.
* Get the tape library up and running. This involves setting up a
Linux box to run the library, installing the software, etc., etc.
* Bring in a bug tracking system we can use from the web interface.
Need someone to look around for this. I hate GNATS!
Rob mentioned RT (
*** Minor:
* Fix up updown page. Get rid of virtual nodes. Move Sharks to the
bottom, or get rid of them all together. Mike, what do you think?
* Frisbee work. Mike/Rob reported that Frisbee "rocks" up to about
50-60 nodes, and then goes south in a hurry. I reported a couple of
optimizations in email that we could apply.
* Move log files into experiment directory so that we can retain them
for debugging. Right now the go to /tmp and get deleted when done.
Could also add a web link to view the most recent log file. Related
to current "view in real time" option.
* Add permission check to eventsys_control and install on ops so that
users can start/stop/replay the event stream.
* Change approve/verify stuff so that project leader does not see
email or new users on approval page until they verify themselves.
Would need to change email too.
* Link on web pages to pop up an ssh to a node. Perhaps do this by
usurping the telnet client.
* Macrofy (remove hardwired) Utah Network Testbed string throughout
the system (perl and php).
* Macrofy the signature of the email (currently "Testbed Ops").
* Add periodic account update to RON nodes, for when they are off the
net and do not get the node_update command.
* FAQ entry for lilo:
To access partitions on the disk outside of the C:H:S tuple limit (8.4
GB), you must add 'lba32' to the global options section.
Not a big deal, but requires someone who knows lilo to verify and to
test it.
* Change perl daemons to clean the environment so that email does not
come from the person who ran it with sul!
* Fix "no networks link warning" to deal with remote node links.
* DB consistency checker; to run at night and as part of flest.
* Documentation page reorg. Need some reference material.
* Fix tbcmd test that broke comparing loss rate of 0.000 expected to
0.013 obtained.
* node_reboot doesn't check if nodes exist before trying to ssh reboot
and IPOD them.
* I'm sitting here looking at the "details" page for an experiment and
no where obvious on this page does it show the name of the
experiment. If I scroll all the way down to experiment details,
there it is. How about putting it over the vis image or making it
part of the vis image? Somewhere right at the top.
* Copyright notices before we give code.
Pat has a script.
* Save off ntp.drift problem.
* event system. skew and delay. unix domain socket to local proxy.
* Possible Cookie problem:
1) Logged into mini using Opera (5)
2) Without logging out, went to
This got me the message "The document contained no data" from Netscape.
