Commit my big TODO list.

30fa4cb7 · Leigh B. Stoller · c44cccd6 · 30fa4cb7
Commit 30fa4cb7 authored 22 years ago by Leigh B. Stoller
--- a/TODO
+++ b/TODO
+*** Major:
+* Fix the entire nalloc/nfree/reloading mechansism and the state
+  control stuff for it that is scattered around nfree, tmcd, stated,
+  and the reload daemon needs a complete overhaul. Many races, many
+  oppotunities to fail. Mac is thinking about this.
+* DB optimization, as per Daves email of Sun, 19 May 2002 10:35:54.
+  Good one for Mac, who is good at this DB stuff! Also check cache
+  sizes in the mysql config file.
+  Related: Transient lost connections to DB. Requires find the last of
+  the queries that do not go through the common interface, and putting
+  in retry code.
+* Event system startup cost. Abhijeet reported that after ISUP, it
+  could take a very long time for events to start. This is because it
+  takes a really long time to process the event stream in event-sched
+  using Ian's original binary tree stuff. I hacked in a fix, but need
+  to look at that algorithm and perhaps change. Need to decide if
+  insertion needs to be optimized, over deletion.
+* Specify maximum running time of a batch experiment.
+  Ron Oldfield requested this, for the case when batches fail, they
+  get stuck in the system. An alternative is an NS "at" event to stop
+  the simulation (which we would have to convert into something, not
+  sure what).
+* Move flest to another machine?
+  Related: Change flest to ignore certain tables, like idle stats to
+  reduce DB churning. Could do it as a table of table names.
+* Jail setup on wide area nodes. LBS: send email to Dave about using
+  tunnels and our discussion the other day. 
+* Need to default the OS id version (4.3, 7.1) if we are going to
+  delay reloading, or else people can get old versions of the OS
+  when in the same project (last_reservation). This might be moot
+  depending on what we do wrt reloading when experiments are done.
+* tmcd does not appear to be scaling with the advent of ssl. Rob
+  suggested a combined tmcd command to return the entire node
+  configuration in one message. We would still keep the individual
+  calls, but provide a way to get all the data at once and save on a
+  dozen connections per boot.
+* Complete event system overhaul (per-exp elvind, secure elvind,
+  per-node elvind, distribution of event lists to nodes).
+  Cannot multicast events to multiple agents at a time.
+* Get the program agent working on ron nodes. This is related (and
+  dependent) on securing the event system.
+* Deal with two ends of a remote link being allocated to the same ron
+  node! Need to catch the situation for now (and error), but
+  eventually make sure it does not happen since setting up a tunnel
+  from a node to itself sound rather silly.
+* Fix Dummynet crashes as reported Parveen.
+* Dummynet validation. Error rates from Table 2 in the paper.
+* Switch to ipsec AH tunnels for remote nodes. Faster than user mode
+  ip in UDP.
+* Fix mountd-invalidating-current-mounts problem.
+* Fix the DHCP/TFTP bootstrap path: we talked about how to work around
+  the BIOS DHCP failing, but I also had a couple of cases of corrupt
+  MFSes while trying to do disk loads enmasse.
+* Swapping support.
+*** Medium:
+* Front end support for changing delay/bw/plr asymmetrically in
+  events. Currently, we can do queue params, but the basic delay, bw,
+  and plr and can be only be done symmetrically.
+  Related: Add NS event support for some of the tb- commands. For
+  example, tb-set-lan-simplex-params. This is actually harder, since
+  lans are not directly controllable at the link level anywhere in the
+  system. delay_config chokes on lans at the moment for this reason.
+* Fix up staticroutes to use lan nodes. Actually, I did most of this,
+  but there was some question that Mike needed to answer about it.
+* Change pipe specification on links so that its more clear. We
+  currently use pipe0,pipe1 in some places, and the actual pipe
+  numbers other places (the numbers reported by experiment setup).
+  This makes it confusing to control pipes via the event system.
+  Perhaps have the front end generate the pipe numbers so that
+  everyone agrees on them (frontend, backend, tevc, delayagent,
+  etc.). 
+* Web page to control delay nodes (well, links). Would go nicely with
+  Chads new vis tool that shows you the link characteristics.
+* Scripts to age idle data!
+* Support images with more than once slice (but not the entire disk).
+  At present, people can make use of the 4th slice, but cannot save it
+  with an image, unless they create an entire disk image, and we do
+  not allow mere users to do that. The current disk image stuff is not
+  flexible enough to support arbitrary slice definitions (save slice
+  2-4).
+  Related: Kill deltas!
+* Add MBR initialization to all images, perhaps as a special Frisbee
+  operation, like Mike did for slice 4. This is to prevent problems
+  with people messing up the MBR.
+* Web Interface Problem with unverified users who forgot their passwords.
+  The verification stuff needs some work, as does the password stuff,
+  which retains some of its structure from when we had to deal with
+  frames. This is mostly a cleanup operation that I can do in a day or
+  two.
+  Dave also requested that we get rid of . from the verification key
+  when it falls at the end so as not to be confused with period.
+  Also includes suggestion from Dave to make verification a web link
+  instead of form. So, send them a URL that when clicked verifies
+  them. This would be a good user interface addition, and not too hard
+  after the above cleanup.
+* Daily experiment stats report sent by email. To include such things
+  as:
+ 	#expts-created		success/fail	#PCs	Avg#PCs/expt?
+ 	#expts-terminated	""		""	""
+ 	#swapped-in		""		""	""
+ 	#swapped-out		""		""	""
+  See Jays message to tbops of Fri, 19 Apr 2002.
+* Web interface to "preload" experiment. Sorta like a syntax check
+  that saves the virt state so it can be visualized, and later swapped
+  in. Or perhaps this is an experiment create option.
+* Change hardwired degree 4 for vrons->rons to more flexible DB
+  management. Related would be dynamic creation of virtual nodes
+  instead of hardwired entries in the nodes table, but thats a lot of
+  work, and might not be worth it.
+* Script to remove old log files from the mysql directory. Also remove
+  old backups from the mysql backup dir. These files are taking up a
+  huge amount of space, and /usr/testbed/log has been filling up a
+  lot.
+* Change exp failed email to report the actual failure mode based on
+  result of assign_wrapper. Most users have no idea why an experiment
+  failed.
+* Fix RPC power controller, which was failing on long strings, as
+  reported by Mike.
+  (Done by Rob on May 21).
+* Add some kind of host table support to RON nodes so that programs
+  can figure out IPs. This is going to be a pain.
+* Support for protocols other than IP. Mike reported some issues
+  related to this in email of Fri, 17 May 2002 10:05:41.
+* Look at ganglia (http://ganglia.sourceforge.net) and other cluster
+  management tools to see of we can leverage something from them,
+  especially for widearea nodes.
+* Add frontend syntax to control (widearea) solver weights. delay,
+  bandwidth, plr.
+* Setup the other RON nodes at MIT.
+* Find/Fix mysterious capture deaths.
+* Web option to become another user (su). Might be possible after I
+  clean up the auth stuff in the web page.
+* Get the tape library up and running.  This involves setting up a
+  Linux box to run the library, installing the software, etc., etc.
+* Bring in a bug tracking system we can use from the web interface.
+  Need someone to look around for this. I hate GNATS!
+  Rob mentioned RT (http://www.fsck.com/projects/rt).
+*** Minor:
+* Fix up updown page. Get rid of virtual nodes. Move Sharks to the
+  bottom, or get rid of them all together. Mike, what do you think?
+* Frisbee work. Mike/Rob reported that Frisbee "rocks" up to about
+  50-60 nodes, and then goes south in a hurry. I reported a couple of
+  optimizations in email that we could apply.
+* Move log files into experiment directory so that we can retain them
+  for debugging. Right now the go to /tmp and get deleted when done.
+  Could also add a web link to view the most recent log file. Related
+  to current "view in real time" option.
+* Add permission check to eventsys_control and install on ops so that
+  users can start/stop/replay the event stream.
+* Change approve/verify stuff so that project leader does not see
+  email or new users on approval page until they verify themselves.
+  Would need to change email too.
+* Link on web pages to pop up an ssh to a node. Perhaps do this by
+  usurping the telnet client.
+* Macrofy (remove hardwired) Utah Network Testbed string throughout
+  the system (perl and php).
+* Macrofy the signature of the email (currently "Testbed Ops").
+* Add periodic account update to RON nodes, for when they are off the
+  net and do not get the node_update command.
+* FAQ entry for lilo:
+  To access partitions on the disk outside of the C:H:S tuple limit (8.4
+  GB), you must add 'lba32' to the global options section.
+  Not a big deal, but requires someone who knows lilo to verify and to
+  test it.
+* Change perl daemons to clean the environment so that email does not
+  come from the person who ran it with sul!
+* Fix "no networks link warning" to deal with remote node links.
+* DB consistency checker; to run at night and as part of flest.
+* Documentation page reorg. Need some reference material.
+* Fix tbcmd test that broke comparing loss rate of 0.000 expected to
+  0.013 obtained.
+* node_reboot doesn't check if nodes exist before trying to ssh reboot
+  and IPOD them.
+* I'm sitting here looking at the "details" page for an experiment and
+  no where obvious on this page does it show the name of the
+  experiment.  If I scroll all the way down to experiment details,
+  there it is. How about putting it over the vis image or making it
+  part of the vis image? Somewhere right at the top.
+* Copyright notices before we give code.
+  Pat has a script.
+* Save off ntp.drift problem.
+* event system. skew and delay. unix domain socket to local proxy.
+* Possible Cookie problem:
+  1) Logged into mini using Opera (5)
+  2) Without logging out, went to https://www.mini.emulab.net
+  This got me the message "The document contained no data" from Netscape.