From 30fa4cb7881fe08f37717f4ad047f4cf263ab405 Mon Sep 17 00:00:00 2001 From: "Leigh B. Stoller" <stoller@flux.utah.edu> Date: Wed, 22 May 2002 22:13:30 +0000 Subject: [PATCH] Commit my big TODO list. --- TODO | 259 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 259 insertions(+) create mode 100644 TODO diff --git a/TODO b/TODO new file mode 100644 index 0000000000..bb92f3950c --- /dev/null +++ b/TODO @@ -0,0 +1,259 @@ +*** Major: + +* Fix the entire nalloc/nfree/reloading mechansism and the state + control stuff for it that is scattered around nfree, tmcd, stated, + and the reload daemon needs a complete overhaul. Many races, many + oppotunities to fail. Mac is thinking about this. + +* DB optimization, as per Daves email of Sun, 19 May 2002 10:35:54. + Good one for Mac, who is good at this DB stuff! Also check cache + sizes in the mysql config file. + + Related: Transient lost connections to DB. Requires find the last of + the queries that do not go through the common interface, and putting + in retry code. + +* Event system startup cost. Abhijeet reported that after ISUP, it + could take a very long time for events to start. This is because it + takes a really long time to process the event stream in event-sched + using Ian's original binary tree stuff. I hacked in a fix, but need + to look at that algorithm and perhaps change. Need to decide if + insertion needs to be optimized, over deletion. + +* Specify maximum running time of a batch experiment. + Ron Oldfield requested this, for the case when batches fail, they + get stuck in the system. An alternative is an NS "at" event to stop + the simulation (which we would have to convert into something, not + sure what). + +* Move flest to another machine? + Related: Change flest to ignore certain tables, like idle stats to + reduce DB churning. Could do it as a table of table names. + +* Jail setup on wide area nodes. LBS: send email to Dave about using + tunnels and our discussion the other day. + +* Need to default the OS id version (4.3, 7.1) if we are going to + delay reloading, or else people can get old versions of the OS + when in the same project (last_reservation). This might be moot + depending on what we do wrt reloading when experiments are done. + +* tmcd does not appear to be scaling with the advent of ssl. Rob + suggested a combined tmcd command to return the entire node + configuration in one message. We would still keep the individual + calls, but provide a way to get all the data at once and save on a + dozen connections per boot. + +* Complete event system overhaul (per-exp elvind, secure elvind, + per-node elvind, distribution of event lists to nodes). + Cannot multicast events to multiple agents at a time. + +* Get the program agent working on ron nodes. This is related (and + dependent) on securing the event system. + +* Deal with two ends of a remote link being allocated to the same ron + node! Need to catch the situation for now (and error), but + eventually make sure it does not happen since setting up a tunnel + from a node to itself sound rather silly. + +* Fix Dummynet crashes as reported Parveen. + +* Dummynet validation. Error rates from Table 2 in the paper. + +* Switch to ipsec AH tunnels for remote nodes. Faster than user mode + ip in UDP. + +* Fix mountd-invalidating-current-mounts problem. + +* Fix the DHCP/TFTP bootstrap path: we talked about how to work around + the BIOS DHCP failing, but I also had a couple of cases of corrupt + MFSes while trying to do disk loads enmasse. + +* Swapping support. + +*** Medium: + +* Front end support for changing delay/bw/plr asymmetrically in + events. Currently, we can do queue params, but the basic delay, bw, + and plr and can be only be done symmetrically. + + Related: Add NS event support for some of the tb- commands. For + example, tb-set-lan-simplex-params. This is actually harder, since + lans are not directly controllable at the link level anywhere in the + system. delay_config chokes on lans at the moment for this reason. + +* Fix up staticroutes to use lan nodes. Actually, I did most of this, + but there was some question that Mike needed to answer about it. + +* Change pipe specification on links so that its more clear. We + currently use pipe0,pipe1 in some places, and the actual pipe + numbers other places (the numbers reported by experiment setup). + This makes it confusing to control pipes via the event system. + Perhaps have the front end generate the pipe numbers so that + everyone agrees on them (frontend, backend, tevc, delayagent, + etc.). + +* Web page to control delay nodes (well, links). Would go nicely with + Chads new vis tool that shows you the link characteristics. + +* Scripts to age idle data! + +* Support images with more than once slice (but not the entire disk). + At present, people can make use of the 4th slice, but cannot save it + with an image, unless they create an entire disk image, and we do + not allow mere users to do that. The current disk image stuff is not + flexible enough to support arbitrary slice definitions (save slice + 2-4). + + Related: Kill deltas! + +* Add MBR initialization to all images, perhaps as a special Frisbee + operation, like Mike did for slice 4. This is to prevent problems + with people messing up the MBR. + +* Web Interface Problem with unverified users who forgot their passwords. + The verification stuff needs some work, as does the password stuff, + which retains some of its structure from when we had to deal with + frames. This is mostly a cleanup operation that I can do in a day or + two. + + Dave also requested that we get rid of . from the verification key + when it falls at the end so as not to be confused with period. + + Also includes suggestion from Dave to make verification a web link + instead of form. So, send them a URL that when clicked verifies + them. This would be a good user interface addition, and not too hard + after the above cleanup. + +* Daily experiment stats report sent by email. To include such things + as: + + #expts-created success/fail #PCs Avg#PCs/expt? + #expts-terminated "" "" "" + #swapped-in "" "" "" + #swapped-out "" "" "" + + See Jays message to tbops of Fri, 19 Apr 2002. + +* Web interface to "preload" experiment. Sorta like a syntax check + that saves the virt state so it can be visualized, and later swapped + in. Or perhaps this is an experiment create option. + +* Change hardwired degree 4 for vrons->rons to more flexible DB + management. Related would be dynamic creation of virtual nodes + instead of hardwired entries in the nodes table, but thats a lot of + work, and might not be worth it. + +* Script to remove old log files from the mysql directory. Also remove + old backups from the mysql backup dir. These files are taking up a + huge amount of space, and /usr/testbed/log has been filling up a + lot. + +* Change exp failed email to report the actual failure mode based on + result of assign_wrapper. Most users have no idea why an experiment + failed. + +* Fix RPC power controller, which was failing on long strings, as + reported by Mike. + (Done by Rob on May 21). + +* Add some kind of host table support to RON nodes so that programs + can figure out IPs. This is going to be a pain. + +* Support for protocols other than IP. Mike reported some issues + related to this in email of Fri, 17 May 2002 10:05:41. + +* Look at ganglia (http://ganglia.sourceforge.net) and other cluster + management tools to see of we can leverage something from them, + especially for widearea nodes. + +* Add frontend syntax to control (widearea) solver weights. delay, + bandwidth, plr. + +* Setup the other RON nodes at MIT. + +* Find/Fix mysterious capture deaths. + +* Web option to become another user (su). Might be possible after I + clean up the auth stuff in the web page. + +* Get the tape library up and running. This involves setting up a + Linux box to run the library, installing the software, etc., etc. + +* Bring in a bug tracking system we can use from the web interface. + Need someone to look around for this. I hate GNATS! + Rob mentioned RT (http://www.fsck.com/projects/rt). + +*** Minor: + +* Fix up updown page. Get rid of virtual nodes. Move Sharks to the + bottom, or get rid of them all together. Mike, what do you think? + +* Frisbee work. Mike/Rob reported that Frisbee "rocks" up to about + 50-60 nodes, and then goes south in a hurry. I reported a couple of + optimizations in email that we could apply. + +* Move log files into experiment directory so that we can retain them + for debugging. Right now the go to /tmp and get deleted when done. + Could also add a web link to view the most recent log file. Related + to current "view in real time" option. + +* Add permission check to eventsys_control and install on ops so that + users can start/stop/replay the event stream. + +* Change approve/verify stuff so that project leader does not see + email or new users on approval page until they verify themselves. + Would need to change email too. + +* Link on web pages to pop up an ssh to a node. Perhaps do this by + usurping the telnet client. + +* Macrofy (remove hardwired) Utah Network Testbed string throughout + the system (perl and php). + +* Macrofy the signature of the email (currently "Testbed Ops"). + +* Add periodic account update to RON nodes, for when they are off the + net and do not get the node_update command. + +* FAQ entry for lilo: + To access partitions on the disk outside of the C:H:S tuple limit (8.4 + GB), you must add 'lba32' to the global options section. + + Not a big deal, but requires someone who knows lilo to verify and to + test it. + +* Change perl daemons to clean the environment so that email does not + come from the person who ran it with sul! + +* Fix "no networks link warning" to deal with remote node links. + +* DB consistency checker; to run at night and as part of flest. + +* Documentation page reorg. Need some reference material. + +* Fix tbcmd test that broke comparing loss rate of 0.000 expected to + 0.013 obtained. + +* node_reboot doesn't check if nodes exist before trying to ssh reboot + and IPOD them. + +* I'm sitting here looking at the "details" page for an experiment and + no where obvious on this page does it show the name of the + experiment. If I scroll all the way down to experiment details, + there it is. How about putting it over the vis image or making it + part of the vis image? Somewhere right at the top. + +* Copyright notices before we give code. + Pat has a script. + +* Save off ntp.drift problem. + +* event system. skew and delay. unix domain socket to local proxy. + +* Possible Cookie problem: + + 1) Logged into mini using Opera (5) + 2) Without logging out, went to https://www.mini.emulab.net + This got me the message "The document contained no data" from Netscape. + -- GitLab