TODO 10.7 KB
Newer Older
Leigh Stoller's avatar
Leigh Stoller committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
*** Major:

* Fix the entire nalloc/nfree/reloading mechansism and the state
  control stuff for it that is scattered around nfree, tmcd, stated,
  and the reload daemon needs a complete overhaul. Many races, many
  oppotunities to fail. Mac is thinking about this.

* DB optimization, as per Daves email of Sun, 19 May 2002 10:35:54.
  Good one for Mac, who is good at this DB stuff! Also check cache
  sizes in the mysql config file.

  Related: Transient lost connections to DB. Requires find the last of
  the queries that do not go through the common interface, and putting
  in retry code.

16 17
  LBS: Thu May 23 - Added some of Dave's suggestion from his email.

Leigh Stoller's avatar
Leigh Stoller committed
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
* Event system startup cost. Abhijeet reported that after ISUP, it
  could take a very long time for events to start. This is because it
  takes a really long time to process the event stream in event-sched
  using Ian's original binary tree stuff. I hacked in a fix, but need
  to look at that algorithm and perhaps change. Need to decide if
  insertion needs to be optimized, over deletion.

* Specify maximum running time of a batch experiment.
  Ron Oldfield requested this, for the case when batches fail, they
  get stuck in the system. An alternative is an NS "at" event to stop
  the simulation (which we would have to convert into something, not
  sure what).

* Move flest to another machine?
  Related: Change flest to ignore certain tables, like idle stats to
  reduce DB churning. Could do it as a table of table names.

* Jail setup on wide area nodes. LBS: send email to Dave about using
  tunnels and our discussion the other day. 

* Need to default the OS id version (4.3, 7.1) if we are going to
  delay reloading, or else people can get old versions of the OS
  when in the same project (last_reservation). This might be moot
  depending on what we do wrt reloading when experiments are done.

* tmcd does not appear to be scaling with the advent of ssl. Rob
  suggested a combined tmcd command to return the entire node
  configuration in one message. We would still keep the individual
  calls, but provide a way to get all the data at once and save on a
  dozen connections per boot.

* Complete event system overhaul (per-exp elvind, secure elvind,
  per-node elvind, distribution of event lists to nodes).
  Cannot multicast events to multiple agents at a time.

* Get the program agent working on ron nodes. This is related (and
  dependent) on securing the event system.

* Deal with two ends of a remote link being allocated to the same ron
  node! Need to catch the situation for now (and error), but
  eventually make sure it does not happen since setting up a tunnel
  from a node to itself sound rather silly.

* Fix Dummynet crashes as reported Parveen.

* Dummynet validation. Error rates from Table 2 in the paper.

* Switch to ipsec AH tunnels for remote nodes. Faster than user mode
  ip in UDP.

* Fix mountd-invalidating-current-mounts problem.

* Fix the DHCP/TFTP bootstrap path: we talked about how to work around
  the BIOS DHCP failing, but I also had a couple of cases of corrupt
  MFSes while trying to do disk loads enmasse.

* Swapping support.

*** Medium:

* Front end support for changing delay/bw/plr asymmetrically in
  events. Currently, we can do queue params, but the basic delay, bw,
  and plr and can be only be done symmetrically.

  Related: Add NS event support for some of the tb- commands. For
  example, tb-set-lan-simplex-params. This is actually harder, since
  lans are not directly controllable at the link level anywhere in the
  system. delay_config chokes on lans at the moment for this reason.

* Fix up staticroutes to use lan nodes. Actually, I did most of this,
  but there was some question that Mike needed to answer about it.

* Change pipe specification on links so that its more clear. We
  currently use pipe0,pipe1 in some places, and the actual pipe
  numbers other places (the numbers reported by experiment setup).
  This makes it confusing to control pipes via the event system.
  Perhaps have the front end generate the pipe numbers so that
  everyone agrees on them (frontend, backend, tevc, delayagent,
  etc.). 

* Web page to control delay nodes (well, links). Would go nicely with
  Chads new vis tool that shows you the link characteristics.

* Support images with more than once slice (but not the entire disk).
  At present, people can make use of the 4th slice, but cannot save it
  with an image, unless they create an entire disk image, and we do
  not allow mere users to do that. The current disk image stuff is not
  flexible enough to support arbitrary slice definitions (save slice
  2-4).

  Related: Kill deltas!

* Add MBR initialization to all images, perhaps as a special Frisbee
  operation, like Mike did for slice 4. This is to prevent problems
  with people messing up the MBR.

114
* Dave requested that we get rid of . from the verification key
Leigh Stoller's avatar
Leigh Stoller committed
115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130
  when it falls at the end so as not to be confused with period.

* Daily experiment stats report sent by email. To include such things
  as:

 	#expts-created		success/fail	#PCs	Avg#PCs/expt?
 	#expts-terminated	""		""	""
 	#swapped-in		""		""	""
 	#swapped-out		""		""	""

  See Jays message to tbops of Fri, 19 Apr 2002.

* Web interface to "preload" experiment. Sorta like a syntax check
  that saves the virt state so it can be visualized, and later swapped
  in. Or perhaps this is an experiment create option.

Leigh Stoller's avatar
Leigh Stoller committed
131 132
  LBS: I added this, but its still an admin option.

Leigh Stoller's avatar
Leigh Stoller committed
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156
* Change hardwired degree 4 for vrons->rons to more flexible DB
  management. Related would be dynamic creation of virtual nodes
  instead of hardwired entries in the nodes table, but thats a lot of
  work, and might not be worth it.

* Script to remove old log files from the mysql directory. Also remove
  old backups from the mysql backup dir. These files are taking up a
  huge amount of space, and /usr/testbed/log has been filling up a
  lot.

* Change exp failed email to report the actual failure mode based on
  result of assign_wrapper. Most users have no idea why an experiment
  failed.

* Add some kind of host table support to RON nodes so that programs
  can figure out IPs. This is going to be a pain.

* Support for protocols other than IP. Mike reported some issues
  related to this in email of Fri, 17 May 2002 10:05:41.

* Look at ganglia (http://ganglia.sourceforge.net) and other cluster
  management tools to see of we can leverage something from them,
  especially for widearea nodes.

Leigh Stoller's avatar
Leigh Stoller committed
157 158
  LBS: I did this. See message to tbops. 

Leigh Stoller's avatar
Leigh Stoller committed
159 160 161
* Add frontend syntax to control (widearea) solver weights. delay,
  bandwidth, plr.

Leigh Stoller's avatar
Leigh Stoller committed
162 163
  LBS: I did this, but Jay wants normalized numbers 

Leigh Stoller's avatar
Leigh Stoller committed
164 165 166 167 168 169 170 171 172 173 174
* Setup the other RON nodes at MIT.

* Find/Fix mysterious capture deaths.

* Web option to become another user (su). Might be possible after I
  clean up the auth stuff in the web page.

* Bring in a bug tracking system we can use from the web interface.
  Need someone to look around for this. I hate GNATS!
  Rob mentioned RT (http://www.fsck.com/projects/rt).

Leigh Stoller's avatar
Leigh Stoller committed
175 176
* CDROM installation of nodes.

Leigh Stoller's avatar
Leigh Stoller committed
177
* Retry/reliability to tmcd from ron nodes.
Leigh Stoller's avatar
Leigh Stoller committed
178

Leigh Stoller's avatar
Leigh Stoller committed
179
  LBS: I have been working on this.
Leigh Stoller's avatar
Leigh Stoller committed
180

Leigh Stoller's avatar
Leigh Stoller committed
181
*** Minor:
Leigh Stoller's avatar
Leigh Stoller committed
182 183 184 185 186

* Frisbee work. Mike/Rob reported that Frisbee "rocks" up to about
  50-60 nodes, and then goes south in a hurry. I reported a couple of
  optimizations in email that we could apply.

Leigh Stoller's avatar
Leigh Stoller committed
187 188 189
* Add support for ssh protocol 2 rsa/dsa keys. Requires minor changes
  to mkacct, and the three web pages that parse the keys.

Leigh Stoller's avatar
Leigh Stoller committed
190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237
* Move log files into experiment directory so that we can retain them
  for debugging. Right now the go to /tmp and get deleted when done.
  Could also add a web link to view the most recent log file. Related
  to current "view in real time" option.

* Link on web pages to pop up an ssh to a node. Perhaps do this by
  usurping the telnet client.

* Macrofy (remove hardwired) Utah Network Testbed string throughout
  the system (perl and php).

* Macrofy the signature of the email (currently "Testbed Ops").

* FAQ entry for lilo:
  To access partitions on the disk outside of the C:H:S tuple limit (8.4
  GB), you must add 'lba32' to the global options section.

  Not a big deal, but requires someone who knows lilo to verify and to
  test it.

* Change perl daemons to clean the environment so that email does not
  come from the person who ran it with sul!

* Fix "no networks link warning" to deal with remote node links.

* DB consistency checker; to run at night and as part of flest.

* Documentation page reorg. Need some reference material.

* Fix tbcmd test that broke comparing loss rate of 0.000 expected to
  0.013 obtained.

* node_reboot doesn't check if nodes exist before trying to ssh reboot
  and IPOD them.

* I'm sitting here looking at the "details" page for an experiment and
  no where obvious on this page does it show the name of the
  experiment.  If I scroll all the way down to experiment details,
  there it is. How about putting it over the vis image or making it
  part of the vis image? Somewhere right at the top.

* Copyright notices before we give code.
  Pat has a script.

* Save off ntp.drift problem.

* event system. skew and delay. unix domain socket to local proxy.

238 239
* When you're only a part of one project, could that project be the default
  value for the "choose a project" dropdown in "begin an experiment"?
Leigh Stoller's avatar
Leigh Stoller committed
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281

* Add cleanup error handling (send email) in tb scripts.

* Sort osids in newimage pages.

* For example, boot up complains about no rc.route script.

* allow user to specify OSIDs for their delay nodes. Not entirely sure
  how, since delays are chosen late in the game, but at the moment its
  difficuly for people to customize delay nodes.

* Fix up hostnames generation from tmcd for lans. Currently, if you
  have a link and a lan to the same node, you get two aliases.

* Add web interface for generating simple hardwired topologies. ie:
  "Just give me some nodes in a lan."

> From: Jay Lepreau <lepreau@fast.cs.utah.edu>
> To: stoller@fast.cs.utah.edu
> Subject: For the 'todo' list -- project and user "archiving"
> Date: Tue, 11 Jun 2002 20:49:44 -0600 (MDT)
> 
> I don't want to delete projects and probably not users (typically).
> I want to "retire" them, or move them to "alumni"/archive status.  I
> want to keep them in the DB for analysis and statistics reasons, but
> not keep them around forever as active because they clutter things
> up.
> 
> Also, the users might be reactivated if they start a new project.
> 
> These take a little thought because of name space issues, at least.
> Maybe more issues (eg how does an inactive user get reactivated
> unless their password is still valid?).
> 
> ANyway, just something for the list, and keep this message in the
> details part.

This is some not well though out RON stuff:
* Add startup command for RON nodes. How to stop them? Jail?
* Add Jail.
* Add node_reboot of ron nodes.
* Add retry in wanassign.