TODO 19.8 KB
Newer Older
1 2
[This file is not kept entirely up to date.]

Leigh Stoller's avatar
Leigh Stoller committed
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109
*  Fix capserver to require reserved port number. It was always
   supposed to be that (its why I picked port 855) but I forgot to
   actually finish it! 

*  Look at using CISCO NAM modules.

*  Beef up ntpstart to deal with timesys version.

*  Add cron job to poll db for unverified users.

*  Fix length of osname and osid to equal imageid and imagename. The
   EZ image page assumes they are the same, and they aren't. They should
   be. 

*  Fix /opt and software install inside a jail. Perhaps point /opt to
   /local (outside the jail).

*  Allow bw=0 in delay_config for unlimited bw.

*  Fix staticroutes for 2000x2000 topos! Too many giant hashes.

*  Eric wants to be able to specify units of `rate' in events to trafgen.

*  Get experiment restart released, also with batch experiments. 

*  Tiptunnel build environment for windows.

*  Analyze mysqld usage wuth swapping in big vnode experiment (using 50% CPU
   as reported by Jay).

*  Upgrade SFS to version 7. Try with rexd only in jails. Work with DM?
   Use DNS SRV records for port?

*  Windows version of ssh-mime.pl and instructions for installing.

*  Fix tbacl files with IE (Jay says they do not work).

*  Fix sshdport in nodes table to default to 22 instead of 11000, even
   though the port is not used unless its a jailed node without its
   own IP address.

*  From: Mike Hibler <mike@flux.utah.edu>
   To: testbed-ops@flux.utah.edu
   Subject: "hardware off" experiment
   Date: Fri, 17 Oct 2003 09:35:34 -0600 (MDT)
    
   Chad Lake sent some mail today about high-temp alarms in the machine room
   and how they need to adjust the threshold that triggers an alert to the
   campus police.
    
   Anyway, that got me thinking about ways that we can help out if temps
   start to climb and we need to start shutting things down.  While I know
   that our heat generation pales compared to the SCI cluster, and that
   shutting it down is a prerequisite to lowering the temps in that room,
   it still wouldn't hurt to at least put this on the todo list.
    
   One easy, not too invasive way would be to shutdown all free nodes.
   At the same time, we would want to remove them from the free list so that
   experiment setup didn't power cycle them and have them wind up on again.
    
   With a couple of options it seems we could do this.  At some level of
   allocation (nalloc?) we have an option to allocate all free nodes to an
   experiment.  Then we add an option to power to work on all nodes in an
   experiment (currently this is only an option in node_reboot).
   Then we could do something like:
    
   	nalloc -A emulab-ops hwoff
    	power off emulab-ops/hwoff
    
   Another useful, but more invasive, alternative is to shutdown all
   experimental nodes.  We could do this by adding an option to power
   to operate on all nodes, for some appropriate definition of "all":
    
    	power -A off
    
   which of course would prompt out the yazoo to make sure you really
   mean it.  Or maybe more explicitly, a "type" option which tells power
   to interpret its arguments as types of machines:
    
    	power -T off pc600 pc850 pc1500 pc2000
    
    
   This is mostly so that, if I get a call at 3am sometime about how
   temps-are-rising-can-you-shut-stuff-down-quick, I can login and do something
   quick without having to try to figure out a query to list all free nodes
   or how to construct a list of all nodes and then do it.  Ideally, we would
   have some command that someone (not us) in the machine room could run to
   do it.  Maybe they login to snake's console as "killswitch" and it takes
   everything down.


*  Fix problem Eric reported with instance limited images and swapmod:

    	Eric> *** Update aborted; old state restored.
    
    This appears to have happened because I wanted to move the instance-limited
    image UAV-TSL31-NET from one node of this experiment to another.  This
    experiment has only one instance of UAV-TSL31-NET; the other is currently
    allocated by another experiment, so there are no spares lying around.
    
    	Eric> *** /usr/testbed/libexec/assign_wrapper:
    	Eric>     Cannot load pces-UAV-TSL31-NET on one or more nodes.
    	Eric>     Too many nodes are already running this OSID!
    	Eric> *** Failed (67) to map to reality.
    


Leigh Stoller's avatar
Leigh Stoller committed
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142
*  From: Robert P Ricci <ricci@cs.utah.edu>
   Subject: Re: Disabled Web
   Date: Fri, 8 Aug 2003 11:12:35 -0600
    
   > Seems it might be good to have a way to just disable swap in/out/mod
   > instead of the whole interface, sine most of the time, that's really all
   > we need to disable.
    
   Acutally, it occurs to me that we should have some way to lock tbswap
   from running at all - for example, if we lock the web interface 'cause
   things are broken, the batch daemon and/or idleswap can go behind our
   backs and do potentially harmful things.
    
*  From: Jay Lepreau <lepreau@flux.utah.edu>
   Subject: Re: "Queued" experiments
   Date: Tue, 05 Aug 2003 19:41:54 MDT
    
   Seems to me it's mostly a UI matter, and reporting from assign.
   As Mac suggests, we would probably just change the "batch" checkbox to, say
     [ ] Queue and retry if not enough resources.
    
   And possibly change the idle-swap time to something low, as I suggested.
    
   However, can assign_wrapper be sure it's simply not enough resources,
   and that they haven't (say) typoed (eg, asking for 1000Mb), or done
   something more complex wrong?
    
   If there is no penalty for doing this auto-retrying, everyone will do it.
   For example, currently we only allow one concurrent batch job per user.
    
*  Fix "auto" problem on CDROM, and allow override of DHCP'ed hostname.
   Add USB support. (keyboard, mass storage, mouse, hci, ohci, generic) 

143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158
*  Fix tmcd/fork problem (events out of order).

*  Supply form args to beginexp page, and allow inline NS files.

*  From: Jay Lepreau <lepreau@flux.utah.edu>
   Subject: expt phase transition durations
   Date: Fri, 18 Jul 2003 14:03:48 MDT

   Some sub-times would be really useful too,
   like how long assign took to run.

*  Feed back jail changes to FreeBSD.org.

*  I think these swap/modify errors should include the ns file.
   Well, certainly if it's a parsing error.

159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241
*  From: Jay Lepreau <lepreau@cs.utah.edu>
   Date: Thu, 10 Jul 2003 10:56:37 MDT

   1. I think most of us are in agreement that in the big visualizer the
   host/lan icons should be replaced by plain boxes and circles.  I
   think Chad might have even been convinced before he left, after
   seeing how nice the little ones looked.

   2. We definitely need some smaller scale options for the big topos
   arising from vnodes, the ones that Mike was playing with.  There
   should be some heuristic to decide what the default should be when the
   big viz page is first displayed.  Ie,  doing the really big one
   first (100%) is often just a waste of time and annoys the user
   while it loads.

   3. Non-tweaks (ie, significant work, but maybe not hard): Really big
   topos need something better.  The Tamara Munzner/then-CAIDA tool
   that lets you zoom in on an area would have been just the thing for
   that really dense/overlapping one that was out there a week ago.
   To support that we need some kind of export and viewer
   configuration capability.

   (This fits into my vision for a one-stop shop for network expts, where
   people will use Elab even for big pure simuations, because it will be
   linked in to great computational resources and have all sorts of handy
   tools.)

*  From: Eric Eide <eeide@cs.utah.edu>
   Date: Thu, 10 Jul 2003 11:12:19 -0600

   Something that I would like to be able to do is to run a batch
   experiment that builds a disk image, and and at the end of the
   batch script, have the disk image dumped.  Any clever ideas about
   how one might go about doing this?

*  Fix /etc/named.conf problem on RON image and build new image.

*  Run neato and the visualization code on ops cause of CPU load on
   big experiments.

*  Change ISO version number.

*  Adjust web page width to actual display width the user is using. 

*  From: Mac Newbold <newbold@flux.utah.edu>
   Date: Tue, 8 Jul 2003 10:57:22 -0600 (MDT)
    
   >> mangled because the regular output goes to stdout and the error
   >> messages go to stderr. I've fixed that problem.
   >
   >Well, we list them in the up/down and reservation page, so people probably
   >see the faster ID and try to get them. Maybe they should not be public
   >(add a "public" bit to the nodes table?).
    
   If you're going to add one, I'd probably vote for doing it per-type
   instead of per-node. So in node_types you could set what things are public
   and what aren't. Then we also wouldn't need holding expts for nodes of odd
   types, and it would be okay to have some of them sitting around free.
    
*  From: Timothy Stack <stack@cs.utah.edu>
   Subject: shell env for startup scripts
   Date: Mon, 7 Jul 2003 10:43:38 -0600 (MDT)
    
   The shell environment that the tb-set-node-startup script uses can be
   quite a bit different from the one you get when logged in remotely.  This
   can make things kinda complicated when trying to automate things, since
   you don't quite know what you gonna get.  I'm particularly interested in
   the ACE_ROOT/TAO_ROOT variables that are set by the scripts in
   /etc/profile.d/.
    
* The routes table is a problem. Currently 129000 rows, averaging 86 bytes,
  for a total of almost 12megs. That on main, with no virtnodes (1000 node
  experiments!). The representation is also grossly inefficient (like,
  pid/eid in every one of those rows!), and using text representation of the
  IP addresses (src,dst,nexthop,mask). We could probably cut that back by
  60%, but its still too much stuff to keep around for every swapped out
  experiment. We should run the route setup at swapin, and if the user
  wants to see them when its swapped out, we can run the route
  calculator on the fly and dump the output to the web page.

* Add routes/events option to web page, since those are not in the
  email any longer.

Leigh Stoller's avatar
Leigh Stoller committed
242 243 244 245 246
* Other items. We better start saving the thumbnails in the experiment
  archive directory too (expinfo). We should also be re-rendering after a
  modify. We should also save the XML representation to avoid having to
  reparse old NS files, although there is some versioning issues with this. 

247 248 249 250 251
* Auto discovery of new nodes.

  * Rob is working on this.

* Viz can't handle multiple links between nodes.
Leigh Stoller's avatar
Leigh Stoller committed
252 253 254 255 256

* Change netbuild to speak XML (in both directions).

  Related: Might require addition of a DTD or Schema to our XML format.

257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306
* From: Jay Lepreau <lepreau@cs.utah.edu>
  Subject: Snapping an image - physnode menu item
  Date: Tue, 01 Jul 2003 17:39:36 MDT
  
  It's always seemed logical to me to have such a menu item on the
  (physical) "Node Information" page.
  In fact, several times I've looked for it there,
  only to remember it doesn't exist, and I need to go to
  "ImageIdS and OSIDs" (which is not an "action" item,
  and is not associated with an experiment).

* Email archive for project email lists. 

* Add NS file history, and links to view/create from old NS files.

	I think this might be pretty cool to do, and its not that
	hard. Change the nsfiles table to index by the unique
	experiment idx, instead of pid/eid, and add a copy of the
	original description field to it. Then link it off the users
	"Show History" link, which has his experiment log. We could
	load up the old NS files, but I think thats enough of a pain
	that I would not nother too.

	From Dave:

	> key by the md5 hash of the ns file and store it by that:
	> 
	> table 1:
	> exp pid date md5_of_file
	> 
	> table 2:
	> md5_of_file  contents_of_file
	> 
	> That has the advantage of keeping the commonly accessed 
	> table (table 1) small and statically sized, whereas the
	> slow and dynamic TEXT records are then in table 2.

	Could also hang it off the experiment_resources record, which
	would address the swapmod issue, since a new resource record
	is created at each swapmod.

* Secure the event system so that experiments cannot spoof/snoop each other.

* Fix case sensitivity of event system (and agents, scheduler).

* Fix idleswap auditing. 

* Aren't users required to fill in their first and last names?
  Also note the state.

Leigh Stoller's avatar
Leigh Stoller committed
307 308
*** Major:

309 310 311
* Fix event scheduler for experiment modify so that it can add new
  events from the current time index, instead of from time 0. 

312 313
* Break up emulab into smaller components (for example, split of
  account and group stuff so its independent.
Leigh Stoller's avatar
Leigh Stoller committed
314

Leigh Stoller's avatar
Leigh Stoller committed
315 316 317 318 319 320 321 322 323 324 325 326
* Fix the entire nalloc/nfree/reloading mechansism and the state
  control stuff for it that is scattered around nfree, tmcd, stated,
  and the reload daemon needs a complete overhaul. Many races, many
  oppotunities to fail. Mac is thinking about this.

* Event system startup cost. Abhijeet reported that after ISUP, it
  could take a very long time for events to start. This is because it
  takes a really long time to process the event stream in event-sched
  using Ian's original binary tree stuff. I hacked in a fix, but need
  to look at that algorithm and perhaps change. Need to decide if
  insertion needs to be optimized, over deletion.

327
* Continuing work on jails for both local and remote nodes.
Leigh Stoller's avatar
Leigh Stoller committed
328 329 330 331 332 333 334 335 336 337 338 339

* Need to default the OS id version (4.3, 7.1) if we are going to
  delay reloading, or else people can get old versions of the OS
  when in the same project (last_reservation). This might be moot
  depending on what we do wrt reloading when experiments are done.

* tmcd does not appear to be scaling with the advent of ssl. Rob
  suggested a combined tmcd command to return the entire node
  configuration in one message. We would still keep the individual
  calls, but provide a way to get all the data at once and save on a
  dozen connections per boot.

340 341 342
  LBS: local nodes using tmcc-nossl since we get security via the
  switch. Wideare nodes *do* use ssl.

Leigh Stoller's avatar
Leigh Stoller committed
343 344 345 346 347
* Complete event system overhaul (per-exp elvind, secure elvind,
  per-node elvind, distribution of event lists to nodes).
  Cannot multicast events to multiple agents at a time.

* Get the program agent working on ron nodes. This is related (and
348 349
  dependent) on securing the event system since we do not want anyone
  to be able to send ab event to the program agent from some random node!
Leigh Stoller's avatar
Leigh Stoller committed
350 351 352 353 354 355

* Deal with two ends of a remote link being allocated to the same ron
  node! Need to catch the situation for now (and error), but
  eventually make sure it does not happen since setting up a tunnel
  from a node to itself sound rather silly.

356 357
* Investigate other types of tunnels for wideare nodes. Perhaps ipsec
  AH tunnels.
Leigh Stoller's avatar
Leigh Stoller committed
358 359 360

* Fix mountd-invalidating-current-mounts problem.

361 362 363
* Automated swapping (with disk saving) support so that we can swap
  out experiments and save per-node images for people. Requires a lot
  of disk space.
Leigh Stoller's avatar
Leigh Stoller committed
364 365 366

*** Medium:

367 368
* Change batch system to handle limited-use disk images like timesys.

369
* CDROM changes:
Leigh Stoller's avatar
Leigh Stoller committed
370
    1) Add per host certificates.
371

372 373 374 375
* Switch rpm/tar file to non-nfs solution. Perhaps a ftpd like daemon
  which does some of the same checks that tmcd does. Or maybe a tmcd
  variant that does nothing but serve up files according. Maybe it
  does not need to be separate, but seems like putting this directly
Leigh Stoller's avatar
Leigh Stoller committed
376 377 378 379
  into tmcd is a bad idea. Maybe not.

  LBS: This is now done for tar files on widearea nodes, and in jails.
       Needs to be done with RPMs too.
380 381 382

* Clean up osid/imageid mess. 

383 384
* Need to add a "kill runnin frisbee" function so that creating new
  images does get frisbee messed up.
Leigh Stoller's avatar
Leigh Stoller committed
385

Leigh Stoller's avatar
Leigh Stoller committed
386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425
* Front end support for changing delay/bw/plr asymmetrically in
  events. Currently, we can do queue params, but the basic delay, bw,
  and plr and can be only be done symmetrically.

  Related: Add NS event support for some of the tb- commands. For
  example, tb-set-lan-simplex-params. This is actually harder, since
  lans are not directly controllable at the link level anywhere in the
  system. delay_config chokes on lans at the moment for this reason.

* Support images with more than once slice (but not the entire disk).
  At present, people can make use of the 4th slice, but cannot save it
  with an image, unless they create an entire disk image, and we do
  not allow mere users to do that. The current disk image stuff is not
  flexible enough to support arbitrary slice definitions (save slice
  2-4).

  Related: Kill deltas!

* Add MBR initialization to all images, perhaps as a special Frisbee
  operation, like Mike did for slice 4. This is to prevent problems
  with people messing up the MBR.

* Change hardwired degree 4 for vrons->rons to more flexible DB
  management. Related would be dynamic creation of virtual nodes
  instead of hardwired entries in the nodes table, but thats a lot of
  work, and might not be worth it.

* Script to remove old log files from the mysql directory. Also remove
  old backups from the mysql backup dir. These files are taking up a
  huge amount of space, and /usr/testbed/log has been filling up a
  lot.

* Add some kind of host table support to RON nodes so that programs
  can figure out IPs. This is going to be a pain.

* Support for protocols other than IP. Mike reported some issues
  related to this in email of Fri, 17 May 2002 10:05:41.

* Bring in a bug tracking system we can use from the web interface.
  Need someone to look around for this. I hate GNATS!
Leigh Stoller's avatar
Leigh Stoller committed
426 427
  Rob mentioned RT (http://www.fsck.com/projects/rt). Eric mentioned
  Bugzilla and Jitterbug
Leigh Stoller's avatar
Leigh Stoller committed
428

429 430 431
* Noswap bit to prevent users from swapping special experiments that
  have things like SPAN turned on.

Leigh Stoller's avatar
Leigh Stoller committed
432
*** Minor:
Leigh Stoller's avatar
Leigh Stoller committed
433

434 435 436 437 438 439
* When I syntax check an ns file, and it fails, it would be handy to have a
  one-click way to check the same file again.  (My Tcl isn't so good.)

* Change logs for group experiments from /proj/<project>/logs/' to
  `/groups/<proj>/<group>/logs/'.

440 441
* Clean up ISADMIN() and ISADMININISTRATOR() calls in php pages.

Leigh Stoller's avatar
Leigh Stoller committed
442 443 444 445 446 447
* > Mapping RHL-STD on pc92 to emulab-ops-RHL73-STD.
  > *** Tarfile /usr/local for node pc96 does not exist!

  Can't we check the validity of these paths during the parse phase
  and fail a lot sooner?

Leigh Stoller's avatar
Leigh Stoller committed
448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466
* Macrofy the signature of the email (currently "Testbed Ops").

* FAQ entry for lilo:
  To access partitions on the disk outside of the C:H:S tuple limit (8.4
  GB), you must add 'lba32' to the global options section.

  Not a big deal, but requires someone who knows lilo to verify and to
  test it.

* Fix "no networks link warning" to deal with remote node links.

* DB consistency checker; to run at night and as part of flest.

* I'm sitting here looking at the "details" page for an experiment and
  no where obvious on this page does it show the name of the
  experiment.  If I scroll all the way down to experiment details,
  there it is. How about putting it over the vis image or making it
  part of the vis image? Somewhere right at the top.

Leigh Stoller's avatar
Leigh Stoller committed
467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496
* allow user to specify OSIDs for their delay nodes. Not entirely sure
  how, since delays are chosen late in the game, but at the moment its
  difficuly for people to customize delay nodes.

* Fix up hostnames generation from tmcd for lans. Currently, if you
  have a link and a lan to the same node, you get two aliases.

* Add web interface for generating simple hardwired topologies. ie:
  "Just give me some nodes in a lan."

> From: Jay Lepreau <lepreau@fast.cs.utah.edu>
> To: stoller@fast.cs.utah.edu
> Subject: For the 'todo' list -- project and user "archiving"
> Date: Tue, 11 Jun 2002 20:49:44 -0600 (MDT)
> 
> I don't want to delete projects and probably not users (typically).
> I want to "retire" them, or move them to "alumni"/archive status.  I
> want to keep them in the DB for analysis and statistics reasons, but
> not keep them around forever as active because they clutter things
> up.
> 
> Also, the users might be reactivated if they start a new project.
> 
> These take a little thought because of name space issues, at least.
> Maybe more issues (eg how does an inactive user get reactivated
> unless their password is still valid?).
> 
> ANyway, just something for the list, and keep this message in the
> details part.