TODO.plab 2.25 KB
Newer Older
Jay Lepreau's avatar
Jay Lepreau committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
For Elab interface to Plab

-Need more randomness in plab node selection, or keeps allocing
 the bad ones with low loadavgs.  Mike/Rob's idea is to add to load
 a little bit if fails or is alloced, but every 5 mins that will get
 wiped out: not good enough.
 THIS IS A BIG PRACTICAL PROBLEM!  When "fatal" is set, the retry gets
 the same physnodes every time, therefore never succeeding.  And
 growing an expt to its full size via modify often fails because of this,

-vnodesetup on a failed node not releasing the lease

-vnode setup on 122-4 in 'chknodes' expt 9/20: it thinks it succeeded,
but it didn't.  Get permission denied when try to ssh in.  Not checking
return code on the close's? 
In teardown portion of chknodes it discovers this:
	vnode plab122-4 teardown on plab122 returned 65280.
	*** /usr/testbed/sbin/vnode_setup:
	    Virtual node plab122-4 teardown failure!
	Node plab122-4 wasn't really allocated
This appears really to be a DB consistency problem.
It alloc'ed, then freed the node.  Never did full setup.

-Ask for "all nodes"

-Shrink the virtnodes table when a node fails?
	Problem: It's unintuitive
	that swapping out and swapping back in won't bring back the failed nodes.
	Note that "modify" will bring them back.

-First class or good support for 'site'
	- Ability to spread the nodes around
	- Intuit when new nodes join
	- Sortable Web page displaying them hierarchically
	- Interaction with load metrics if not .99, and UI to it?
   and probably the other criteria that Brent mentioned in his mail of 9/9

-Early abort if "cannot fail" is set and get a failed node

-Merge ron/pcwas/plab as much as possible and makes sense (eg showsites).

(-Modify when no nodes change?
	fix-node ok hack may not be good idea in long run)

-When can't map because of fix-nodes not available, tell the user
why. (maybe won't occur if the modify fix is done).

-Document all the installation/maintenance issues.
	Formalize the log of hostname/aux_type manual overrides.

-Build (regression) tests

-Document Plab support.  A few things aren't in the email and proto file,
e.g. how to queue, what cpu-usage and admission control map to.

-Describe/outline the internals of plab support, eg the process involved.