1. 23 Oct, 2003 1 commit
    • Kirk Webb's avatar
      · 5b52831c
      Kirk Webb authored
      Well, here it is:  The checkin implementing robust recovery/retry and
      asynchronous safe termination in plab allocation/deallocation/setup.
      
      Here are some of the more prominent changes/additions:
      
      * Bounded plab agent communication
        Scripts should never hang waiting for plab xmlrpc commands to complete;
        they have their own internal timeouts.  Node.create() in libplab is an
        exception, but is always run under a timeout constraint in vnode_setup
        and can be changed easily if the need arises.
      
      * Wrote functions in libplab to do the retry/recovery/timeout of remote
        command exection.
      
      * Wrapped critical sections with a signal watcher.
      
      * Added code to handle various error conditions properly
      
      * Added a libtestbed function, TBForkCmd, which runs a given program in
        a child process, and can optionally catch incoming SIGTERMs and terminate
        the child (then exit itself).
      
      * Fixed up vnode_setup to batch the 'plabnode free' operation along with
        a few other cleanups.  This should alleviate Jay's concern about how
        long it used to take to teardown a plab expt.
      
      * Whacked plabmonitord into better shape; fixed a couple bugs, taught it how
        to daemonize, and implemented a priority list for testing broken plab nodes.
        This list causes new (as yet unseen) nodes to be tried first over ones that
        have been tested already.
      5b52831c
  2. 23 Sep, 2003 3 commits
  3. 17 Sep, 2003 1 commit
    • Kirk Webb's avatar
      Several updates to libplab.py and plabnode.in · 56e67515
      Kirk Webb authored
      - getfree daemon doesn't die anymore when communcation with the plab dslice
        agent fails.
      
      - the link classifier logic has been changed slightly to allow nodes
        to be classified as inet2 even if they don't reverse resolve.  The problem
        here is that intl nodes that don't resolve, but which go through abilene
        will look like inet2 nodes, which is wrong.  Manual verification of the
        node_auxtypes table is still recommended.
      
      - The fping verifier has been disabled for now (since some plab nodes
        block ICMP traffic).
      
      - made some error messages more descriptive
      
      - plabnodes script now handles more agent communication errors gracefully
       (retries when if encounters them).
      
      - rearranged plabnode's retry loops to be a little easier to read, and
        more general.
      56e67515
  4. 15 Sep, 2003 1 commit
    • Kirk Webb's avatar
      · e1a2fabc
      Kirk Webb authored
      Some PLAB dslice manager updates:
      
      - in addition to asking the dslice agent (on plab) for a list of available
        nodes, we now also fping them all to weed out unresponsive ones.  One problem
        here is that several plab nodes block ICMP; could be solved by pinging with
        nmap (tries both a ICMP, and TCP ping).  This affects the plabdaemon getfree
        command, and subsequently which plab nodes appear as "up" in the DB
      
      - Changed slice naming scheme:  we now append the experiment index onto the
        slice name to try to ensure uniqueness (emulab_<pid>_<eid>_<idx>)
      
      - Modified plabnode to try to cope with flaky nodes - there is some retry
        code in there now
      
      - Added the "fixsudo" shell script which is run very first as root (via the
        cumbersome "su" command) to fix sudoers for later sudo use on plab nodes.
      e1a2fabc
  5. 22 Aug, 2003 1 commit
    • Austin Clements's avatar
      * Rewrote argument handling code to use getopt. · 6348a02e
      Austin Clements authored
      * Various improvements to new node stuff, including reworking node
        status updates so that they use the right table, and don't update
        vnodes that are alive (since their watchdog will do this).
      
      * Added renewal code to automatically renew all leases that are doing
        to expire within two days.
      
      * Moved Emulabification directly into the node abstraction.  Now the
        libplab wrapper scripts are all just plain wrapper scripts, instead
        of having the knowledge spread out
      
      * Switched from using a Plab-specific keypair to using the normal
        Emulab one, which makes it possible to use sshtb to Plab nodes.
      
      * Removed node booting code, since vnode_setup takes care of this.
      6348a02e
  6. 19 Aug, 2003 1 commit
    • Austin Clements's avatar
      This is the Planetlab manager. It includes a combination dslice · e6ce08a1
      Austin Clements authored
      service manager and resource broker that works closely with the
      control flow through the Emulab experiment swap process.  It keeps all
      slice and node data in the DB.  Node allocation automatically unpacks
      and configures the node to come up as an Emulab/Plab node when it is
      booted (later, via vnode_setup).  It also takes care of other
      necessary bits of interfacing with Planetlab, including discovering
      which nodes are available, adding new Plab nodes to the DB, and
      maintaining status information on Plab nodes.
      e6ce08a1