1. 01 Nov, 2003 1 commit
    • Kirk Webb's avatar
      Couple important, but small fixes: · 92eb1d5e
      Kirk Webb authored
      1) properly disable alarm before exiting ForkCmd
         - this was causing SIGALRM to get sent when it shouldn't have, and
           probably caused the renewal failures.
         - was introduced accidentally yesterday when I unwittingly committed
           some beta libplab code along with the rootball version string fix.
      
      2) Changed semantics of the renew daemon s.t. it only sends a single message
         for each invocation of the renewal loop - summarizes the ones that failed.
      
      The rest of the code I committed accidentally yesterday seems to be working
      just fine.  It all looks sane on perusal.
      92eb1d5e
  2. 31 Oct, 2003 2 commits
  3. 24 Oct, 2003 2 commits
    • Robert Ricci's avatar
      Commit the stuff necessary to copy out new plab rootballs, versions of · d12f9b61
      Robert Ricci authored
      which had been hanging around in my home directory for a while.
      
      There are a few new things in plab/etc/netbed_files that set up a
      directory of the same name in @prefix@. This will get rsync'ed with
      netbed_files/ on each planetlab node.
      log/  - just needs to exist for the httpd server
      sbin/ - contains thttpd, and scripts to manipulate it
      www/  - the directory served by thttpd. Contains symlinks to the 'real'
              location of the rootballs (etc/plab)
      
      I've committed a binary of thttpd - this is simply because it'd be a
      PITA to compile a Linux binary for every devel tree, etc.
      
      PLAB_ROOTBALL has now become a configure options. The idea is that we
      will keep the latest version number in configure.in, but you can
      override it in your defs
      file. This way, we don't have to update every defs file when there's a new
      version, but people can still play around with their own version if they want.
      
      The two scripts that interact with the plab nodes skip ones that are
      down. They ssh in as 'utah1', meaning that one of us who has access to
      that account needs to run them, so that they can have access to our
      keys. We can put boss's public key (or something) out there to remove
      this requirement.
      
      plabdist runs an rsync between @prefix@/etc/plab/netbed_files and a
      file of the same name on the planetlab nodes. It's intended to be run
      from the main install tree - the local rsync directory is not normally
      set up in devel trees. It runs in parallel, but is limited to 4 to
      avoid beating up boss too much. Takes about 1:40 with the current set
      of plab nodes (took > 10 minutes doing one at a time).
      
      plabhttpd (re)starts the mini web server on all plab nodes
      d12f9b61
    • Leigh Stoller's avatar
      Fix minor syntax error in SENDMAIL() call. · 0c8442dd
      Leigh Stoller authored
      0c8442dd
  4. 23 Oct, 2003 2 commits
    • Leigh Stoller's avatar
      Plab link data retrieval program. This little number gets the latency · 98d2488c
      Leigh Stoller authored
      and bandwidth data from the various plab websites and parses the ad-hoc
      files into something that can be inserted into the widearea_recent
      table.
      
      Not a real daemon at the moment; it will run from crontab until
      I have a chance to fully daemonize.
      98d2488c
    • Kirk Webb's avatar
      · 5b52831c
      Kirk Webb authored
      Well, here it is:  The checkin implementing robust recovery/retry and
      asynchronous safe termination in plab allocation/deallocation/setup.
      
      Here are some of the more prominent changes/additions:
      
      * Bounded plab agent communication
        Scripts should never hang waiting for plab xmlrpc commands to complete;
        they have their own internal timeouts.  Node.create() in libplab is an
        exception, but is always run under a timeout constraint in vnode_setup
        and can be changed easily if the need arises.
      
      * Wrote functions in libplab to do the retry/recovery/timeout of remote
        command exection.
      
      * Wrapped critical sections with a signal watcher.
      
      * Added code to handle various error conditions properly
      
      * Added a libtestbed function, TBForkCmd, which runs a given program in
        a child process, and can optionally catch incoming SIGTERMs and terminate
        the child (then exit itself).
      
      * Fixed up vnode_setup to batch the 'plabnode free' operation along with
        a few other cleanups.  This should alleviate Jay's concern about how
        long it used to take to teardown a plab expt.
      
      * Whacked plabmonitord into better shape; fixed a couple bugs, taught it how
        to daemonize, and implemented a priority list for testing broken plab nodes.
        This list causes new (as yet unseen) nodes to be tried first over ones that
        have been tested already.
      5b52831c
  5. 20 Oct, 2003 1 commit
  6. 15 Oct, 2003 1 commit
  7. 14 Oct, 2003 1 commit
    • Kirk Webb's avatar
      · 4deac149
      Kirk Webb authored
      Update to libplab.plab.renew:
      
        * Make renewal robust against various kinds of failures.  These changes
          will augment my larger set of libplab and plab* updates/fixes coming
          soon to an Emulab near you.
      4deac149
  8. 30 Sep, 2003 1 commit
  9. 25 Sep, 2003 2 commits
  10. 24 Sep, 2003 1 commit
    • Leigh Stoller's avatar
      Commit my daemon to monitor the status of plab physnodes in hwdown, · 59c5d5bb
      Leigh Stoller authored
      trying to bring them back from the dead periodically by trying to
      instantiate a vserver/vnode on them, and then tearing it down. If we
      can do that, then the node is usable, and it gets moved back into the
      normal holding experiment so that ptopgen will add it to ptop files.
      
      This deamon is not turned on yet; waiting for other little bits and
      pieces to be done.
      
      There is an equiv change in os_setup that moves physnodes into hwdown
      when a setup on a vnode fails.
      
      Lbs
      59c5d5bb
  11. 23 Sep, 2003 3 commits
  12. 22 Sep, 2003 3 commits
  13. 19 Sep, 2003 6 commits
  14. 18 Sep, 2003 3 commits
    • Mike Hibler's avatar
      First whack at a per-node stats program that will make a stats web page go. · 4bc8c274
      Mike Hibler authored
      Usage: plabstats [-dfh] [-CDHILMS]
        -d    print debug diagnostics
        -f    fetch new data, else use what is in /tmp/plabxml
        -h    this help message
        -i    print IP address along with metrics
        -n    do not print hostname with metrics
      
        -C    print Ganglia CPU metrics, sorted by %CPU usage
        -D    print Ganglia disk metrics, sorted by %disk usage
        -L    print Ganglia load metrics, sorted by one minute load
        -M    print Ganglia memory metrics, sorted by %mem usage
        -S    print Emulab state info, summarizing per-node availability
      
      Default is to print a terse summary of per-node resource usage.
      Use "plabstats -f" to get fresh data or try something whacky like:
      	plabstats -S | grep accept_3
      to get the list of nodes which are currently available for mapping
      by a "level 3" (aka, average) resource consuming experiment, or:
      	plabstats -S | grep reject
      to get info about the nodes that cannot be used along with the reason(s) why.
      
      Needs some refinement:
        plabmetrics should store raw info into the DB where plabstats can get it
        presentation of Emulab state info should be improved
      4bc8c274
    • Kirk Webb's avatar
      A couple of enhancements to libplab: · d8b5e83b
      Kirk Webb authored
      - new SENDMAIL function that mirrors the perl lib's
      - retry logic for communicating with the dslice agent when a slice/sliver is
        being deleted (plabnode free)
      
      One thing I'd like to do is write a wrapper class around agent and node
      manager communication since it can be flaky - would clean up a and simplify
      things.  Maybe just wait since we're going to have to port over to dynamic
      slices soon enough.
      d8b5e83b
    • Robert Ricci's avatar
      Bump the rootball to version 4 · c66fd66c
      Robert Ricci authored
      c66fd66c
  15. 17 Sep, 2003 4 commits
  16. 16 Sep, 2003 2 commits
  17. 15 Sep, 2003 4 commits
    • Kirk Webb's avatar
      Small changes to plab "etc" Makefile · 17ac2f75
      Kirk Webb authored
      - Not installing plabroot.tgz anymore
        - grrr, kept clobbering the one I was making by hand.  We should install
          and update this tarball out-of-band anyway
      
      - Installs new fixsudo shell script.
      17ac2f75
    • Kirk Webb's avatar
      Oops: small bug fix. · 92c6eb66
      Kirk Webb authored
      92c6eb66
    • Kirk Webb's avatar
      · e1a2fabc
      Kirk Webb authored
      Some PLAB dslice manager updates:
      
      - in addition to asking the dslice agent (on plab) for a list of available
        nodes, we now also fping them all to weed out unresponsive ones.  One problem
        here is that several plab nodes block ICMP; could be solved by pinging with
        nmap (tries both a ICMP, and TCP ping).  This affects the plabdaemon getfree
        command, and subsequently which plab nodes appear as "up" in the DB
      
      - Changed slice naming scheme:  we now append the experiment index onto the
        slice name to try to ensure uniqueness (emulab_<pid>_<eid>_<idx>)
      
      - Modified plabnode to try to cope with flaky nodes - there is some retry
        code in there now
      
      - Added the "fixsudo" shell script which is run very first as root (via the
        cumbersome "su" command) to fix sudoers for later sudo use on plab nodes.
      e1a2fabc
    • Mike Hibler's avatar
      Default to using 15-minute loadave and a cap of 5.0. · ef79a466
      Mike Hibler authored
      Change to make nodes with stale data unusable (give em a loadave of 999)
      ef79a466
  18. 12 Sep, 2003 1 commit