1. 01 Nov, 2003 1 commit
    • Kirk Webb's avatar
      Couple important, but small fixes: · 92eb1d5e
      Kirk Webb authored
      1) properly disable alarm before exiting ForkCmd
         - this was causing SIGALRM to get sent when it shouldn't have, and
           probably caused the renewal failures.
         - was introduced accidentally yesterday when I unwittingly committed
           some beta libplab code along with the rootball version string fix.
      
      2) Changed semantics of the renew daemon s.t. it only sends a single message
         for each invocation of the renewal loop - summarizes the ones that failed.
      
      The rest of the code I committed accidentally yesterday seems to be working
      just fine.  It all looks sane on perusal.
      92eb1d5e
  2. 31 Oct, 2003 5 commits
  3. 29 Oct, 2003 1 commit
  4. 24 Oct, 2003 2 commits
    • Robert Ricci's avatar
      Commit the stuff necessary to copy out new plab rootballs, versions of · d12f9b61
      Robert Ricci authored
      which had been hanging around in my home directory for a while.
      
      There are a few new things in plab/etc/netbed_files that set up a
      directory of the same name in @prefix@. This will get rsync'ed with
      netbed_files/ on each planetlab node.
      log/  - just needs to exist for the httpd server
      sbin/ - contains thttpd, and scripts to manipulate it
      www/  - the directory served by thttpd. Contains symlinks to the 'real'
              location of the rootballs (etc/plab)
      
      I've committed a binary of thttpd - this is simply because it'd be a
      PITA to compile a Linux binary for every devel tree, etc.
      
      PLAB_ROOTBALL has now become a configure options. The idea is that we
      will keep the latest version number in configure.in, but you can
      override it in your defs
      file. This way, we don't have to update every defs file when there's a new
      version, but people can still play around with their own version if they want.
      
      The two scripts that interact with the plab nodes skip ones that are
      down. They ssh in as 'utah1', meaning that one of us who has access to
      that account needs to run them, so that they can have access to our
      keys. We can put boss's public key (or something) out there to remove
      this requirement.
      
      plabdist runs an rsync between @prefix@/etc/plab/netbed_files and a
      file of the same name on the planetlab nodes. It's intended to be run
      from the main install tree - the local rsync directory is not normally
      set up in devel trees. It runs in parallel, but is limited to 4 to
      avoid beating up boss too much. Takes about 1:40 with the current set
      of plab nodes (took > 10 minutes doing one at a time).
      
      plabhttpd (re)starts the mini web server on all plab nodes
      d12f9b61
    • Leigh Stoller's avatar
      Fix minor syntax error in SENDMAIL() call. · 0c8442dd
      Leigh Stoller authored
      0c8442dd
  5. 23 Oct, 2003 3 commits
    • Leigh Stoller's avatar
      Plab link data retrieval program. This little number gets the latency · 98d2488c
      Leigh Stoller authored
      and bandwidth data from the various plab websites and parses the ad-hoc
      files into something that can be inserted into the widearea_recent
      table.
      
      Not a real daemon at the moment; it will run from crontab until
      I have a chance to fully daemonize.
      98d2488c
    • Kirk Webb's avatar
      · 5b52831c
      Kirk Webb authored
      Well, here it is:  The checkin implementing robust recovery/retry and
      asynchronous safe termination in plab allocation/deallocation/setup.
      
      Here are some of the more prominent changes/additions:
      
      * Bounded plab agent communication
        Scripts should never hang waiting for plab xmlrpc commands to complete;
        they have their own internal timeouts.  Node.create() in libplab is an
        exception, but is always run under a timeout constraint in vnode_setup
        and can be changed easily if the need arises.
      
      * Wrote functions in libplab to do the retry/recovery/timeout of remote
        command exection.
      
      * Wrapped critical sections with a signal watcher.
      
      * Added code to handle various error conditions properly
      
      * Added a libtestbed function, TBForkCmd, which runs a given program in
        a child process, and can optionally catch incoming SIGTERMs and terminate
        the child (then exit itself).
      
      * Fixed up vnode_setup to batch the 'plabnode free' operation along with
        a few other cleanups.  This should alleviate Jay's concern about how
        long it used to take to teardown a plab expt.
      
      * Whacked plabmonitord into better shape; fixed a couple bugs, taught it how
        to daemonize, and implemented a priority list for testing broken plab nodes.
        This list causes new (as yet unseen) nodes to be tried first over ones that
        have been tested already.
      5b52831c
    • Leigh Stoller's avatar
      Spit out -1 for bw/latency when no entry exists in the widearea table. · 8e9a2957
      Leigh Stoller authored
      This will happen for plab nodes because of the way that we can get the
      data. Not ideal, but not sure what to do about it.
      8e9a2957
  6. 22 Oct, 2003 2 commits
  7. 20 Oct, 2003 3 commits
    • Leigh Stoller's avatar
      Bump to rev 8. · 9d0a21d9
      Leigh Stoller authored
      9d0a21d9
    • Leigh Stoller's avatar
      Bring wanassign back from the bit rot abyss. Three changes. · fe9eba11
      Leigh Stoller authored
      * Remove all of the code that dealt with allocating unconnected nodes.
        It used to be assign_wrapper passed all widearea node allocation
        decisions to wanassign, those in links and those that were
        unconnected. assign_wrapper now handles all unconnected nodes since
        assign is much better with features/desires and node type stuff.
      
      * Do not modify any database state in wanassign; It used to do the
        actual nalloc calls, but now it just returns the mapping to
        assign_wrapper so that we can more easily track "recoverability" and
        because there is existing code in assign_wrapper to allocate vnodes
        on the selected pnodes. No point in duplication.
      
      * Switch from mapping to vnodes, to mapping to pnodes. We made this
        change for other virtual nodes; instead of "fixing" to a vnode on a
        pnode, fix to the pnode. The resulting mappings are also given as
        pnodes, and assign_wrapper does the allocation on those selected
        nodes.
      
      Now all we need is uptodate widearea data!
      fe9eba11
    • Leigh Stoller's avatar
  8. 19 Oct, 2003 1 commit
  9. 18 Oct, 2003 1 commit
  10. 17 Oct, 2003 1 commit
  11. 16 Oct, 2003 2 commits
  12. 15 Oct, 2003 5 commits
  13. 14 Oct, 2003 1 commit
    • Kirk Webb's avatar
      · 4deac149
      Kirk Webb authored
      Update to libplab.plab.renew:
      
        * Make renewal robust against various kinds of failures.  These changes
          will augment my larger set of libplab and plab* updates/fixes coming
          soon to an Emulab near you.
      4deac149
  14. 13 Oct, 2003 2 commits
    • Leigh Stoller's avatar
      Aside from another round of cleanup, there is a significant change. · a70aef53
      Leigh Stoller authored
      I have implemented the suggestion Jay made a couple of weeks ago
      about allowing partial allocation in assign_wrapper, and retrying with a
      modified set of "fixed" nodes.
      
      My basic approach was to change nalloc to optionally allow partial
      allocations, returning the number of nodes that could not be allocated as
      its return value. In assign_wrapper, I determine which nodes we were able
      to get (in each loop), set their allocstate to INIT_DIRTY, augment the
      fixed_node set, and recreate the top file. Then I try again, up to the
      current number of maxtries. If assign fails with an unretryable error, or
      if we could not nalloc a user directed fixed node, then I stop right away
      since the experiment is not going to map (in the near term) if the fixed
      node list cannot be allocated.
      
      I am confident that this works okay, although testing is a little
      difficult. The main problem is how this interacts with experiment modify.
      Chad's implementation is that a modify can be reverted (recovered from)
      only as long as the DB is not modified by assign_wrapper. Well, a partial
      allocation, followed by failure, obviously modifies the DB, and so is
      deemed not recoverable. I am still trying to figure out the effects of
      this, and whether I can relax this requirement, but in the meantime
      lets install it and see what happens (won't affect many people).
      a70aef53
    • Mac Newbold's avatar
      Rollback to prestatewait for now. · 3b210b7b
      Mac Newbold authored
      3b210b7b
  15. 10 Oct, 2003 2 commits
    • Mac Newbold's avatar
      Fix a nit for Mike. · b71f5f90
      Mac Newbold authored
      b71f5f90
    • Mac Newbold's avatar
      New StateWait changes - the main point of all this is to move to our new · 2b2a306d
      Mac Newbold authored
      model of waiting for state changes. Before we were watching the database
      (which means we can only watch for terminal/stable/long-lived states, and
      have to poll the db). Now things that are waiting for states to change
      become event listeners, and watch the stream of events flow by, and don't
      have to do any polling. They can now watch for any state, and even
      sequences of states (ie a Shutdown followed by an Isup).
      
      To do this, there is now a cool StateWait.pm library that encapsulates the
      functionality needed. To use it, you call initStateWait before you start
      the chain of events (ie before you call node reboot). Then do your stuff,
      and call waitForState() when you're ready to wait. It can be told to
      return periodically with the results so far, and you can cancel waiting
      for things. An example program called waitForState is in
      testbed/event/stated/ , and can also be used nicely as a command line tool
      that wraps up the library functionality.
      
      This also required the introduction of a TBFAILED event that can be sent
      when a node isn't going to make it to the state that someone may be
      waiting for. Ie if it gets wedged coming up, and stated retries, but
      eventually gives up on it, it sends this to let things know that the node
      is hozed and won't ever come up.
      
      Another thing that is part of this is that node_reboot moves (back) to the
      fully-event-driven model, where users call node reboot, and it does some
      checks and sends some events. Then stated calls node_reboot in "real mode"
      to actually do the work, and handles doing the appropriate retries until
      the node either comes up or is deemed "failed" and stated gives up on it.
      This means stated is also the gatekeeper of when you can and cannot reboot
      a node. (See mail archives for extensive discussions of the details.)
      
      A big part of the motivation for this was to get uninformed timeouts and
      retries out of os_load/os_setup and put them in stated where we can make a
      wiser choice. So os_load and os_setup now use this new stuff and don't
      have to worry about timing out on nodes and rebooting. Stated makes sure
      that they either come up, get retried, or fail to boot. tbrestart also
      underwent a similar change.
      2b2a306d
  16. 09 Oct, 2003 2 commits
    • Leigh Stoller's avatar
      Reorg of two aspects of node update. · 2641af4d
      Leigh Stoller authored
      * install-rpm, install-tarfile, spewrpmtar.php3, spewrpmtar.in: Pumped
        up even more! The db file we store in /var/db now records both the
        timestamp (of the file, or if remote the install time) and the MD5
        of the file that was installed. Locally, we can get this info when
        accessing the file via NFS (copymode on or off). Remote, we use wget
        to get the file, and so pass the timestamp along in the URL request,
        and let spewrpmtar.in determine if the file has changed. If the
        timestamp it gets is >= to the timestamp of the file, an error code
        of 304 (Not Modifed) is returned. Otherwise the file is returned.
      
        If the timestamps are different (remote, server sends back an actual
        file), the MD5 of the file is compared against the value stored. If
        they are equal, update the timestamp in the db file to avoid
        repeated MD5s (or server downloads) in the future. If the MD5 is
        different, then reinstall the tarball or rpm, and update the db file
        with the new timestamp and MD5. Presto, we have auto update capability!
      
        Caveat: I pass along the old MD5 in the URL, but it is currently
        ignored. I do not know if doing the MD5 on the server is a good
        idea, but obviously it is easy to add later. At the moment it
        happens on the node, which means wasted bandwidth when the timestamp
        has changed, but the file has not (probably not something that will
        happen in typical usage).
      
        Caveat: The timestamp used on remote nodes is the time the tarfile
        is installed (GM time of course). We could arrange to return the
        timestamp of the local file back to the node, but that would mean
        complicating the protocol (or using an http header) and I was not in
        the mood for that. In typical usage, I do not think that people will
        be changing tarfiles and rpms so rapidly that this will make a
        difference, but if it does, we can change it.
      
      * node_update.in, client side watchdog, and various web pages:
        Deflated node_update, removing all of the older ssh code. We now
        assume that all nodes will auto update on a periodic basis, via the
        watchdog that runs on all client nodes, including plab nodes.
      
        Changed the permission check to look for new UPDATE permission (used
        to be UPDATEACCOUNT). As before, it requires local_root or better.
        The reason for this is that node_update now implies more than just
        updating the accounts/mounts. The web pages have been changed to
        explain that in addition to mounts/accounts, rpms and tarfiles will
        also be updated. At the moment, this is still tied to a single
        variable (update_accounts) in the nodes table, but as Kirk requested
        at the meeting, it will probably be nice to split these out in the
        future.
      
        Added the ability to node_update a single node in an experiment (in
        addition to all nodes option on the showexp page). This has been
        added to the shownode webpage menu options.
      
        Changed locking code to use the newer wrapper states, and to move
        the experiment to RUNNING_LOCKED until the update completes. This is
        to prevent mayhem in the rest of the system (which could be dealt
        with, but is not worth the trouble; people have to wait until their
        initiated update is complete, before they can swap out the
        experiment).
      
        Added "short" mode to shownode routine, equiv to the recently added
        short mode for showexp. I use this on the confirmation page for
        updating a single node, giving the user a couple of pertinent (feel
        good) facts before they comfirm.
      2641af4d
    • Mac Newbold's avatar
      tbsetup/node_reboot.in · 4bc03e0b
      Mac Newbold authored
      4bc03e0b
  17. 07 Oct, 2003 1 commit
  18. 06 Oct, 2003 1 commit
  19. 02 Oct, 2003 4 commits