1. 20 Oct, 2006 1 commit
    • Mike Hibler's avatar
      Wow, this should make me look important! · afa5e919
      Mike Hibler authored
      Two-day boondoggle to support "/scratch", an optional large, shared filesystem
      for users.  To do this, I needed to find all the instances where /proj is used
      and behave accordingly.  The boondoggle part was the decision to gather up all
      the hardwired instances of shared directory names ("/proj", "/users", etc.)
      so that they are set in a common place (via unexposed configure variables).
      This is a boondoggle because:
      
      1. I didn't change the client-side scripts.  They need a different mechanism
         (e.g., tmcd) to get the info, configure is the wrong way.
      
      2. Even if I had done #1 it is likely--no, certain--that something would
         fail if you tried to rename "/proj" to be "/mike".  These names are just
         too ingrained.
      
      3. We may not even use "/scratch" as it turns out.
      
      Note, I also didn't fix any of the .html documentation.  Anyway, it is done.
      To maintain my illusion in the future you should:
      
      1. Have perl scripts include "use libtestbed" and use the defined PROJROOT(),
         et.al. functions where possible.  If not possible, make sure they run
         through configure and use @PROJROOT_DIR@, etc.
      
      2. Use the configure method for python, C, php and other languages.
      
      3. There are perl (TBValidUserDir) and php (VALIDUSERPATH) functions which
         you should call to determine if an NS, template parameter, tarball or
         other file are in "an acceptable location."  Use these functions where
         possible.  They know about the optional "scratch" filesystem.  Note that
         the perl function is over-engineered to handles cases that don't occur
         in nature.
      afa5e919
  2. 27 Sep, 2006 1 commit
    • Kevin Atkinson's avatar
      · 7293bbc0
      Kevin Atkinson authored
      Second attempt to fix the problem of duplicate log entries.  I am
      99.99% sure this will get 100% of the cases, and 99.999% sure it won't
      break anything.
      
      It basically detects when the DB handle is a child and if so set
      "InaciveDestroy" before the database handle DESTROY method is called.
      Since the DB handle can be closed in several different places I created a
      new class to override the Db Handle (the Mysql class) DESTROY method. The
      other alternative is to add special code anywhere where the database handle
      could be destroyed which is when every a reconnect is done and when the
      module exists.  The later would have involved putting code in the END block.
      I think the new class method is simpler for that reason.
      
      
      Also, add a note about patching Mysql.pm in doc/UPDATING.
      7293bbc0
  3. 10 Sep, 2006 1 commit
    • Leigh B. Stoller's avatar
      The bulk of this commit adds the ability to run the program agent on ops · e8bb6bca
      Leigh B. Stoller authored
      so that users can schedule program events to run there. For example:
      
      	set myprog [new Program $ns]
      	$myprog set node "ops"
      	$myprog set command "/usr/bin/env >& /tmp/foo"
      
      	$ns at 10 "$myprog start"
      or
      	tevc -e pid/eid now myprog start
      
      Since the program agent cannot talk to tmcd from ops, there are new
      routines to create the config files that the program agent uses, in
      the expertment tbdata directory.
      
      I also rewrote the eventsys.proxy script that starts the event
      scheduler on ops; I rolled the startup of the program agent into this
      script, via new -a option which is passed over from boss when an ops
      program agent is detected in the virt topology. This keep the number
      of new processes on ops to a small number.
      
      Also part of the above rewrite is that we now catch when event
      scheduler (or the program agent) exits abnormally, sending email to
      tbops and the swapper of the experiment. We have been seeing abnormal
      exits of the scheduler a...
      e8bb6bca
  4. 07 Sep, 2006 1 commit
    • Leigh B. Stoller's avatar
      Some changes to how log files are handled; this too way too long to · c01f7b3e
      Leigh B. Stoller authored
      do!
      
      The original operation was to save up every log file forever in the
      work directory, and copy that out to both the user directory and the
      info directory (long term archive). When I cleaned /proj on ops
      yesterday of all this old cruft, I recoved 17GB of disk space. Yow!
      
      So, the new operation is:
      
      * Only files that end in .log are copied to the user directory. No
        longer copying out .top, .ptop, and a couple of other logs; 99% of
        users never look at these things. We still have them available to us
        though, on boss.
      
      * At the beginning of each swap operation, clean out the work
        directory of all the old log files. These are named a variety of
        ways, so I use some pattern patches to do this.
      
      * Jigger the names a little so that we do not name things in the form
        "$$.log", to avoid copying out different named files to the user
        directory each time; instead link the .log file to the real output
        file so that it gets overwritten eac...
      c01f7b3e
  5. 31 Aug, 2006 1 commit
    • Kevin Atkinson's avatar
      · 964b8d11
      Kevin Atkinson authored
      Add patch to modify Mysql.pm to allow setting the "InactiveDestroy" in
      the underlying DB handle.  Also avoid disconnecting the file handle
      explistly on DESTROY as that will be taken care of in the DESTROY
      method for the the DB handle.
      
      Override perl version of fork() to set InactiveDestroy in all open
      database handles in the child so that it won't send a disconnect when
      the handle is destroyed as this will also close the database handle
      for the parent.  It will also call tblog_new_child_process in the
      child process to properly inform tblog of the new process. This will
      be a NoOp if the libtblog module is not loaded.
      964b8d11
  6. 18 Aug, 2006 1 commit
    • Kirk Webb's avatar
      · 5c798e51
      Kirk Webb authored
      Left in debugging values on accident...
      5c798e51
  7. 17 Aug, 2006 1 commit
    • Kirk Webb's avatar
      · f1fa5a51
      Kirk Webb authored
      New plab vnode monitor framework, now with proactive node checking action!
      
      The old monitor has been completely replaced.  The new one uses modular pools
      to test and track plab nodes.  There are currently two pool modules:
      good and bad.  THe good pool tests nodes that have are not known to have
      issues to proactively find problems and push nodes into the "bad" pool
      when necessary.  The bad pool acts similarly to the old plabmonitor; it
      does and end to end test on nodes, and if and when they finally come up,
      moves them to the good pool.  Both pools have a testing backoff mechanism
      that works as follows:
      
        * The node is tested right away upon entering either pool
        * Node fails to setup:
          * goodpool: node is sent to bad pool (hwdown)
          * badpool:  node is scheduled to be retested according to
                      an additive backoff function, maxing out at 1 hour.
        * Node setup succeeds:
          * goodpool: node is scheduled to be retested according to
                      an additive backoff function, maxing out at 1 hour.
          * badpool:  node is moved to good pool.
      
      The backoff thing may be bogus, we'll see.  It seems like a reasonable thing
      to do though - no need to hammer a node with tests if it consistently
      succeeds or fails.  Nodes that flop back and forth will get the most
      testing punishment.  A future enhancement will be to watch for flopping
      and force nodes that exhibit this behavior to pass several consecutive
      tests before being eligible for return back into the good pool.
      
      The monitor only allows a configurable window's worth of outstanding
      tests to go on at once.  When tests finish, more nodes tests are allowed
      to start up right away.
      
      Some refactoring needs to be done.  Currently the good and bad pools share
      quite a bit of duplicated code.  I don't know if I dare venture into
      inheritance with perl, but that would be a good way to approach this.
      
      Some other pool module ideas:
      
      * dynamic setup pools
      
      When experiments w/ plab vnodes are swapped in, use the plab monitor to
      manage setting up the vnodes by dynamically creating pools on a per-experiment
      basis.  This has the advantage that the monitor can keep a global cap on
      the number of outstanding setup operations.  These pools might also try to
      bring up vnodes that failed to setup during swapin later on, along with other
      vnode monitoring tasks.
      
      * "all nodes" pools
      
      Similar to the dynamic pools just mentioned, but with the mission to extend
      experiments to all plab nodes possible (as nodes come and go).  Useful for
      services.
      f1fa5a51
  8. 14 Aug, 2006 1 commit
    • Kevin Atkinson's avatar
      · 07dda0d8
      Kevin Atkinson authored
      Prep for Mike Kasick report code.  Updated database schema and
      installed hooks for his code.
      
      Cleaned up how errors were handled in tblog(...).
      
      Allow SENDMAIL to be called before the path is untained in '-T' scripts.
      
      Other small changes.
      07dda0d8
  9. 07 Aug, 2006 1 commit
    • Kirk Webb's avatar
      · d6cf21fe
      Kirk Webb authored
      Add plab_mapping to the list of tables containing physical node data.
      d6cf21fe
  10. 20 Jul, 2006 1 commit
    • Leigh B. Stoller's avatar
      This start out as: · 5dfa4a47
      Leigh B. Stoller authored
      * add an "active" flag to the template record, which will be used by
        the user to indicate what templates he wants listed (rather then the
        roots). Basically, the current working templates, rather then a big
        graph.
      
      But I never actually finished that cause it sorta morphed into:
      
      * Added a vis_graphs table to cache the last generated visualization
        rendering in the database so that we do not have to wait so long for:
      
      * Add new buttons to showexp and template_show pages, to display in the
        same page either the settings (current view), the NS file, or the
        visualization (along with zoom in/out buttons).
      
      And now I can go back to that "active" thing I mentioned up above ...
      5dfa4a47
  11. 18 Jul, 2006 2 commits
    • Leigh B. Stoller's avatar
      fdcf08df
    • Leigh B. Stoller's avatar
      Changes necessary for moving most of the stuff in the node_types · 624a0364
      Leigh B. Stoller authored
      table, into a new table called node_type_attributes, which is intended
      to be a more extensible way of describing nodes.
      
      The only things left in the node_types table will be type,class and the
      various isXXX boolean flags, since we use those in numerous joins all over
      the system (ie: when discriminating amongst nodes).
      
      For the most part, all of that other stuff is rarely used, or used in
      contexts where the information is needed, but not for type descrimination.
      Still, it made for a lot of queries to change!
      
      Along the way I added a NodeType library module that represents the type
      info as a perl object. I also beefed up the existing Node module, and
      started using it in more places. I also added an Interfaces module, but I
      have not done much with that yet.
      
      I have not yet removed all the slots from the node_types table; I plan to
      run the new code for a few days and then remove the slots.
      
      Example using the new NodeType object:
      
      	use NodeType;
      
      ...
      624a0364
  12. 05 Jul, 2006 1 commit
  13. 03 Jul, 2006 1 commit
    • Mike Hibler's avatar
      Framework for supporting 802.1q tagged VLANs as a form of multiplexed link. · 3f1c15e2
      Mike Hibler authored
      Actually, most of the changes here were just to generalize the "virtual
      interface" state in the DB.  Other than the client-side scripts, there
      is very little specific here specific to tagged VLANs.
      
      In fact, you cannot specify "vlan" as a type yet as we haven't done the
      snmpit support for setting up the switches.
      
      For more info see bas:~mike/flux/doc/testbed-virtinterfaces.txt, which I
      will integrate into the knowledge base and the Emulab doc at some point.
      3f1c15e2
  14. 22 May, 2006 1 commit
  15. 15 May, 2006 1 commit
    • Mike Hibler's avatar
      Initial "Inner Plab" support. In your NS file, you declare one node: · 9512772e
      Mike Hibler authored
      tb-set-node-plab-role $plc plc
      
      to make it the PLC node.  Then any number of other nodes are declared as:
      
      tb-set-node-plab-role $plab1 node
      
      to make them inner plab nodes.  Unlike elabinelab, there is no magic
      "tb-plab-in-elab" command which implies the topology, you put all the
      plab nodes in a LAN or whatever yourself.  This may or may not be a good idea.
      
      Anyway, these NS commands set DB state in virt_nodes and reserved much like
      elabinelab.  During swapin, the dhcpd.conf file is rewritten so that
      inner plab nodes have their "filename" set to "pxelinux.0" and their
      "next-server" set to the designated PLC node.  The PLC node will then be
      loaded/booted before anything is done to the inner-plab nodes.  After
      it comes up, the inner plab nodes are rebooted and declared as up.
      There is a new tmcd command "eplabconfig" (suggestions for a new name
      welcom!), which returns info like:
      
          NAME=plc ROLE=plc IP=155.98.36.3 MAC=00d0b713f57d
          N...
      9512772e
  16. 05 May, 2006 1 commit
  17. 02 May, 2006 1 commit
  18. 27 Apr, 2006 1 commit
    • Leigh B. Stoller's avatar
      Change the handling for when mysqld goes away (CR_SERVER_LOST || · 8aa8098d
      Leigh B. Stoller authored
      CR_SERVER_GONE_ERROR). Instead of bailing, sit and loop trying to
      reconnect, given that this is known to be a transient error, and we do
      not really want our daemons to go belly up during that brief time when
      the watchdog is getting it restarted. The query is then resent.
      
      For the perl version of this change, I was a lot more pedantic since
      we use this library from a zillion places. Also, there is some special
      handling cause of the mysqld watchdog which would become useless if
      the test query hung trying to reconnect to the server forever.
      
      As a side effect of this change, we should see way less email when
      mysqld goes catanoic since the new code will just loop instead of
      generating tons of errors.
      
      Might actually increase overall rebustness. On the other hand, could
      end up being a total disaster!
      8aa8098d
  19. 30 Mar, 2006 1 commit
  20. 28 Mar, 2006 1 commit
  21. 21 Mar, 2006 1 commit
    • Kevin Atkinson's avatar
      · d258dde6
      Kevin Atkinson authored
      Changed format of email sent to user on errors.  The error will now
      appear instead of the generic message when I am confident it is
      accurate.  The subject line will also change to reflect the cause of
      an error.
      
      Avoid sending mail to testbed-ops during failed swap related evenets
      in some cases.  It will instead be sent to a new mailing list
      testbed-errors.
      
      Added a new row in the experiment info table "Last Error:" which
      states the cause of the error, and links to a new page displaying the
      error.
      
      Made some assign/assign_wrapper errors more informative.
      
      The error (as determined by tblog) is now stored in the database in a
      more structured fashion.  This inlcudes adding a column for the session
      (in the log table) to testbed_stats to link eash swap event with the
      logs and possible the error.
      
      Other changes to the database, see sql/database-migrate.txt
      d258dde6
  22. 15 Mar, 2006 1 commit
  23. 15 Feb, 2006 1 commit
    • David Johnson's avatar
      * Makeconf.in, configure, configure.in, defs-default, defs-johnsond-emulab: · 4982b9cd
      David Johnson authored
          - added a new defs var, TBROBOCOPSEMAIL
      
        * tbsetup/power_mail.pm.in:
          - add some new info to robot powerup mails
      
        * db/libdb.pm.in:
          - add a new function to determine if an experiment contains nodes of a
            given class/type
      
        * tbsetup/swapexp.in:
          - check if exp is a robot exp; that is, if it has robots or motes; if
            so, cc error msgs to TBROBOCOPSEMAIL in addition to TBOPS
      4982b9cd
  24. 07 Feb, 2006 1 commit
  25. 02 Feb, 2006 1 commit
    • Timothy Stack's avatar
      · 11f6065f
      Timothy Stack authored
      Finish off changes to consult the os_boot_cmd table when setting
      def_boot_cmd_line.
      
      	* db/libdb.pm.in: Add TBGetOSBootCmd function that returns the
      	boot command line for an osid/role.
      
      	* sql/database-fill-supplemental.sql: Add some queries that
      	initialize the os_boot_cmd table.
      
      	* tbsetup/assign_wrapper.in: For nodes that need linkdelays or
      	inner elab boss nodes, set their command line using
      	TBGetOSBootCmd.  Since the TBGetOSBootCmd function needs a "real"
      	OSID to work, we resolve the nextosid for any meta OSIDs.
      
      	* tbsetup/elabinelab.in: Add os_boot_cmd to the list of full
      	tables to be dumped.
      
      	* tbsetup/ns2ir/elabinelab-withfsnode.ns,
      	tbsetup/ns2ir/elabinelab.ns: Remove boot command lines,
      	assign_wrapper does it instead.
      11f6065f
  26. 30 Jan, 2006 1 commit
  27. 26 Jan, 2006 1 commit
    • Kevin Atkinson's avatar
      · 05015359
      Kevin Atkinson authored
      Merged in changes from tblog-2-branch:
      
                Move parts of libtblog into libtblog_simple.  Libtblog simple
                provided the basic logging functions but doesn't touch anything.
                Moreover including libtblog_simple doesn't automatically start
                the logging subsystem.  It also doesn't have testbed dependencies
                which mean 1) it can be used in the core testbed libraries (such
                as libdb, libtestbed) without introducing a circular dependency
                and 2) can be used independently.
      
                Reworked DBFatal and DBWarn to use tblog.  It will still email
                testbed-ops, however.
      
                Make use of the "cause" field to determine the cause of the bug.
                In particular tblog_find_error will look at the value of this
                field and report the "cause".  In the future different actions
                can be taken based on the ultimate "cause" of the bug, such as if
                testbed-ops should be notified.
      
                Change format of Error Message reported by libtblog.  As per the
                email "Format or Error Messages" ro testbed-dev.
      
                Have libtblog use its own Database handle to avoid problems with
                locked tables.
      
                Also set DBCONN_MAXTRIES to 3 for most important queries.  For
                queries that are not important don't send mail on error.
      05015359
  28. 17 Jan, 2006 1 commit
  29. 27 Dec, 2005 1 commit
    • Mike Hibler's avatar
      More tightly connect the notion of a firewall and the security level. · f1206314
      Mike Hibler authored
      If you specify an explicit firewall, you are implicitly assigned security
      level 2 and you cannot explicitly specify the security level.  Likewise,
      if you specify a security level, you cannot also specify a firewall.
      
      The reason for this is that security level 1 (aka "Blue") now has a slightly
      different meaning.  It is intended for protecting the inside from the outside
      rather than visa-versa.  The only practical implication of this is that for
      level 1, we don't do all the fancy power-off-boot-into-MFS-zapbootblock stuff
      that we do for higher levels.
      
      Anyway, I wanted to make sure that if you specify your own firewall, you
      DO have to go through the full cleansing swapout since we can't trust a
      firewall that the Average Joe sets up.
      f1206314
  30. 21 Dec, 2005 1 commit
  31. 19 Dec, 2005 1 commit
    • Leigh B. Stoller's avatar
      Add support for moving deleted users to a deleted users table. This · b4231fbf
      Leigh B. Stoller authored
      would be no big deal, except that we want to retain user_stats for
      deleted users, and rather then a deleted_user_stats table, I want to
      retain stats for deleted users in the user_stats table, since that
      is a more natural place for them.
      
      The main problem is that we use the login (uid) as the cross table
      reference slot all over the DB, which is fundamentally incorrect, if
      we want to be able reuse uids and still know what historical data
      refers to.
      
      So, I have taken a few baby steps towards weaning us off the uid, and
      towards permanently unique key for users, using the unix_uid integer
      for now, but probably something slightly different later.
      
      The user_stats is now indexed on this new key (called uid_idx in the
      users_stats table) instead of the plain uid.
      
      The unix_uid slot in the users table is no longer an auto_increment
      field, but instead uses the emulab_indicies table for the next
      available index.
      b4231fbf
  32. 15 Dec, 2005 1 commit
  33. 14 Dec, 2005 1 commit
  34. 12 Dec, 2005 1 commit
  35. 07 Dec, 2005 1 commit
  36. 06 Dec, 2005 1 commit
    • Mike Hibler's avatar
      Phase II in disk state saving for swapout. · ed0d25b4
      Mike Hibler authored
      Exec summary: after this checkin, the infrastructure exists (once enabled)
      to create swapout-time "delta" images for all machines in experiments.
      There is only a single, cumulative swap image per node (i.e., all diffs
      are from the base image, not from the previous swap).
      
      What doesn't yet exist, is the mechanism for reloading the delta at
      swapin time.  That is Phase III.
      
      The nitty-gritty:
      
      1. Keep disk image signature files for all nodes in an experiment.
      
         New fields in the DB to track, for each disk partition, what image the
         partition was loaded from.  This enables us at swapin or os_load time to
         create signature files in /proj/<pid>/exp/<eid>/swapinfo for the current
         contents of a node disk/partition.  All nodes with the same image loaded
         will share (via symlink) the same signature file.  TODO: no longer
         referenced signature files should be removed.
      
         Signature info is only collected in the swapinfo directory if the
         experiment is set to have disk state saving enabled (see #5 below).
         Info consists of the <vname>.sig file, which is the file created
         by imagehash, and <vname>.part which says what the root disk is
         for the node and whether to look at the whole disk or just a single
         partition when crafting the delta image.
      
      2. Swapout-time hook for creating swapout image.
      
         If the experiment is marked as allowing disk state saving, tbswap
         will arrange to run and then monitor the create-swapimage command
         on each node.  This script will run the modified version of imagezip
         which uses the signature file to create a delta image.
      
         The command to run and maximum timeout are specified via sitevars
         (previously checked in).  Note that the tbswap script currently has
         special knowledge of /usr/local/bin/create-swapimage as a swapout
         time script.  If the swap/swapout_command sitevar is set to that,
         Magic Stuff shall occur (i.e. it will monitor the command and make
         periodic reports of progress).  The sitevars are a total hack and
         will disappear at some point.
      
      3. Client-side script for creating swapout image.
      
         os/create-swapimage, very similar to create-image.  Uses the info
         stashed in /proj/..blahblah../swapinfo to create a delta image.
      
         XXX fer now hack: the script first looks in /proj/<pid>/bin for an
         imagezip binary to use.  Failing that, it uses the one in the MFS.
         This allows for easier development of the imagezip changes (i.e.,
         don't have to update the MFS every time.
      
      4. Auto creation of signature files for new images.
      
         The create_image script (the one that runs on boss when creating images
         for users) has been modified to automatically create a signature via
         imagehash.  The .sig file winds up in /usr/testbed/images/sigs or
         in /proj/<pid>/images/sigs.  From there it will be copied at swapin/os_load
         time to the per-expt swapinfo directory for any node that uses the images.
      
         The process for creating standard system images (aka, "Mike") has not
         yet been modified.  When the image creation/installation procedure
         is formalized into a script, this will be done.
      
      5. Web changes to set/clear saving of disk state at swapout time.
      
         Add a checkbox to the experiment create page to allow setting "save
         swap state".  Also added to the experiment modify page, but currently
         "if (0)"ed out as it will need some additional support.  The showstuff
         page will show it.
      
         Taking a page from Leigh's hack book, if EXPOSESTATESAVE in defs.php3
         is set to zero (as it is now), then the checkbox doesn't appear in the
         create experiment page except for STUDLY users.
      ed0d25b4
  37. 29 Nov, 2005 1 commit
  38. 17 Nov, 2005 2 commits
    • Mike Hibler's avatar
      Minor fixes: add another level of panic that we set when swapout fails. · 32560429
      Mike Hibler authored
      Produces a different message in the web page.
      
      Also fix up a couple of minor firewalled elabinelab issues.
      32560429
    • Mike Hibler's avatar
      1. Beef up "admin mode" support. · 4ec701e7
      Mike Hibler authored
      * Add libadminmfs.pm with routines for entering/exiting and executing
        commands in, the admin MFS.  Node admin and firewall swapout (see
        below) now use this, the image creation process does not yet.
      
      * Add swapout time hooks for running an admin mode process, likely to
        be used to collect swapout time state.  Currently controlled globally
        by two new sitevars.
      
      * Modified node_admin to use the library and added a "-c <command>"
        option to have nodes go into admin mode and run a command.  I don't
        really expect this to be useful, it was just a testing vehicle for
        the library.
      
      2. Improved the swapout process for firewalled experiments.  Largely
         just generalized what we already did for paniced experiments.
         At swapout, firewalled nodes are:
      
         - powered off
         - set to boot into admin mode and run a disk zapper
         - powered on
      
        The swapout process then waits for all nodes to successfully complete
        disk zapage, at which point the nodes are nfree'ed as usual....
      4ec701e7