1. 20 Oct, 2006 1 commit
    • Mike Hibler's avatar
      Wow, this should make me look important! · afa5e919
      Mike Hibler authored
      Two-day boondoggle to support "/scratch", an optional large, shared filesystem
      for users.  To do this, I needed to find all the instances where /proj is used
      and behave accordingly.  The boondoggle part was the decision to gather up all
      the hardwired instances of shared directory names ("/proj", "/users", etc.)
      so that they are set in a common place (via unexposed configure variables).
      This is a boondoggle because:
      
      1. I didn't change the client-side scripts.  They need a different mechanism
         (e.g., tmcd) to get the info, configure is the wrong way.
      
      2. Even if I had done #1 it is likely--no, certain--that something would
         fail if you tried to rename "/proj" to be "/mike".  These names are just
         too ingrained.
      
      3. We may not even use "/scratch" as it turns out.
      
      Note, I also didn't fix any of the .html documentation.  Anyway, it is done.
      To maintain my illusion in the future you should:
      
      1. Have perl scripts include "use libtestbed" and use the defined PROJROOT(),
         et.al. functions where possible.  If not possible, make sure they run
         through configure and use @PROJROOT_DIR@, etc.
      
      2. Use the configure method for python, C, php and other languages.
      
      3. There are perl (TBValidUserDir) and php (VALIDUSERPATH) functions which
         you should call to determine if an NS, template parameter, tarball or
         other file are in "an acceptable location."  Use these functions where
         possible.  They know about the optional "scratch" filesystem.  Note that
         the perl function is over-engineered to handles cases that don't occur
         in nature.
      afa5e919
  2. 02 Oct, 2006 1 commit
  3. 25 Aug, 2006 1 commit
    • Leigh B. Stoller's avatar
      Add support for dynamic registration of ports on experimental nodes so · 73102ef8
      Leigh B. Stoller authored
      that clients and servers can avoid using hardwired ports on those
      experimental nodes. I have added the following tmcd operation:
      
      	tmcc portregister <service> [<port>]
      
      where we assume its the control network IP (from the DB), and the pid/eid
      of the node the experiment belongs to. The given port is entered into
      the port_registration table for the experiment, using the service as the
      tag. Supplying port=0 clears the registration from the table.
      
      When called like:
      
      	tmcc portregister <service>
      
      we return the registered port, or nothing.
      
      I hacked up a little C library module in libtb so that there is something
      that looks like a C interface to this:
      
       	int
       	PortRegister(char *service, int port);
      
       	int
       	PortLookup(char *service, char *hostname, int namelen, int *port);
      
      The above routines call out to tmcc of course.
      
      Lastly, I changed the sync server and client to use the new port
      registration, via the library calls above.
      
      There are other emulab services that need to be changed as well, but
      they can be done on an as needed basis.
      73102ef8
  4. 16 Aug, 2006 1 commit
    • Kevin Atkinson's avatar
      - Added tbreport database schema (added three tables), storage for · 9c5d3308
      Kevin Atkinson authored
        tbreport errors & context.
      
      - Modified fatal() in swapexp, batchexp, and tbprerun, and die_noretry()
        in os_setup to pass hash parameter to tblog functions.
      
      - Added tbreport errror & context information for select errors in
        swapexp, tbswap, assign_wrapper2, snmpit_lib, snmpit, batchexp,
        assign_wrapper, os_setup, parse-ns, & tbprerun.
      
      - Added assign error parser in assign_wrapper2.
      
      - Added parse.tcl error parser in parse-ns.
      
      - Added severity constants for tbreport in libtblog_simple.
      
      - Added tbreport() function & context table mappging for reporting
        discrete error types to libtblog.
      9c5d3308
  5. 05 Jul, 2006 1 commit
    • Kevin Atkinson's avatar
      · 183040de
      Kevin Atkinson authored
      Many changes to tblog code.  Database update needed:
      
      1) Added summary of failed nodes is os_setup.  The cause of the error is now
      classified as "user" if it is only user images that failed and the user
      image failed on every pc of a particular type.  Otherwise I leave the cause
      as "unknown" since it is really hard to tell what the real cause is.
      
      2) Raised the confidence threshold for most errors so that they will appear
      on the top.
      
      3) Added a special error when an experiment is canceled.  The cause is
      "canceled" and testbed-ops won't see these errors.
      
      4) Fixed a bug in assign_wrapper where it will incorrectly report "This
      experiment cannot be instantiated on this testbed..." when really the user
      canceled the swapin.
      
      5) Fixed a bug where os_setup errors where being incorrectly reported as
      assign errors.  This happens when os_setup fails for some reason and
      tbswap tries again, but the second time around there are not enough nodes.
      So the last error is coming from assign even though the true cause of the
      error is due to failed nodes.  The fix for this involved added a new column
      to the log table, "attempt", which will be 1 for the first attempt and then
      incremented for each new attempt.  tblog_find_error will then simply ignore
      any errors with "attempt > 1".
      
      6) Also fixed a potential problem when there is an error during the cleanup
      phase by adding another column "cleanup".  tblog_find_error will
      also ignore any errors with the cleanup bit set.
      183040de
  6. 15 May, 2006 1 commit
    • Mike Hibler's avatar
      Initial "Inner Plab" support. In your NS file, you declare one node: · 9512772e
      Mike Hibler authored
      tb-set-node-plab-role $plc plc
      
      to make it the PLC node.  Then any number of other nodes are declared as:
      
      tb-set-node-plab-role $plab1 node
      
      to make them inner plab nodes.  Unlike elabinelab, there is no magic
      "tb-plab-in-elab" command which implies the topology, you put all the
      plab nodes in a LAN or whatever yourself.  This may or may not be a good idea.
      
      Anyway, these NS commands set DB state in virt_nodes and reserved much like
      elabinelab.  During swapin, the dhcpd.conf file is rewritten so that
      inner plab nodes have their "filename" set to "pxelinux.0" and their
      "next-server" set to the designated PLC node.  The PLC node will then be
      loaded/booted before anything is done to the inner-plab nodes.  After
      it comes up, the inner plab nodes are rebooted and declared as up.
      There is a new tmcd command "eplabconfig" (suggestions for a new name
      welcom!), which returns info like:
      
          NAME=plc ROLE=plc IP=155.98.36.3 MAC=00d0b713f57d
          NAME=plab1 ROLE=node IP=155.98.36.10 MAC=0002b3877a4f
          NAME=plab2 ROLE=node IP=155.98.36.34 MAC=00d0b7141057
      
      to just the PLC node (returns nothing to any other node).
      
      The implications of this setup are:
      
       * The PLC node must act as a TFTP server as we have discussed in the past.
         The TMCC info above is hopefully enough to configure pxelinux, if not
         we can change it.
      
       * The PLC node is responsible for loading the disks of inner plab nodes.
         This is implied by the setup, where we change the dhcpd.conf file before
         doing anything to the inner nodes.  Thus, once the inner nodes are
         rebooted, they will be talking pxelinux with PLC, and not to boss.
         This step is dubious, as we could no doubt load the disks faster than
         whatever plab uses can.  But it simplified the setup (and is more
         realistic!).  The alternative, which is something that might be useful
         anyway, is to introduce a "state" after which nodes have been reloaded
         but before they are rebooted.  With that, we can reload the plab nodes
         and then change the dhcpd.conf file so when they reboot they start
         talking to the PLC.
      9512772e
  7. 04 May, 2006 1 commit
  8. 02 May, 2006 1 commit
  9. 30 Mar, 2006 1 commit
  10. 28 Mar, 2006 1 commit
    • Mike Hibler's avatar
      Attempt to make firewall experiment swapout more robust by addressing a · 69b90e79
      Mike Hibler authored
      couple of MFS booting problems:
       * in the RPC power controller, make sure that an "on" command succeeds
         by checking the status, retrying if it failed (we already did this for
         "off")
       * if nodes fail to boot up the MFS after a power on, try again with a
         power cycle.  I have seen "power on" leave pc600s hung, and a power
         cycle seems to cure it.
      69b90e79
  11. 20 Mar, 2006 1 commit
  12. 02 Mar, 2006 1 commit
  13. 16 Feb, 2006 1 commit
  14. 02 Feb, 2006 1 commit
    • Timothy Stack's avatar
      · 0a4176c1
      Timothy Stack authored
      Various nfstrace changes that have been sitting in my tree for awhile.
      
      	* GNUmakefile.in: Do fs-install in the sensors subdir so the
      	nfstracer gets installed.
      
      	* sensors/and/and-emulab.priorities: Add some more daemon uid's to
      	be excluded from auto-nicing.
      
      	* sensors/and/and.c: Ignore invalid uids/gids in the config file
      	instead of dying.
      
      	* sensors/nfstrace/GNUmakefile: Makefile used to generate
      	nfsdb-create.sql.
      
      	* sensors/nfstrace/GNUmakefile.in: Some more installation stuff.
      
      	* sensors/nfstrace/nfsdb-create.sql: SQL used to create the nfsdb
      	database.
      
      	* sensors/nfstrace/nfsdump2db: Bunch of bug fixes and cleanup.
      
      	* sensors/nfstrace/nfsdump2db.8, sensors/nfstrace/nfstrace.7,
      	sensors/nfstrace/nfstrace.proxy.8: Start at some man pages.
      
      	* sensors/nfstrace/nfstrace.init.in: Try to detect the interface
      	to listen on, not perfect though.  Add a restart handler that just
      	restarts nfsdump2db.  Some other cleanup.
      
      	* sensors/nfstrace/nfstrace.proxy: Some optimizations for
      	resolving file names.
      
      	* sensors/nfstrace/nfsdump2/*: Only print summaries of read/write
      	packets and start a separate thread to read from the bpf socket.
      
      	* tbsetup/tbswap.in: Stop transferring nfs accesses to boss' db
      	until we figure out what we want to do with it.
      0a4176c1
  15. 19 Jan, 2006 1 commit
  16. 28 Dec, 2005 1 commit
    • Leigh B. Stoller's avatar
      A fair number of changes. · 564d958d
      Leigh B. Stoller authored
      * The rest of the backend support for simplistic experiment duplication,
        either from an existing (current) experiment, or from a specific
        archive revision of a current or terminated experiment.
      
      * Rework the swapmod code so that the archive is committed at the exact
        end of the swapout phase. This required adding (and moving) some code
        from swapexp to tbswap sine that is where the actual swapout/swapin
        happens during a swapmod
      
      * Add a special directory called "archive" to the experiment
        directory, which is a place where users can store stuff they want
        saved away. This will eventually be a user defined set of
        directories, but this was good for getting the basic mechanism in
        place. Note that the when the contents of this directory are copied
        out for placement into the archive, it is an exact copy made with
        rsync.
      
      * No longer "clean" the contents of the temporary store between
        commits of the archive. This was creating a lot of headaches, and
        was also causing the revision history to get messed up. The downside
        of this is that we have to be more careful to explicitly delete
        files that the user no longer uses. I have not solved all these
        issues yet, so in the meantime files will get left in the archive
        even if the user no longer references them.
      564d958d
  17. 27 Dec, 2005 1 commit
    • Mike Hibler's avatar
      More tightly connect the notion of a firewall and the security level. · f1206314
      Mike Hibler authored
      If you specify an explicit firewall, you are implicitly assigned security
      level 2 and you cannot explicitly specify the security level.  Likewise,
      if you specify a security level, you cannot also specify a firewall.
      
      The reason for this is that security level 1 (aka "Blue") now has a slightly
      different meaning.  It is intended for protecting the inside from the outside
      rather than visa-versa.  The only practical implication of this is that for
      level 1, we don't do all the fancy power-off-boot-into-MFS-zapbootblock stuff
      that we do for higher levels.
      
      Anyway, I wanted to make sure that if you specify your own firewall, you
      DO have to go through the full cleansing swapout since we can't trust a
      firewall that the Average Joe sets up.
      f1206314
  18. 21 Dec, 2005 2 commits
  19. 19 Dec, 2005 4 commits
    • Mike Hibler's avatar
      Disable my last hack til I figure out vnodes · 1c15b2f3
      Mike Hibler authored
      1c15b2f3
    • Mike Hibler's avatar
    • Kevin Atkinson's avatar
      · 45f997fd
      Kevin Atkinson authored
      Updates to to Error Logging API Code.
      
      You should start seeing much better error messages coming from my
      system.  Errors coming from parse.proxy and assign (the two most
      frequent sources of errors) should now be concise and to the point.
      Errors coming from libosload/libreboot (the next most frequent source
      of errors) should now also be much better, but not perfect.  Getting
      perfect errors will likely a rework of how errors are handled in
      libosload/libreboot, just adding tberror/tbwarn/tbnotice calls is not
      enough.  I can do this at a latter date if necessary.
      
      A few minor database changes.
      
      Some changes to the API.  A few bug fixes. Lots of tberror/tbwarn/tbnotice
      added to scripts.
      
      Since assign is a C program, and at this time my API is perl only, I wrote a
      second wrapper around assign, assign_wrapper2.  When assign fails errors are
      now parsed in assign_wrapper2, sent to stderr and logged.  This means that
      RunAssign() just returns when assign fails rather than echoing some of
      assign.log output and then quiting.  The output to the activity log remains
      unchanged.
      
      Since "parse.proxy" is run from ops I couldn't use my API in it, even though
      it is a perl program.  Instead I parse the errors coming form it in
      parse-ns.
      45f997fd
    • Mike Hibler's avatar
      Subject all "testbed" experiments to swapout state saving · d29f8858
      Mike Hibler authored
      (actually just stats gathering right now, no images are produced)
      d29f8858
  20. 15 Dec, 2005 1 commit
  21. 12 Dec, 2005 1 commit
  22. 08 Dec, 2005 1 commit
  23. 07 Dec, 2005 1 commit
  24. 06 Dec, 2005 2 commits
    • Mike Hibler's avatar
      Phase II in disk state saving for swapout. · ed0d25b4
      Mike Hibler authored
      Exec summary: after this checkin, the infrastructure exists (once enabled)
      to create swapout-time "delta" images for all machines in experiments.
      There is only a single, cumulative swap image per node (i.e., all diffs
      are from the base image, not from the previous swap).
      
      What doesn't yet exist, is the mechanism for reloading the delta at
      swapin time.  That is Phase III.
      
      The nitty-gritty:
      
      1. Keep disk image signature files for all nodes in an experiment.
      
         New fields in the DB to track, for each disk partition, what image the
         partition was loaded from.  This enables us at swapin or os_load time to
         create signature files in /proj/<pid>/exp/<eid>/swapinfo for the current
         contents of a node disk/partition.  All nodes with the same image loaded
         will share (via symlink) the same signature file.  TODO: no longer
         referenced signature files should be removed.
      
         Signature info is only collected in the swapinfo directory if the
         experiment is set to have disk state saving enabled (see #5 below).
         Info consists of the <vname>.sig file, which is the file created
         by imagehash, and <vname>.part which says what the root disk is
         for the node and whether to look at the whole disk or just a single
         partition when crafting the delta image.
      
      2. Swapout-time hook for creating swapout image.
      
         If the experiment is marked as allowing disk state saving, tbswap
         will arrange to run and then monitor the create-swapimage command
         on each node.  This script will run the modified version of imagezip
         which uses the signature file to create a delta image.
      
         The command to run and maximum timeout are specified via sitevars
         (previously checked in).  Note that the tbswap script currently has
         special knowledge of /usr/local/bin/create-swapimage as a swapout
         time script.  If the swap/swapout_command sitevar is set to that,
         Magic Stuff shall occur (i.e. it will monitor the command and make
         periodic reports of progress).  The sitevars are a total hack and
         will disappear at some point.
      
      3. Client-side script for creating swapout image.
      
         os/create-swapimage, very similar to create-image.  Uses the info
         stashed in /proj/..blahblah../swapinfo to create a delta image.
      
         XXX fer now hack: the script first looks in /proj/<pid>/bin for an
         imagezip binary to use.  Failing that, it uses the one in the MFS.
         This allows for easier development of the imagezip changes (i.e.,
         don't have to update the MFS every time.
      
      4. Auto creation of signature files for new images.
      
         The create_image script (the one that runs on boss when creating images
         for users) has been modified to automatically create a signature via
         imagehash.  The .sig file winds up in /usr/testbed/images/sigs or
         in /proj/<pid>/images/sigs.  From there it will be copied at swapin/os_load
         time to the per-expt swapinfo directory for any node that uses the images.
      
         The process for creating standard system images (aka, "Mike") has not
         yet been modified.  When the image creation/installation procedure
         is formalized into a script, this will be done.
      
      5. Web changes to set/clear saving of disk state at swapout time.
      
         Add a checkbox to the experiment create page to allow setting "save
         swap state".  Also added to the experiment modify page, but currently
         "if (0)"ed out as it will need some additional support.  The showstuff
         page will show it.
      
         Taking a page from Leigh's hack book, if EXPOSESTATESAVE in defs.php3
         is set to zero (as it is now), then the checkbox doesn't appear in the
         create experiment page except for STUDLY users.
      ed0d25b4
    • Leigh B. Stoller's avatar
      Temporary change to linktest while we continue to debug; Always run · 428c5121
      Leigh B. Stoller authored
      linktest at level 3 if a mere user. Studly users still have control
      though. Note that errors are no longer mailed to user by linktest_control.
      
      Also moved duplicated code to get dbuid (and email address) to top of
      file.
      428c5121
  25. 05 Dec, 2005 1 commit
  26. 01 Dec, 2005 1 commit
  27. 29 Nov, 2005 1 commit
  28. 28 Nov, 2005 1 commit
  29. 22 Nov, 2005 1 commit
  30. 17 Nov, 2005 2 commits
    • Mike Hibler's avatar
      Minor fixes: add another level of panic that we set when swapout fails. · 32560429
      Mike Hibler authored
      Produces a different message in the web page.
      
      Also fix up a couple of minor firewalled elabinelab issues.
      32560429
    • Mike Hibler's avatar
      1. Beef up "admin mode" support. · 4ec701e7
      Mike Hibler authored
      * Add libadminmfs.pm with routines for entering/exiting and executing
        commands in, the admin MFS.  Node admin and firewall swapout (see
        below) now use this, the image creation process does not yet.
      
      * Add swapout time hooks for running an admin mode process, likely to
        be used to collect swapout time state.  Currently controlled globally
        by two new sitevars.
      
      * Modified node_admin to use the library and added a "-c <command>"
        option to have nodes go into admin mode and run a command.  I don't
        really expect this to be useful, it was just a testing vehicle for
        the library.
      
      2. Improved the swapout process for firewalled experiments.  Largely
         just generalized what we already did for paniced experiments.
         At swapout, firewalled nodes are:
      
         - powered off
         - set to boot into admin mode and run a disk zapper
         - powered on
      
        The swapout process then waits for all nodes to successfully complete
        disk zapage, at which point the nodes are nfree'ed as usual.  Any
        failure of the above process, marks the experiment as panic'ed (to
        ensure that we are involved in cleanup) and sends mail to testbed-ops
        describing the state of the nodes.
      
      3. Added the aforementioned disk zapper, a little C program in the MFS
         which zeroes out the MBR and partition boot blocks (but not the MBR
         partition table or FS superblocks).  This is added insurance that if
         a node somehow gets diverted after being nfree'd but before getting
         the disk reloaded (e.g., goes to hwdown), that we cannot accidentally
         boot from the disk.  This program gets installed in the admin MFS.
      
      4. Related to firewalls, modified swapin to use the new documented
         "snmpit -N" to get the firewall VLAN number rather than parsing the
         output that was a side-effect of VLAN creation.
      4ec701e7
  31. 04 Nov, 2005 1 commit
    • Kevin Atkinson's avatar
      · a2aba279
      Kevin Atkinson authored
      Added error logging API.  See tbsetup/libtblog.pm.in and tbsetup/libtblog.sql.
      a2aba279
  32. 20 Oct, 2005 2 commits
  33. 19 Oct, 2005 1 commit
    • Timothy Stack's avatar
      · bd627836
      Timothy Stack authored
      Some event system changes for linktest and any future things we want to
      run with the event system in the swapin path:
      
      	* event/linktest/linktest_control.in: Let linktest be run while
      	the experiment is activating.
      
      	* event/sched/event-sched.c, event/sched/rpc.h,
      	event/sched/rpc.cc: Don't wait for the experiment to become active
      	before loading the eventlist so any system defined agents are
      	available to use.  Don't start time unless the experiment is
      	already active, let boss do it otherwise.  Send out COMPLETE
      	events so 'tevc' can listen for them.
      
      	* event/sched/timeline-agent.c: Send a complete event even if the
      	timeline is empty.
      
      	* event/tbgen/tevc.c: Add a '-w' option so that tevc can wait for
      	an event that sends back a COMPLETE.
      
      	* tbsetup/tbswap.in: Explicitly send an event to start event time.
      bd627836