1. 19 Dec, 2005 1 commit
  2. 15 Dec, 2005 1 commit
  3. 12 Dec, 2005 1 commit
  4. 08 Dec, 2005 1 commit
  5. 07 Dec, 2005 1 commit
  6. 06 Dec, 2005 2 commits
    • Mike Hibler's avatar
      Phase II in disk state saving for swapout. · ed0d25b4
      Mike Hibler authored
      Exec summary: after this checkin, the infrastructure exists (once enabled)
      to create swapout-time "delta" images for all machines in experiments.
      There is only a single, cumulative swap image per node (i.e., all diffs
      are from the base image, not from the previous swap).
      
      What doesn't yet exist, is the mechanism for reloading the delta at
      swapin time.  That is Phase III.
      
      The nitty-gritty:
      
      1. Keep disk image signature files for all nodes in an experiment.
      
         New fields in the DB to track, for each disk partition, what image the
         partition was loaded from.  This enables us at swapin or os_load time to
         create signature files in /proj/<pid>/exp/<eid>/swapinfo for the current
         contents of a node disk/partition.  All nodes with the same image loaded
         will share (via symlink) the same signature file.  TODO: no longer
         referenced signature files should be removed.
      
         Signature info is only collected in the swapinfo directory if the
         experiment is set to have disk state saving enabled (see #5 below).
         Info consists of the <vname>.sig file, which is the file created
         by imagehash, and <vname>.part which says what the root disk is
         for the node and whether to look at the whole disk or just a single
         partition when crafting the delta image.
      
      2. Swapout-time hook for creating swapout image.
      
         If the experiment is marked as allowing disk state saving, tbswap
         will arrange to run and then monitor the create-swapimage command
         on each node.  This script will run the modified version of imagezip
         which uses the signature file to create a delta image.
      
         The command to run and maximum timeout are specified via sitevars
         (previously checked in).  Note that the tbswap script currently has
         special knowledge of /usr/local/bin/create-swapimage as a swapout
         time script.  If the swap/swapout_command sitevar is set to that,
         Magic Stuff shall occur (i.e. it will monitor the command and make
         periodic reports of progress).  The sitevars are a total hack and
         will disappear at some point.
      
      3. Client-side script for creating swapout image.
      
         os/create-swapimage, very similar to create-image.  Uses the info
         stashed in /proj/..blahblah../swapinfo to create a delta image.
      
         XXX fer now hack: the script first looks in /proj/<pid>/bin for an
         imagezip binary to use.  Failing that, it uses the one in the MFS.
         This allows for easier development of the imagezip changes (i.e.,
         don't have to update the MFS every time.
      
      4. Auto creation of signature files for new images.
      
         The create_image script (the one that runs on boss when creating images
         for users) has been modified to automatically create a signature via
         imagehash.  The .sig file winds up in /usr/testbed/images/sigs or
         in /proj/<pid>/images/sigs.  From there it will be copied at swapin/os_load
         time to the per-expt swapinfo directory for any node that uses the images.
      
         The process for creating standard system images (aka, "Mike") has not
         yet been modified.  When the image creation/installation procedure
         is formalized into a script, this will be done.
      
      5. Web changes to set/clear saving of disk state at swapout time.
      
         Add a checkbox to the experiment create page to allow setting "save
         swap state".  Also added to the experiment modify page, but currently
         "if (0)"ed out as it will need some additional support.  The showstuff
         page will show it.
      
         Taking a page from Leigh's hack book, if EXPOSESTATESAVE in defs.php3
         is set to zero (as it is now), then the checkbox doesn't appear in the
         create experiment page except for STUDLY users.
      ed0d25b4
    • Leigh B. Stoller's avatar
      Temporary change to linktest while we continue to debug; Always run · 428c5121
      Leigh B. Stoller authored
      linktest at level 3 if a mere user. Studly users still have control
      though. Note that errors are no longer mailed to user by linktest_control.
      
      Also moved duplicated code to get dbuid (and email address) to top of
      file.
      428c5121
  7. 05 Dec, 2005 1 commit
  8. 01 Dec, 2005 1 commit
  9. 29 Nov, 2005 1 commit
  10. 28 Nov, 2005 1 commit
  11. 22 Nov, 2005 1 commit
  12. 17 Nov, 2005 2 commits
    • Mike Hibler's avatar
      Minor fixes: add another level of panic that we set when swapout fails. · 32560429
      Mike Hibler authored
      Produces a different message in the web page.
      
      Also fix up a couple of minor firewalled elabinelab issues.
      32560429
    • Mike Hibler's avatar
      1. Beef up "admin mode" support. · 4ec701e7
      Mike Hibler authored
      * Add libadminmfs.pm with routines for entering/exiting and executing
        commands in, the admin MFS.  Node admin and firewall swapout (see
        below) now use this, the image creation process does not yet.
      
      * Add swapout time hooks for running an admin mode process, likely to
        be used to collect swapout time state.  Currently controlled globally
        by two new sitevars.
      
      * Modified node_admin to use the library and added a "-c <command>"
        option to have nodes go into admin mode and run a command.  I don't
        really expect this to be useful, it was just a testing vehicle for
        the library.
      
      2. Improved the swapout process for firewalled experiments.  Largely
         just generalized what we already did for paniced experiments.
         At swapout, firewalled nodes are:
      
         - powered off
         - set to boot into admin mode and run a disk zapper
         - powered on
      
        The swapout process then waits for all nodes to successfully complete
        disk zapage, at which point the nodes are nfree'ed as usual.  Any
        failure of the above process, marks the experiment as panic'ed (to
        ensure that we are involved in cleanup) and sends mail to testbed-ops
        describing the state of the nodes.
      
      3. Added the aforementioned disk zapper, a little C program in the MFS
         which zeroes out the MBR and partition boot blocks (but not the MBR
         partition table or FS superblocks).  This is added insurance that if
         a node somehow gets diverted after being nfree'd but before getting
         the disk reloaded (e.g., goes to hwdown), that we cannot accidentally
         boot from the disk.  This program gets installed in the admin MFS.
      
      4. Related to firewalls, modified swapin to use the new documented
         "snmpit -N" to get the firewall VLAN number rather than parsing the
         output that was a side-effect of VLAN creation.
      4ec701e7
  13. 04 Nov, 2005 1 commit
    • Kevin Atkinson's avatar
      · a2aba279
      Kevin Atkinson authored
      Added error logging API.  See tbsetup/libtblog.pm.in and tbsetup/libtblog.sql.
      a2aba279
  14. 20 Oct, 2005 2 commits
  15. 19 Oct, 2005 2 commits
    • Timothy Stack's avatar
      · bd627836
      Timothy Stack authored
      Some event system changes for linktest and any future things we want to
      run with the event system in the swapin path:
      
      	* event/linktest/linktest_control.in: Let linktest be run while
      	the experiment is activating.
      
      	* event/sched/event-sched.c, event/sched/rpc.h,
      	event/sched/rpc.cc: Don't wait for the experiment to become active
      	before loading the eventlist so any system defined agents are
      	available to use.  Don't start time unless the experiment is
      	already active, let boss do it otherwise.  Send out COMPLETE
      	events so 'tevc' can listen for them.
      
      	* event/sched/timeline-agent.c: Send a complete event even if the
      	timeline is empty.
      
      	* event/tbgen/tevc.c: Add a '-w' option so that tevc can wait for
      	an event that sends back a COMPLETE.
      
      	* tbsetup/tbswap.in: Explicitly send an event to start event time.
      bd627836
    • Leigh B. Stoller's avatar
      As per Jay's request, change the message at the top of the activity log · aeed9520
      Leigh B. Stoller authored
      to include the date and the experiment index.
      aeed9520
  16. 06 Oct, 2005 2 commits
  17. 20 May, 2005 1 commit
    • Leigh B. Stoller's avatar
      Checkpoint some robot changes. · 5f67fe09
      Leigh B. Stoller authored
      * New robot event listener:
      
          * It is intended to be started and stopped from the experiment
            swapin path instead of as a global daemon. It takes the pid/eid
            of the experiment, and will deal with events only for those
            nodes that are allocated to the experiment. We have some long
            range plans of time sharing the robot lab, so I figured we might
            as get a little bit of a start on that.
      
          * Once it fires up, it subscribes to the usual assortment of
            events, just like the loclistener does.
      
          * It then binds a socket on which to listen for connections from
            the web server.
      
          * Then it loops, looking for events and for connections from the
            web server. Connections from the web server are for forwarding
            the event stream in real time to whatever applets are currently
            viewing the robot lab.
      
          * As each event comes in, it is parsed, entered into the DB (nodes
            and location_info table), and fired out (in a textual form) to
            all the applet listeners. The web interface just acts as pipe in
            this case for the data.
      
          * The event stream is also duplicated to a file in the experiment
            directory (the same stuff that is piped to the applet), named by
            the current resource record ID. I hope to use this stream to
            playback the motion in the applet (coupled with webcam images
            once I figure out how to sync them).
      
      * tbswap: Start and stop the new listener.
      
      * Robotrack: I changed the interface for how we actually communicate
        the event data. Much more reasonable then it was (a comma separated
        string of numbers!).
      
      * new database fields in the experiments table to hold the process ID
        of the listener and the port it is listening on. The port is not
        used yet, as the robot lab is still not shared. Will need to revist
        this later. Currently uses a fixed port number.
      
      * www/robotrack/robopipe.php3: Killed most of the old code and replace
        with simple socket connect to the listener.
      5f67fe09
  18. 12 May, 2005 1 commit
    • Leigh B. Stoller's avatar
      Checkpoint the rest of my changes to support swapmod of both ElabInElab and · 6eff9de6
      Leigh B. Stoller authored
      Firewalled experiments (see tbsetup/elabinelab.in for the other stuff).
      
      * To support firewalled experiments, needed to add a new virt_firewalls
        table to split the existing firewalls table up, which included both
        virtual and physical stuff. There are the usual frontend changes and a
        few other things scattered around, including tmcd.c.
      
      * The firewall code in tbswap got some beefing up to support adding and
        deleting nodes from the its special control net vlan. Note that I have
        not made any progress on containment of deleted nodes, just as we do not
        do anything now for teardown (unless its paniced, in which case the
        experiment cannot be modified anyway).
      
      * ptopgen and assign_wrapper got some interesting modifications: Unlike
        regular swapmod, we cannot just tear down all the vlans since that would
        interrupt everything inside the inner elab. Instead, leave the vlans as
        is. The problem is that when assign runs, it can just as easily pick
        different interfaces on the same nodes, which would be a royal pain in
        the ass to deal with! So, ptopgen got a new option (-u) that assign
        wrapper uses to tell ptopgen that it should prune out unused interfaces
        from nodes that are already allocated to the experiment. This is, at
        best, as pathetically gross hack, but it makes sure that all the
        interfaces stay the same across swapmods.
      
      * The unrelated revision of elabinelab has a bunch of new code for adding
        and deleting nodes from the inner elab. Mostly it deals with dhcpd (inner
        and outer, waiting for nodes to reboot, etc). It also deals with updating
        the vlans table in the DB, pruning out any nodes (ports) that are deleted
        but for which there are still interfaces in existing vlans. Said ports
        are them moved back to the default vlan with calls to snmpit. Also under
        another revision a a couple of weeks ago are the web interface changes to
        support the newnode MFS inside an inner Emulab.
      
      * swapexp and endexp got some more checks for firewalled and paniced
        experiments, which were missing.
      6eff9de6
  19. 27 Jan, 2005 1 commit
    • Leigh B. Stoller's avatar
      Add admission control checks to both tbswap and to assign_wrapper.in. · b5c775be
      Leigh B. Stoller authored
      There are two different kinds of admission control checks, hence two
      locations.
      
      * From tbswap, easy checks are made (like, too many nodes of a real
        type), or trying to the use robot lab when its closed.
      
      * From assign_wrapper.in. Because of virtual types, and because assign
        is the ultimate descision maker, we need to pass into assign per
        physical type limits, and let it deal with how virtual types map to
        physical types, and whether it exceeds the limits. So, when called
        from assign_wrapper, libadminctrl returns a hash of types and the
        counts the current experiment is limited to. That is put into the
        ptop file where assign deals with it.
      
      If admission control goes sour, then just turn off the sitevar for it:
      'swap/use_admission_control'
      b5c775be
  20. 22 Dec, 2004 1 commit
  21. 16 Dec, 2004 1 commit
    • Leigh B. Stoller's avatar
      The panic button ... · 87dd2e60
      Leigh B. Stoller authored
      * tbsetup/panic.in: New backend script to implement the panic button
        feature. When used, it will cut the severe the connection to the
        firewall node by using snmpit to disable the port. Sets the panic
        bit (and date) in the experiments table, and changes the state of
        the experiment from "active" to "paniced" to ensure that the
        experiment cannot be messed with (swapped out or modified). Sends
        email to tbops when the panic button is pressed.
      
        Used with -r option, reverses the above. State is set back to
        active, the panic bit is cleared, and the port is renabled with
        snmpit.
      
      * tbsetup/tbswap.in: During swapout, a firewalled experiment that has
        been paniced will get a cleaning; The nodes are powered off, then
        the osids for all the nodes are reset (with os_select) so that they
        will boot the MFS, and then the nodes are powered on. Then the
        control network is turned back on, and then I wait for the nodes to
        reboot (this is simply cause we do not record in the DB that a node
        is turned off, and if I do not wait, the reload daemon will end
        hitting the power button again if they do not reboot in time. We can
        fix this later.
      
        I am not planning to apply this to general firewalled experiments
        yet as the power cycling is going to be hard on the nodes, so would
        rather that we at least have a 1/2 baked plan before we do that.
      
      * www/showexp.php3: If experiment is firewalled, show the Panic
        Button, linked to the panic button web script. If the experiment has
        already had the panic button pressed, show a big warning message and
        explain that user must talk to tbops to swap the experiment out.
        Also fiddle with menu options so that the terminate link is gone,
        and the swap link is visible only in admin mode. In other words, only
        an admin person can swap an experiment once it is paniced. And of
        course, an admin person can the backend panic script above with the
        -r option, but thats not something to be done lightly.
      
      * db/libdb.pm.in: Add "paniced" as an experiment state (EXPTSTATE_PANICED).
        Add utility functions: TBExptSetPanicBit(), TBExptGetPanicBit(), and
        TBExptClearPanicBit().
      
      * tbsetup/swapexp.in: Minor state fiddling so that an experiment can
        be swapped while in paniced state, but only when in admin mode. Also
        clear the panic bit when experiment is swapped out.
      
      * www/dbdefs.php3.in: Add "paniced" as an experiment state. Add a
        utility function TBExptFirewall() to see if experiment is firewalled.
      
      * www/panicbutton.php3: New web script to invoke the backend panic
        script mentioned above, after the usual confirm song and dance.
      
      * www/panicbutton.gif: New gif of a red panic button that I stole off
        the net. If anyone has sees/has a better one, feel free to replace
        this one.
      
      * utils/node_statewait.in: Add -s option so that I can pass in the
        state I want to wait for (used from tbswap above to wait for nodes
        to reach ISUP after power on).
      87dd2e60
  22. 09 Dec, 2004 1 commit
  23. 07 Dec, 2004 1 commit
  24. 06 Dec, 2004 1 commit
  25. 03 Dec, 2004 1 commit
    • Mike Hibler's avatar
      First past at integration of VLAN firewall: · 719794dc
      Mike Hibler authored
      1. Do all the magic snmpit-speak at the beginning of a swapin.
      
      2. Undo all the magic snmpit-speak right before nfree in a swapout.
         THIS IS CURRENTLY UNSAFE because I don't yet put all the firewalled
         nodes into "purgatory" before dropping the firewall.  That is yet
         to come.
      
      3. Disallow modification of a firewalled experiment til I figure out
         how to do it safely.
      719794dc
  26. 15 Nov, 2004 1 commit
    • Leigh B. Stoller's avatar
      A bunch of ElabInELab changes. · 10b116e0
      Leigh B. Stoller authored
      * snmpit: When ElabInELabis true, use the routines in the new
        snmpit_remote.pm library for setting up and tearing down vlans for an
        experiment. At present, only these two operations are proxied out to
        the outer emulab.
      
      * snmpit_remote.pm: A new little library that uses the XMLPRC server on
        the outer emulab to setup and destroy vlans for an inner experiment.
        This code is used from snmpit (see above).
      
      * snmpit_lib.pm: A couple of minor changes for the server side of the
        proxy operation.
      
      * snmpit.proxy.in: A new perl module that is invoked from the RPC
        server.  This proxy sets up and tears down vlans for an inner elab.
        The basic model is that the container experiment will have lots of
        vlans for various individual experiments running on the inner emulab.
      
      * swapexp: A couple of minor elabinelab hacks.
      
      * tbswap: For elabinelab experiments, reconfig/restart dhcpd when
        tearing down the experiment, and call out to new elabinelab script
        when setting up an elabinelab experiment. There is no provision for
        swapmod at this time.
      
      * elabinelab: A new script to create the inner emulab. Does all kinds of
        gross DB stuff then more gross stuff on the inner ops and boss.
      10b116e0
  27. 04 Oct, 2004 1 commit
  28. 09 Aug, 2004 1 commit
    • Leigh B. Stoller's avatar
      Some cleanups and performance improvements: · f604dc33
      Leigh B. Stoller authored
      * Be more selective about what lists are regenerated; we were generating
        way too many lists each time called. When calling from tbswap, use new
        -t option to generate just the active lists. When called from setgroups,
        use -p option to generate lists just for the project. Add update option
        for when user changes email address (and all lists really do need to be
        regenerated).
      
      * Add "diff" processing. Instead of blindly firing each new list over to
        ops with ssh, store a copy of all of the lists in
        /usr/testbed/lists. After we generate the new list, diff it against the
        stored copy. If the same, skip it. Otherwise stash new copy and fire it
        over. This should reduce the wait times by quite a bit since the lists
        rarely change (except for the activity lists of course).
      
      * Add -n (impotent) option for debugging; skips the ssh over to ops.
      
      * Reorg a lot of stuff; it was getting hard to follow.
      f604dc33
  29. 29 Jul, 2004 1 commit
    • Leigh B. Stoller's avatar
      Two unrelated bug fixes (with some related cleanups and tweaks) · 9f4edbba
      Leigh B. Stoller authored
      * The first involves swapmod. When a swapmod on an active experiment fails,
        tbswap will reswap the experiment back to the original configuration. The
        problem is that it is reswapping it with the *new* virtual state of the
        experiment in the DB. It is not until later when control returns to
        swapexp that the virtual state is restored. This is plainly wrong, and in
        fact was causing the event scheduler grief cause it was starting up,
        reading the the virtual topo, which was different, wrong, and about to be
        blown away.
      
        I reorganized the modify section of swapexp so that virtual state is
        restored only when its a swapmod on a swapped experiment. On an active
        experiment, I moved that code down into tbswap, which will now does all
        of the virtual and physical state retore before it does the reswap back
        to the original experiment. Just for kicks, its also done if tbswap
        decides to swap the experiment cause of a fatal error.
      
        Cleanups: I changed $NoRecover to $CanRecover. My feeble brain cannot
        deal with !$NoRecover. I know, two knots make a wright for most people.
      
        Renderer: I was annoyed by the fact that we rerun the renderer on a
        failed swapmod. The original reason is that the renderer runs in the
        background and so vis_nodes cannot be saved with the rest of the virtual
        state tables cause the renderer might still be running when the user
        fires off the swapmod. Well, the hell with that. We lock the vis_nodes
        table anyway in the renderer during update, so we are certain to get a
        consistent snapshot. We store the renderer pid in the experiments table,
        so if the renderer was running, just fire off another one; mostly this is
        not going to happen. In addition, tbprerun no longer starts a new
        renderer when doing the swapmod; I start the new renderer later after
        swapmod succeeds. I might end up tweaking this a bit depending on what
        people notice as being different.
      
      * Termination changes to batchexp and swapexp: I've rearranged the
        termination code using an END block so that any uncontrolled exit from
        either batchexp or swapexp will go through the cleanup code, and
        hopefully insert a stats record, as well as not leave the experiment in
        some inbetween state. I've set the max DB retry count to zero in both
        cases, which means infinite retry. I've also added SIGTERM handlers to
        both so that again, we can kill a hung batch/swap and have it clean up
        things more or less. Note that END blocks are not caught when a signal
        causes the program to die; you have to catch it and then die() so that
        the END block is executed.
      
        Eventually, we need to clean up the various libraries so that we do not
        use DBQueryFatal(), but rather use DBQueryWarn(), and look for failure.
        Ditto for event system interface.
      9f4edbba
  30. 28 May, 2004 1 commit
  31. 26 Apr, 2004 1 commit
    • Leigh B. Stoller's avatar
      Changes to exit status stuff to reflect recent changes by Rob to how · 1c4a613c
      Leigh B. Stoller authored
      assign exits (exit codes).
      
      * in assign_wrapper, no longer return any status from assign to the
        caller. This was pointless. Instead, return 0 on success, 1 on
        controlled error, and -1 on uncontrolled error (die() called
        someplace). Add in CANRECOVER bit whenever the wrapper exits, even
        if uncontrolled, by putting in an END block to catch the die. This
        should prevent certain cases where a swapmod error would be flagged
        as not recoverable.
      
      * Remove most of the assign output processing since we no longer
        return its codes. Still print a portion of it to the log though.
      
      * Change call to fatal() in assign_wrapper; do not pass an exitcode
        since in every case it was the same damn thing!
      
      * Change tbswap to no longer carry assign_wrapper exit code to its
        exit.
      
      * Change the batch daemon to treat all errors as continuable (keep
        batch queued) unless exit code is -1. We will need to revisit this a
        bit perhaps, when Rob adds precheck code.
      1c4a613c
  32. 16 Feb, 2004 1 commit
  33. 10 Feb, 2004 1 commit
  34. 09 Jan, 2004 1 commit
    • Shashi Guruprasad's avatar
      Changes to do auto re-swap of expts with simnodes when an nse on a simhost · 375f87c1
      Shashi Guruprasad authored
      (or more than one simhost) is unable to keep up with real-time. It includes
      changes to assign_wrapper to handle swap modify for simnodes, the simple
      algorithm in nseswap that bumps up the nodeweight of simnodes being hosted
      on a simhost that reports "can't keep up with real-time" (aka nse violation),
      ptopgen and sim.tcl to prefer nodes that already have the FBSD-NSE image.
      Also, changes to other files to send out NSESWAP event.
      
      One unrelated change: We now have per-swapin .top files and assign.log
      files along with .ptop files. This helps in debugging across multiple
      swapins since files remain in the form of
      <pid>-<eid>-<process_id>.{top,ptop} and assign-<pid>-<eid>-<process_id>.log
      Also useful for archiving.
      375f87c1
  35. 08 Jan, 2004 1 commit