1. 31 Mar, 2006 1 commit
  2. 30 Mar, 2006 1 commit
  3. 28 Mar, 2006 1 commit
  4. 24 Mar, 2006 1 commit
    • Kevin Atkinson's avatar
      · bcbd18aa
      Kevin Atkinson authored
      Hhave swapexp/batchexp dump the error when the -w" option is
      specified.  The error will look something like:
      
        ERROR:: <cause desc>
      
        <text of the error>
      
        Cause: <cause>
        Confidence: <confidence>
      
      This will be the last thing printed.  The "::" is there to make
      recognizing the error easy to scripts since they can just look for the
      "ERROR::".
      bcbd18aa
  5. 21 Mar, 2006 1 commit
    • Kevin Atkinson's avatar
      · d258dde6
      Kevin Atkinson authored
      Changed format of email sent to user on errors.  The error will now
      appear instead of the generic message when I am confident it is
      accurate.  The subject line will also change to reflect the cause of
      an error.
      
      Avoid sending mail to testbed-ops during failed swap related evenets
      in some cases.  It will instead be sent to a new mailing list
      testbed-errors.
      
      Added a new row in the experiment info table "Last Error:" which
      states the cause of the error, and links to a new page displaying the
      error.
      
      Made some assign/assign_wrapper errors more informative.
      
      The error (as determined by tblog) is now stored in the database in a
      more structured fashion.  This inlcudes adding a column for the session
      (in the log table) to testbed_stats to link eash swap event with the
      logs and possible the error.
      
      Other changes to the database, see sql/database-migrate.txt
      d258dde6
  6. 07 Mar, 2006 1 commit
  7. 18 Jan, 2006 1 commit
  8. 17 Jan, 2006 1 commit
  9. 12 Jan, 2006 1 commit
    • Leigh B. Stoller's avatar
      Checkpoint changes to the Archive code. · aca4f452
      Leigh B. Stoller authored
      * Add support for linking to the NS file that will be used, from the
        begin experiment page, when duplicating or branching an experiment.
        Ultimately we want to separate things so that user can first edit
        the NS file and then proceed to branching.
      
      * In discussion we agreed to use the convention that a directory called
        "archive" in experiment directory, will always be saved and restored.
        This has been implemented.
      
      * Add more of the support for branching an experiment (the archive).
        Batchexp takes a couple of new arguments:
      
      	-c pid,eid[:tag]  or
      	-c exptidx[:tag]
      
        The above specifies what and where to duplicate or branch. Simply
        giving pid,eid does not use the archive, but just copies right out
        of the existing experiment directory.
      
        Adding the -b option says to branch instead of duplicate.
      aca4f452
  10. 06 Jan, 2006 1 commit
  11. 28 Dec, 2005 1 commit
    • Leigh B. Stoller's avatar
      A fair number of changes. · 564d958d
      Leigh B. Stoller authored
      * The rest of the backend support for simplistic experiment duplication,
        either from an existing (current) experiment, or from a specific
        archive revision of a current or terminated experiment.
      
      * Rework the swapmod code so that the archive is committed at the exact
        end of the swapout phase. This required adding (and moving) some code
        from swapexp to tbswap sine that is where the actual swapout/swapin
        happens during a swapmod
      
      * Add a special directory called "archive" to the experiment
        directory, which is a place where users can store stuff they want
        saved away. This will eventually be a user defined set of
        directories, but this was good for getting the basic mechanism in
        place. Note that the when the contents of this directory are copied
        out for placement into the archive, it is an exact copy made with
        rsync.
      
      * No longer "clean" the contents of the temporary store between
        commits of the archive. This was creating a lot of headaches, and
        was also causing the revision history to get messed up. The downside
        of this is that we have to be more careful to explicitly delete
        files that the user no longer uses. I have not solved all these
        issues yet, so in the meantime files will get left in the archive
        even if the user no longer references them.
      564d958d
  12. 19 Dec, 2005 1 commit
    • Kevin Atkinson's avatar
      · 45f997fd
      Kevin Atkinson authored
      Updates to to Error Logging API Code.
      
      You should start seeing much better error messages coming from my
      system.  Errors coming from parse.proxy and assign (the two most
      frequent sources of errors) should now be concise and to the point.
      Errors coming from libosload/libreboot (the next most frequent source
      of errors) should now also be much better, but not perfect.  Getting
      perfect errors will likely a rework of how errors are handled in
      libosload/libreboot, just adding tberror/tbwarn/tbnotice calls is not
      enough.  I can do this at a latter date if necessary.
      
      A few minor database changes.
      
      Some changes to the API.  A few bug fixes. Lots of tberror/tbwarn/tbnotice
      added to scripts.
      
      Since assign is a C program, and at this time my API is perl only, I wrote a
      second wrapper around assign, assign_wrapper2.  When assign fails errors are
      now parsed in assign_wrapper2, sent to stderr and logged.  This means that
      RunAssign() just returns when assign fails rather than echoing some of
      assign.log output and then quiting.  The output to the activity log remains
      unchanged.
      
      Since "parse.proxy" is run from ops I couldn't use my API in it, even though
      it is a perl program.  Instead I parse the errors coming form it in
      parse-ns.
      45f997fd
  13. 15 Dec, 2005 1 commit
  14. 06 Dec, 2005 1 commit
    • Mike Hibler's avatar
      Phase II in disk state saving for swapout. · ed0d25b4
      Mike Hibler authored
      Exec summary: after this checkin, the infrastructure exists (once enabled)
      to create swapout-time "delta" images for all machines in experiments.
      There is only a single, cumulative swap image per node (i.e., all diffs
      are from the base image, not from the previous swap).
      
      What doesn't yet exist, is the mechanism for reloading the delta at
      swapin time.  That is Phase III.
      
      The nitty-gritty:
      
      1. Keep disk image signature files for all nodes in an experiment.
      
         New fields in the DB to track, for each disk partition, what image the
         partition was loaded from.  This enables us at swapin or os_load time to
         create signature files in /proj/<pid>/exp/<eid>/swapinfo for the current
         contents of a node disk/partition.  All nodes with the same image loaded
         will share (via symlink) the same signature file.  TODO: no longer
         referenced signature files should be removed.
      
         Signature info is only collected in the swapinfo directory if the
         experiment is set to have disk state saving enabled (see #5 below).
         Info consists of the <vname>.sig file, which is the file created
         by imagehash, and <vname>.part which says what the root disk is
         for the node and whether to look at the whole disk or just a single
         partition when crafting the delta image.
      
      2. Swapout-time hook for creating swapout image.
      
         If the experiment is marked as allowing disk state saving, tbswap
         will arrange to run and then monitor the create-swapimage command
         on each node.  This script will run the modified version of imagezip
         which uses the signature file to create a delta image.
      
         The command to run and maximum timeout are specified via sitevars
         (previously checked in).  Note that the tbswap script currently has
         special knowledge of /usr/local/bin/create-swapimage as a swapout
         time script.  If the swap/swapout_command sitevar is set to that,
         Magic Stuff shall occur (i.e. it will monitor the command and make
         periodic reports of progress).  The sitevars are a total hack and
         will disappear at some point.
      
      3. Client-side script for creating swapout image.
      
         os/create-swapimage, very similar to create-image.  Uses the info
         stashed in /proj/..blahblah../swapinfo to create a delta image.
      
         XXX fer now hack: the script first looks in /proj/<pid>/bin for an
         imagezip binary to use.  Failing that, it uses the one in the MFS.
         This allows for easier development of the imagezip changes (i.e.,
         don't have to update the MFS every time.
      
      4. Auto creation of signature files for new images.
      
         The create_image script (the one that runs on boss when creating images
         for users) has been modified to automatically create a signature via
         imagehash.  The .sig file winds up in /usr/testbed/images/sigs or
         in /proj/<pid>/images/sigs.  From there it will be copied at swapin/os_load
         time to the per-expt swapinfo directory for any node that uses the images.
      
         The process for creating standard system images (aka, "Mike") has not
         yet been modified.  When the image creation/installation procedure
         is formalized into a script, this will be done.
      
      5. Web changes to set/clear saving of disk state at swapout time.
      
         Add a checkbox to the experiment create page to allow setting "save
         swap state".  Also added to the experiment modify page, but currently
         "if (0)"ed out as it will need some additional support.  The showstuff
         page will show it.
      
         Taking a page from Leigh's hack book, if EXPOSESTATESAVE in defs.php3
         is set to zero (as it is now), then the checkbox doesn't appear in the
         create experiment page except for STUDLY users.
      ed0d25b4
  15. 04 Nov, 2005 1 commit
  16. 19 Oct, 2005 1 commit
  17. 30 Sep, 2005 1 commit
  18. 22 Sep, 2005 1 commit
  19. 13 Jul, 2005 1 commit
  20. 31 May, 2005 1 commit
  21. 27 May, 2005 1 commit
  22. 18 May, 2005 1 commit
  23. 19 Apr, 2005 1 commit
  24. 22 Feb, 2005 1 commit
  25. 05 Nov, 2004 1 commit
  26. 30 Aug, 2004 1 commit
  27. 29 Jul, 2004 1 commit
    • Leigh B. Stoller's avatar
      Two unrelated bug fixes (with some related cleanups and tweaks) · 9f4edbba
      Leigh B. Stoller authored
      * The first involves swapmod. When a swapmod on an active experiment fails,
        tbswap will reswap the experiment back to the original configuration. The
        problem is that it is reswapping it with the *new* virtual state of the
        experiment in the DB. It is not until later when control returns to
        swapexp that the virtual state is restored. This is plainly wrong, and in
        fact was causing the event scheduler grief cause it was starting up,
        reading the the virtual topo, which was different, wrong, and about to be
        blown away.
      
        I reorganized the modify section of swapexp so that virtual state is
        restored only when its a swapmod on a swapped experiment. On an active
        experiment, I moved that code down into tbswap, which will now does all
        of the virtual and physical state retore before it does the reswap back
        to the original experiment. Just for kicks, its also done if tbswap
        decides to swap the experiment cause of a fatal error.
      
        Cleanups: I changed $NoRecover to $CanRecover. My feeble brain cannot
        deal with !$NoRecover. I know, two knots make a wright for most people.
      
        Renderer: I was annoyed by the fact that we rerun the renderer on a
        failed swapmod. The original reason is that the renderer runs in the
        background and so vis_nodes cannot be saved with the rest of the virtual
        state tables cause the renderer might still be running when the user
        fires off the swapmod. Well, the hell with that. We lock the vis_nodes
        table anyway in the renderer during update, so we are certain to get a
        consistent snapshot. We store the renderer pid in the experiments table,
        so if the renderer was running, just fire off another one; mostly this is
        not going to happen. In addition, tbprerun no longer starts a new
        renderer when doing the swapmod; I start the new renderer later after
        swapmod succeeds. I might end up tweaking this a bit depending on what
        people notice as being different.
      
      * Termination changes to batchexp and swapexp: I've rearranged the
        termination code using an END block so that any uncontrolled exit from
        either batchexp or swapexp will go through the cleanup code, and
        hopefully insert a stats record, as well as not leave the experiment in
        some inbetween state. I've set the max DB retry count to zero in both
        cases, which means infinite retry. I've also added SIGTERM handlers to
        both so that again, we can kill a hung batch/swap and have it clean up
        things more or less. Note that END blocks are not caught when a signal
        causes the program to die; you have to catch it and then die() so that
        the END block is executed.
      
        Eventually, we need to clean up the various libraries so that we do not
        use DBQueryFatal(), but rather use DBQueryWarn(), and look for failure.
        Ditto for event system interface.
      9f4edbba
  28. 28 Jul, 2004 2 commits
    • Leigh B. Stoller's avatar
      Fix merge error in last revision. · 55575967
      Leigh B. Stoller authored
      55575967
    • Leigh B. Stoller's avatar
      Fix rather serious indexing bug that was causing experiment indicies · 95ad01c1
      Leigh B. Stoller authored
      to be reused if the DB is dropped and recreated, since when that
      happens, auto_increment history is lost and it will go back to using
      the latest highest index in the table. Usually not a problem, but
      since we cross index three other tables using the experiment index,
      this causes quite a bit of grief.
      
      So, my solution is to do my own auto_increment using the
      experiment_stats table (locked of course), which we never delete
      entries from without deleting all entries from the other cross
      referenced tables.
      
          DBQueryFatal("select MAX(exptidx) from experiment_stats");
      
      I also added a sanity check to make sure the new index is not
      currently in use in any of the tables. I also cleaned up the
      error path when something goes wrong.
      95ad01c1
  29. 15 Jul, 2004 1 commit
  30. 29 Jun, 2004 1 commit
  31. 21 May, 2004 1 commit
  32. 17 May, 2004 1 commit
  33. 29 Apr, 2004 1 commit
    • Leigh B. Stoller's avatar
      Add prelim support for using linktest. Because of problems, this is · 6cdccbd2
      Leigh B. Stoller authored
      currently available to only people with stud=1 status in the DB.
      
      * www/tbauth.php3: Add a STUDLY() function to check that bit.
      
      * www/linktest.php3: New page to run linktest on the fly. The level
        defaults to the current level in the experiments table, but you can
        override that via the form on the page.
      
      * www/showexp.php3: Add link to aforementioned page. STUDLY() only.
      
      * www/beginexp_form.php3: Add an option (selection) to set the linktest
        level for create/swapin. Defaults to 0 (no linktest). STUDLY() only.
      
      * www/editexp.php3: Add an option to edit the default linktest level
        for an experiment. STUDLY() only.
      
      * tbsetup/batchexp.in and tbsetup/swapexp.in: Add code to optionally run
        the linktest, sending email if it fails (exists with non-zero status).
        Failure does not affect the swapin.
      6cdccbd2
  34. 07 Apr, 2004 1 commit
  35. 15 Mar, 2004 1 commit
  36. 09 Mar, 2004 1 commit
    • Leigh B. Stoller's avatar
      Clean up of the web to batchexp interface: · b6a9b9c2
      Leigh B. Stoller authored
      * Add proper check_slot() calls to all of the user input that is going into
        the DB (already had taint checking), since batchexp is now available for
        interactive use from ops.
      
      * Remove separate DB insertions of noswap/noidleswap reasons from web
        script, and pass on the command line from web to batchexp. Now inserted
        in the backend script so that they can be provided on the command line
        when batchexp is used interactively.
      
      * Change defaults in backend script; experiments now default to swappable
        and idleswap; previously defaulted to not swappable and no idleswap.
      
      * Remove [-s] (swappable) and add [-S <reason>] option. -S sets experiment to
        not swappable, with supplied reason (text string).
      
      * Add [-L <reason>] option. -L sets experiment to no idleswap, with
        supplied reason (text string).
      
      * Add several missing table_regex entries for experiments table.
      b6a9b9c2
  37. 20 Feb, 2004 1 commit
    • Leigh B. Stoller's avatar
      Hmm, looks to me like I got distracted while merging startexp into · 7d9be6de
      Leigh B. Stoller authored
      batchexp, and forgot to finish the changes! The result was a fairly
      broken batch system, which is not hopefully fixed!
      
      Took the opportunity to remove the -x (expires) and -l (priority)
      options which are no longer references anyplace.
      
      Fix up email message so that idle/auto swap times are in hours not
      minutes.
      
      Provide a proper usage() function that describes the morass of
      options (for interactive use from ops).
      7d9be6de
  38. 12 Feb, 2004 1 commit
    • Leigh B. Stoller's avatar
      * Removed startexp, and merged its contents into batchexp. There has been · aef08532
      Leigh B. Stoller authored
        no reason for the separation for a long time, and it made maintence more
        difficult cause of duplication between batchexp and startexp (batch was
        the sole user of startexp). Cleaner solution.
      
      * Check argument processing for batchexp, swapexp, endexp to make sure the
        taint checks are correct. All three of these scripts will now be
        available from ops. I especially watch the filename processing, which was
        pretty loose before and could allow some to grab a file on boss by trying
        to use it as an NS file (scripts all runs as user of course). The web
        interface generates filenames that are hard to guess, so rather then
        wrapping these scripts when invoked from ops, just allow the usual paths
        (/proj, /groups, /users) but also /tmp/$uid-XXXXXX.nsfile pattern, which
        should be hard enough to guess that users will not be able to get
        anything they are not supposed to.
      
      * Add -w (waitmode) options to all three scripts. In waitmode, the backend
        detaches, but the parent remains waiting for the child to finish so it
        can exit with the appropriate status (for scripting). The user can
        interrupt (^C), but it has no effect on the backend; it just kills the
        parent side that is waiting (backend is in a new session ID). Log outout
        still goes to the file (available from web page) and is emailed.
      aef08532
  39. 09 Feb, 2004 1 commit