1. 30 Nov, 2006 1 commit
    • Kevin Atkinson's avatar
      · 1253f479
      Kevin Atkinson authored
      IO::Handle::opened method doesn't work when a ref to STDOUT/ERR are stored in
      a variable since it really a glob in perl 5.005.  IN 5.8 it works for some
      reason.  To fix use IO::Handle::opened($$this) not $$this->opened()
      1253f479
  2. 06 Nov, 2006 1 commit
    • Kevin Atkinson's avatar
      libaudit related changes: · e89ee617
      Kevin Atkinson authored
        - Added "LIBAUDIT_FANCY" option to AuditStart.  When this option is
          used libaudit will send a different email than it normally sends,
          and on error call tblog_find_error() to determine the error.
      
        - Also add audit function AddAuditInfo which adds add additional
          information for libaudit to use in SendAuditMail when AUDIT_FANCY
          is set.
      
        - Modify template_swapin, template_instantiate, and template_create
          to use the new audit functionality.
      
        - Suppressing calling tblog_find_error and sending the error email
          when auditing in swapexp and batchexp
      
      tblog changes:
      
        - Shorten the message sent to the user when the error in unknown.
          Remove all parts about lack of free nodes as it no longer really
          applies as tblog now correctly identified those errors and handles
          them separately.  The message is now just "Please look at the log
          below to see what happened."
      
        - Improve algo. used to determine the other error when canceled.
          Will now work by removing all errors related to the cancel request
          and the essentially rerunning tblog_find_error.  If the cause of
          the error is still canceled, repeat and try again until the cause
          is something other than canceled or no errors are left.
      
        - Refactor tblog_find_error, which involves creating new internal
          functions: tblog_determine_single_error, tblog_store_error,
          tblog_dump_error
      
        - Add section on Primary vs Secondary Errors to the inline POD
          documentation.
      
        - Other minor enhancements and bug fixes.
      e89ee617
  3. 26 Oct, 2006 1 commit
    • Kevin Atkinson's avatar
      Various tblog changes: · 95a3a6a7
      Kevin Atkinson authored
        Make an attempt to discover what the error was before an swap-* was
        canceled, if any.  Both the main error (canceled) and the other error
        are stored in the error table.  To support this a new column in the
        error table is added "rank".  The primary error has a rank 0 while the
        other error has a rank 1.
      
        Make an attempt to determine when an error is a "me too" error or the
        real cause of the problem.  "Me too" errors are errors which are
        generally reported when the callie script determined that the caller
        script fails.  The caller script should have reported the error, but
        in some cases the error didn't make it into the database.  Thus if a
        "me too" is reported as the cause of a "swap-*" more info is needed to
        determine the true cause.  When a "me too" error is reported it is
        followed by a "..." on it's own line.  It is also recorded in the
        errors table under the new column "need_more_info".
      
        Add inferred column to the errors table.  This is the same value as
        the inferred variable in tblog_find_error.
      
        Add revision column to errors table to make it easy to tell which
        algorithm was used to determine the error.
      95a3a6a7
  4. 27 Sep, 2006 1 commit
    • Kevin Atkinson's avatar
      · 7293bbc0
      Kevin Atkinson authored
      Second attempt to fix the problem of duplicate log entries.  I am
      99.99% sure this will get 100% of the cases, and 99.999% sure it won't
      break anything.
      
      It basically detects when the DB handle is a child and if so set
      "InaciveDestroy" before the database handle DESTROY method is called.
      Since the DB handle can be closed in several different places I created a
      new class to override the Db Handle (the Mysql class) DESTROY method. The
      other alternative is to add special code anywhere where the database handle
      could be destroyed which is when every a reconnect is done and when the
      module exists.  The later would have involved putting code in the END block.
      I think the new class method is simpler for that reason.
      
      
      Also, add a note about patching Mysql.pm in doc/UPDATING.
      7293bbc0
  5. 31 Aug, 2006 1 commit
    • Kevin Atkinson's avatar
      · 964b8d11
      Kevin Atkinson authored
      Add patch to modify Mysql.pm to allow setting the "InactiveDestroy" in
      the underlying DB handle.  Also avoid disconnecting the file handle
      explistly on DESTROY as that will be taken care of in the DESTROY
      method for the the DB handle.
      
      Override perl version of fork() to set InactiveDestroy in all open
      database handles in the child so that it won't send a disconnect when
      the handle is destroyed as this will also close the database handle
      for the parent.  It will also call tblog_new_child_process in the
      child process to properly inform tblog of the new process. This will
      be a NoOp if the libtblog module is not loaded.
      964b8d11
  6. 25 Aug, 2006 1 commit
    • Kevin Atkinson's avatar
      · 312021d4
      Kevin Atkinson authored
      More tbreport changes from Mike Kasick <mkasick@andrew.cmu.edu>:
      
      - Added tblog support to nscheck.
      
      - Added ns_parse_failed error to nscheck.
      
      - Added invocation column to report_errors to differentiate between assign
        runs in infeasible resource assignments.
      312021d4
  7. 16 Aug, 2006 1 commit
    • Kevin Atkinson's avatar
      - Added tbreport database schema (added three tables), storage for · 9c5d3308
      Kevin Atkinson authored
        tbreport errors & context.
      
      - Modified fatal() in swapexp, batchexp, and tbprerun, and die_noretry()
        in os_setup to pass hash parameter to tblog functions.
      
      - Added tbreport errror & context information for select errors in
        swapexp, tbswap, assign_wrapper2, snmpit_lib, snmpit, batchexp,
        assign_wrapper, os_setup, parse-ns, & tbprerun.
      
      - Added assign error parser in assign_wrapper2.
      
      - Added parse.tcl error parser in parse-ns.
      
      - Added severity constants for tbreport in libtblog_simple.
      
      - Added tbreport() function & context table mappging for reporting
        discrete error types to libtblog.
      9c5d3308
  8. 14 Aug, 2006 1 commit
    • Kevin Atkinson's avatar
      · 07dda0d8
      Kevin Atkinson authored
      Prep for Mike Kasick report code.  Updated database schema and
      installed hooks for his code.
      
      Cleaned up how errors were handled in tblog(...).
      
      Allow SENDMAIL to be called before the path is untained in '-T' scripts.
      
      Other small changes.
      07dda0d8
  9. 20 Jul, 2006 1 commit
    • Kevin Atkinson's avatar
      · 5710c340
      Kevin Atkinson authored
      Various tblog changes:
      
      Added message about recovery action when a swap-modify failed to the
      top of the email.
      
      Fine tuned os_setup summary error.  Added (possible partial) list of
      nodes that fail; if a large number fail only show as many that will
      fit on a single line.  Other tweaks.
      
      Flagged assign_wrapper errors of an Invalid OS as user errors.
      5710c340
  10. 05 Jul, 2006 1 commit
    • Kevin Atkinson's avatar
      · 183040de
      Kevin Atkinson authored
      Many changes to tblog code.  Database update needed:
      
      1) Added summary of failed nodes is os_setup.  The cause of the error is now
      classified as "user" if it is only user images that failed and the user
      image failed on every pc of a particular type.  Otherwise I leave the cause
      as "unknown" since it is really hard to tell what the real cause is.
      
      2) Raised the confidence threshold for most errors so that they will appear
      on the top.
      
      3) Added a special error when an experiment is canceled.  The cause is
      "canceled" and testbed-ops won't see these errors.
      
      4) Fixed a bug in assign_wrapper where it will incorrectly report "This
      experiment cannot be instantiated on this testbed..." when really the user
      canceled the swapin.
      
      5) Fixed a bug where os_setup errors where being incorrectly reported as
      assign errors.  This happens when os_setup fails for some reason and
      tbswap tries again, but the second time around there are not enough nodes.
      So the last error is coming from assign even though the true cause of the
      error is due to failed nodes.  The fix for this involved added a new column
      to the log table, "attempt", which will be 1 for the first attempt and then
      incremented for each new attempt.  tblog_find_error will then simply ignore
      any errors with "attempt > 1".
      
      6) Also fixed a potential problem when there is an error during the cleanup
      phase by adding another column "cleanup".  tblog_find_error will
      also ignore any errors with the cleanup bit set.
      183040de
  11. 29 May, 2006 1 commit
  12. 08 May, 2006 1 commit
    • Kevin Atkinson's avatar
      · 95f529d3
      Kevin Atkinson authored
      Refactor "log" table to move some stuff into a new table.
      95f529d3
  13. 27 Mar, 2006 1 commit
    • Kevin Atkinson's avatar
      · d8625ddd
      Kevin Atkinson authored
      Change the email going to testbed-errors from:
      
        From: user
        To: user
        Cc: testbed-errors
      
      to
      
        From: testbed-ops
        To: user
        Bcc: testbed-errors
        X-NetBed-Cc: testbed-errors
      
      This should cause all replies to these message to go to testbed-ops
      instead of testbed-errors.  The only thing is that you need to be
      careful when replying since if you only reply to the sender than it
      will go to testbed-ops only and not to the user.  I was also thinking
      about changing the other swap* failures messages still going to
      testbed-ops in a similar way but due to the reply issue for those on
      testbed-ops I will hold off on that for now unless someone else thinks
      it's a good idea.
      
      The addition of the X-NetBed-Cc header is to make filtering the email
      easier since the fact that it is going to testbed-errors will no
      longer be in the header.  This header is also present in the swap*
      failures messages still going to testbed-ops.
      d8625ddd
  14. 24 Mar, 2006 1 commit
    • Kevin Atkinson's avatar
      · bcbd18aa
      Kevin Atkinson authored
      Hhave swapexp/batchexp dump the error when the -w" option is
      specified.  The error will look something like:
      
        ERROR:: <cause desc>
      
        <text of the error>
      
        Cause: <cause>
        Confidence: <confidence>
      
      This will be the last thing printed.  The "::" is there to make
      recognizing the error easy to scripts since they can just look for the
      "ERROR::".
      bcbd18aa
  15. 23 Mar, 2006 1 commit
    • Kevin Atkinson's avatar
      · d3ca9c2d
      Kevin Atkinson authored
      Change @TBBASE@ to @TBDOCBASE@ when refering to KB entry.
      d3ca9c2d
  16. 21 Mar, 2006 2 commits
    • Kevin Atkinson's avatar
      · 1fa07472
      Kevin Atkinson authored
      Fix bug causing strange errors from snipit due to an invalid assumtion
      about __DIE__ handler in libtblog.pm.in.
      1fa07472
    • Kevin Atkinson's avatar
      Changed format of email sent to user on errors. The error will now · d258dde6
      Kevin Atkinson authored
      appear instead of the generic message when I am confident it is
      accurate.  The subject line will also change to reflect the cause of
      an error.
      
      Avoid sending mail to testbed-ops during failed swap related evenets
      in some cases.  It will instead be sent to a new mailing list
      testbed-errors.
      
      Added a new row in the experiment info table "Last Error:" which
      states the cause of the error, and links to a new page displaying the
      error.
      
      Made some assign/assign_wrapper errors more informative.
      
      The error (as determined by tblog) is now stored in the database in a
      more structured fashion.  This inlcudes adding a column for the session
      (in the log table) to testbed_stats to link eash swap event with the
      logs and possible the error.
      
      Other changes to the database, see sql/database-migrate.txt
      d258dde6
  17. 13 Mar, 2006 1 commit
    • Kevin Atkinson's avatar
      · 6e488e77
      Kevin Atkinson authored
      Added inline POD documentation of libtblog.
      
      TODO: Automatically generate HTML page from and and have it installed
      with the other HTML docs.
      6e488e77
  18. 23 Feb, 2006 1 commit
    • Kevin Atkinson's avatar
      · 10fd4a08
      Kevin Atkinson authored
      Fix tied STDIN and STDOUT with perl 5.8 in libtblog.
      
      Fix a phototype warning in  os_load.in
      10fd4a08
  19. 21 Feb, 2006 1 commit
    • Leigh B. Stoller's avatar
      Neuter the perltie stuff under perl 5.8 since it does not work properly. · 85488e5b
      Leigh B. Stoller authored
      I got close to getting it to work by adding this:
      
      	sub FILENO  { my $this = shift; fileno($$this) }
      	sub CLOSE   { my $this = shift; close($$this) }
      
      	sub OPEN {
      	    my $this = shift;
      
      	    close($$this) if defined(fileno($$this));
      	    @_ == 1 ? open($$this, $_[0]) : open($$this, $_[0], $_[1]);
      	}
      
      But subprocesses were not seeing the right stdout/stderr after doing
      something like:
      
      	open(STDERR, ">> $logname");
      	open(STDOUT, ">> $logname");
      
      So, I will let Kevin work on it; I've spent too much time on it
      already!
      85488e5b
  20. 09 Feb, 2006 1 commit
  21. 26 Jan, 2006 1 commit
    • Kevin Atkinson's avatar
      · 05015359
      Kevin Atkinson authored
      Merged in changes from tblog-2-branch:
      
                Move parts of libtblog into libtblog_simple.  Libtblog simple
                provided the basic logging functions but doesn't touch anything.
                Moreover including libtblog_simple doesn't automatically start
                the logging subsystem.  It also doesn't have testbed dependencies
                which mean 1) it can be used in the core testbed libraries (such
                as libdb, libtestbed) without introducing a circular dependency
                and 2) can be used independently.
      
                Reworked DBFatal and DBWarn to use tblog.  It will still email
                testbed-ops, however.
      
                Make use of the "cause" field to determine the cause of the bug.
                In particular tblog_find_error will look at the value of this
                field and report the "cause".  In the future different actions
                can be taken based on the ultimate "cause" of the bug, such as if
                testbed-ops should be notified.
      
                Change format of Error Message reported by libtblog.  As per the
                email "Format or Error Messages" ro testbed-dev.
      
                Have libtblog use its own Database handle to avoid problems with
                locked tables.
      
                Also set DBCONN_MAXTRIES to 3 for most important queries.  For
                queries that are not important don't send mail on error.
      05015359
  22. 19 Dec, 2005 1 commit
    • Kevin Atkinson's avatar
      · 45f997fd
      Kevin Atkinson authored
      Updates to to Error Logging API Code.
      
      You should start seeing much better error messages coming from my
      system.  Errors coming from parse.proxy and assign (the two most
      frequent sources of errors) should now be concise and to the point.
      Errors coming from libosload/libreboot (the next most frequent source
      of errors) should now also be much better, but not perfect.  Getting
      perfect errors will likely a rework of how errors are handled in
      libosload/libreboot, just adding tberror/tbwarn/tbnotice calls is not
      enough.  I can do this at a latter date if necessary.
      
      A few minor database changes.
      
      Some changes to the API.  A few bug fixes. Lots of tberror/tbwarn/tbnotice
      added to scripts.
      
      Since assign is a C program, and at this time my API is perl only, I wrote a
      second wrapper around assign, assign_wrapper2.  When assign fails errors are
      now parsed in assign_wrapper2, sent to stderr and logged.  This means that
      RunAssign() just returns when assign fails rather than echoing some of
      assign.log output and then quiting.  The output to the activity log remains
      unchanged.
      
      Since "parse.proxy" is run from ops I couldn't use my API in it, even though
      it is a perl program.  Instead I parse the errors coming form it in
      parse-ns.
      45f997fd
  23. 04 Nov, 2005 1 commit