- 10 Nov, 2010 1 commit
-
-
Mike Hibler authored
stated POWER* triggers will now actually do something!
-
- 29 Sep, 2010 2 commits
-
-
Mike Hibler authored
-
Mike Hibler authored
Under load, nodes that have just entered reloading and have just rebooted might fail to get bootinfo. The default behavior in this case is for the node to boot from disk (dubious, but that is the topic for another day). This causes the node to fall off the RELOAD path, winding up in either TBFAILED or ISUP. Worse, if the node makes it to ISUP, its reload state is cleared and even if the reload_daemon reboots the node, it will still not go through the reloading process. The result is a bunch of nodes left in reloading. Now if a node makes an invalid transition to TBFAILED or ISUP while in the RELOAD state machine, it fires the new REBOOT trigger which does...well, you figure it out. Note that in the ISUP case, this trigger overrides the default that would otherwise clear the reload state--so reboot is sufficient to get the machine back on the RELOAD track.
-
- 05 Jan, 2010 2 commits
-
-
Leigh B. Stoller authored
-
Leigh B. Stoller authored
I added a sig handler to reopen the log file. See corresponding changes to doc/UPDATING and to install/boss-install.in
-
- 04 Aug, 2009 1 commit
-
-
Kevin Atkinson authored
at once with Frisbee (excludes the actual MFS changes). Os_load now takes take a list of comma serrated image names for the "-i" and "-m" options. The default OS is the OS for the last image specified in the list. I also changed the "-p" option of osload to search both the project specified and emulab-ops for the image rather than just the project specified in order to simplify specifying multiple images (and because I personally found that behavior annoying when using osload). I modified the current_reloads table to be able to specify more than one image for a node by adding an "idx" column which controls the order of the reloads. I also added a "prepare" column to the table (explained below) I modified tmcd to basically loop over the entries in the table and create a multiline LOADINFO responsive, and modified rc.frisbee to handle the multiline response and load each image in turn. I modified os_load to take a new option "-P" which will tell rc.frisbee to zap the superblocks even if a whole disk image is not specified. To do this I set the prepare entry for the first image in the current_reloads table to true. Tmcd than passes this into to rc.frisbee in the LOADINFO line. When rc.frisbee sees this it will make sure to zap the superblock before loading that image. To support having multiple images as the default, "default_imageid" can now be a comma separated list. I implemented a hack to be able to set multiple imageids via editnodetype.php3. Basically the form splits default_imageid into default_imageid_0, default_imageid_1, etc and than adds an empty default_imageid_# slot to allow adding an imageid. Multiple images can be added by adding one image, than submitting the form, and than adding another into the empty slot. Not the best, but I don't thing this will be a very common operation. When the form is submitted it will than combine all default_imageid_# into a comma separated list ignoring any that are deleted or set to "No ImageID" (ie 0). Everything will work fine with old MFSs as long as only one image is loaded. If multiple images are loaded with an old MFS, an email will be sent to testbed-ops. This works by having tmcd detect old MFS's by using the version number and setting the state to RELOADOLDMFS. Stated will pick up on the and send the email to testbed-ops via a trigger.
-
- 06 Mar, 2007 1 commit
-
-
Leigh B. Stoller authored
indexed by exptidx. I also got the last of the pid and pid,gid tables.
-
- 30 Aug, 2006 1 commit
-
-
Leigh B. Stoller authored
mixed togther, and port registrations are not made. The one case currently handled is when the syncserver node goes ISUP, but has not reported its port. In this case, it must be an old image and so we place a port registration in for it.
-
- 26 Apr, 2006 1 commit
-
-
Leigh B. Stoller authored
-
- 25 Apr, 2006 1 commit
-
-
Leigh B. Stoller authored
5.0 client side is very picky about this.
-
- 07 Feb, 2006 1 commit
-
-
Leigh B. Stoller authored
-
- 01 Dec, 2005 1 commit
-
-
Russ Fish authored
-
- 19 Aug, 2005 1 commit
-
-
Leigh B. Stoller authored
frisbee MFS, and stated will send an apod out to it. This updates the previous revision in which I was doing this for RELOADDONE state, so that we maintain backwards compatability with older frisbee MFSs.
-
- 18 Aug, 2005 1 commit
-
-
Leigh B. Stoller authored
the "race" problem.
-
- 12 Jan, 2005 1 commit
-
-
Leigh B. Stoller authored
out of the reserved table. Mostly this happens in nfree and nalloc, but there a couple of other moves, in libdb and in the reload daemon. The uid and experiment are stored, long with a timestamp.
-
- 19 Aug, 2004 1 commit
-
-
Leigh B. Stoller authored
-
- 18 Aug, 2004 1 commit
-
-
Leigh B. Stoller authored
"arbitrary" script as defined in the stated_triggers table. Currently using this to invoke the new opsreboot script whenever ISUP comes in from ops. The opsreboot script is currently a skeleton. All it does is send email. I'll add the rest later (which really won't be much at first; just getting the event schedulers started).
-
- 22 Jul, 2004 1 commit
-
-
Leigh B. Stoller authored
goes whacky.
-
- 11 Feb, 2004 1 commit
-
-
Robert Ricci authored
node to get ignored.
-
- 15 Jan, 2004 1 commit
-
-
Mac Newbold authored
- add functions to recursively dump hashes and arrays into a string suitable for printing as debugging output (great for data structures) - add three new trigger strings - add 'use strict', do corresponding cleanup stated changes: - move special-cased stuff in handleEvent for PXEBOOTING and BOOTING into triggers (PXEBOOTING, BOOTING, and CHECKGENISUP) - clarify (via comments) the existing kinds of triggers and which ones run when, and add a new kind (global "any-mode" triggers). We already had per-node mode-specific, per-node any-mode, and global mode-specific triggers. Now you can have a trigger that is good for any mode in a given state, that can be overridden on a mode-specific basis. This is great for PXEBOOTING, BOOTING, and ISUP, since they each have a trigger list that should be run regardless of what mode you're in. Now they only require 3 entries instead of 3*N that have to be maintained per mode. # A note about triggers: # # "per-node" triggers only affect their specific node in a # particular mode/state, and are run first of all. "global" # triggers are triggers for a given mode/state that affect all # nodes, and are run after any per-node triggers. "Any-mode" # triggers are tied to a state, and occur in that state in any # mode. The any-mode triggers are over-ridden by global triggers, # and if an "Any-mode" trigger for state XYZ exists as well as a # global trigger for mode FOOBAR state XYZ, then when I arrive in # XYZ any per-node triggers will be run. Then, if I'm in mode # FOOBAR, only the global trigger will run. If I'm in any other # mode, only the any-mode trigger will run. # (our "*" is stored as $TBANYMODE) # Per-node triggers have a specific node_id # Global triggers have "*" as the node_id # Any-mode triggers have "*" as the mode, and can be global or per-node The updated table looks like this in the accompanying change to database-fill.sql: +---------+----------+------------+-----------------------+ | node_id | op_mode | state | trigger | +---------+----------+------------+-----------------------+ | * | * | BOOTING | BOOTING, CHECKGENISUP | | * | * | ISUP | RESET | | * | * | PXEBOOTING | PXEBOOT | | * | RELOAD | RELOADDONE | RESET, RELOADDONE | | * | ALWAYSUP | SHUTDOWN | ISUP | +---------+----------+------------+-----------------------+ - I also cleaned up the functions that add, get, and delete triggers. Before, the get function didn't include global triggers. Now it does, and has an option to just get the per-node triggers. Add and delete are still just per-node, of course. - Also found and fixed some little bugs while I was in there. (global triggers not taking a list, These changes are me getting ready to re-add all the changes I made months ago in order to do a before-and-after experiment for my thesis. Between now and the end of next week I'll be working on taking before numbers, patching stated with the changes, and getting after numbers. The problems I'm trying to replicate are the problems and slowdowns we used to get when os_{load,setup} would reboot a node, thinking it had timed out, when it really didn't know whether it was making progress or not. The fix includes making os_{load,setup} depend on stated to watch for progress and timeouts, and do any appropriate retries. Part of that is the StateWait stuff, that lets programs watch for events easily, and the node_reboot-with-events stuff that puts stated in control of nodes as they reboot.
-
- 12 Jan, 2004 1 commit
-
-
Leigh B. Stoller authored
handling PXEWAKUP timeouts, retrying 3 times and then forcing a power cycle. Changed BOOTING event action to auto switch in and out of the special PXEKERNEL state machine that all local nodes use since all local nodes boot the same pxeboot kernel and talk to bootinfo (as directed to by dhcp).
-
- 07 Jan, 2004 1 commit
-
-
Leigh B. Stoller authored
probably imperfect, but better then nothing. New option, "-t tag" allows you to specify an arbitrary tag to match against the stated_tag of the nodes table. The stated invocation will only operate on nodes that match the tag, ignoring all events for other nodes. If unspecified, stated will operate on all nodes with a NULL tag. This is setup up at the beginning of time (or during a reload) saving the per-node tag in the $nodes hash. Each time an event arrives, check the tag in the table, ignoring the event if not a match. On signaled reload() must also be careful to throw away timeouts from the queue (and be careful not to set up new timeouts for ignored nodes). So, this allows you to set the tag for a node in the DB, and then HUP stated so that it reloads it tables. That node will now be ignored by that stated. Also made some changes to debug mode. In debug mode, don't worry about the pidfile or the lockfile or checking for other running stated (which causes my debug version to exit! right away). Also, added a new -l option to turn of syslog output and just send it all to stdout with the debug output. -l can be only be used with -d of course. So what can I do with all this: update nodes set stated_tag='lbs' where node_id='pc5'; sudo kill -HUP `cat /var/run/stated.pid` sudo stated -d -l -t lbs Which tells the main stated to ignore pc5. Then I run a debugging stated that operates only on pc5. Later when done: update nodes set stated_tag=NULL where node_id='pc5'; sudo kill -HUP `cat /var/run/stated.pid` Which tells the main stated to operate on pc5 again.
-
- 15 Oct, 2003 1 commit
-
-
Mike Hibler authored
as defined in the defs-* file (e.g. "TBLOGFACIL=local2"). The default is "local5" which is what we are setup to use so you shouldn't need to mess with your defs- file! perl scripts just get this value configured in when configure is run. C programs get the value in two ways. For programs that are intimate with the testbed infrastructure, and include "config.h", they just get it from that file. For programs that we sometimes use outside the Emulab build environment (e.g., frisbee, capture) and that don't include config.h, the value is set via a "-DLOG_TESTBED=..." in the GNUmakefile build line. If the value isn't set, it defaults to what it used to be (usually LOG_USER). Still to do: healthd, hmcd (whose build doesn't seem to be completely integrated) and plabdaemon.in (since its icky python :-)
-
- 13 Oct, 2003 1 commit
-
-
Mac Newbold authored
-
- 10 Oct, 2003 1 commit
-
-
Mac Newbold authored
model of waiting for state changes. Before we were watching the database (which means we can only watch for terminal/stable/long-lived states, and have to poll the db). Now things that are waiting for states to change become event listeners, and watch the stream of events flow by, and don't have to do any polling. They can now watch for any state, and even sequences of states (ie a Shutdown followed by an Isup). To do this, there is now a cool StateWait.pm library that encapsulates the functionality needed. To use it, you call initStateWait before you start the chain of events (ie before you call node reboot). Then do your stuff, and call waitForState() when you're ready to wait. It can be told to return periodically with the results so far, and you can cancel waiting for things. An example program called waitForState is in testbed/event/stated/ , and can also be used nicely as a command line tool that wraps up the library functionality. This also required the introduction of a TBFAILED event that can be sent when a node isn't going to make it to the state that someone may be waiting for. Ie if it gets wedged coming up, and stated retries, but eventually gives up on it, it sends this to let things know that the node is hozed and won't ever come up. Another thing that is part of this is that node_reboot moves (back) to the fully-event-driven model, where users call node reboot, and it does some checks and sends some events. Then stated calls node_reboot in "real mode" to actually do the work, and handles doing the appropriate retries until the node either comes up or is deemed "failed" and stated gives up on it. This means stated is also the gatekeeper of when you can and cannot reboot a node. (See mail archives for extensive discussions of the details.) A big part of the motivation for this was to get uninformed timeouts and retries out of os_load/os_setup and put them in stated where we can make a wiser choice. So os_load and os_setup now use this new stuff and don't have to worry about timing out on nodes and rebooting. Stated makes sure that they either come up, get retried, or fail to boot. tbrestart also underwent a similar change.
-
- 21 Aug, 2003 1 commit
-
-
Mac Newbold authored
-
- 19 Jun, 2003 1 commit
-
-
Mac Newbold authored
really-reboot-nodes-that-timeout stuff. NOTE: Until the timeout/retry stuff is gone from os_load/os_setup, it is disabled in stated. It will still only send email. But all the stuff is there and has been tested. NOTE: Until other things don't depend on the old behavior of node_reboot (when it returns, all nodes are in SHUTDOWN), the event stuff is disabled. Real mode is the default, and can be run by anyone. In short, this commit is new versions of stated and node_reboot that act almost exactly like the old ones. But I wanted to commit them before I go on making a bunch more changes, to have a checkpoint that I know works.
-
- 09 Jun, 2003 1 commit
-
-
Mac Newbold authored
-
- 06 Jun, 2003 1 commit
-
-
Mac Newbold authored
is supported: - stated listens for TBCOMMAND events, and currently handles REBOOT, POWEROFF, POWERON, and POWERCYCLE events. It does everything except make the actual calls to node_reboot and power. And it accepts batches of nodes instead of just single ones. - Timeouts were added to the db for these commands, with no timeout for the power ones (since the node can't hang during those), and a 15 second timeout from reboot until the SHUTDOWN state. - If a rebootimes out, it tries it again, up to 3 times. If it gets to three times without working, it sends mail to tbops and turns the machine off instead of continuing to reboot it. Right now I haven't made it do node_reboot -f or power cycle on retries, but it easily could. - Stuff to be done before they work: make node_reboot send an event instead of doing the work, and make a new script that has node_reboot's old guts. Note that this requires authentication in our events for these commands, and a way to make sure that the command that came in as an event was properly authenticated. - For future growth and expansion, it is set up so it should be relatively easy to add other commands that do different things, even if they take arbitrary params that aren't nodes or lists of nodes.
-
- 23 May, 2003 1 commit
-
-
Mac Newbold authored
1. timeouts for nodes weren't getting reset when they had a mode ransition, so they were timing out in shutdown after changing modes. 2. It was still going back into a blocking wait, even though a signal had been recieved, and not quitting back up to the main loop to handle it.
-
- 22 May, 2003 2 commits
-
-
Mac Newbold authored
code when taking a signal.
-
Mac Newbold authored
memory leak in one of the timeout queue data structures, more or less.
-
- 20 May, 2003 2 commits
-
-
Mac Newbold authored
didn't see in testing. Specifically, why it pegs at 100% CPU after a while, and why it gets timeouts after it has removed the timeout from the queue.
-
Mac Newbold authored
1. Change from inefficient timeout search algo that ran once per second to a highly efficient priority queue method of managing timeouts. Now instead of checking every node's timestamps, we just look at the head of the queue, and it is often much less frequent than once a second, since we know how long we have until the next timeout. 2. Start using a blocking poll for events, so I can sleep for long periods of time instead of having to wake up at least once a second to check for timeouts and events. Will set the block timeout for the shortest of: the time to send out the next batch of queued emails, the next time a timeout may occur, or when there are no mails waiting and no timeouts possible, 10 minutes. Comes back as soon as an event comes in. 3. Given the above two items, we no longer need a sleep(1) in our main loop. One small glitch is in the progress of being fixed. When using blocking polls, things hang when trying to unregister from the event system. Not a big deal, just ^C twice to kill it. (May cause it to need two SIGUSR1's to get it to restart, too.) In the next update, look for: - Really take action on timeouts. - keep track of how many times we've retried, and notify if something may be wrong with the node. - Find out policy on taking action with timeouts. - Do it if the expt is in transition or the node is free - Probably don't touch if the expt is established. - Maybe? in active expt, send (good) email to expt owner on timeouts Related "coming soon" items: os_load/os_setup etc.: - Add the waitforstate stuff we've talked about - make os_load/os_setup use it
-
- 30 Apr, 2003 1 commit
-
-
Mac Newbold authored
allows only SHUTDOWN and ISUP states, and whenever stated sees a SHUTDOWN for it, there's a trigger that sends an ISUP right afterwards. Thus the name, ALWAYSUP.
-
- 28 Apr, 2003 1 commit
-
-
Leigh B. Stoller authored
get an updated copyrights message.
-
- 17 Apr, 2003 1 commit
-
-
Mac Newbold authored
on any node on any state, in any specific mode, or without any mode restriction. The imediate use of this is the FREENODE trigger. Now RELOADDONE adds a FREENODE trigger on the ISUP state, if the node is in the reloading expt. Then next time the node hits ISUP, it gets freed from the reloading expt. This fix solves the race where recently freed (and still rebooting) nodes get grabbed by an expt and get rebooted in a way that may hoze their FS's. Also fixed a problem that was making it load the db twice on startup.
-
- 19 Mar, 2003 1 commit
-
-
Mac Newbold authored
state. Before, this caused "Running os Foo but in the wrong opmode!" messages. Now, if I get to BOOTING and the state I'm coming from was ISUP, and there's a next_op_mode waiting for me, force the mode transition without sending any mail.
-
- 12 Mar, 2003 1 commit
-
-
Mac Newbold authored
-
- 08 Mar, 2003 1 commit
-
-
Mac Newbold authored
make sure it got run as root. Update that to allow running as non root if you're in a devel tree and you're an admin (in your own copy of the db). This will let flest run it simply by using 'withadminprivs' with the startup of stated.
-