- 29 Jan, 2003 1 commit
-
-
Mac Newbold authored
-
- 07 Jan, 2003 2 commits
-
-
Mac Newbold authored
-
Mac Newbold authored
as the op_mode it is currently running. If not, force it and send mail. This fixes the "stuck-in-reloading" phenomenon we occasionally see when the states get messed up. Now anytime it loads PXEFRISBEE it should force the mode to RELOAD, and it will stop reloading the first time it hits RELOADDONE.
-
- 20 Dec, 2002 1 commit
-
-
Mac Newbold authored
-
- 16 Dec, 2002 1 commit
-
-
Mac Newbold authored
events. This may delay handling of other stuff that happens in my main loop, but not by too much. To prevent skew, everything (including reload frequency) is done strictly by seconds elapsed, not by iterations or anything. I found that even polling for multiple events without sleeping, I could only handle a little over 1 per second when I was calling inuse/statetime for additional info on every event. Even though this only happens in the worst case (every event is wrong), it won't do. So I took that out. I'll probably end up adding a faster lookup of the info I need (mostly reservation, and what osid it thinks it is running). That change took it up to at least 4 per second (as fast as I could send them manually), more than 4x our previous performance. So we should be able to keep up now. Also, add the support for "announcements" to testbed ops when I die and such. (Been in a few days, but this is the first commit of it)
-
- 09 Dec, 2002 1 commit
-
-
Mac Newbold authored
-
- 03 Dec, 2002 1 commit
-
-
Mac Newbold authored
-
- 22 Nov, 2002 1 commit
-
-
Mac Newbold authored
Set mailgap to 15 seconds, and wait a second on startup before sending mail so that the timeouts all come in one message. And don't clean up my stated.pid file if I'm a forked child of the real daemon.
-
- 14 Nov, 2002 1 commit
-
-
Mac Newbold authored
First, fix up the isup generation code. When a node/OS doesn't send its own isups, but is pingable, we need to fork and ping it, and send ISUP when it pings. The code was there, but was broken. This fixes it. The one time that it may cause errant messages is in modes other than MINIMAL. When we get BOOTING, we check if it needs isup generated. If we have to ping it, when it pings we send ISUP. This means that if we are really in NORMAL mode, we might send ISUP before the node sends REBOOTED (or TBSETUP in NORMALv1), and it would look funny. But that case will be really rare, since everything that sends REBOOTED or TBSETUP has no reason not to send ISUP itself. Second, after mailbombing myself a couple of times, Kirk and I decided I'd better put some throttling in the notification code that stated uses. So now it throttles itself and digests the messages if they're sent too close together. The first message it gets will get sent immediately. If the next one is long enough after that, it sends it immediately too. If a message comes too soon after sending one, we queue it up, and send it later after enough time has passed. Currently it is set to wait 5 seconds between messages, so it will send up to 12 per second, and wait no more than 5 seconds before sending a message that is queued up. (Something similar to this may be a nice thing in the rest of our stuff, but it was made a lot easier by the fact that stated already had a polling loop in it. Without that, you'd have to use alarms or some other weird thing, which would be painful.)
-
- 05 Nov, 2002 1 commit
-
-
Mac Newbold authored
was sending incorrect params, and os_select had a bad regexp that was causing failures prematurely.
-
- 04 Nov, 2002 1 commit
-
-
Mac Newbold authored
- Better pidfile handling, do proper locking, etc. - Change die() to fatal(), so it sends mail and goes to syslog instead of to /dev/null - Fix RESET to not reset pxe_boot_path for Mike. - Fix sendmail call to have proper to and from addrs
-
- 01 Nov, 2002 1 commit
-
-
Mac Newbold authored
Make stated put a pidfile in /var/run/stated.pid, when its the real version. Devel versions use a file in /var/run based on their prefix. Also add more info to the invalid transition message.
-
- 31 Oct, 2002 1 commit
-
-
Mac Newbold authored
Make stated use syslog, add a restart feature triggered by SIGUSR1. Sometimes it still misses events between going down and coming up, but we'll see how it goes.
-
- 22 Oct, 2002 2 commits
-
-
Mac Newbold authored
Add proper subscription to only stuff I care about and proper exiting when a signal is received (instead of a 'natural' death like exit or die).
-
Mac Newbold authored
Add email aliases for testbed-stated and testbed-testsuite, and update all the defs files to use the same (or similar) addr for those lists as for testbed-logs. Make stated use the new alias too.
-
- 18 Oct, 2002 1 commit
-
-
Mac Newbold authored
Changes to watch out for: - db calls that change boot info in nodes table are now calls to os_select - whenever you want to change a node's pxe boot info, or def or next boot osids or paths, use os_select. - when you need to wait for a node to reach some point in the boot process (like ISUP), check the state in the database using the lib calls - Proxydhcp now sends a BOOTING state for each node that it talks to. - OSs that don't send ISUP will have one generated for them by stated either when they ping (if they support ping) or immediately after they get to BOOTING. - States now have timeouts. Actions aren't currently carried out, but they will be soon. If you notice problems here, let me know... we're still tuning it. (Before all timeouts were set to "none" in the db) One temporary change: - While I make our new free node manager daemon (freed), all nodes are forced into reloading when they're nfreed and the calls to reset the os are disabled (that will move into freed).
-
- 20 Sep, 2002 2 commits
-
-
Mac Newbold authored
-
Mac Newbold authored
-
- 19 Sep, 2002 1 commit
-
-
Robert Ricci authored
1) Checks database redirects for nodes, and ignore events that aren't directed to its database. 2) Doesn't insist on being run as root (doesn't need to be right now, anyway.) 3) '-f' option that prevents it from forking into the backgound, for easier killing.
-
- 10 Jul, 2002 1 commit
-
-
Leigh B. Stoller authored
-
- 12 Jun, 2002 1 commit
-
-
Leigh B. Stoller authored
-
- 10 Jun, 2002 1 commit
-
-
Robert Ricci authored
node we don't know about. This will prevent the case where a new node is referenced within an hour of it first being added to the nodes table.
-
- 20 May, 2002 1 commit
-
-
Robert Ricci authored
compatibility.
-
- 25 Apr, 2002 1 commit
-
-
Robert Ricci authored
this was the only script still using the old log location.
-
- 16 Apr, 2002 1 commit
-
-
Robert Ricci authored
-
- 10 Apr, 2002 1 commit
-
-
Robert Ricci authored
Operational mode (op_mode in the database) affects the state diagram and timeouts for a node. Modes planned so far are: NORMAL - Normal operation DELAYING - Acting as a delay node UNKNOWNOS - Running an OS that does not report its state (OSKit kernels, etc.) RELOADING - Disk reloading stated now responds to to TBNODEOPMODE events, and sets database state accordingly. The set of state timeouts and valid state transitions are affected by a node's operational mode. The nodes table now stores information about operational modes, and the state_transitions and state_timeouts tables include the operational mode in addition to states. Next step will be to get the appropriate programs to send TBNODEOPMODE events.
-
- 02 Apr, 2002 1 commit
-
-
Robert Ricci authored
find a node that we already knew about, and it hasn't changed state or timestamp, we just use the old entry. This allows us to still notice new nodes, or nodes that have had their state changed externally (say, by hand), but not forget about nodes we've already sent mail about.
-
- 01 Apr, 2002 1 commit
-
-
Robert Ricci authored
-
- 29 Mar, 2002 1 commit
-
-
Robert Ricci authored
broken. Also, it made me slightly uneasy that there was no way to prevent swig from putting one of its generated files in sorce directory. So, I've just checked in the two major files that get generated by SWIG, so that the make rule that runs it never gets invoked. One of the reasons for doing this is that swig generates slightly broken code when the -exportall (which does perl module exports correctly) arugment is given. A very minor amount of manual tweaking of the generated .pm file can fix this problem. So, the checked in copy of event.pm has these tweaks applied. As a result of all of this, exports work correctly in the event perl module, so the hacky practice of putting your program in the event namespace is no longer necessary.
-
- 28 Mar, 2002 1 commit
-
-
Robert Ricci authored
Watches for events sent by TMCD regarding the state of nodes. Records this information in the database. Also watches for nodes that undergo invalid state transitions, or stay in the same state for too long. Right now, the only action it takes is to send email, but in the future, will take action to 'unstick' nodes. Not yet installed by default.
-