- 16 Oct, 2003 4 commits
-
-
Leigh B. Stoller authored
number of connections to tmcd, and the resulting number of DB queries. Currently thats about 24 per node when it boots. Each vnode adds another 24 or so. The new approach is to use the "fullconfig" command, which dumps the entire config in one shot, saving about 20 of those connections. We still need to do the status/state commands for real of course. When a node boots, it requests the fullconfig; the client side takes this fullconfig, and dumps the individual sections to /var/emulab/boot/tmcc/section_name. Subsequent requests first look for it locally in the above named files, falling back to real tmcc if none exists. The update command also refreshes the cache. Tested for jails and plab node vservers as well.
-
Leigh B. Stoller authored
ordering of events is not obvious to anyone except me (on a good day).
-
Leigh B. Stoller authored
swapped out (non-recoverable) by tbswap. swapexp was leaving the experiment in the running state instead of paused. We need to check this after tbswap since we do not get reasonable error codes back. Also some cleanup with respect to how aborted modifies are handled. I think I understand what Chad did ... A general comment; we need to be better about returning meaningful error codes!
-
Leigh B. Stoller authored
-
- 15 Oct, 2003 13 commits
-
-
Shashi Guruprasad authored
install file to reflect this.
-
Leigh B. Stoller authored
-
Leigh B. Stoller authored
-
Mike Hibler authored
All to try to avoid loopback lockups.
-
Mike Hibler authored
from /var/log/messages. Fixed up the list of logfiles on the boss node.
-
Kirk Webb authored
-
Mike Hibler authored
as defined in the defs-* file (e.g. "TBLOGFACIL=local2"). The default is "local5" which is what we are setup to use so you shouldn't need to mess with your defs- file! perl scripts just get this value configured in when configure is run. C programs get the value in two ways. For programs that are intimate with the testbed infrastructure, and include "config.h", they just get it from that file. For programs that we sometimes use outside the Emulab build environment (e.g., frisbee, capture) and that don't include config.h, the value is set via a "-DLOG_TESTBED=..." in the GNUmakefile build line. If the value isn't set, it defaults to what it used to be (usually LOG_USER). Still to do: healthd, hmcd (whose build doesn't seem to be completely integrated) and plabdaemon.in (since its icky python :-)
-
Leigh B. Stoller authored
-
Kirk Webb authored
This is much more robust than the old method.
-
Mike Hibler authored
-
Leigh B. Stoller authored
slightly different things for physnodes than for vnodes.
-
Mike Hibler authored
We have seen cases where dmesg has info from multiple boots and sometime even garbage.
-
Leigh B. Stoller authored
-
- 14 Oct, 2003 6 commits
-
-
Kirk Webb authored
Update to libplab.plab.renew: * Make renewal robust against various kinds of failures. These changes will augment my larger set of libplab and plab* updates/fixes coming soon to an Emulab near you.
-
Leigh B. Stoller authored
-
Leigh B. Stoller authored
that slot, and domounts checks that slot now. The older USESFS=1 is still supported for now, but will chucked eventually. More work on supporting client side caching of the full configuration.
-
Leigh B. Stoller authored
-
Leigh B. Stoller authored
-
Leigh B. Stoller authored
-
- 13 Oct, 2003 10 commits
-
-
Leigh B. Stoller authored
users.
-
Leigh B. Stoller authored
via the Mod User Info page.
-
David Anderson authored
-
David Anderson authored
-
David Anderson authored
also includes updated tb_compat.tcl include file and ns patch.
-
Leigh B. Stoller authored
-
Leigh B. Stoller authored
-
Leigh B. Stoller authored
I have implemented the suggestion Jay made a couple of weeks ago about allowing partial allocation in assign_wrapper, and retrying with a modified set of "fixed" nodes. My basic approach was to change nalloc to optionally allow partial allocations, returning the number of nodes that could not be allocated as its return value. In assign_wrapper, I determine which nodes we were able to get (in each loop), set their allocstate to INIT_DIRTY, augment the fixed_node set, and recreate the top file. Then I try again, up to the current number of maxtries. If assign fails with an unretryable error, or if we could not nalloc a user directed fixed node, then I stop right away since the experiment is not going to map (in the near term) if the fixed node list cannot be allocated. I am confident that this works okay, although testing is a little difficult. The main problem is how this interacts with experiment modify. Chad's implementation is that a modify can be reverted (recovered from) only as long as the DB is not modified by assign_wrapper. Well, a partial allocation, followed by failure, obviously modifies the DB, and so is deemed not recoverable. I am still trying to figure out the effects of this, and whether I can relax this requirement, but in the meantime lets install it and see what happens (won't affect many people).
-
Leigh B. Stoller authored
but its nice to have it in the DB too so that we do not have to read that file!
-
Mac Newbold authored
-
- 10 Oct, 2003 7 commits
-
-
Mac Newbold authored
-
Mac Newbold authored
-
Robert Ricci authored
they mean.
-
Leigh B. Stoller authored
www tree.
-
Mike Hibler authored
-
Mac Newbold authored
-
Mac Newbold authored
model of waiting for state changes. Before we were watching the database (which means we can only watch for terminal/stable/long-lived states, and have to poll the db). Now things that are waiting for states to change become event listeners, and watch the stream of events flow by, and don't have to do any polling. They can now watch for any state, and even sequences of states (ie a Shutdown followed by an Isup). To do this, there is now a cool StateWait.pm library that encapsulates the functionality needed. To use it, you call initStateWait before you start the chain of events (ie before you call node reboot). Then do your stuff, and call waitForState() when you're ready to wait. It can be told to return periodically with the results so far, and you can cancel waiting for things. An example program called waitForState is in testbed/event/stated/ , and can also be used nicely as a command line tool that wraps up the library functionality. This also required the introduction of a TBFAILED event that can be sent when a node isn't going to make it to the state that someone may be waiting for. Ie if it gets wedged coming up, and stated retries, but eventually gives up on it, it sends this to let things know that the node is hozed and won't ever come up. Another thing that is part of this is that node_reboot moves (back) to the fully-event-driven model, where users call node reboot, and it does some checks and sends some events. Then stated calls node_reboot in "real mode" to actually do the work, and handles doing the appropriate retries until the node either comes up or is deemed "failed" and stated gives up on it. This means stated is also the gatekeeper of when you can and cannot reboot a node. (See mail archives for extensive discussions of the details.) A big part of the motivation for this was to get uninformed timeouts and retries out of os_load/os_setup and put them in stated where we can make a wiser choice. So os_load and os_setup now use this new stuff and don't have to worry about timing out on nodes and rebooting. Stated makes sure that they either come up, get retried, or fail to boot. tbrestart also underwent a similar change.
-