- 15 Oct, 2003 2 commits
-
-
Mike Hibler authored
We have seen cases where dmesg has info from multiple boots and sometime even garbage.
-
Leigh B. Stoller authored
-
- 14 Oct, 2003 6 commits
-
-
Kirk Webb authored
Update to libplab.plab.renew: * Make renewal robust against various kinds of failures. These changes will augment my larger set of libplab and plab* updates/fixes coming soon to an Emulab near you.
-
Leigh B. Stoller authored
-
Leigh B. Stoller authored
that slot, and domounts checks that slot now. The older USESFS=1 is still supported for now, but will chucked eventually. More work on supporting client side caching of the full configuration.
-
Leigh B. Stoller authored
-
Leigh B. Stoller authored
-
Leigh B. Stoller authored
-
- 13 Oct, 2003 10 commits
-
-
Leigh B. Stoller authored
users.
-
Leigh B. Stoller authored
via the Mod User Info page.
-
David Anderson authored
-
David Anderson authored
-
David Anderson authored
also includes updated tb_compat.tcl include file and ns patch.
-
Leigh B. Stoller authored
-
Leigh B. Stoller authored
-
Leigh B. Stoller authored
I have implemented the suggestion Jay made a couple of weeks ago about allowing partial allocation in assign_wrapper, and retrying with a modified set of "fixed" nodes. My basic approach was to change nalloc to optionally allow partial allocations, returning the number of nodes that could not be allocated as its return value. In assign_wrapper, I determine which nodes we were able to get (in each loop), set their allocstate to INIT_DIRTY, augment the fixed_node set, and recreate the top file. Then I try again, up to the current number of maxtries. If assign fails with an unretryable error, or if we could not nalloc a user directed fixed node, then I stop right away since the experiment is not going to map (in the near term) if the fixed node list cannot be allocated. I am confident that this works okay, although testing is a little difficult. The main problem is how this interacts with experiment modify. Chad's implementation is that a modify can be reverted (recovered from) only as long as the DB is not modified by assign_wrapper. Well, a partial allocation, followed by failure, obviously modifies the DB, and so is deemed not recoverable. I am still trying to figure out the effects of this, and whether I can relax this requirement, but in the meantime lets install it and see what happens (won't affect many people).
-
Leigh B. Stoller authored
but its nice to have it in the DB too so that we do not have to read that file!
-
Mac Newbold authored
-
- 10 Oct, 2003 7 commits
-
-
Mac Newbold authored
-
Mac Newbold authored
-
Robert Ricci authored
they mean.
-
Leigh B. Stoller authored
www tree.
-
Mike Hibler authored
-
Mac Newbold authored
-
Mac Newbold authored
model of waiting for state changes. Before we were watching the database (which means we can only watch for terminal/stable/long-lived states, and have to poll the db). Now things that are waiting for states to change become event listeners, and watch the stream of events flow by, and don't have to do any polling. They can now watch for any state, and even sequences of states (ie a Shutdown followed by an Isup). To do this, there is now a cool StateWait.pm library that encapsulates the functionality needed. To use it, you call initStateWait before you start the chain of events (ie before you call node reboot). Then do your stuff, and call waitForState() when you're ready to wait. It can be told to return periodically with the results so far, and you can cancel waiting for things. An example program called waitForState is in testbed/event/stated/ , and can also be used nicely as a command line tool that wraps up the library functionality. This also required the introduction of a TBFAILED event that can be sent when a node isn't going to make it to the state that someone may be waiting for. Ie if it gets wedged coming up, and stated retries, but eventually gives up on it, it sends this to let things know that the node is hozed and won't ever come up. Another thing that is part of this is that node_reboot moves (back) to the fully-event-driven model, where users call node reboot, and it does some checks and sends some events. Then stated calls node_reboot in "real mode" to actually do the work, and handles doing the appropriate retries until the node either comes up or is deemed "failed" and stated gives up on it. This means stated is also the gatekeeper of when you can and cannot reboot a node. (See mail archives for extensive discussions of the details.) A big part of the motivation for this was to get uninformed timeouts and retries out of os_load/os_setup and put them in stated where we can make a wiser choice. So os_load and os_setup now use this new stuff and don't have to worry about timing out on nodes and rebooting. Stated makes sure that they either come up, get retried, or fail to boot. tbrestart also underwent a similar change.
-
- 09 Oct, 2003 8 commits
-
-
Mike Hibler authored
We have a few more sources changes then they do, so we cannot just use it.
-
Mike Hibler authored
the closest match.
-
Mike Hibler authored
-
Leigh B. Stoller authored
-
Leigh B. Stoller authored
-
Leigh B. Stoller authored
* install-rpm, install-tarfile, spewrpmtar.php3, spewrpmtar.in: Pumped up even more! The db file we store in /var/db now records both the timestamp (of the file, or if remote the install time) and the MD5 of the file that was installed. Locally, we can get this info when accessing the file via NFS (copymode on or off). Remote, we use wget to get the file, and so pass the timestamp along in the URL request, and let spewrpmtar.in determine if the file has changed. If the timestamp it gets is >= to the timestamp of the file, an error code of 304 (Not Modifed) is returned. Otherwise the file is returned. If the timestamps are different (remote, server sends back an actual file), the MD5 of the file is compared against the value stored. If they are equal, update the timestamp in the db file to avoid repeated MD5s (or server downloads) in the future. If the MD5 is different, then reinstall the tarball or rpm, and update the db file with the new timestamp and MD5. Presto, we have auto update capability! Caveat: I pass along the old MD5 in the URL, but it is currently ignored. I do not know if doing the MD5 on the server is a good idea, but obviously it is easy to add later. At the moment it happens on the node, which means wasted bandwidth when the timestamp has changed, but the file has not (probably not something that will happen in typical usage). Caveat: The timestamp used on remote nodes is the time the tarfile is installed (GM time of course). We could arrange to return the timestamp of the local file back to the node, but that would mean complicating the protocol (or using an http header) and I was not in the mood for that. In typical usage, I do not think that people will be changing tarfiles and rpms so rapidly that this will make a difference, but if it does, we can change it. * node_update.in, client side watchdog, and various web pages: Deflated node_update, removing all of the older ssh code. We now assume that all nodes will auto update on a periodic basis, via the watchdog that runs on all client nodes, including plab nodes. Changed the permission check to look for new UPDATE permission (used to be UPDATEACCOUNT). As before, it requires local_root or better. The reason for this is that node_update now implies more than just updating the accounts/mounts. The web pages have been changed to explain that in addition to mounts/accounts, rpms and tarfiles will also be updated. At the moment, this is still tied to a single variable (update_accounts) in the nodes table, but as Kirk requested at the meeting, it will probably be nice to split these out in the future. Added the ability to node_update a single node in an experiment (in addition to all nodes option on the showexp page). This has been added to the shownode webpage menu options. Changed locking code to use the newer wrapper states, and to move the experiment to RUNNING_LOCKED until the update completes. This is to prevent mayhem in the rest of the system (which could be dealt with, but is not worth the trouble; people have to wait until their initiated update is complete, before they can swap out the experiment). Added "short" mode to shownode routine, equiv to the recently added short mode for showexp. I use this on the confirmation page for updating a single node, giving the user a couple of pertinent (feel good) facts before they comfirm.
-
Mac Newbold authored
-
Mac Newbold authored
-
- 08 Oct, 2003 1 commit
-
-
Leigh B. Stoller authored
this page is open to the world.
-
- 07 Oct, 2003 6 commits
-
-
Robert Ricci authored
second argument so that you can pass ($pid,$gid) when showing a group's experiments. But, it has a default value, so you don't have to around around passing a superfluous second argument for showing user or project experiments.
-
Robert Ricci authored
-
Mac Newbold authored
-
Leigh B. Stoller authored
-
Robert Ricci authored
directory, symlink in /proj/<pid> .
-
Leigh B. Stoller authored
nodes it can). Change exit value; return -1 on fatal error, otherwise return the number of nodes that could not be allocated. Combined with the -p switch, assign_wrapper can easily determine that nalloc was able to reserve a subset of the nodes. Also fix up getopts() call, which had its arguments backwards! Good thing we hardly pass switches to nalloc.
-