- 25 Jun, 2003 1 commit
-
-
Leigh B. Stoller authored
120 + (30 * number_of_vnodes).
-
- 14 Apr, 2003 2 commits
-
-
Chad Barb authored
Fixed error message. "There were 0 nodes." --> "There were 0 failed nodes."
-
Leigh B. Stoller authored
whatever assign_wrapper did. This is different than local nodes where we allocate the underlying phys node and set its osid. Confusing.
-
- 07 Apr, 2003 1 commit
-
-
Chad Barb authored
Modify os_setup return codes to enable "intelligent" retry; Now os_setup returns: 0 on success 1 on one or more retry-friendly errors -1 on no-retry errors tbswap.in checks os_setup's return code, and will only retry on 1.
-
- 18 Mar, 2003 1 commit
-
-
Chad Barb authored
Here it is; reswap. nfree - modified to put node in FREE_DIRTY when it is freed assign_wrapper - '-u' update switch added. os_setup - doesn't reboot node which is already in RES_READY tbswap - calls all this stuff appropriately
-
- 17 Mar, 2003 1 commit
-
-
Leigh B. Stoller authored
to a specific one, for the purposes of mapping things like FBSD-STD to FBSD47-STD (the current OSID to use). This is technically more correct than what os_setup used to do, which was map FBSD-STD to whatever FreeBSD OSID was currently on the disk. Now it maps to a specific one, and if that is not loaded, it sets up a reload.
-
- 31 Jan, 2003 2 commits
-
-
Robert Ricci authored
-
Robert Ricci authored
-
- 29 Jan, 2003 1 commit
-
-
Robert Ricci authored
no longer waits that long.
-
- 07 Jan, 2003 1 commit
-
-
Leigh B. Stoller authored
real nodes get. Also, run a proper os_select on jailed nodes, *after* the os for the physical node is setup, since otherwise stated will not be happy. Fixes for dealing with failed os_load. Previously, if os_load would fail, os_setup would wait for those nodes anyway since it had no idea what nodes had failed (and we do not want to just quit from os_setup since that might cause a lot of extra power cycles). Now, for each node that got an os_load, check its eventstate; it should be in ISUP immediately after os_load exits (since thats what os_load waited for), and if its not, then mark that node as failed. Note though that failed loads no longer result in the node going into hwdown, since 99 percent of the time its a busted user image, not a hardware problem. I figure we will catch real hw errors via the reload daemon, when it sends email about nodes not finishing. Do not bother with doing the vnode setup if any of the phys nodes failed to setup. Leads to cascading errors and prolongs the angony by another few minutes. Might revisit this later. Remove local WaitTillAlive() function, and switch to using the version I put into libdb a couple of weeks ago. Fix up a bunch of print statements to be nicer.
-
- 31 Oct, 2002 1 commit
-
-
Leigh B. Stoller authored
-
- 18 Oct, 2002 1 commit
-
-
Mac Newbold authored
Changes to watch out for: - db calls that change boot info in nodes table are now calls to os_select - whenever you want to change a node's pxe boot info, or def or next boot osids or paths, use os_select. - when you need to wait for a node to reach some point in the boot process (like ISUP), check the state in the database using the lib calls - Proxydhcp now sends a BOOTING state for each node that it talks to. - OSs that don't send ISUP will have one generated for them by stated either when they ping (if they support ping) or immediately after they get to BOOTING. - States now have timeouts. Actions aren't currently carried out, but they will be soon. If you notice problems here, let me know... we're still tuning it. (Before all timeouts were set to "none" in the db) One temporary change: - While I make our new free node manager daemon (freed), all nodes are forced into reloading when they're nfreed and the calls to reset the os are disabled (that will move into freed).
-
- 26 Sep, 2002 1 commit
-
-
Mac Newbold authored
Fix small problem that was causing a failure in the test suite: If there are no nodes in the expt, don't die(), just exit().
-
- 05 Aug, 2002 1 commit
-
-
Leigh B. Stoller authored
-
- 07 Jul, 2002 1 commit
-
-
Leigh B. Stoller authored
-
- 03 Jul, 2002 1 commit
-
-
Robert Ricci authored
-
- 02 Jun, 2002 1 commit
-
-
Leigh B. Stoller authored
state to REBOOTING, and then wait for the ISUP state to be set. This change reflected in the clientside startup scripts on remote nodes, that now issues a REBOOTED event, and then an ISUP event after everything is setup properly.
-
- 13 May, 2002 1 commit
-
-
Leigh B. Stoller authored
-
- 10 May, 2002 2 commits
-
-
Robert Ricci authored
too much CPU
-
Robert Ricci authored
of just pinging it, poll its eventstate in the database, and wait until it's reported ISUP. This way, we don't report that a node is ready before it really is. Also bumped up the timeout by a couple of minutes to account for the extra time it takes for the OS to boot. Right now, if the OS is considered pingable, we assume that it will report ISUP. In the future, however, when the node state is more formalized, we will have better ways of determining this.
-
- 09 May, 2002 1 commit
-
-
Robert Ricci authored
better picture of where time is spent during experiment setup.
-
- 08 May, 2002 1 commit
-
-
Leigh B. Stoller authored
remote, be local someday). BIG CHANGE: Start using the last_reservation table to auto reload nodes that are reallocated before they are reloaded by the reload daemon.
-
- 22 Apr, 2002 1 commit
-
-
Leigh B. Stoller authored
-
- 16 Apr, 2002 1 commit
-
-
Leigh B. Stoller authored
second chance by rebooting and adding it back onto the end of the list of nodes to wait for. This is a temporary measure (until stated handles this).
-
- 05 Mar, 2002 1 commit
-
-
Leigh B. Stoller authored
assign_wrapper.in: Hack in a change that ensures a delay node is created for any link on which an event is posted (up,down,modify), no matter what its initial parameters are. ie: If a link is created with no delay, but there is an event that adds a delay later, then we must drop in a delay node. Same for up/down on a link. We do this in the delay node. I am reasonably confident that this change is fine for duplex links, but I am less sure of the effect on lans! eventsys_control.in: Checkpoint latest changes. Add "replay" option, which right now just stops and starts the event scheduler so that it reloads the entire event list. Add check for existing experiment, and that the experiment is either active or swapping (do not want to start a scheduler for a swapped out experiment!). Add check to see if there are any events, and skip startup if there are not events in the DB. Lastly, get very serious about preventing more than one scheduler from being started, either by accident or intentionally. My protocol is to lock the table, grab and set the pid to -pid, test the pid for a positive value, and if positive, send the scheduler a kill(TERM) so that it can cleanup, clear the pid to zero in the DB, and exit. This approach ensures that we do not try to send a kill to a pid that is no longer active or owned by the user (this last part is not really necessary cause of how pids are reused, but it was easy to add so why not). exports_setup.in: Trivial change to make it easier to turn this on temporarily in devel trees. named_setup.in: Ditto. node_reboot.in: Add call to TBdbfork() in child cause of apparent DB connection problems across forks. In the child, set the eventstatus for the node to REBOOT if successful (not this event status stuff is temporary, will be recast in next set of revisions). GNUmakefile: Add new controlling program, eventsys_control. power.in: Ditto previous comment about REBOOT. os_setup.in: Non event system cleanups. tbend.in: Add DB cleanup of the new virt_trafgens and eventlist tables. tbprerun.in: Ditto. tbreport.in: Print out the event list in a pretty print format. tbswapin.in: Add call to start the event system. Also a big fix; move the named script up above the os_setup so that the named tables have been updated by the time the first node reboots. I noticed that nodes were failing on gethostbyname(). tbswapout.in: Add call to stop the event system.
-
- 12 Feb, 2002 1 commit
-
-
Leigh B. Stoller authored
line in all email from the system. Remove all of the TESTBED: tags and modify the email function in the web server and perl library to prepend @DOMAIN@: to the message.
-
- 08 Feb, 2002 1 commit
-
-
Leigh B. Stoller authored
supporting autocreating and autoloading images. The imageid form now sports a field to specify a nodeid to create the image from; If set, the backend create_image script is invoked. Thats the easy part. Slightly harder is autoloading images based on the osid specified in the NS file. To support this, I have added a new DB table called osidtoimageid, which holds the mapping from osid/pctype to imageid. When users create images, they must specify what node types that image is good for. Obviously, the mappings have to be unique or it would be impossible to figure it out! Anyway, once that image mapping is in place and the image created, the user can specify that ID in the NS file. I've changed os_setup to to look for IDs that are not loaded, and to try and find one in the osidtoimageid. If found, it invokes os_load. To keep things running in parallel as much as possible, os_setup issues all the loads/reboots (could be more than a single set of loads is multiple IDs are in the NS file) at once, and waits for all the children to exit. I've hacked up os_load a bit to try and be more robust in the face of PXE failures, which still happen and are rather troublsesome. Need an event system! Contained in this revision are unrelated changed to make the OS and Image IDs per-project unique instead of globally unique, since thats a pain for the users. This turns out to be very messy, since underneath we do not want to pass around pid/ID in all the various places its used. Rather, I create a globally unique name and extened the OS and Image tables to include pid/name/ID. The user selects pid/name, and I create the globally unique ID. For the most part this is invisible throughout the system, except where we interface with the user, say in the web pages; the user should see his chosen name where possible, and the should invoke scripts (os_load, create_image, etc) using his/her name not the internal ID. Also, in the front end the NS file should use the user name not the ID. All in all, this accounted for a number of annoying changes and some special cases that are unavoidable.
-
- 20 Dec, 2001 1 commit
-
-
Leigh B. Stoller authored
-
- 17 Dec, 2001 1 commit
-
-
Leigh B. Stoller authored
"goto" was messing up PERL?
-
- 11 Oct, 2001 1 commit
-
-
Leigh B. Stoller authored
reboot.
-
- 06 Sep, 2001 1 commit
-
-
Leigh B. Stoller authored
have been added as OSIDs so that the parser accepts them. os_setup maps them into whatever equiv OSID is loaded on the target node, according to the OS slot of the osid table entry. If no mapping can be made (no equiv OS loaded, as defined by the partitions table) os_setup fails. I've also changed the web page node control form so that the only OSIDs you can set for a node are the ones loaded (partitions table) or OSKit kernels (osid table entry has a path).
-
- 24 Aug, 2001 3 commits
-
-
Mac Newbold authored
Change occurrences of "@TESTMODE@" back to @TESTMODE@ like they were supposed to be in the first place...
-
Mac Newbold authored
-
Leigh B. Stoller authored
The problem is that "@TESTMODE@" is wrong! This becomes a string "0", instead of the integer 0, and so "if ($TESTMODE)" breaks down.
-
- 23 Aug, 2001 1 commit
-
-
Mac Newbold authored
Lots of small changes for turning our 'require lib*' lines into 'use lib*' lines. Proper modules declare themselves as a package, and use Exporter to export the names of the subroutines that should be visible from the outside world. Many of ours didn't do that, it was just a file with a bunch of subs in it. So now I've fixed many of them to be proper, and removed the requires and 'push(@INC,...)' hacks and changed it to the proper 'use lib @prefix@/lib/;' and use lib*.
-
- 23 Jul, 2001 1 commit
-
-
Leigh B. Stoller authored
-
- 21 Jul, 2001 1 commit
-
-
Mac Newbold authored
Many changes and updates for handling new types. The db now has types like 'pc600', 'pc850', and 'dnard', and each type has a class like 'pc' or 'shark'. This updates scripts that use types to use classes where appropriate, and to handle the new types where there were hardcoded things that couldn't be eliminated right now.
-
- 17 Jul, 2001 1 commit
-
-
Leigh B. Stoller authored
a bootstatus field to the nodes table. os_setup sets this to one of okay, failed, unknown. This is to be used with the still to be defined method of specifying certain nodes that can fail reboot on experiment creation. Right now sharks are wired to this, and this information is presented in the web page. Its also essential for the batch system, which needs to consider nodes that failed to reboot, or else batch experiments would never end. Might still need a way for an experiment to tell the batch system its done though.
-
- 16 Jul, 2001 1 commit
-
-
Leigh B. Stoller authored
its hardwired to the shark type, but needs to come from DB somehow later. Mail is sent to the user (CC'ed to testbed-ops) when a node fails but the experiment continues setup.
-
- 13 Jul, 2001 1 commit
-
-
Leigh B. Stoller authored
Minor cleanups in os_setup, and move some code to libdb.
-