- 18 Dec, 2002 1 commit
-
-
Leigh B. Stoller authored
before using it!
-
- 16 Dec, 2002 1 commit
-
-
Mac Newbold authored
help nodes in reload_pending get sucked into reloading faster. If it doesn't do enough, we'll need to do more batching of stuff, so we get some parallelism in os_load instead of forcing it to serialize by calling os_load one node at a time. I was tempted to nuke all the stuff that was in there from the netdisk reload type, but decided not to. It won't be too long (relatively speaking) before we have freed, the new "free node manager" that will replace/supersede our current reload_daemon anyway.
-
- 11 Dec, 2002 1 commit
-
-
Leigh B. Stoller authored
a temporary class for testing new images.
-
- 04 Nov, 2002 1 commit
-
-
Mac Newbold authored
-
- 01 Nov, 2002 1 commit
-
-
Mac Newbold authored
Update reload_daemon with corresponding changes to those in os_load. Quick cleanup to deactivate the bitrotted netdisk stuff.
-
- 18 Oct, 2002 1 commit
-
-
Mac Newbold authored
Changes to watch out for: - db calls that change boot info in nodes table are now calls to os_select - whenever you want to change a node's pxe boot info, or def or next boot osids or paths, use os_select. - when you need to wait for a node to reach some point in the boot process (like ISUP), check the state in the database using the lib calls - Proxydhcp now sends a BOOTING state for each node that it talks to. - OSs that don't send ISUP will have one generated for them by stated either when they ping (if they support ping) or immediately after they get to BOOTING. - States now have timeouts. Actions aren't currently carried out, but they will be soon. If you notice problems here, let me know... we're still tuning it. (Before all timeouts were set to "none" in the db) One temporary change: - While I make our new free node manager daemon (freed), all nodes are forced into reloading when they're nfreed and the calls to reset the os are disabled (that will move into freed).
-
- 07 Jul, 2002 1 commit
-
-
Leigh B. Stoller authored
-
- 13 May, 2002 1 commit
-
-
Robert Ricci authored
-
- 12 Feb, 2002 1 commit
-
-
Leigh B. Stoller authored
line in all email from the system. Remove all of the TESTBED: tags and modify the email function in the web server and perl library to prepend @DOMAIN@: to the message.
-
- 08 Feb, 2002 1 commit
-
-
Leigh B. Stoller authored
supporting autocreating and autoloading images. The imageid form now sports a field to specify a nodeid to create the image from; If set, the backend create_image script is invoked. Thats the easy part. Slightly harder is autoloading images based on the osid specified in the NS file. To support this, I have added a new DB table called osidtoimageid, which holds the mapping from osid/pctype to imageid. When users create images, they must specify what node types that image is good for. Obviously, the mappings have to be unique or it would be impossible to figure it out! Anyway, once that image mapping is in place and the image created, the user can specify that ID in the NS file. I've changed os_setup to to look for IDs that are not loaded, and to try and find one in the osidtoimageid. If found, it invokes os_load. To keep things running in parallel as much as possible, os_setup issues all the loads/reboots (could be more than a single set of loads is multiple IDs are in the NS file) at once, and waits for all the children to exit. I've hacked up os_load a bit to try and be more robust in the face of PXE failures, which still happen and are rather troublsesome. Need an event system! Contained in this revision are unrelated changed to make the OS and Image IDs per-project unique instead of globally unique, since thats a pain for the users. This turns out to be very messy, since underneath we do not want to pass around pid/ID in all the various places its used. Rather, I create a globally unique name and extened the OS and Image tables to include pid/name/ID. The user selects pid/name, and I create the globally unique ID. For the most part this is invisible throughout the system, except where we interface with the user, say in the web pages; the user should see his chosen name where possible, and the should invoke scripts (os_load, create_image, etc) using his/her name not the internal ID. Also, in the front end the NS file should use the user name not the ID. All in all, this accounted for a number of annoying changes and some special cases that are unavoidable.
-
- 07 Feb, 2002 1 commit
-
-
Robert Ricci authored
retry (or warn) about nodes that get stuck more than once.
-
- 14 Jan, 2002 1 commit
-
-
Leigh B. Stoller authored
* Add appropriate goo to os/GNUMakefile so that Frisbee daemon is built and installed. * Rework the frisbee launcher slightly. Aside from little changes (send email to tbops when frisbeed dies, new cmdline syntax to frisbeed), allow for frisbeed to exit gracefully after a period of inactivity (no client requests for 30 minutes, at present). In order to prevent a race condition with a new client being added (and rebooted) and frisbeed terminating before the client gets started, add a load_busy indicator to the images table (next to load_address slot) and set that to one each time to frisbeelauncher is invoked. When frisbeed exits, test and clear that bit atomically (lock tables) and go around another time (restart frisbeed for another 30 minute period). * Rework waitmode in os_load. Wait for all of the nodes to finish at once, and track which nodes never finish. Retry those nodes again by rebooting. The number of retries is configurable in the script, and is currently set to one. This should take care of some PXE boot related problems, although obviously not all. * Got rid of -w option to os_load and made waitmode the default. The -s option can be used to start a reload, but not to wait for it to complete. * Minor changes to sched_reload and reload_daemon; pass in -s option to os_load.
-
- 04 Dec, 2001 1 commit
-
-
Robert Ricci authored
reloading experiment. Not particularly elegant, but a better solution should come with the event system.
-
- 27 Nov, 2001 1 commit
-
-
Robert Ricci authored
-
- 07 Nov, 2001 1 commit
-
-
Robert Ricci authored
just add seconds to it - It's in 'YYYYMMDDHHMMSS' format rather than a UNIX-style second count. So, the solution I came up with (not sure if it's the best one) is: (CURRENT_TIMESTAMP - INTERVAL $warn_time MINUTE) > rsrv_time
-
- 06 Nov, 2001 1 commit
-
-
Robert Ricci authored
than trying to infer how long a node has been reserved by polling the contents of the reserved table, use the timestamp in that table, which I didn't notice the first time through. Makes the code much simpler and more correct.
-
- 05 Nov, 2001 1 commit
-
-
Robert Ricci authored
currently defined as 30 minutes, to keep false positives to a minimum. Sends mail to testbed-ops if/when it finds any. The timing is not precise, as it only polls in between loading machines, but this is fine for our purposes.
-
- 23 Oct, 2001 1 commit
-
-
Robert Ricci authored
have entries in scheduled_reloads. Also changed hard-coded reload types to use the constants in libdb for flexibility.
-
- 17 Oct, 2001 3 commits
-
-
Robert Ricci authored
-
Robert Ricci authored
-
Robert Ricci authored
reloads to finish.
-
- 30 Sep, 2001 1 commit
-
-
Leigh B. Stoller authored
-
- 28 Sep, 2001 1 commit
-
-
Leigh B. Stoller authored
node is in the reloadpending EID. Typically means a bad image ID, but maybe some other problem.
-
- 26 Sep, 2001 1 commit
-
-
Leigh B. Stoller authored
-
- 18 Sep, 2001 1 commit
-
-
Robert Ricci authored
purpose, and avoid confusion with the current_reloads table.
-
- 17 Sep, 2001 1 commit
-
-
Leigh B. Stoller authored
scripts and move to libdb (hardwired in one place of many!).
-
- 06 Sep, 2001 1 commit
-
-
Leigh B. Stoller authored
firing off an os_load, just move the node from its current reservation to emulab-ops/reloadpending. This moves the operation out of band from the user's perspective (he gets more immediate response when an experiment ends, and besides we cannot handle mass reloads anyway, and so this approach is unusable until Frisbee. Change the reload_daemon to look for free nodes that need a reload (as before) *and* nodes in emulab-ops/reloadpending (as put there by nfree). In this case, the imageid comes from the reloads table instead of the node-types table. I also updated the reload_daemon to use libdb routines. Also change testbed/reloading to emulab-ops/reloading. Maybe someday I'll remove these hardwired strings.
-
- 23 Aug, 2001 1 commit
-
-
Mac Newbold authored
Lots of small changes for turning our 'require lib*' lines into 'use lib*' lines. Proper modules declare themselves as a package, and use Exporter to export the names of the subroutines that should be visible from the outside world. Many of ours didn't do that, it was just a file with a bunch of subs in it. So now I've fixed many of them to be proper, and removed the requires and 'push(@INC,...)' hacks and changed it to the proper 'use lib @prefix@/lib/;' and use lib*.
-
- 01 Aug, 2001 1 commit
-
-
Leigh B. Stoller authored
-
- 21 Jul, 2001 1 commit
-
-
Mac Newbold authored
Many changes and updates for handling new types. The db now has types like 'pc600', 'pc850', and 'dnard', and each type has a class like 'pc' or 'shark'. This updates scripts that use types to use classes where appropriate, and to handle the new types where there were hardcoded things that couldn't be eliminated right now.
-
- 29 Jun, 2001 1 commit
-
-
Leigh B. Stoller authored
will catch most problems.
-
- 10 May, 2001 1 commit
-
-
Leigh B. Stoller authored
proper headers. Split out some of the mail into testbed-logs, testbed-ops, and testbed-approval. Added a library for including from our perl scripts. Contains a couple of mail helper functions, but will hopefully contain more as time goes by. Fixed a bug in the web interface that was causing breakage for people with multiple accounts. Mac and Jay have noticed this, when logging out and trying to join or create a project under a new or different name.
-
- 03 May, 2001 1 commit
-
-
Leigh B. Stoller authored
replaced by the "images" table. New os_info table is added. New web pages to add and delete OSIDs to/from the os_info table, for use in the NS file. tb-create-os is gone. handle_os no longer operates on the tbcmds file, and no longer writes anything into the ir file. Moved the setting up of os state (nodes table) from os_setup to handle_os, where it should be. os_load and sched_reload now take a single argument, the name of the imageid from the images table.
-
- 30 Mar, 2001 1 commit
-
-
Leigh B. Stoller authored
since the last reservation (as determined by last_reservation table). Picks one (randomly) from that set of nodes, and calls sched_reload on it. Then waits until the node has finished reloading, as determined by the reserved table, which gets cleared by the tmcd when the node first reboots after a scheduled reload. Sleeps 30 seconds, and then goes around again. So at most one node is tied up in a reload at a time, which seems like a good balance between trying to keep the machines in a pristine state, and having nodes available for use. The advantage of this approach is that instead of calling sched_reload on 40 nodes (after generating a new image) and watching the network meltdown, we can let the nodes reload at a slower pace. We could call sched_reload on allocated nodes so that they will load when freed, but we run into the problem of big experiments ending and causing meltdown. The downside is that this approach is a little too aggressive. Nodes will end up reloading after just a single experiment. Need finer grain control over when to reload, but I will leave that as an exercise for later.
-