- 08 Dec, 2010 1 commit
-
-
Leigh B Stoller authored
-
- 10 May, 2010 1 commit
-
-
Leigh B Stoller authored
-
- 14 Apr, 2010 1 commit
-
-
Mike Hibler authored
Boss/ops/fs: reboot them together after setup rather than serially. Nodes: leave them in PXEWAIT throughout the setup, until after boss has been rebooted. At that point we send them the new bootinfo RESTART command telling pxeboot to re-DHCP and use the new info obtained (next-server) to contact a potentially new boss node. This is a quick way to switch a node in PXEWAIT from talking to the outer boss to talking to the inner one. A significant number of rinky-dink changes were needed to do this, primarily adding a new state, PXELIMBO, where nodes can be sent to sit until they are restarted. It turns out, just putting them in an existing state such as PXEWAKEUP or SHUTDOWN wouldn't work, as they tend to timeout or otherwise reboot.
-
- 08 Apr, 2010 1 commit
-
-
Leigh B Stoller authored
resolution. Less confusing now. Ongoing changes to make better use of the node objects and methods.
-
- 22 Mar, 2010 1 commit
-
-
Leigh B Stoller authored
deleted, they still remain in the user table with a status of "archived", but since all the queries in the system now use uid_idx instead of uid, it is safe to reuse a uid since they are no longer ambiguous. The reason for not deleting users from the users table is so that the stats records can refer to the original record (who was that person named "mike"). This is very handy and worth the additional effort it has taken. There is no way to ressurect a user, but it would not be hard to add.
-
- 11 Mar, 2010 1 commit
-
-
Leigh B Stoller authored
-
- 21 Dec, 2009 1 commit
-
-
Leigh B. Stoller authored
land in hwdown. Currently, if a node fails to boot in os_setup and the node is running a system image, it is moved into hwdown. 99% of the time this is wasted work; the node did not fail for hardware reasons, but for some other reason that is transient. The new approach is to move the node into another holding experiment, emulab-ops/hwcheckup. The daemon watches that experiment, and nodes that land in it are freshly reloaded with the default image and rebooted. If the node reboots okay after reload, it is released back into the free pool. If it fails any part of the reload/reboot, it is officially moved into hwdown. Another possible use; if you have a suspect node, you go wiggle some hardware, and instead of releasing it into the free pool, you move it into hwcheckup, to see if it reloads/reboots. If not, it lands in hwdown again. Then you break out the hammer. Most of the changes in Node.pm, libdb.pm, and os_setup are organizational changes to make the code cleaner.
-
- 16 Oct, 2009 1 commit
-
-
David Johnson authored
because we (vnode_setup) needs to go out to the nodes and run vnodesetup to trigger the reload, but os_setup needs to setup the reload. So for now, os_setup sets up the reload but does not wait nor reboot the vnode; vnode_setup does that like normal. Probably there are going to be timeout problems, but it's good enough for my needs right now.
-
- 12 Oct, 2009 1 commit
-
-
David Johnson authored
the tb-set-node-os command with a second optional argument; if that is present, the first arg is the child OS and the second is the parent OS. We add some new features in ptopgen (OS-parentOSname-childOSname) based off a new table that maps which child OSes can run on which parents, and the right desires get added to match. We setup the reloads in os_setup along with the parents. Also needed a new opmode, RELOAD-PCVM, to handle all this. For now, users only have to specify that their images can run on pcvms, a special hack for which type the images can run on. This makes sense in general since there is no point conditionalizing childOS loading on hardware type at the moment, but rather on parentOS. Hopefully this stuff wiill mostly work on shared nodes too, although we'll have to be more aggressive on the client side garbage collecting old frisbee'd images for long-lived shared hosts. I only made these changes in libvtop, so assign_wrapper folks are left in the dark. Currently, the client side supports frisbee. Only in openvz for now, and this probably breaks libvnode_xen.pm. Also in here are some openvz improvements, like ability to sniff out which network is the public control net, and which is the fake virtual control net.
-
- 24 Sep, 2009 1 commit
-
-
Leigh B. Stoller authored
-
- 08 May, 2009 1 commit
-
-
Mike Hibler authored
-
- 13 Feb, 2009 1 commit
-
-
Kevin Atkinson authored
node or virthost.
-
- 12 Feb, 2009 1 commit
-
-
Kevin Atkinson authored
Enough information is logged so that, at any point in time, it is possible to tell what images are being used. After collecting some stats for a while I hope to use this data to evaluate various strategies for preloading disks with images other than the default. Although not its primary purpose, enough information is collection to be able to get a snapshot of node usage at any point in time. This includes what nodes are being used and by who, as in which experiments and thus which projects. NOTE: For a while you might see a few of these warnings, *** WARNING: os_setup: *** could not find previous state (rsrcidx=484084) in image_history *** table, won't be able to determine newly allocated nodes if someone does a swapmod to an experiment that was swapped in before this commit was installed. This is because os_setup uses previous information in the table to determine newly allocated nodes. This warning can safely be ignored in this case, and should go away over time.
-
- 10 Sep, 2008 1 commit
-
-
Kevin Atkinson authored
Currently nodereboot in libreboot essentially ignores the waittime arg because it forks and calls node_reboot to do the real work, but doesn't pass on the waittime to it. Fix this by adding a "-W" option to node_reboot in order to specify the waittime. Use this to extend the waittime for a PLC node to come up from 6 minutes to 10.
-
- 02 May, 2008 1 commit
-
-
Kevin Atkinson authored
rather than "USER@boss.emulab.net". The reason they where from USER@boss.emulab.net is because they the script was being run as USER. I just added the "From:" header to the email.
-
- 25 Oct, 2007 1 commit
-
-
David Johnson authored
-
- 17 Sep, 2007 1 commit
-
-
David Johnson authored
after reboot in first swapmod, so it would be dealloc'd in the second swapmod).
-
- 16 Aug, 2007 1 commit
-
-
Leigh B. Stoller authored
plabslice, but still sorta behave like one. Mostly fixing up some special cases and using a different waittime calculation.
-
- 02 Aug, 2007 1 commit
-
-
Leigh B. Stoller authored
thankless job but someone has to do it. I'm expecting to finish by the time Bush 43 leaves office.
-
- 25 Apr, 2007 1 commit
-
-
Leigh B. Stoller authored
statements.
-
- 05 Apr, 2007 1 commit
-
-
Leigh B. Stoller authored
-
- 08 Sep, 2006 1 commit
-
-
Kirk Webb authored
Parallelize the setup of plab vnodes alongside the loading of local physical nodes. We fork vnode_setup to operate on the plab vnodes just before firing off local reload/reboot/reconfig operations. The status of the plab vnode setup setup is checked just before firing off vnode_setup for any local vnodes. The ISUP wait for plab vnodes continues to fall within the same stage as wating for local vnodes. New arguments have been added to vnode_setup to tell it to only operate on specific vnode types. '-j' for local jail nodes, and '-p' for plab nodes. If neither are specified, the default is to operate on all types.
-
- 21 Aug, 2006 1 commit
-
-
Kevin Atkinson authored
Avoid counting planetlab vnodes twice.
-
- 16 Aug, 2006 1 commit
-
-
Kevin Atkinson authored
tbreport errors & context. - Modified fatal() in swapexp, batchexp, and tbprerun, and die_noretry() in os_setup to pass hash parameter to tblog functions. - Added tbreport errror & context information for select errors in swapexp, tbswap, assign_wrapper2, snmpit_lib, snmpit, batchexp, assign_wrapper, os_setup, parse-ns, & tbprerun. - Added assign error parser in assign_wrapper2. - Added parse.tcl error parser in parse-ns. - Added severity constants for tbreport in libtblog_simple. - Added tbreport() function & context table mappging for reporting discrete error types to libtblog.
-
- 27 Jul, 2006 1 commit
-
-
Kevin Atkinson authored
Small bug fixes in cleanup in os_setup summary code.
-
- 26 Jul, 2006 2 commits
-
-
Kevin Atkinson authored
Fix syntax error.
-
Kevin Atkinson authored
swapexp: The previous commit, witch added a message about the recovery action when a swap-modify failed to the top of the email, did not catch all of the possible cases. Added the case when the experiment is not swapped in. os_setup: Refactored/rewrote os_setup error summary code. Distinguish the case when nodes fail to properly load the os and when the don't boot after loading the os.
-
- 21 Jul, 2006 1 commit
-
-
Kevin Atkinson authored
Don't use "no warnings 'uninitialized'" since that is a perl 5.6+ feature and some are still using an ancient version of perl.
-
- 20 Jul, 2006 3 commits
-
-
Kevin Atkinson authored
length => $length in os_setup!
-
Kevin Atkinson authored
Fixed bug in summary of failed nodes when there are more than can fit on a line.
-
Kevin Atkinson authored
Various tblog changes: Added message about recovery action when a swap-modify failed to the top of the email. Fine tuned os_setup summary error. Added (possible partial) list of nodes that fail; if a large number fail only show as many that will fit on a single line. Other tweaks. Flagged assign_wrapper errors of an Invalid OS as user errors.
-
- 18 Jul, 2006 1 commit
-
-
Leigh B. Stoller authored
table, into a new table called node_type_attributes, which is intended to be a more extensible way of describing nodes. The only things left in the node_types table will be type,class and the various isXXX boolean flags, since we use those in numerous joins all over the system (ie: when discriminating amongst nodes). For the most part, all of that other stuff is rarely used, or used in contexts where the information is needed, but not for type descrimination. Still, it made for a lot of queries to change! Along the way I added a NodeType library module that represents the type info as a perl object. I also beefed up the existing Node module, and started using it in more places. I also added an Interfaces module, but I have not done much with that yet. I have not yet removed all the slots from the node_types table; I plan to run the new code for a few days and then remove the slots. Example using the new NodeType object: use NodeType; my $typeinfo = NodeType->Lookup($type); if ($typeinfo->control_interface(\$control_iface) || !$control_iface) { warn "No control interface for $type is defined in the DB!\n"; } or using the Node: use Node; my $nodeobject = Node->Lookup($node_id); my $imageable = $nodeobject->NodeTypeInfo()->imageable(); or my $rebootable = $nodeobject->isrebootable(); or $nodeobject->NodeTypeAttribute("control_interface", \$control_iface); Lots of way to accomplish the same thing, but the main point is that the Node is able to override the NodeType (if it wants to), which I think is necessary for flexibly describing one/two of a kind things like switches, etc.
-
- 10 Jul, 2006 1 commit
-
-
Mike Hibler authored
-
- 08 Jul, 2006 1 commit
-
-
Kevin Atkinson authored
where no failed nodes for that pc type, it will now display "2 xxx's ... successfully ...". Add "no warnings 'uninitialized'" inside loop since I want undefined to mean zero. Revert previous (incomplete) fix by checking if it is defined first.
-
- 07 Jul, 2006 1 commit
-
-
Russ Fish authored
-
- 05 Jul, 2006 2 commits
-
-
Kevin Atkinson authored
Fixed perl warning about Use of uninitialized value in numeric gt.
-
Kevin Atkinson authored
Many changes to tblog code. Database update needed: 1) Added summary of failed nodes is os_setup. The cause of the error is now classified as "user" if it is only user images that failed and the user image failed on every pc of a particular type. Otherwise I leave the cause as "unknown" since it is really hard to tell what the real cause is. 2) Raised the confidence threshold for most errors so that they will appear on the top. 3) Added a special error when an experiment is canceled. The cause is "canceled" and testbed-ops won't see these errors. 4) Fixed a bug in assign_wrapper where it will incorrectly report "This experiment cannot be instantiated on this testbed..." when really the user canceled the swapin. 5) Fixed a bug where os_setup errors where being incorrectly reported as assign errors. This happens when os_setup fails for some reason and tbswap tries again, but the second time around there are not enough nodes. So the last error is coming from assign even though the true cause of the error is due to failed nodes. The fix for this involved added a new column to the log table, "attempt", which will be 1 for the first attempt and then incremented for each new attempt. tblog_find_error will then simply ignore any errors with "attempt > 1". 6) Also fixed a potential problem when there is an error during the cleanup phase by adding another column "cleanup". tblog_find_error will also ignore any errors with the cleanup bit set.
-
- 14 Jun, 2006 1 commit
-
-
Leigh B. Stoller authored
Each template has a datastore, which is really just a subdirectory that can be populated with files, and committed to the subversion archive. Note, the datastore os specific to the template itself. The Template Archive link on the Show Template page takes you to the subdirectory, which by convention I am calling "datastore". The directory actually lives in /proj/pid/exp/eid/TGUID-VERS ... but that path is printed out for you on the archive page. Anyway, put stuff in the datastore directory, and then commit the template archive so there is a tag associated with it. When an instance is created, a checkout of the datastore is placed in the experiment directory (/proj/pid/eid/exp/template_datastore). The current tag (from above) is stored with the instance so that we can later recreate the enviroment for the instance, say for rerun. Tarfiles and rpms in the datastore can be referenced as xxx://foo.rpm (in your NS file). tarfiles_setup transforms those when the instance is swapped in, sorta like it does other URLs, only it does not actually fetch them, just need to rewrite the paths so they reference datastore. The program agent gets another environment variable so you can refer to the datastore without hardwiring paths ($DATASTORE). Eventually I want to move the checkout someplace else, but it was easy to drop it into the experiment directory for now.
-
- 15 May, 2006 1 commit
-
-
Mike Hibler authored
tb-set-node-plab-role $plc plc to make it the PLC node. Then any number of other nodes are declared as: tb-set-node-plab-role $plab1 node to make them inner plab nodes. Unlike elabinelab, there is no magic "tb-plab-in-elab" command which implies the topology, you put all the plab nodes in a LAN or whatever yourself. This may or may not be a good idea. Anyway, these NS commands set DB state in virt_nodes and reserved much like elabinelab. During swapin, the dhcpd.conf file is rewritten so that inner plab nodes have their "filename" set to "pxelinux.0" and their "next-server" set to the designated PLC node. The PLC node will then be loaded/booted before anything is done to the inner-plab nodes. After it comes up, the inner plab nodes are rebooted and declared as up. There is a new tmcd command "eplabconfig" (suggestions for a new name welcom!), which returns info like: NAME=plc ROLE=plc IP=155.98.36.3 MAC=00d0b713f57d NAME=plab1 ROLE=node IP=155.98.36.10 MAC=0002b3877a4f NAME=plab2 ROLE=node IP=155.98.36.34 MAC=00d0b7141057 to just the PLC node (returns nothing to any other node). The implications of this setup are: * The PLC node must act as a TFTP server as we have discussed in the past. The TMCC info above is hopefully enough to configure pxelinux, if not we can change it. * The PLC node is responsible for loading the disks of inner plab nodes. This is implied by the setup, where we change the dhcpd.conf file before doing anything to the inner nodes. Thus, once the inner nodes are rebooted, they will be talking pxelinux with PLC, and not to boss. This step is dubious, as we could no doubt load the disks faster than whatever plab uses can. But it simplified the setup (and is more realistic!). The alternative, which is something that might be useful anyway, is to introduce a "state" after which nodes have been reloaded but before they are rebooted. With that, we can reload the plab nodes and then change the dhcpd.conf file so when they reboot they start talking to the PLC.
-
- 16 Feb, 2006 1 commit
-
-
Leigh B. Stoller authored
-