- 12 Jan, 2005 3 commits
-
-
Robert Ricci authored
Each switch has a 'primary' stack that it belongs to if it's specified with the '-i' parameter. Otherwise, it can be considered to be a part of any of the stacks of which it's a member. The main point of this is so that we can have switches that are on both the control and experimental networks. Note: Having a VLAN with the same name on two overlapping stacks is like crossing the streams: that would be bad. Not "all life as you know it stopping instantaneously" bad, but snmpit might get confused.
-
Leigh B. Stoller authored
table that will prevent an experiment from being swapped/modified. The toggle is on the showexp page, and the toggle is *not* admin over-ridable; you must turn the toggle off (and of course, you must be an admin to do that).
-
Leigh B. Stoller authored
out of the reserved table. Mostly this happens in nfree and nalloc, but there a couple of other moves, in libdb and in the reload daemon. The uid and experiment are stored, long with a timestamp.
-
- 10 Jan, 2005 2 commits
-
-
Leigh B. Stoller authored
to override from the NS file. In your NS file: namespace eval TBCOMPAT { set elabinelab_fixnodes("boss") pc171 set elabinelab_hardware("boss") pc2000 set elabinelab_hardware("ops") pc2000 }
-
Leigh B. Stoller authored
boss node, which were coming from my home dir. Take them from the source tree instead in /proj.
-
- 06 Jan, 2005 2 commits
-
-
Robert Ricci authored
all ports that have been specified.
-
Leigh B. Stoller authored
* Add boot_errno to the nodes table so that nodes can report in a subcode to indicate what went wrong. At present, we do not report any real error codes; that is going to take some time to work out since it will reqiure a bunch of changes to the boot scripts. * Add new table node_bootlogs to store logs provided by the nodes. Not a full console log, but a log of the tmcd client side part. We can make it a full log if we want though; just means mucking about with the boot phase a bit. * Add new state transition to NORMALv2 and PCVM state machines. "TBFAILED" is a new state that is sent (after TBSETUP) if a node fails somewhere in the tmcd client side. * Change TBNodeStateWait() to take a list of states (instead of single state) and an optional pass by reference parameter to return the actual state that the node landed in. Change all calls to TBNodeStateWait() of course. * Change os_setup (and libreboot in wait mode) to look for both TBFAILED and ISUP. If a TBFAILED event is seen, we can terminate the wait early and not retry os_setup on physical nodes (although still retry virtual nodes). The nice thing about this is that the wait should terminate much earlier (rather then waiting for timeout), especially for virtual nodes which can take a really long time when there are a couple of hundred. * Add new routines dobooterrno() and dobootlog() to tmcd. Bump version number and increase the buffer size to allow for the larger packets that a console log wikk generate (added MAXTMCDPACKET variable, set to 0x4000). * Add new -f option to tmcc to specify a datafile to send along as the last argument to tmcd. This is more pleasing then trying to send a console log in on the command line. For example: "tmcc -f /tmp/log BOOTLOG" will send a BOOTLOG command along with the contents of /tmp/log. Also close the write side of the pipe so that server sees EOF on read. See aside comment below. * Changes to rc.bootsetup: 1. Use perl tricks to capture all output, duping to the console and to a log file in /var/emulab/logs. 2. On any error, send a status code (boot_errno) and the bootlog to tmcd. 3. Generate a TBFAILED state transition. * Changes to rc.injail: 1. Same as rc.bootsetup, but do not send log files; that would pummel boss. Leave them on the physical node. * Change vnodesetup (which calls mkjail) to watch for any error and send a TBFAILED state transition. This should catch almost all errors, and dramatically reduce waiting when something fails. * Changes to rc.cdboot are essentially the same as rc.bootsetup, although a bootlog is sent all the time (success or failure), and I do not generate a boot_errno yet. Also, instead of TBFAILED, generate a PXEFAILED state since the CDROM is actually operating within the PXEFBSD opmode. I have yet to work this into the rest of the system though; waiting to get a new CD built and actually experiment with it. * Add new menu option and web page to display the node bootlog. We store only the lastest bootlog, but maybe someday store more then one. Display boot_errno on node page. Aside: I made a big mistake in the tmcd protocol; I did not envision passing more then a small amount of data (one fragment) and so I do not include a record terminator (ie: close of the write side on the client sends EOF) or a size field at the beginning. No big deal since small requests are sent in one fragment and the server sees the entire thing. Well, with a large console log, that will end up as multiple fragments, and the server will often not get the entire thing on the first read, and there are no subsequent reads (with no EOF or known size, it would block forever). Well, fixing this in a backwards compatable manner (for old images) was way too much pain. Instead, tmcc now closes the write side, and the server does subsequent reads *only* in the new dobbootlog() routine. Note that it *is* possible to fix this in a backwards compatable manner, but I did not want to go down that path just yet.
-
- 22 Dec, 2004 2 commits
-
-
Leigh B. Stoller authored
-
Mike Hibler authored
-
- 21 Dec, 2004 3 commits
-
-
Robert Ricci authored
to assume that the leader of a stack is the switch after which it was named - we can now name stacks things like 'Control' or 'Experiment'.
-
Leigh B. Stoller authored
rules back on. The ipfw silently fails, but if I do it a second time, it works fine. This is bogus of course ...
-
Robert Ricci authored
-
- 16 Dec, 2004 6 commits
-
-
Leigh B. Stoller authored
XMLRPC, but for now avoid the warnings.
-
Leigh B. Stoller authored
Do not die when turning firewall rules back on fails. This is a transient error I do not understand yet. When firewalled and paniced, skip clean shutdown of inner nodes since they are going to be powered off anyway later, and besides, the control network is shut off, so no way to talk to inner boss anyway.
-
Leigh B. Stoller authored
-
Leigh B. Stoller authored
-
Robert Ricci authored
the web interface.
-
Leigh B. Stoller authored
* tbsetup/panic.in: New backend script to implement the panic button feature. When used, it will cut the severe the connection to the firewall node by using snmpit to disable the port. Sets the panic bit (and date) in the experiments table, and changes the state of the experiment from "active" to "paniced" to ensure that the experiment cannot be messed with (swapped out or modified). Sends email to tbops when the panic button is pressed. Used with -r option, reverses the above. State is set back to active, the panic bit is cleared, and the port is renabled with snmpit. * tbsetup/tbswap.in: During swapout, a firewalled experiment that has been paniced will get a cleaning; The nodes are powered off, then the osids for all the nodes are reset (with os_select) so that they will boot the MFS, and then the nodes are powered on. Then the control network is turned back on, and then I wait for the nodes to reboot (this is simply cause we do not record in the DB that a node is turned off, and if I do not wait, the reload daemon will end hitting the power button again if they do not reboot in time. We can fix this later. I am not planning to apply this to general firewalled experiments yet as the power cycling is going to be hard on the nodes, so would rather that we at least have a 1/2 baked plan before we do that. * www/showexp.php3: If experiment is firewalled, show the Panic Button, linked to the panic button web script. If the experiment has already had the panic button pressed, show a big warning message and explain that user must talk to tbops to swap the experiment out. Also fiddle with menu options so that the terminate link is gone, and the swap link is visible only in admin mode. In other words, only an admin person can swap an experiment once it is paniced. And of course, an admin person can the backend panic script above with the -r option, but thats not something to be done lightly. * db/libdb.pm.in: Add "paniced" as an experiment state (EXPTSTATE_PANICED). Add utility functions: TBExptSetPanicBit(), TBExptGetPanicBit(), and TBExptClearPanicBit(). * tbsetup/swapexp.in: Minor state fiddling so that an experiment can be swapped while in paniced state, but only when in admin mode. Also clear the panic bit when experiment is swapped out. * www/dbdefs.php3.in: Add "paniced" as an experiment state. Add a utility function TBExptFirewall() to see if experiment is firewalled. * www/panicbutton.php3: New web script to invoke the backend panic script mentioned above, after the usual confirm song and dance. * www/panicbutton.gif: New gif of a red panic button that I stole off the net. If anyone has sees/has a better one, feel free to replace this one. * utils/node_statewait.in: Add -s option so that I can pass in the state I want to wait for (used from tbswap above to wait for nodes to reach ISUP after power on).
-
- 14 Dec, 2004 1 commit
-
-
Robert Ricci authored
'power' command.
-
- 13 Dec, 2004 1 commit
-
-
Mike Hibler authored
-
- 12 Dec, 2004 1 commit
-
-
Leigh B. Stoller authored
-
- 11 Dec, 2004 2 commits
-
-
Leigh B. Stoller authored
-
Leigh B. Stoller authored
assume it always has its default OSID (from node_types) loaded and ready to go (add this as the OS- feature). This so assign will agree to the allocation (assign_wrapper adds a desire that says it has to be running the OSID the user has selected, or the default OSID from the node_types table). Watch out for problems ...
-
- 10 Dec, 2004 4 commits
-
-
Robert Ricci authored
subnodes - this was only a good idea in the IXP case. Instead, skip reloading if they do not have their imageable bit set.
-
Robert Ricci authored
'rebootable' flag set to 0.
-
Leigh B. Stoller authored
-
Leigh B. Stoller authored
-
- 09 Dec, 2004 5 commits
-
-
Leigh B. Stoller authored
source tb_compat.tcl set ns [new Simulator] tb-elab-in-elab 1 tb-set-inner-elab-eid two-simple tb-set-security-level Red $ns run tbsetup/ns2ir/elabinelab.ns has all the goo, which is sourced from the NS run subroutine, using "uplevel 1" so that the context is correct. You can of course include you own goo, in which case the default goo will be skipped.
-
Timothy Stack authored
Make the dots move on the robot map web page: * configure, configure.in: Add robots/emc/loclistener. * event/lib/event.h, event/lib/event.c: Add some helper functions for sending events and parsing args. * event/lib/tbevent.py.tail, event/lib/tbevent.py: Add support for clients that register using keyfiles. * robots/emc/GNUmakefile.in: Install loclistener on boss. * robots/emc/emcd.h, robots/emc/emcd.c: Send update events every two seconds with the node's location. Fill out a little more of the event callback, not sure what to do with the requested destination though. Add some code to the vmc callback to store position updates. Changed the config file format to also include the vname of the robot. * robots/emc/loclistener.in: Listen for NODE MODIFY events with coordinates and update the database accordingly. Kinda sucks, but it works. * robots/emc/test_emcd.config: Add vnames to the robots to reflect change in the config file format. * tbsetup/ns2ir/node.tcl: Add nodes to the virt_agents table.
-
Leigh B. Stoller authored
ElabinElab experiments that wrap another experiment, either firewalled or not. This instead of my security level stuff, that I decided was too much of a pain the user, at least for now. New NS syntax: tb-set-inner-elab-eid two-simple In the ElabinElab file, sets the name of an existing experiment in the same project. Experiment is parsed, and after the parse we notice in tbprerun that we have an inner eid, so we reparse the NS file, only this time we pass in the maximum number of nodes needed by the inner eid (tbprerun now computes min/max nodes at prerun time, instead of later as first part of swapin). This number is used to allocatethe appropriate number of inner experimental nodes. Why do it this way? Cause the NS parser is the only tool we have for generating the virt topology, and I do not want go down the path of inventing a new frontend. Anyway, after the reparse, we now have the proper number of nodes in the wrapper experiment. Now its simply a matter of copying over the type and fixnode info from the inner experiment to the outer experiment. Why? So that when the outer experiment is swapped in, it gets the nodes (of the right type/fixnode) that the inner experiment is going to want later, when it is swapped in by the inner emulab! Another approach would be to make elabinelab and elabinelab_eid options to batchexp (and thus the web form and XMLRPC interface) so that we can avoid the double parse. I suspect people do not want more crap on the web form, so I did not do it this way.
-
Leigh B. Stoller authored
firewall node and disable the rules during the inner elab setup, and then turn them back on after the inner boss has rebooted. In the case that an experiment is to be launched inside, launch the experiment async and then turn rules back on. Technically, this should be proxied through the firewall instead of directly, but this is okay for now. As for experiment teardown, I am not doing anything yet since the closed firewall lets ssh through, and thats all I need to teardown the inner elab. Also during teardown, if DHCPD cannot be killed on inner boss, then skip rest of the steps and return okay so that the rest of experiment teardown proceeds (if need be, inner nodes will be power cycled). Not being be able to kill DHCPD can happen for lots of reasons (like, experiment never setup in the first place).
-
Leigh B. Stoller authored
-
- 08 Dec, 2004 1 commit
-
-
Leigh B. Stoller authored
so unhappy with my current approach that I decided to drop that idea for now and just specify the eid of the experiment to run. Obviously, it has to be an existing experiment in the same project, whose nsfile is grabbed from the DB and shipped over to the inner boss.
-
- 07 Dec, 2004 3 commits
-
-
Leigh B. Stoller authored
* Always run assign_wrapper using -t mode. This just runs the top file stuff, and writes the min/max nodes into the DB. * Then look at the security level for the experiment, and if orange or red, create a parallel elabinelab experiment to run it in. This is a completely new experiement in addition to the original. The two experiments are linked with some DB state so we know what experiment to fire off inside the inner elab. I am using a template NS file and passing in the number of nodes computed in the previous step above. The template includes the firewall rules. This is quote hokey. It should be more invisible to the user. I have not dealt with yellow (just a firewall). * I added some stats code so that we update the experiement_stats record with the elabinelab status and security level. * Cleanup how errors were handled and get rid of silly duplicated code.
-
Leigh B. Stoller authored
utility script to wait for them to reboot and reach PXEWAIT. This indicates inner emulab is raelly ready * When an inner experiment is defined (elabinelab_eid in experiments table) fire that experiment off by doing an ssh into inner boss. I am currently doing this with -w (wait mode) but eventually will need to do it async for experiments in which the control net is turned off. Also, not actually swapping experiment in yet since multicast and frisbee are still broken inside. * Add -k mode for cleaning up. The intent of this is to avoid power cycling all the nodes cause outer elab cannot reboot or ipod them. Goes like this: * Clear the inner_elab_role for experiment's nodes from the reserved table. * Clear def_boot_osid,next_boot_osid,temp_boot_osid for nodes. This is bogus cause os_select whines about doing this, but the point is to make sure that all nodes will go into PXEWAIT when they reboot. We could have them go into MFS, but thats bound to cause problems if inner elab has a lot of nodes (remember, cannot trust what is on disk). This needs more thought. * Regen and restart outer dhcpd. Nodes will become part of outer emulab on next boot cycle. * SSH into inner boss and kill inner DHCPD so that there will not be any DHCPD responses on inner control network. * SSH into inner boss and have it reboot all inner nodes. * Wait for node to reach PXEWAIT. The above needs more thought wrt firewalled experiments and isolated control network. * Kill off some old MFS copy code since we now get those direct from website.
-
Mike Hibler authored
that the firewall rules are preventing essential communication and causing the failure, so don't retry. We should probably only do this if the user has specified additional firewall rules. But right now, I may screw up the default rules too!
-
- 06 Dec, 2004 4 commits
-
-
Leigh B. Stoller authored
tb-set-security-level Green|Yellow|Orange:Red Also add a template elabinelab.ns file.
-
Leigh B. Stoller authored
option. New code there coordinates (I hope) the reboot, dhcpd config regen, and experiment teardown, in the hopes of avoiding numerous power cycles.
-
Leigh B. Stoller authored
cleanup that the code copies various bits to the reserved table.
-
Leigh B. Stoller authored
(real reboot, not bootinfo requery) nodes. I'm using this to move nodes in PXEWAIT back to outer emulab, thereby avoiding power cycle of all inner testnodes.
-