1. 04 Apr, 2005 1 commit
    • Timothy Stack's avatar
      · 463ee6b1
      Timothy Stack authored
      Mote and robot related stuff.  The main thing is the addition of relay
      capabilities to capture and related things.
      
      	* GNUmakefile.in: Add the capture and tip subdirectories to the
      	client and client-install targets.
      
      	* configure, configure.in, config.h.in: Detect srandomdev() for
      	capture and add "mote/newmote" script.
      
      	* capture/GNUmakefile.in, capture/capture.c: Add "relay"
      	capabilities to capture.
      
      	* capture/capquery.c: Query the capserver for the relay receiver's
      	port number.
      
      	* capture/capserver.c: Small hack to return the port number
              for a node.
      
      	* db/libdb.pm.in, db/xmlconvert.in: Add virt_tiptunnels table.
      
      	* event/program-agent/program-agent.c: Change log file names to
      	something a little more user-friendly.  Add a "MODIFY" event
      	handler that lets the user set agent attributes (command, tag,
      	timeout) without having to run a program.
      
      	* event/sched/GNUmakefile.in, event/sched/console-agent.cc,
      	event/sched/console-agent.h, event/sched/event-sched.c: Add
      	console agents that can be used to snapshot a section of the
      	capture log file.
      
      	* event/sched/node-agent.cc: Some minor cleanup.
      
      	* event/sched/simulator-agent.cc, event/sched/simulator-agent.h:
      	Add the config data to the report mail.  Add a "RESET" event
      	handler that runs "loghole clean".  Save the report mail in a file
      	so it gets archived with the rest of the logs.
      
      	* lib/libtb/tbdefs.h: Add CONSOLE object type.
      
      	* mote/GNUmakefile.in, mote/newmote: Add newmote script, just a
      	quick hack to add motes to the DB.
      
      	* mote/tbuisp.in: Add another backend for loading motes through
      	their relay capture server.
      
      	* robots/mtp/mtp_dump.c: Dump the min/max values for x and y,
      	handy for figuring out the bounds of the camera.
      
      	* sql/database-fill.sql: Change the RELOAD-MOTE/SHUTDOWN ->
      	ALWAYSUP/SHUTDOWN mode transition to ALWAYSUP/ISUP since stated
      	doesn't seem to run triggers after a state change by a mode
      	transition.
      
      	* tbsetup/tbreport.in: Change the ordering of the eventlist so it
      	displays event-sequences appropriately.
      
      	* tbsetup/ns2ir/GNUmakefile.in, tbsetup/ns2ir/console.tcl,
      	tbsetup/ns2ir/node.tcl, tbsetup/ns2ir/parse.tcl.in,
      	tbsetup/ns2ir/sim.tcl.in: Add a "console" agent that represents
      	the serial console for a node.
      
      	* tbsetup/ns2ir/sequence.tcl: Add an "append" method so it is
      	easier to build sequences dynamically.
      
      	* tbsetup/ns2ir/topography.tcl: Make checkdest available to
      	regular users.
      
      	* tip/GNUmakefile.in, tip/tiptunnel.c: Add support for uploading a
      	file to a relay version of capture and exporting the end
      	connection as a pty.
      
      	* tmcd/decls.h, tmcd/common/libsetup.pm: Bump version number since
      	the dosubnodelist change is not backwards compatible.
      
      	* tmcd/tmcd.c: Make dosubnodelist and dosubconfig callable even
      	when a node isn't allocated.  Add dotiptunnels command that
      	returns which serial consoles are to be mounted on a node.  Add
      	mote version of subconfig that returns information needed to
      	startup the relay version of capture.
      
      	* tmcd/common/bootsubnodes: For motes, startup the relay version
      	of capture (XXX stargate specific).
      
      	* tmcd/common/libsetup.pm, tmcd/common/libtmcc.pm,
      	tmcd/common/config/rc.config, tmcd/common/config/rc.tiptunnels:
      	Client side changes for mounting another nodes serial line.
      
      	* tmcd/common/rc.bootsetup: Always boot the subnodes, even when
      	free.  This is used for motes since their capture needs to be up
      	for reloading at the time.
      
      	* tmcd/linux/ixpboot: Shuffle some code around so the script
      	doesn't fail if the ixp isn't allocated.
      
      	* utils/loghole.in: Add "digest.out" and "report.mail" as global
      	logs to be saved in archives and display the "report.mail" file
      	when showing a loghole archive.
      
      	* xmlrpc/emulabserver.py.in: Scrub more of the return values to
      	get rid of "None"s.
      463ee6b1
  2. 22 Mar, 2005 1 commit
  3. 07 Mar, 2005 1 commit
    • Timothy Stack's avatar
      · 898cf9a2
      Timothy Stack authored
      Checkin some changes related to experiment automation and vnode feedback:
      
      	* configure, configure.in: Add sensors/canaryd/feedbacklogs
      	template.
      
      	* db/libdb.pm.in, db/xmlconvert.in: Add "virt_user_environment"
      	table that holds environment variable names and values.
      
      	* event/lib/event.c: Allocate memory of the right size for
      	event_notifications.
      
      	* event/program-agent/GNUmakefile.in: Add version.c file and
      	add install targets for the man page.
      
      	* event/program-agent/program-agent.8: Man page describing the
      	program-agent daemon.
      
      	* event/program-agent/program-agent.c: Add a bunch of convenience
      	features: let the user specify the working directory for commands;
      	save output to separate files on every invocation of an agent; let
      	the user specify a timeout for a command; make the set of
      	environment variables sane and add vars given in the NS file in
      	the opt array; a "status" file containing process information is
      	written out when children are collected.  Internal changes: child
      	processes are collected immediately, instead of waiting for the
      	next START event, so we can send back COMPLETE events; the daemon
      	now runs with a real-time priority, to increase the chances of
      	receiving events.
      
      	* event/proxy/evproxy.c: Made it bidirectional so the
      	program-agent's COMPLETE events make it back to the scheduler.
      
      	* event/sched/error-record.c: Change the default log directory.
      
      	* event/sched/event-sched.h, event/sched/event-sched.c: Setup an
      	environment similar to a program-agent to run the user's log
      	digester.
      
      	* event/sched/node-agent.cc: Add a handler for the SNAPSHOT event
      	that runs create_image for the node.
      
      	* event/sched/simulator-agent.h, event/sched/simulator-agent.cc:
      	Let the user specify a "DIGESTER" script that digests the log
      	files into a summary of the results.  Add event handler for
      	remapping a vnode experiment.
      
      	* event/sched/timeline-agent.c: Accept the RUN event as well as
      	the START event.
      
      	* os/GNUmakefile.in: Install the install-tarfile.1 man page.
      
      	* os/install-tarfile: Automatically chown/chgrp any files that do
      	not have valid user or group IDs, the new owner will be the user
      	that swapped in the experiment.  Include the install directory in
      	the DB file.  Add a "list" mode that just dumps what files have
      	been installed and where.  Add a "force" option so the user can
      	forcefully install the file, even though the DB says its already
      	there.
      
      	* os/install-tarfile.1: Man page describing the install-tarfile
      	tool.
      
      	* os/syncd/GNUmakefile.in: Install man pages on ops.
      
      	* sensors/canaryd/GNUmakefile.in: Link canaryd statically and
      	install "feedbacklogs" tool.
      
      	* sensors/canaryd/canaryd.c: Dump dummynet pipe data.
      
      	* sensors/canaryd/canarydEvents.c: Log errors.
      
      	* sensors/canaryd/feedbacklogs.in: Tool used to generate feedback
      	data from canaryd log files.
      
      	* sensors/slothd/GNUmakefile.in: Install digest-slothd on ops.
      
      	* sensors/slothd/digest-slothd: Fix some bugs and write out an
      	"alert" file with all the nodes/links that were overloaded.
      
      	* tbsetup/os_load.in, tbsetup/libosload.pm.in: Add "waitmode"
      	argument that lets you specify that you want to wait for the disk
      	to finish loading and/or wait for the node to come back up in the
      	new OS.
      
      	* tbsetup/power.in: Remove debugging printf.
      
      	* tbsetup/ns2ir/node.tcl, tbsetup/ns2ir/program.tcl,
      	tbsetup/ns2ir/sequence.tcl, tbsetup/ns2ir/sim.tcl.in: Fix some
      	quoting problems with event-sequences.  Add -expected-exit-code
      	and -tag options to the "$program run" event.  Add -digester to
      	the "$ns report" event that lets the user specify a program to run
      	to digest the log files.
      
      	* tbsetup/ns2ir/tb_compat.tcl.in: Change the initial scaling
      	factor for feedback nodes to 1%, instead of 100%.
      
      	* tmcd/tmcd.c, tmcd/common/libtmcc.pm: Add "userenv" command that
      	returns the values in "virt_user_environment".  Return new program
      	agent fields: dir, timeout, and expected_exit_code.
      
      	* tmcd/common/GNUmakefile.in: Install rc.canaryd.
      
      	* tmcd/common/bootvnodes: Add hack to boost the program-agents to
      	a real-time priority, they can't do it from inside the jail.
      
      	* tmcd/common/rc.canaryd: Rc script for canaryd.
      
      	* tmcd/common/watchdog: Don't fail outright if there is a bad line
      	in the battery.log
      
      	* tmcd/common/rc.progagent: Append "userenv" data to the
      	program-agent config file.
      
      	* utils/GNUmakefile.in: Install loghole and its man page on ops.
      
      	* utils/loghole.1: Document "clean" command and the change in
      	loghole directories.
      
      	* utils/loghole.in: Add "clean" command and parallelization.
      
      	* xmlrpc/emulabserver.py.in: Add "virt_user_environment" table.
      	Order the eventlist by "idx" and time, needed for sequences.  And
      	removed unnecessary nologin checks.
      898cf9a2
  4. 22 Feb, 2005 1 commit
    • Leigh B. Stoller's avatar
      Okay, first attempt to deal with os_setup waittimes on a per node_type · facc7acd
      Leigh B. Stoller authored
      and per OSID basis.
      
      * Added bios_waittime to node_types table and reboot_waittime to
        os_info table. Initialized them as follows:
      
              update node_types set bios_waittime=60 where class='pc';
              update os_info set reboot_waittime=150 where OS='Linux' or
      	  OS='FreeBSD' or OS='NetBSD';
              update os_info set reboot_waittime=180 where OS=Windows';
      
      * The bios waittime can be edited via the web interface.
      
      * The reboot waittime can be set only by admin people right now; this
        is another case of something that maybe the user should not see
        cause its too much stuff? Instead, default values are established in
        www/osiddefs.php3.
      
      * os_setup computes its per-node waitime as:
      
      	(bios_waittime + reboot_waittime) * 2
      
        as per Mike's suggestion. If either value is not defined in the DB,
        it defaults the original 7 minute value.
      facc7acd
  5. 08 Feb, 2005 1 commit
  6. 27 Jan, 2005 1 commit
  7. 26 Jan, 2005 1 commit
    • Leigh B. Stoller's avatar
      The Robot Lab Monitor Daemon. A very silly script that looks at some · 4963660a
      Leigh B. Stoller authored
      sitevars to determine if the Robot Lab is open or closed. The sitevars:
      
      * 'robotlab/override' - Override other settings and forcibly turn the lab
        "on" or "off" (open or close). When the lab is turned off, new
        experiments cannot swap in and the current experiment is immediately
        swapped out.
      
      * 'robotlab/exclusive' - The robot lab is exclusive use. Best to not mess
        with this sitevar :-)
      
      * 'robotlab/opentime' - The time that the robot lab opens in the
        morning. The default is 07:00, but feel free to change this as you like.
      
      * 'robotlab/closetime' - The time that the robot lab closes in the
        evening. The default is 18:00, but feel free to change this as you like.
      
      * 'robotlab/open' - The robot lab is open or closed. DO NOT MESS WITH THIS!
        It is updated by the robomonitord script and intended to be used by
        admission control (not done yet).
      
      The robomonitord script runs and periodically (every 2 minutes) wakes up
      and looks at the various sitevars above. The lab is open during the day,
      Monday through Friday, and closed on weekends. It is also supposed to be
      closed on holidays, but I have not added that yet.
      
      15 minutes before the lab is to be closed, a warning message is sent to the
      swapper of the experiment running on the robot testbed, that their
      experiment is going to be swapped soon. When the Robot lab is closed
      (either cause the close time was reached, or because the lab was forcibly
      closed with the override), the current experiment is immediately swapped
      out.
      
      I know, this is hopelessly bogus, but it will do until we feel like adding
      a "Lab" datatype to the system.
      4963660a
  8. 18 Jan, 2005 1 commit
    • Leigh B. Stoller's avatar
      Here is a checkpoint of the admission control stuff I have been working on. · 54f55585
      Leigh B. Stoller authored
      The last part is the stuff to hook it in from assign_wrapper, and some
      additional support in assign that Rob is adding for me. This comment is
      from the top of new file db/libadminctrl.pm.in and describes everything in
      detail.
      
      # Admission control policies. These are the ones I could think of, although
      # not all of these are implemented.
      #
      #  * Number of experiments per type/class (only one expt using robots).
      #
      #  * Number of experiments per project
      #  * Number of experiments per subgroup
      #  * Number of experiments per user
      #
      #  * Number of nodes per project      (nodes really means pc testnodes)
      #  * Number of nodes per subgroup
      #  * Number of nodes per user
      #
      #  * Number of nodes of a class per project
      #  * Number of nodes of a class per group
      #  * Number of nodes of a class per user
      #
      #  * Number of nodes of a type per project
      #  * Number of nodes of a type per group
      #  * Number of nodes of a type per user
      #
      #  * Number of nodes with attribute(s) per project
      #  * Number of nodes with attribute(s) per group
      #  * Number of nodes with attribute(s) per user
      #
      # So we have group (pid/gid) policies and user policies. These are stored
      # into two different tables, group_policies and user_policies, indexed in
      # the obvious manner. Each row of the table defines a count (experiments,
      # nodes, etc) and a type of thing being counted (experiments, nodes, types,
      # classes, etc). When we test for admission, we look for each matching row
      # and test each condition. All conditions must pass. No conditions means a
      # pass. There is also some "auxdata" which holds extra information needed
      # for the policy (say, the type of node being restricted).
      #
      #      uid:     a uid
      #   policy:     'experiments', 'nodes', 'type', 'class', 'attribute'
      #    count:     a number
      #  auxdata:     a string (optional)
      #
      # Example: A user policy of ('mike', 'nodes', 10) says that poor mike is
      # not allowed to have more 10 nodes at a time, while ('mike', 'type',
      # '10', 'pc850') says that mike cannot allocate more than 10 pc850s.
      #
      # The group_policies table:
      #
      #      pid:     a pid
      #      gid:     a gid
      #   policy:     'experiments', 'nodes', 'type', 'class', 'attribute'
      #    count:     a number
      #  auxdata:     a string (optional)
      #
      # Example: A project policy of ('testbed', 'testbed', 'experiments', 10)
      # says that the testbed project may not have more then 10 experiments
      # swapped in at a time, while ('testbed', 'TG1', 'nodes', 10) says that the
      # TG1 subgroup of the testbed project may not use more than 10 nodes at
      # time.
      #
      # In addition to group and user policies (which are policies that apply to
      # specific users/projects/subgroups), we also need policies that apply to
      # all users/projects/subgroups (ie: do not want to specify a particular
      # restriction for every user!). To indicate such a policy, we use a special
      # tag in the tables (for the user or pid/gid):
      #
      #      '+'  -  The policy applies to all users (or project/groups).
      #
      # Example: ('+','experiments',10) says that no user may have more then 10
      # experiments swapped in at a time. The rule overrides anything more
      # specific (say a particular user is restricted to 20 experiments; the above
      # rule overrides that and the user (all users) is restricted to 10.
      #
      # Sometimes, you want one of these special rules to apply to everyone, but
      # *allow* it to be overridden by a more specific rule. For that we use:
      #
      #      '-'  -  The policy applies to all users (or project/groups),
      #              but can be overridden by a more specific rule.
      #
      # Example: The rules:
      #
      #	('-','type',0, 'garcia')
      #       ('testbed', 'testbed', 'type', 10, 'garcia')
      #
      # says that no one is allowed to allocate garcias, unless there is specific
      # rule that allows it; in this case the testbed project can allocate them.
      #
      # There are other global policies we would like to enforce. For example,
      # "only one experiment can be using the robot testbed." Encoding this kind
      # of policy is harder, and leads down a path that can get arbitrarily
      # complex. Tha path leads to ruination, and so we want to avoid it at
      # all costs.
      #
      # Instead we define a simple global policies table that applies to all
      # experiments currently active on the testbed:
      #
      #   policy:     'nodes', 'type', 'class', 'attribute'
      #     test:     'max', others I cannot think of right now ...
      #    count:     a number
      #  auxdata:     a string
      #
      # Example: A global policy of ('nodes', 'max', 10, '') say that the maximum
      # number of nodes that may be allocated across the testbed is 10. Thats not
      # a very realistic policy of course, but ('type', 'max', 1, 'garcia') says
      # that a max of one garcia can be allocated across the testbed, which
      # effectively means only one experiment will be able to use them at once.
      # This is of course very weak, but I want to step back and give it some
      # more thought before I redo this part.
      #
      # Is that clear? Hope so, cause it gets more complicated. Some admission
      # control tests can be done early in the swap phase, before we really do
      # anything (before assign_wrapper). Others (type and class) tests cannot
      # be done here; only assign can figure out how an experiment is going to map
      # to physical nodes (remember virtual types too), and in that case we need
      # to tell assign what the "constraints" are and let it figure out what is
      # possible.
      #
      # So, in addition to the simple checks we can do, we also generate an array
      # to return to assign_wrapper with the maximum counts of each node type and
      # class that is limited by the policies. assign_wrapper will dump those
      # values into the ptop file so that assign can enforce those maximum values
      # regardless of what hardware is actually available to use. As per discussion
      # with Rob, that will look like:
      #
      #	set-type-limit <type> <limit>
      #
      # and assign will spit out a new type of violation that assign_wrapper will
      # parse.
      #
      # NOTES:
      #
      #  1) Admission control is skipped in admin mode; returns okay.
      #  2) Admission control is skipped when the pid is emulab-ops; returns okay.
      #  3) When calculating current usage, nodes reserved to emulab-ops are
      #     ignored.
      #  4) The sitevar "swap/use_admission_control" controls the use of admission
      #     control; defaults to 1 (on).
      #  5) The current policies can be viewed in the web interface. See
      #     https://www.emulab.net/showpolicies.php3
      #  6) The global policy stuff is weak. I plan to step back and think about it
      #     some more before redoing it, but it will tide us over for now.
      #
      54f55585
  9. 17 Jan, 2005 1 commit
    • Timothy Stack's avatar
      · bf489797
      Timothy Stack authored
      More robot integration and some event system updates.
      
      	* configure, configure.in: Detect rsync for loghole and add
      	utils/loghole to the list of template files.
      
      	* db/libdb.pm.in, db/xmlconvert.in: Add virt_node_startloc to the
      	list of virtual tables.
      
      	* event/lib/event.h, event/lib/event.c, event/lib/tbevent.py.tail:
      	Add event_stop_main function to break out of the event_main()
      	loop.  Add timeline to the address tuple.
      
      	* event/sched/GNUmakefile.in, event/sched/error-record.h,
      	event/sched/error-record.c, event/sched/event-sched.8,
      	event/sched/event-sched.h, event/sched/event-sched.c,
      	event/sched/group-agent.h, event/sched/group-agent.c,
      	event/sched/listNode.h, event/sched/listNode.c,
      	event/sched/local-agent.h, event/sched/local-agent.c,
      	event/sched/node-agent.h, event/sched/node-agent.cc,
      	event/sched/queue.c, event/sched/rpc.h, event/sched/rpc.cc,
      	event/sched/simulator-agent.h, event/sched/simulator-agent.c,
      	event/sched/timeline-agent.h, event/sched/timeline-agent.c:
      	Updated event scheduler, not completely finished, but well enough
      	along for the robots.
      
      	* lib/libtb/GNUmakefile.in, lib/libtb/popenf.h,
      	lib/libtb/popenf.c, lib/libtb/systemf.h, lib/libtb/systemf.c: Add
      	some handy versions of system/popen that take format arguments.
      
      	* lib/libtb/tbdefs.h, lib/libtb/tbdefs.c: Add some more event and
      	object types.
      
      	* tbsetup/assign_wrapper.in: Add the virt_node_startloc building
      	to desires string for a node.
      
      	* tbsetup/ptopgen.in: Add a node's location to the feature list.
      
      	* tbsetup/tbreport.in: Display the timeline/sequence an event is a
      	part of.
      
      	* tbsetup/ns2ir/GNUmakefile.in: Add timeline, sequence, and
      	topography files.
      
      	* tbsetup/ns2ir/node.tcl: Add initial position for nodes and allow
      	them to be attached to "topographys".
      
      	* tbsetup/ns2ir/parse-ns.in: Make a hwtype_class array with a
      	node_type's class.  Make an 'areas' array that holds the
      	buildings where nodes are located.  Make an 'obstacles' table
      	with any obstacles in the building.
      
      	* tbsetup/ns2ir/parse.tcl.in: Move named-args function from
      	tb_compat.tcl to here.  Add reltime-to-secs function that converts
      	time given in a format like "10h2m1s" to a seconds value, used in
      	"$ns at" so its easier to write time values.  Add "K", "Kb", and
      	"Kbps" as possible units for bandwidth (only the lowercase
      	versions were available before).
      
      	* tbsetup/ns2ir/program.tcl: Add "dir" and "timeout" attributes,
      	although they don't go anywhere at the moment.
      
      	* tbsetup/ns2ir/sequence.tcl, tbsetup/ns2ir/timeline.tcl,
      	tbsetup/ns2ir/topography.tcl: Initial versions.
      
      	* tbsetup/ns2ir/sim.tcl.in: Add support for timelines and
      	sequences.  Add 'node-config' method to change the default
      	configuration for nodes produced by the Simulator object.  Send an
      	initial MODIFY event to any trafgen objects so their configuration
      	gets through, even when there are no start/stop events.  Move
      	event parsing to the 'make_event' method.
      
      	* utils/loghole.1, utils/loghole.in: Loghole utility, used for
      	retrieving logs from experimental nodes and creating archives of
      	the logs.
      
      	* xmlrpc/emulabclient.py.in: Escape any strange characters in the
      	output field.
      
      	* xmlrpc/emulabserver.py.in: Add virt_node_startloc to the list of
      	virtual_tables.  Add emulab.vision_config and
      	emulab.obstacle_config methods for getting information pertaining
      	to the robots.  Change the OSID listing to include more fields.
      	Add a "physical" aspect to experiment.info to get information
      	about the physical nodes.  Add parent field to the events in the
      	array returned by eventlist.  Add sshdescription to get extra
      	information needed to log into a vnode.  Add node.statewait so you
      	can wait for nodes to come up.
      bf489797
  10. 12 Jan, 2005 1 commit
  11. 06 Jan, 2005 1 commit
    • Leigh B. Stoller's avatar
      A bunch of boot changes. Read carefully. · 94ccc3f4
      Leigh B. Stoller authored
      * Add boot_errno to the nodes table so that nodes can report in a
        subcode to indicate what went wrong. At present, we do not report any
        real error codes; that is going to take some time to work out since it
        will reqiure a bunch of changes to the boot scripts.
      
      * Add new table node_bootlogs to store logs provided by the nodes. Not
        a full console log, but a log of the tmcd client side part. We can
        make it a full log if we want though; just means mucking about with
        the boot phase a bit.
      
      * Add new state transition to NORMALv2 and PCVM state machines. "TBFAILED"
        is a new state that is sent (after TBSETUP) if a node fails somewhere in
        the tmcd client side.
      
      * Change TBNodeStateWait() to take a list of states (instead of single
        state) and an optional pass by reference parameter to return the actual
        state that the node landed in. Change all calls to TBNodeStateWait() of
        course.
      
      * Change os_setup (and libreboot in wait mode) to look for both TBFAILED
        and ISUP. If a TBFAILED event is seen, we can terminate the wait early
        and not retry os_setup on physical nodes (although still retry virtual
        nodes). The nice thing about this is that the wait should terminate much
        earlier (rather then waiting for timeout), especially for virtual nodes
        which can take a really long time when there are a couple of hundred.
      
      * Add new routines dobooterrno() and dobootlog() to tmcd. Bump version
        number and increase the buffer size to allow for the larger packets that
        a console log wikk generate (added MAXTMCDPACKET variable, set to 0x4000).
      
      * Add new -f option to tmcc to specify a datafile to send along as the last
        argument to tmcd. This is more pleasing then trying to send a console log
        in on the command line. For example: "tmcc -f /tmp/log BOOTLOG" will send
        a BOOTLOG command along with the contents of /tmp/log.
      
        Also close the write side of the pipe so that server sees EOF on
        read. See aside comment below.
      
      * Changes to rc.bootsetup:
           1. Use perl tricks to capture all output, duping to the console and to
              a log file in /var/emulab/logs.
           2. On any error, send a status code (boot_errno) and the bootlog to
              tmcd.
           3. Generate a TBFAILED state transition.
      
      * Changes to rc.injail:
           1. Same as rc.bootsetup, but do not send log files; that would pummel
              boss. Leave them on the physical node.
      
      * Change vnodesetup (which calls mkjail) to watch for any error and send a
        TBFAILED state transition. This should catch almost all errors, and
        dramatically reduce waiting when something fails.
      
      * Changes to rc.cdboot are essentially the same as rc.bootsetup, although a
        bootlog is sent all the time (success or failure), and I do not generate
        a boot_errno yet. Also, instead of TBFAILED, generate a PXEFAILED state
        since the CDROM is actually operating within the PXEFBSD opmode. I have
        yet to work this into the rest of the system though; waiting to get a new
        CD built and actually experiment with it.
      
      * Add new menu option and web page to display the node bootlog. We store
        only the lastest bootlog, but maybe someday store more then one. Display
        boot_errno on node page.
      
      Aside: I made a big mistake in the tmcd protocol; I did not envision
      passing more then a small amount of data (one fragment) and so I do not
      include a record terminator (ie: close of the write side on the client
      sends EOF) or a size field at the beginning. No big deal since small
      requests are sent in one fragment and the server sees the entire
      thing. Well, with a large console log, that will end up as multiple
      fragments, and the server will often not get the entire thing on the first
      read, and there are no subsequent reads (with no EOF or known size, it
      would block forever). Well, fixing this in a backwards compatable manner
      (for old images) was way too much pain. Instead, tmcc now closes the write
      side, and the server does subsequent reads *only* in the new dobbootlog()
      routine. Note that it *is* possible to fix this in a backwards compatable
      manner, but I did not want to go down that path just yet.
      94ccc3f4
  12. 03 Jan, 2005 1 commit
  13. 16 Dec, 2004 1 commit
    • Leigh B. Stoller's avatar
      The panic button ... · 87dd2e60
      Leigh B. Stoller authored
      * tbsetup/panic.in: New backend script to implement the panic button
        feature. When used, it will cut the severe the connection to the
        firewall node by using snmpit to disable the port. Sets the panic
        bit (and date) in the experiments table, and changes the state of
        the experiment from "active" to "paniced" to ensure that the
        experiment cannot be messed with (swapped out or modified). Sends
        email to tbops when the panic button is pressed.
      
        Used with -r option, reverses the above. State is set back to
        active, the panic bit is cleared, and the port is renabled with
        snmpit.
      
      * tbsetup/tbswap.in: During swapout, a firewalled experiment that has
        been paniced will get a cleaning; The nodes are powered off, then
        the osids for all the nodes are reset (with os_select) so that they
        will boot the MFS, and then the nodes are powered on. Then the
        control network is turned back on, and then I wait for the nodes to
        reboot (this is simply cause we do not record in the DB that a node
        is turned off, and if I do not wait, the reload daemon will end
        hitting the power button again if they do not reboot in time. We can
        fix this later.
      
        I am not planning to apply this to general firewalled experiments
        yet as the power cycling is going to be hard on the nodes, so would
        rather that we at least have a 1/2 baked plan before we do that.
      
      * www/showexp.php3: If experiment is firewalled, show the Panic
        Button, linked to the panic button web script. If the experiment has
        already had the panic button pressed, show a big warning message and
        explain that user must talk to tbops to swap the experiment out.
        Also fiddle with menu options so that the terminate link is gone,
        and the swap link is visible only in admin mode. In other words, only
        an admin person can swap an experiment once it is paniced. And of
        course, an admin person can the backend panic script above with the
        -r option, but thats not something to be done lightly.
      
      * db/libdb.pm.in: Add "paniced" as an experiment state (EXPTSTATE_PANICED).
        Add utility functions: TBExptSetPanicBit(), TBExptGetPanicBit(), and
        TBExptClearPanicBit().
      
      * tbsetup/swapexp.in: Minor state fiddling so that an experiment can
        be swapped while in paniced state, but only when in admin mode. Also
        clear the panic bit when experiment is swapped out.
      
      * www/dbdefs.php3.in: Add "paniced" as an experiment state. Add a
        utility function TBExptFirewall() to see if experiment is firewalled.
      
      * www/panicbutton.php3: New web script to invoke the backend panic
        script mentioned above, after the usual confirm song and dance.
      
      * www/panicbutton.gif: New gif of a red panic button that I stole off
        the net. If anyone has sees/has a better one, feel free to replace
        this one.
      
      * utils/node_statewait.in: Add -s option so that I can pass in the
        state I want to wait for (used from tbswap above to wait for nodes
        to reach ISUP after power on).
      87dd2e60
  14. 09 Dec, 2004 1 commit
  15. 07 Dec, 2004 1 commit
  16. 03 Dec, 2004 1 commit
  17. 01 Dec, 2004 2 commits
  18. 11 Nov, 2004 1 commit
  19. 01 Nov, 2004 1 commit
  20. 29 Oct, 2004 1 commit
  21. 25 Oct, 2004 2 commits
  22. 11 Oct, 2004 3 commits
  23. 08 Sep, 2004 1 commit
    • Mike Hibler's avatar
      1.275: Add timed-based mapping table for generic OSIDs. This augments the · bb56a192
      Mike Hibler authored
             nextosid mechinism of 1.114 making it possible to map a generic *-STD
             OSID based on the time in which an experiment is created.  This
             provides backward compatibility for old experiments when the standard
             images are changed.
      
             The osid_map table lookup is triggered when the value of the nextosid
             field is set to 'MAP:osid_map'.  The nextosid also continues to behave
             as before: if it contains a valid osid, that OSID value is used to map
             independent of the experiment creation time.  The two styles can also
             be mixed, for example FBSD-JAIL has a nextosid of FBSD-STD which in
             turn is looked up and redirects to the osid_map and selects one of
             FBSD47-STD or FBSD410-STD depending on the time.
      
      	CREATE TABLE osid_map (
      	  osid varchar(35) NOT NULL default '',
      	  btime datetime NOT NULL default '1000-01-01 00:00:00',
      	  etime datetime NOT NULL default '9999-12-31 23:59:59',
      	  nextosid varchar(35) default NULL,
      	  PRIMARY KEY  (osid,btime,etime)
      	) TYPE=MyISAM;
      
             Yeah, yeah, I'm using another magic date as a sentinel value.
             Tell ya what, in 7995 years, find out where I'm buried, dig me up,
             and kick my ass for being so short-sighted...
      
             The following commands are not strictly needed, they just give
             an example, default population of the table.  They cause the standard
             images to be revectored through the table and then remapped, based on
             two time ranges, to the exact same image.  Obviously, the second set
             would normally be mapped to a different set of images (say RHL90 and
             FBSD410):
      
      	INSERT INTO osid_map (osid,etime,nextosid) VALUES \
      	  ('RHL-STD','2004-09-08 08:59:59','emulab-ops-RHL73-STD');
      	INSERT INTO osid_map (osid,etime,nextosid) VALUES \
      	  ('FBSD-STD','2004-09-08 08:59:59','emulab-ops-FBSD47-STD');
      
      	INSERT INTO osid_map (osid,btime,nextosid) VALUES \
      	  ('RHL-STD','2004-09-08 09:00:00','emulab-ops-RHL73-STD');
      	INSERT INTO osid_map (osid,btime,nextosid) VALUES \
      	  ('FBSD-STD','2004-09-08 09:00:00','emulab-ops-FBSD47-STD');
      
      	UPDATE os_info SET nextosid='MAP:osid_map' \
      	  WHERE osname IN ('RHL-STD','FBSD-STD');
      bb56a192
  24. 27 Aug, 2004 1 commit
  25. 25 Aug, 2004 1 commit
  26. 18 Aug, 2004 2 commits
    • Christopher Alfeld's avatar
      Fix for ALWAYSUP nodes and fix for switches with interface entries. · a7b4249d
      Christopher Alfeld authored
      In detail:
      
      1. Added TBDB_NODESTATE_ALWAYSUP to libdb.pm for representing the ALWAYSUP
      eventstate.
      
      2. Modified free node calculation in ptopgen to include ALWAYSUP nodes.
      
      3. Added code to ptopgen to correctly handle the case of a NULL iface
      column, which happens when switches have interface (as they do in
      Wisconsin), but assign_wrapper expects (null) for their iface rather than
      "".
      a7b4249d
    • Robert Ricci's avatar
      New script, deletenode. Does what it sounds like. Scrubs tables · 6c685f91
      Robert Ricci authored
      of all references to a node. Mainly intended for when you have a
      mishap with the newnode stuff and need to clean it up.
      
      Added a big list of which tables contain information about physical
      nodes to libdb, so that this and other scripts can find it all.
      6c685f91
  27. 16 Aug, 2004 1 commit
  28. 11 Aug, 2004 1 commit
    • Leigh B. Stoller's avatar
      Add new per-lan table, which currently is just for Mike: · d09d9696
      Leigh B. Stoller authored
      1.269: Add new table to generate a per virt_lan index for use with
             veth vlan tags. This would be so much easier if the virt_lans
             table had been split into virt_lans and virt_lan_members.
             Anyway, this table might someday become the per-lan table, with a
             table of member settings. This would reduce the incredible amount of
             duplicate info in virt_lans!
      
      	CREATE TABLE virt_lan_lans (
      	  pid varchar(12) NOT NULL default '',
      	  eid varchar(32) NOT NULL default '',
      	  idx int(11) NOT NULL auto_increment,
      	  vname varchar(32) NOT NULL default '',
      	  PRIMARY KEY  (pid,eid,idx),
      	  UNIQUE KEY vname (pid,eid,vname)
      	) TYPE=MyISAM;
      
             This arrangement will provide a unique index per virt_lan, within
             each pid,eid. That is, it starts from 1 for each pid,eid. That is
             necessary since the limit is 16 bits, so a global index would
             quickly overflow. The above table is populated with:
      
      	insert into virt_lan_lans (pid, eid, vname)
                  select distinct pid,eid,vname from virt_lans;
      d09d9696
  29. 29 Jul, 2004 4 commits
    • Leigh B. Stoller's avatar
    • Leigh B. Stoller's avatar
      Rework TBGetSiteVar() slightly. Add optional second parameter $rptr to · 03403a55
      Leigh B. Stoller authored
      store the result in. When called this new way, the value goes into
      $rptr, and exit status is returned to caller instead. In addition,
      when called this way, all errors are non-fatal; it is up to the caller
      to decide what to do.
      03403a55
    • Leigh B. Stoller's avatar
      Two unrelated bug fixes (with some related cleanups and tweaks) · 9f4edbba
      Leigh B. Stoller authored
      * The first involves swapmod. When a swapmod on an active experiment fails,
        tbswap will reswap the experiment back to the original configuration. The
        problem is that it is reswapping it with the *new* virtual state of the
        experiment in the DB. It is not until later when control returns to
        swapexp that the virtual state is restored. This is plainly wrong, and in
        fact was causing the event scheduler grief cause it was starting up,
        reading the the virtual topo, which was different, wrong, and about to be
        blown away.
      
        I reorganized the modify section of swapexp so that virtual state is
        restored only when its a swapmod on a swapped experiment. On an active
        experiment, I moved that code down into tbswap, which will now does all
        of the virtual and physical state retore before it does the reswap back
        to the original experiment. Just for kicks, its also done if tbswap
        decides to swap the experiment cause of a fatal error.
      
        Cleanups: I changed $NoRecover to $CanRecover. My feeble brain cannot
        deal with !$NoRecover. I know, two knots make a wright for most people.
      
        Renderer: I was annoyed by the fact that we rerun the renderer on a
        failed swapmod. The original reason is that the renderer runs in the
        background and so vis_nodes cannot be saved with the rest of the virtual
        state tables cause the renderer might still be running when the user
        fires off the swapmod. Well, the hell with that. We lock the vis_nodes
        table anyway in the renderer during update, so we are certain to get a
        consistent snapshot. We store the renderer pid in the experiments table,
        so if the renderer was running, just fire off another one; mostly this is
        not going to happen. In addition, tbprerun no longer starts a new
        renderer when doing the swapmod; I start the new renderer later after
        swapmod succeeds. I might end up tweaking this a bit depending on what
        people notice as being different.
      
      * Termination changes to batchexp and swapexp: I've rearranged the
        termination code using an END block so that any uncontrolled exit from
        either batchexp or swapexp will go through the cleanup code, and
        hopefully insert a stats record, as well as not leave the experiment in
        some inbetween state. I've set the max DB retry count to zero in both
        cases, which means infinite retry. I've also added SIGTERM handlers to
        both so that again, we can kill a hung batch/swap and have it clean up
        things more or less. Note that END blocks are not caught when a signal
        causes the program to die; you have to catch it and then die() so that
        the END block is executed.
      
        Eventually, we need to clean up the various libraries so that we do not
        use DBQueryFatal(), but rather use DBQueryWarn(), and look for failure.
        Ditto for event system interface.
      9f4edbba
    • Leigh B. Stoller's avatar
  30. 15 Jul, 2004 2 commits
    • Leigh B. Stoller's avatar
      Couple of minor tweaks to make sure that experiment state events · d1a35ea9
      Leigh B. Stoller authored
      get sent properly; need to call TBdbfork(), and add a couple more
      event sends in libdb.
      d1a35ea9
    • Leigh B. Stoller's avatar
      Overview: Add Event Groups: · ed964507
      Leigh B. Stoller authored
      	set g1 [new EventGroup $ns]
      	$g1 add  $link0 $link1
      	$ns at 60.0 "$g1 down"
      
      See the new advanced tutorial section on event groups for a better
      example.
      
      Changed tbreport to dump the event groups table when in summary mode.
      At the same time, I changed tbreport to use the recently added
      virt_lans:vnode and ip slots, decprecating virt_nodes:ips in one more
      place. I also changed the web interface to always dump the event and
      event group summaries.
      
      The parser gets a new file (event.tcl), and the "at" method deals with
      event group events by expanding them inline into individual events
      sent to each member. For some agents, this is unavoidable; traffic
      generators get the initial params in the event, so it is not possible
      to send a single event to all members of the group. Same goes for
      program objects, although program objects do default to the initial
      command now, at least on new images.
      
      Changed the event scheduler to load the event groups table. The
      current operation is that the scheduler expands events sent to a
      group, into a set of distinct events sent to each member of the
      group. At some point we proably want to optimize this by telling the
      agents (running on the nodes) what groups they are members of.
      
      Other News: Added a "mustdelay" slot to the virt_lans table so the
      parser can tell assign_wrapper that a link needs to be delayed, say if
      there are events or if the link is red/gred. Previously,
      assign_wrapper tried to figure this out by looking at the event list,
      etc. I have removed that code; see database-migrate for instructions
      on how to initialize this slot in existing experiments. assign_wrapper
      is free to ignore or insert delays anyway, but having the parser do
      this makes more sense.
      
      I also made some "rename" changes to the parser wrt queues and lans
      and links. Not really necessary, but I got sidetracked (for several
      hours!) trying to understand that rename stuff a little better, and
      now I do.
      ed964507
  31. 12 Jul, 2004 1 commit