1. 22 Mar, 2005 1 commit
  2. 07 Mar, 2005 1 commit
    • Timothy Stack's avatar
      · 898cf9a2
      Timothy Stack authored
      Checkin some changes related to experiment automation and vnode feedback:
      
      	* configure, configure.in: Add sensors/canaryd/feedbacklogs
      	template.
      
      	* db/libdb.pm.in, db/xmlconvert.in: Add "virt_user_environment"
      	table that holds environment variable names and values.
      
      	* event/lib/event.c: Allocate memory of the right size for
      	event_notifications.
      
      	* event/program-agent/GNUmakefile.in: Add version.c file and
      	add install targets for the man page.
      
      	* event/program-agent/program-agent.8: Man page describing the
      	program-agent daemon.
      
      	* event/program-agent/program-agent.c: Add a bunch of convenience
      	features: let the user specify the working directory for commands;
      	save output to separate files on every invocation of an agent; let
      	the user specify a timeout for a command; make the set of
      	environment variables sane and add vars given in the NS file in
      	the opt array; a "status" file containing process information is
      	written out when children are collected.  Internal changes: child
      	processes are c...
      898cf9a2
  3. 22 Feb, 2005 1 commit
    • Leigh B. Stoller's avatar
      Okay, first attempt to deal with os_setup waittimes on a per node_type · facc7acd
      Leigh B. Stoller authored
      and per OSID basis.
      
      * Added bios_waittime to node_types table and reboot_waittime to
        os_info table. Initialized them as follows:
      
              update node_types set bios_waittime=60 where class='pc';
              update os_info set reboot_waittime=150 where OS='Linux' or
      	  OS='FreeBSD' or OS='NetBSD';
              update os_info set reboot_waittime=180 where OS=Windows';
      
      * The bios waittime can be edited via the web interface.
      
      * The reboot waittime can be set only by admin people right now; this
        is another case of something that maybe the user should not see
        cause its too much stuff? Instead, default values are established in
        www/osiddefs.php3.
      
      * os_setup computes its per-node waitime as:
      
      	(bios_waittime + reboot_waittime) * 2
      
        as per Mike's suggestion. If either value is not defined in the DB,
        it defaults the original 7 minute value.
      facc7acd
  4. 08 Feb, 2005 1 commit
  5. 27 Jan, 2005 1 commit
  6. 26 Jan, 2005 1 commit
    • Leigh B. Stoller's avatar
      The Robot Lab Monitor Daemon. A very silly script that looks at some · 4963660a
      Leigh B. Stoller authored
      sitevars to determine if the Robot Lab is open or closed. The sitevars:
      
      * 'robotlab/override' - Override other settings and forcibly turn the lab
        "on" or "off" (open or close). When the lab is turned off, new
        experiments cannot swap in and the current experiment is immediately
        swapped out.
      
      * 'robotlab/exclusive' - The robot lab is exclusive use. Best to not mess
        with this sitevar :-)
      
      * 'robotlab/opentime' - The time that the robot lab opens in the
        morning. The default is 07:00, but feel free to change this as you like.
      
      * 'robotlab/closetime' - The time that the robot lab closes in the
        evening. The default is 18:00, but feel free to change this as you like.
      
      * 'robotlab/open' - The robot lab is open or closed. DO NOT MESS WITH THIS!
        It is updated by the robomonitord script and intended to be used by
        admission control (not done yet).
      
      The robomonitord script runs and periodically (every 2 minutes) wakes up
      and looks at the various sitevars above. The lab is open during the day,
      Monday through Friday, and closed on weekends. It is also supposed to be
      closed on holidays, but I have not added that yet.
      
      15 minutes before the lab is to be closed, a warning message is sent to the
      swapper of the experiment running on the robot testbed, that their
      experiment is going to be swapped soon. When the Robot lab is closed
      (either cause the close time was reached, or because the lab was forcibly
      closed with the override), the current experiment is immediately swapped
      out.
      
      I know, this is hopelessly bogus, but it will do until we feel like adding
      a "Lab" datatype to the system.
      4963660a
  7. 18 Jan, 2005 1 commit
    • Leigh B. Stoller's avatar
      Here is a checkpoint of the admission control stuff I have been working on. · 54f55585
      Leigh B. Stoller authored
      The last part is the stuff to hook it in from assign_wrapper, and some
      additional support in assign that Rob is adding for me. This comment is
      from the top of new file db/libadminctrl.pm.in and describes everything in
      detail.
      
      # Admission control policies. These are the ones I could think of, although
      # not all of these are implemented.
      #
      #  * Number of experiments per type/class (only one expt using robots).
      #
      #  * Number of experiments per project
      #  * Number of experiments per subgroup
      #  * Number of experiments per user
      #
      #  * Number of nodes per project      (nodes really means pc testnodes)
      #  * Number of nodes per subgroup
      #  * Number of nodes per user
      #
      #  * Number of nodes of a class per project
      #  * Number of nodes of a class per group
      #  * Number of nodes of a class per user
      #
      #  * Number of nodes of a type per project
      #  * Number of nodes of a type per group
      #  * Number of nodes of a type per user
      #
      #  * Number of nodes with attribute(s) per project
      #  * Number of nodes with attribute(s) per group
      #  * Number of nodes with attribute(s) per user
      #
      # So we have group (pid/gid) policies and user policies. These are stored
      # into two different tables, group_policies and user_policies, indexed in
      # the obvious manner. Each row of the table defines a count (experiments,
      # nodes, etc) and a type of thing being counted (experiments, nodes, types,
      # classes, etc). When we test for admission, we look for each matching row
      # and test each condition. All conditions must pass. No conditions means a
      # pass. There is also some "auxdata" which holds extra information needed
      # for the policy (say, the type of node being restricted).
      #
      #      uid:     a uid
      #   policy:     'experiments', 'nodes', 'type', 'class', 'attribute'
      #    count:     a number
      #  auxdata:     a string (optional)
      #
      # Example: A user policy of ('mike', 'nodes', 10) says that poor mike is
      # not allowed to have more 10 nodes at a time, while ('mike', 'type',
      # '10', 'pc850') says that mike cannot allocate more than 10 pc850s.
      #
      # The group_policies table:
      #
      #      pid:     a pid
      #      gid:     a gid
      #   policy:     'experiments', 'nodes', 'type', 'class', 'attribute'
      #    count:     a number
      #  auxdata:     a string (optional)
      #
      # Example: A project policy of ('testbed', 'testbed', 'experiments', 10)
      # says that the testbed project may not have more then 10 experiments
      # swapped in at a time, while ('testbed', 'TG1', 'nodes', 10) says that the
      # TG1 subgroup of the testbed project may not use more than 10 nodes at
      # time.
      #
      # In addition to group and user policies (which are policies that apply to
      # specific users/projects/subgroups), we also need policies that apply to
      # all users/projects/subgroups (ie: do not want to specify a particular
      # restriction for every user!). To indicate such a policy, we use a special
      # tag in the tables (for the user or pid/gid):
      #
      #      '+'  -  The policy applies to all users (or project/groups).
      #
      # Example: ('+','experiments',10) says that no user may have more then 10
      # experiments swapped in at a time. The rule overrides anything more
      # specific (say a particular user is restricted to 20 experiments; the above
      # rule overrides that and the user (all users) is restricted to 10.
      #
      # Sometimes, you want one of these special rules to apply to everyone, but
      # *allow* it to be overridden by a more specific rule. For that we use:
      #
      #      '-'  -  The policy applies to all users (or project/groups),
      #              but can be overridden by a more specific rule.
      #
      # Example: The rules:
      #
      #	('-','type',0, 'garcia')
      #       ('testbed', 'testbed', 'type', 10, 'garcia')
      #
      # says that no one is allowed to allocate garcias, unless there is specific
      # rule that allows it; in this case the testbed project can allocate them.
      #
      # There are other global policies we would like to enforce. For example,
      # "only one experiment can be using the robot testbed." Encoding this kind
      # of policy is harder, and leads down a path that can get arbitrarily
      # complex. Tha path leads to ruination, and so we want to avoid it at
      # all costs.
      #
      # Instead we define a simple global policies table that applies to all
      # experiments currently active on the testbed:
      #
      #   policy:     'nodes', 'type', 'class', 'attribute'
      #     test:     'max', others I cannot think of right now ...
      #    count:     a number
      #  auxdata:     a string
      #
      # Example: A global policy of ('nodes', 'max', 10, '') say that the maximum
      # number of nodes that may be allocated across the testbed is 10. Thats not
      # a very realistic policy of course, but ('type', 'max', 1, 'garcia') says
      # that a max of one garcia can be allocated across the testbed, which
      # effectively means only one experiment will be able to use them at once.
      # This is of course very weak, but I want to step back and give it some
      # more thought before I redo this part.
      #
      # Is that clear? Hope so, cause it gets more complicated. Some admission
      # control tests can be done early in the swap phase, before we really do
      # anything (before assign_wrapper). Others (type and class) tests cannot
      # be done here; only assign can figure out how an experiment is going to map
      # to physical nodes (remember virtual types too), and in that case we need
      # to tell assign what the "constraints" are and let it figure out what is
      # possible.
      #
      # So, in addition to the simple checks we can do, we also generate an array
      # to return to assign_wrapper with the maximum counts of each node type and
      # class that is limited by the policies. assign_wrapper will dump those
      # values into the ptop file so that assign can enforce those maximum values
      # regardless of what hardware is actually available to use. As per discussion
      # with Rob, that will look like:
      #
      #	set-type-limit <type> <limit>
      #
      # and assign will spit out a new type of violation that assign_wrapper will
      # parse.
      #
      # NOTES:
      #
      #  1) Admission control is skipped in admin mode; returns okay.
      #  2) Admission control is skipped when the pid is emulab-ops; returns okay.
      #  3) When calculating current usage, nodes reserved to emulab-ops are
      #     ignored.
      #  4) The sitevar "swap/use_admission_control" controls the use of admission
      #     control; defaults to 1 (on).
      #  5) The current policies can be viewed in the web interface. See
      #     https://www.emulab.net/showpolicies.php3
      #  6) The global policy stuff is weak. I plan to step back and think about it
      #     some more before redoing it, but it will tide us over for now.
      #
      54f55585
  8. 17 Jan, 2005 1 commit
    • Timothy Stack's avatar
      · bf489797
      Timothy Stack authored
      More robot integration and some event system updates.
      
      	* configure, configure.in: Detect rsync for loghole and add
      	utils/loghole to the list of template files.
      
      	* db/libdb.pm.in, db/xmlconvert.in: Add virt_node_startloc to the
      	list of virtual tables.
      
      	* event/lib/event.h, event/lib/event.c, event/lib/tbevent.py.tail:
      	Add event_stop_main function to break out of the event_main()
      	loop.  Add timeline to the address tuple.
      
      	* event/sched/GNUmakefile.in, event/sched/error-record.h,
      	event/sched/error-record.c, event/sched/event-sched.8,
      	event/sched/event-sched.h, event/sched/event-sched.c,
      	event/sched/group-agent.h, event/sched/group-agent.c,
      	event/sched/listNode.h, event/sched/listNode.c,
      	event/sched/local-agent.h, event/sched/local-agent.c,
      	event/sched/node-agent.h, event/sched/node-agent.cc,
      	event/sched/queue.c, event/sched/rpc.h, event/sched/rpc.cc,
      	event/sched/simulator-agent.h, event/sched/simulator-agent.c,
      	event/sched/timeline-agent.h, event/sched/timeline-agent.c:
      	Updated event schedu...
      bf489797
  9. 12 Jan, 2005 1 commit
  10. 06 Jan, 2005 1 commit
    • Leigh B. Stoller's avatar
      A bunch of boot changes. Read carefully. · 94ccc3f4
      Leigh B. Stoller authored
      * Add boot_errno to the nodes table so that nodes can report in a
        subcode to indicate what went wrong. At present, we do not report any
        real error codes; that is going to take some time to work out since it
        will reqiure a bunch of changes to the boot scripts.
      
      * Add new table node_bootlogs to store logs provided by the nodes. Not
        a full console log, but a log of the tmcd client side part. We can
        make it a full log if we want though; just means mucking about with
        the boot phase a bit.
      
      * Add new state transition to NORMALv2 and PCVM state machines. "TBFAILED"
        is a new state that is sent (after TBSETUP) if a node fails somewhere in
        the tmcd client side.
      
      * Change TBNodeStateWait() to take a list of states (instead of single
        state) and an optional pass by reference parameter to return the actual
        state that the node landed in. Change all calls to TBNodeStateWait() of
        course.
      
      * Change os_setup (and libreboot in wait mode) to look for both TBFAILED...
      94ccc3f4
  11. 03 Jan, 2005 1 commit
  12. 16 Dec, 2004 1 commit
    • Leigh B. Stoller's avatar
      The panic button ... · 87dd2e60
      Leigh B. Stoller authored
      * tbsetup/panic.in: New backend script to implement the panic button
        feature. When used, it will cut the severe the connection to the
        firewall node by using snmpit to disable the port. Sets the panic
        bit (and date) in the experiments table, and changes the state of
        the experiment from "active" to "paniced" to ensure that the
        experiment cannot be messed with (swapped out or modified). Sends
        email to tbops when the panic button is pressed.
      
        Used with -r option, reverses the above. State is set back to
        active, the panic bit is cleared, and the port is renabled with
        snmpit.
      
      * tbsetup/tbswap.in: During swapout, a firewalled experiment that has
        been paniced will get a cleaning; The nodes are powered off, then
        the osids for all the nodes are reset (with os_select) so that they
        will boot the MFS, and then the nodes are powered on. Then the
        control network is turned back on, and then I wait for the nodes to
        reboot (this is simply cause we do not record in the DB th...
      87dd2e60
  13. 09 Dec, 2004 1 commit
  14. 07 Dec, 2004 1 commit
  15. 03 Dec, 2004 1 commit
  16. 01 Dec, 2004 2 commits
  17. 11 Nov, 2004 1 commit
  18. 01 Nov, 2004 1 commit
  19. 29 Oct, 2004 1 commit
  20. 25 Oct, 2004 2 commits
  21. 11 Oct, 2004 3 commits
  22. 08 Sep, 2004 1 commit
    • Mike Hibler's avatar
      1.275: Add timed-based mapping table for generic OSIDs. This augments the · bb56a192
      Mike Hibler authored
             nextosid mechinism of 1.114 making it possible to map a generic *-STD
             OSID based on the time in which an experiment is created.  This
             provides backward compatibility for old experiments when the standard
             images are changed.
      
             The osid_map table lookup is triggered when the value of the nextosid
             field is set to 'MAP:osid_map'.  The nextosid also continues to behave
             as before: if it contains a valid osid, that OSID value is used to map
             independent of the experiment creation time.  The two styles can also
             be mixed, for example FBSD-JAIL has a nextosid of FBSD-STD which in
             turn is looked up and redirects to the osid_map and selects one of
             FBSD47-STD or FBSD410-STD depending on the time.
      
      	CREATE TABLE osid_map (
      	  osid varchar(35) NOT NULL default '',
      	  btime datetime NOT NULL default '1000-01-01 00:00:00',
      	  etime datetime NOT NULL default '9999-12-31 23:59:59',
      	  nextosid varchar(35) default NULL,
      	  PRIMARY KEY  (osid,btime,etime)
      	) TYPE=MyISAM;
      
             Yeah, yeah, I'm using another magic date as a sentinel value.
             Tell ya what, in 7995 years, find out where I'm buried, dig me up,
             and kick my ass for being so short-sighted...
      
             The following commands are not strictly needed, they just give
             an example, default population of the table.  They cause the standard
             images to be revectored through the table and then remapped, based on
             two time ranges, to the exact same image.  Obviously, the second set
             would normally be mapped to a different set of images (say RHL90 and
             FBSD410):
      
      	INSERT INTO osid_map (osid,etime,nextosid) VALUES \
      	  ('RHL-STD','2004-09-08 08:59:59','emulab-ops-RHL73-STD');
      	INSERT INTO osid_map (osid,etime,nextosid) VALUES \
      	  ('FBSD-STD','2004-09-08 08:59:59','emulab-ops-FBSD47-STD');
      
      	INSERT INTO osid_map (osid,btime,nextosid) VALUES \
      	  ('RHL-STD','2004-09-08 09:00:00','emulab-ops-RHL73-STD');
      	INSERT INTO osid_map (osid,btime,nextosid) VALUES \
      	  ('FBSD-STD','2004-09-08 09:00:00','emulab-ops-FBSD47-STD');
      
      	UPDATE os_info SET nextosid='MAP:osid_map' \
      	  WHERE osname IN ('RHL-STD','FBSD-STD');
      bb56a192
  23. 27 Aug, 2004 1 commit
  24. 25 Aug, 2004 1 commit
  25. 18 Aug, 2004 2 commits
    • Christopher Alfeld's avatar
      Fix for ALWAYSUP nodes and fix for switches with interface entries. · a7b4249d
      Christopher Alfeld authored
      In detail:
      
      1. Added TBDB_NODESTATE_ALWAYSUP to libdb.pm for representing the ALWAYSUP
      eventstate.
      
      2. Modified free node calculation in ptopgen to include ALWAYSUP nodes.
      
      3. Added code to ptopgen to correctly handle the case of a NULL iface
      column, which happens when switches have interface (as they do in
      Wisconsin), but assign_wrapper expects (null) for their iface rather than
      "".
      a7b4249d
    • Robert Ricci's avatar
      New script, deletenode. Does what it sounds like. Scrubs tables · 6c685f91
      Robert Ricci authored
      of all references to a node. Mainly intended for when you have a
      mishap with the newnode stuff and need to clean it up.
      
      Added a big list of which tables contain information about physical
      nodes to libdb, so that this and other scripts can find it all.
      6c685f91
  26. 16 Aug, 2004 1 commit
  27. 11 Aug, 2004 1 commit
    • Leigh B. Stoller's avatar
      Add new per-lan table, which currently is just for Mike: · d09d9696
      Leigh B. Stoller authored
      1.269: Add new table to generate a per virt_lan index for use with
             veth vlan tags. This would be so much easier if the virt_lans
             table had been split into virt_lans and virt_lan_members.
             Anyway, this table might someday become the per-lan table, with a
             table of member settings. This would reduce the incredible amount of
             duplicate info in virt_lans!
      
      	CREATE TABLE virt_lan_lans (
      	  pid varchar(12) NOT NULL default '',
      	  eid varchar(32) NOT NULL default '',
      	  idx int(11) NOT NULL auto_increment,
      	  vname varchar(32) NOT NULL default '',
      	  PRIMARY KEY  (pid,eid,idx),
      	  UNIQUE KEY vname (pid,eid,vname)
      	) TYPE=MyISAM;
      
             This arrangement will provide a unique index per virt_lan, within
             each pid,eid. That is, it starts from 1 for each pid,eid. That is
             necessary since the limit is 16 bits, so a global index would
             quickly overflow. The above table is populated with:
      
      	insert into virt_lan_lans (pid, eid, vname)
                  select distinct pid,eid,vname from virt_lans;
      d09d9696
  28. 29 Jul, 2004 4 commits
    • Leigh B. Stoller's avatar
    • Leigh B. Stoller's avatar
      Rework TBGetSiteVar() slightly. Add optional second parameter $rptr to · 03403a55
      Leigh B. Stoller authored
      store the result in. When called this new way, the value goes into
      $rptr, and exit status is returned to caller instead. In addition,
      when called this way, all errors are non-fatal; it is up to the caller
      to decide what to do.
      03403a55
    • Leigh B. Stoller's avatar
      Two unrelated bug fixes (with some related cleanups and tweaks) · 9f4edbba
      Leigh B. Stoller authored
      * The first involves swapmod. When a swapmod on an active experiment fails,
        tbswap will reswap the experiment back to the original configuration. The
        problem is that it is reswapping it with the *new* virtual state of the
        experiment in the DB. It is not until later when control returns to
        swapexp that the virtual state is restored. This is plainly wrong, and in
        fact was causing the event scheduler grief cause it was starting up,
        reading the the virtual topo, which was different, wrong, and about to be
        blown away.
      
        I reorganized the modify section of swapexp so that virtual state is
        restored only when its a swapmod on a swapped experiment. On an active
        experiment, I moved that code down into tbswap, which will now does all
        of the virtual and physical state retore before it does the reswap back
        to the original experiment. Just for kicks, its also done if tbswap
        decides to swap the experiment cause of a fatal error.
      
        Cleanups: I changed $NoRecover to $CanRecover. My feeble brain cannot
        deal with !$NoRecover. I know, two knots make a wright for most people.
      
        Renderer: I was annoyed by the fact that we rerun the renderer on a
        failed swapmod. The original reason is that the renderer runs in the
        background and so vis_nodes cannot be saved with the rest of the virtual
        state tables cause the renderer might still be running when the user
        fires off the swapmod. Well, the hell with that. We lock the vis_nodes
        table anyway in the renderer during update, so we are certain to get a
        consistent snapshot. We store the renderer pid in the experiments table,
        so if the renderer was running, just fire off another one; mostly this is
        not going to happen. In addition, tbprerun no longer starts a new
        renderer when doing the swapmod; I start the new renderer later after
        swapmod succeeds. I might end up tweaking this a bit depending on what
        people notice as being different.
      
      * Termination changes to batchexp and swapexp: I've rearranged the
        termination code using an END block so that any uncontrolled exit from
        either batchexp or swapexp will go through the cleanup code, and
        hopefully insert a stats record, as well as not leave the experiment in
        some inbetween state. I've set the max DB retry count to zero in both
        cases, which means infinite retry. I've also added SIGTERM handlers to
        both so that again, we can kill a hung batch/swap and have it clean up
        things more or less. Note that END blocks are not caught when a signal
        causes the program to die; you have to catch it and then die() so that
        the END block is executed.
      
        Eventually, we need to clean up the various libraries so that we do not
        use DBQueryFatal(), but rather use DBQueryWarn(), and look for failure.
        Ditto for event system interface.
      9f4edbba
    • Leigh B. Stoller's avatar
  29. 15 Jul, 2004 2 commits
    • Leigh B. Stoller's avatar
      Couple of minor tweaks to make sure that experiment state events · d1a35ea9
      Leigh B. Stoller authored
      get sent properly; need to call TBdbfork(), and add a couple more
      event sends in libdb.
      d1a35ea9
    • Leigh B. Stoller's avatar
      Overview: Add Event Groups: · ed964507
      Leigh B. Stoller authored
      	set g1 [new EventGroup $ns]
      	$g1 add  $link0 $link1
      	$ns at 60.0 "$g1 down"
      
      See the new advanced tutorial section on event groups for a better
      example.
      
      Changed tbreport to dump the event groups table when in summary mode.
      At the same time, I changed tbreport to use the recently added
      virt_lans:vnode and ip slots, decprecating virt_nodes:ips in one more
      place. I also changed the web interface to always dump the event and
      event group summaries.
      
      The parser gets a new file (event.tcl), and the "at" method deals with
      event group events by expanding them inline into individual events
      sent to each member. For some agents, this is unavoidable; traffic
      generators get the initial params in the event, so it is not possible
      to send a single event to all members of the group. Same goes for
      program objects, although program objects do default to the initial
      command now, at least on new images.
      
      Changed the event scheduler to load the event groups table. The
      current operation is that the scheduler expands events sent to a
      group, into a set of distinct events sent to each member of the
      group. At some point we proably want to optimize this by telling the
      agents (running on the nodes) what groups they are members of.
      
      Other News: Added a "mustdelay" slot to the virt_lans table so the
      parser can tell assign_wrapper that a link needs to be delayed, say if
      there are events or if the link is red/gred. Previously,
      assign_wrapper tried to figure this out by looking at the event list,
      etc. I have removed that code; see database-migrate for instructions
      on how to initialize this slot in existing experiments. assign_wrapper
      is free to ignore or insert delays anyway, but having the parser do
      this makes more sense.
      
      I also made some "rename" changes to the parser wrt queues and lans
      and links. Not really necessary, but I got sidetracked (for several
      hours!) trying to understand that rename stuff a little better, and
      now I do.
      ed964507
  30. 12 Jul, 2004 2 commits