1. 12 Jan, 2005 3 commits
    • Robert Ricci's avatar
      Added support for a switch being in more than one stack at a time. · e189be0a
      Robert Ricci authored
      Each switch has a 'primary' stack that it belongs to if it's specified
      with the '-i' parameter. Otherwise, it can be considered to be a part
      of any of the stacks of which it's a member.
      
      The main point of this is so that we can have switches that are on
      both the control and experimental networks.
      
      Note: Having a VLAN with the same name on two overlapping stacks is
      like crossing the streams: that would be bad. Not "all life as you
      know it stopping instantaneously" bad, but snmpit might get confused.
      e189be0a
    • Leigh B. Stoller's avatar
      Another little hack for Mike; Add a "lockdown" bit to the experiments · d8b17f2c
      Leigh B. Stoller authored
      table that will prevent an experiment from being swapped/modified. The
      toggle is on the showexp page, and the toggle is *not* admin
      over-ridable; you must turn the toggle off (and of course, you must be
      an admin to do that).
      d8b17f2c
    • Leigh B. Stoller's avatar
      A hack for Mike. Add a node_history table to store all moves in and · 26b318a2
      Leigh B. Stoller authored
      out of the reserved table. Mostly this happens in nfree and nalloc,
      but there a couple of other moves, in libdb and in the reload daemon.
      The uid and experiment are stored, long with a timestamp.
      26b318a2
  2. 10 Jan, 2005 2 commits
  3. 06 Jan, 2005 2 commits
    • Robert Ricci's avatar
      Add a sanity check to make sure that we have appropriate stacks for · ff753403
      Robert Ricci authored
      all ports that have been specified.
      ff753403
    • Leigh B. Stoller's avatar
      A bunch of boot changes. Read carefully. · 94ccc3f4
      Leigh B. Stoller authored
      * Add boot_errno to the nodes table so that nodes can report in a
        subcode to indicate what went wrong. At present, we do not report any
        real error codes; that is going to take some time to work out since it
        will reqiure a bunch of changes to the boot scripts.
      
      * Add new table node_bootlogs to store logs provided by the nodes. Not
        a full console log, but a log of the tmcd client side part. We can
        make it a full log if we want though; just means mucking about with
        the boot phase a bit.
      
      * Add new state transition to NORMALv2 and PCVM state machines. "TBFAILED"
        is a new state that is sent (after TBSETUP) if a node fails somewhere in
        the tmcd client side.
      
      * Change TBNodeStateWait() to take a list of states (instead of single
        state) and an optional pass by reference parameter to return the actual
        state that the node landed in. Change all calls to TBNodeStateWait() of
        course.
      
      * Change os_setup (and libreboot in wait mode) to look for both TBFAILED
        and ISUP. If a TBFAILED event is seen, we can terminate the wait early
        and not retry os_setup on physical nodes (although still retry virtual
        nodes). The nice thing about this is that the wait should terminate much
        earlier (rather then waiting for timeout), especially for virtual nodes
        which can take a really long time when there are a couple of hundred.
      
      * Add new routines dobooterrno() and dobootlog() to tmcd. Bump version
        number and increase the buffer size to allow for the larger packets that
        a console log wikk generate (added MAXTMCDPACKET variable, set to 0x4000).
      
      * Add new -f option to tmcc to specify a datafile to send along as the last
        argument to tmcd. This is more pleasing then trying to send a console log
        in on the command line. For example: "tmcc -f /tmp/log BOOTLOG" will send
        a BOOTLOG command along with the contents of /tmp/log.
      
        Also close the write side of the pipe so that server sees EOF on
        read. See aside comment below.
      
      * Changes to rc.bootsetup:
           1. Use perl tricks to capture all output, duping to the console and to
              a log file in /var/emulab/logs.
           2. On any error, send a status code (boot_errno) and the bootlog to
              tmcd.
           3. Generate a TBFAILED state transition.
      
      * Changes to rc.injail:
           1. Same as rc.bootsetup, but do not send log files; that would pummel
              boss. Leave them on the physical node.
      
      * Change vnodesetup (which calls mkjail) to watch for any error and send a
        TBFAILED state transition. This should catch almost all errors, and
        dramatically reduce waiting when something fails.
      
      * Changes to rc.cdboot are essentially the same as rc.bootsetup, although a
        bootlog is sent all the time (success or failure), and I do not generate
        a boot_errno yet. Also, instead of TBFAILED, generate a PXEFAILED state
        since the CDROM is actually operating within the PXEFBSD opmode. I have
        yet to work this into the rest of the system though; waiting to get a new
        CD built and actually experiment with it.
      
      * Add new menu option and web page to display the node bootlog. We store
        only the lastest bootlog, but maybe someday store more then one. Display
        boot_errno on node page.
      
      Aside: I made a big mistake in the tmcd protocol; I did not envision
      passing more then a small amount of data (one fragment) and so I do not
      include a record terminator (ie: close of the write side on the client
      sends EOF) or a size field at the beginning. No big deal since small
      requests are sent in one fragment and the server sees the entire
      thing. Well, with a large console log, that will end up as multiple
      fragments, and the server will often not get the entire thing on the first
      read, and there are no subsequent reads (with no EOF or known size, it
      would block forever). Well, fixing this in a backwards compatable manner
      (for old images) was way too much pain. Instead, tmcc now closes the write
      side, and the server does subsequent reads *only* in the new dobbootlog()
      routine. Note that it *is* possible to fix this in a backwards compatable
      manner, but I did not want to go down that path just yet.
      94ccc3f4
  4. 22 Dec, 2004 2 commits
  5. 21 Dec, 2004 3 commits
  6. 16 Dec, 2004 6 commits
    • Leigh B. Stoller's avatar
      Neuter portstats inside an elabinelab. Eventually pass it out with · de2e409f
      Leigh B. Stoller authored
      XMLRPC, but for now avoid the warnings.
      de2e409f
    • Leigh B. Stoller's avatar
      Fully launch inner experiment (was just preloading cause of frisbee problems). · fb726630
      Leigh B. Stoller authored
      Do not die when turning firewall rules back on fails. This is a transient
      error I do not understand yet.
      When firewalled and paniced, skip clean shutdown of inner nodes since
      they are going to be powered off anyway later, and besides, the control
      network is shut off, so no way to talk to inner boss anyway.
      fb726630
    • Leigh B. Stoller's avatar
      Nothing special. · 760bf11b
      Leigh B. Stoller authored
      760bf11b
    • Leigh B. Stoller's avatar
    • Robert Ricci's avatar
      Add support (admins only for now) for restarting the event system via · 1d13cde6
      Robert Ricci authored
      the web interface.
      1d13cde6
    • Leigh B. Stoller's avatar
      The panic button ... · 87dd2e60
      Leigh B. Stoller authored
      * tbsetup/panic.in: New backend script to implement the panic button
        feature. When used, it will cut the severe the connection to the
        firewall node by using snmpit to disable the port. Sets the panic
        bit (and date) in the experiments table, and changes the state of
        the experiment from "active" to "paniced" to ensure that the
        experiment cannot be messed with (swapped out or modified). Sends
        email to tbops when the panic button is pressed.
      
        Used with -r option, reverses the above. State is set back to
        active, the panic bit is cleared, and the port is renabled with
        snmpit.
      
      * tbsetup/tbswap.in: During swapout, a firewalled experiment that has
        been paniced will get a cleaning; The nodes are powered off, then
        the osids for all the nodes are reset (with os_select) so that they
        will boot the MFS, and then the nodes are powered on. Then the
        control network is turned back on, and then I wait for the nodes to
        reboot (this is simply cause we do not record in the DB that a node
        is turned off, and if I do not wait, the reload daemon will end
        hitting the power button again if they do not reboot in time. We can
        fix this later.
      
        I am not planning to apply this to general firewalled experiments
        yet as the power cycling is going to be hard on the nodes, so would
        rather that we at least have a 1/2 baked plan before we do that.
      
      * www/showexp.php3: If experiment is firewalled, show the Panic
        Button, linked to the panic button web script. If the experiment has
        already had the panic button pressed, show a big warning message and
        explain that user must talk to tbops to swap the experiment out.
        Also fiddle with menu options so that the terminate link is gone,
        and the swap link is visible only in admin mode. In other words, only
        an admin person can swap an experiment once it is paniced. And of
        course, an admin person can the backend panic script above with the
        -r option, but thats not something to be done lightly.
      
      * db/libdb.pm.in: Add "paniced" as an experiment state (EXPTSTATE_PANICED).
        Add utility functions: TBExptSetPanicBit(), TBExptGetPanicBit(), and
        TBExptClearPanicBit().
      
      * tbsetup/swapexp.in: Minor state fiddling so that an experiment can
        be swapped while in paniced state, but only when in admin mode. Also
        clear the panic bit when experiment is swapped out.
      
      * www/dbdefs.php3.in: Add "paniced" as an experiment state. Add a
        utility function TBExptFirewall() to see if experiment is firewalled.
      
      * www/panicbutton.php3: New web script to invoke the backend panic
        script mentioned above, after the usual confirm song and dance.
      
      * www/panicbutton.gif: New gif of a red panic button that I stole off
        the net. If anyone has sees/has a better one, feel free to replace
        this one.
      
      * utils/node_statewait.in: Add -s option so that I can pass in the
        state I want to wait for (used from tbswap above to wait for nodes
        to reach ISUP after power on).
      87dd2e60
  7. 14 Dec, 2004 1 commit
  8. 13 Dec, 2004 1 commit
  9. 12 Dec, 2004 1 commit
  10. 11 Dec, 2004 2 commits
  11. 10 Dec, 2004 4 commits
  12. 09 Dec, 2004 5 commits
    • Leigh B. Stoller's avatar
      By the pervision of TCL, I have hidden the details: · 1c0efe2c
      Leigh B. Stoller authored
      	source tb_compat.tcl
      	set ns [new Simulator]
      
      	tb-elab-in-elab 1
      	tb-set-inner-elab-eid two-simple
      	tb-set-security-level Red
      
      	$ns run
      
      tbsetup/ns2ir/elabinelab.ns has all the goo, which is sourced from the
      NS run subroutine, using "uplevel 1" so that the context is correct.
      You can of course include you own goo, in which case the default goo
      will be skipped.
      1c0efe2c
    • Timothy Stack's avatar
      · 1f16a276
      Timothy Stack authored
      Make the dots move on the robot map web page:
      
      	* configure, configure.in: Add robots/emc/loclistener.
      
      	* event/lib/event.h, event/lib/event.c: Add some helper functions
      	for sending events and parsing args.
      
      	* event/lib/tbevent.py.tail, event/lib/tbevent.py: Add support for
      	clients that register using keyfiles.
      
      	* robots/emc/GNUmakefile.in: Install loclistener on boss.
      
      	* robots/emc/emcd.h, robots/emc/emcd.c: Send update events every
      	two seconds with the node's location.  Fill out a little more of
      	the event callback, not sure what to do with the requested
      	destination though.  Add some code to the vmc callback to store
      	position updates.  Changed the config file format to also include
      	the vname of the robot.
      
      	* robots/emc/loclistener.in: Listen for NODE MODIFY events with
      	coordinates and update the database accordingly.  Kinda sucks, but
      	it works.
      
      	* robots/emc/test_emcd.config: Add vnames to the robots to reflect
      	change in the config file format.
      
      	* tbsetup/ns2ir/node.tcl: Add nodes to the virt_agents table.
      1f16a276
    • Leigh B. Stoller's avatar
      Okay, here is the current development approach for dealing with · 5a4e9df8
      Leigh B. Stoller authored
      ElabinElab experiments that wrap another experiment, either firewalled
      or not. This instead of my security level stuff, that I decided was
      too much of a pain the user, at least for now. New NS syntax:
      
      	tb-set-inner-elab-eid two-simple
      
      In the ElabinElab file, sets the name of an existing experiment in the
      same project. Experiment is parsed, and after the parse we notice in
      tbprerun that we have an inner eid, so we reparse the NS file, only
      this time we pass in the maximum number of nodes needed by the inner
      eid (tbprerun now computes min/max nodes at prerun time, instead of
      later as first part of swapin). This number is used to allocatethe
      appropriate number of inner experimental nodes. Why do it this way?
      Cause the NS parser is the only tool we have for generating the virt
      topology, and I do not want go down the path of inventing a new
      frontend.
      
      Anyway, after the reparse, we now have the proper number of nodes in
      the wrapper experiment. Now its simply a matter of copying over the
      type and fixnode info from the inner experiment to the outer
      experiment.  Why? So that when the outer experiment is swapped in, it
      gets the nodes (of the right type/fixnode) that the inner experiment
      is going to want later, when it is swapped in by the inner emulab!
      
      Another approach would be to make elabinelab and elabinelab_eid
      options to batchexp (and thus the web form and XMLRPC interface) so
      that we can avoid the double parse. I suspect people do not want more
      crap on the web form, so I did not do it this way.
      5a4e9df8
    • Leigh B. Stoller's avatar
      When the elabinelab experiment is also firewalled, ssh into the · 12c44d00
      Leigh B. Stoller authored
      firewall node and disable the rules during the inner elab setup, and
      then turn them back on after the inner boss has rebooted. In the case
      that an experiment is to be launched inside, launch the experiment
      async and then turn rules back on. Technically, this should be proxied
      through the firewall instead of directly, but this is okay for now.
      
      As for experiment teardown, I am not doing anything yet since the
      closed firewall lets ssh through, and thats all I need to teardown the
      inner elab.
      
      Also during teardown, if DHCPD cannot be killed on inner boss, then
      skip rest of the steps and return okay so that the rest of experiment
      teardown proceeds (if need be, inner nodes will be power cycled). Not
      being be able to kill DHCPD can happen for lots of reasons (like,
      experiment never setup in the first place).
      12c44d00
    • Leigh B. Stoller's avatar
      Trivial print statement change. · e1a4917a
      Leigh B. Stoller authored
      e1a4917a
  13. 08 Dec, 2004 1 commit
  14. 07 Dec, 2004 3 commits
    • Leigh B. Stoller's avatar
      A number of changes: · 261b35fe
      Leigh B. Stoller authored
      * Always run assign_wrapper using -t mode. This just runs the top file
        stuff, and writes the min/max nodes into the DB.
      
      * Then look at the security level for the experiment, and if orange or
        red, create a parallel elabinelab experiment to run it in. This is a
        completely new experiement in addition to the original. The two
        experiments are linked with some DB state so we know what experiment
        to fire off inside the inner elab. I am using a template NS file and
        passing in the number of nodes computed in the previous step above.
        The template includes the firewall rules.
      
        This is quote hokey. It should be more invisible to the user.
      
        I have not dealt with yellow (just a firewall).
      
      * I added some stats code so that we update the experiement_stats
        record with the elabinelab status and security level.
      
      * Cleanup how errors were handled and get rid of silly duplicated
        code.
      261b35fe
    • Leigh B. Stoller's avatar
      * After rebooting the inner nodes, ssh into the inner boss and run · dd3b8989
      Leigh B. Stoller authored
        utility script to wait for them to reboot and reach PXEWAIT. This
        indicates inner emulab is raelly ready
      
      * When an inner experiment is defined (elabinelab_eid in experiments
        table) fire that experiment off by doing an ssh into inner boss. I
        am currently doing this with -w (wait mode) but eventually will need
        to do it async for experiments in which the control net is turned
        off. Also, not actually swapping experiment in yet since multicast
        and frisbee are still broken inside.
      
      * Add -k mode for cleaning up. The intent of this is to avoid power
        cycling all the nodes cause outer elab cannot reboot or ipod them.
        Goes like this:
      
        * Clear the inner_elab_role for experiment's nodes from the reserved
          table.
      
        * Clear def_boot_osid,next_boot_osid,temp_boot_osid for nodes. This
          is bogus cause os_select whines about doing this, but the point is
          to make sure that all nodes will go into PXEWAIT when they reboot.
          We could have them go into MFS, but thats bound to cause problems
          if inner elab has a lot of nodes (remember, cannot trust what is
          on disk). This needs more thought.
      
        * Regen and restart outer dhcpd. Nodes will become part of outer
          emulab on next boot cycle.
      
        * SSH into inner boss and kill inner DHCPD so that there will not be
          any DHCPD responses on inner control network.
      
        * SSH into inner boss and have it reboot all inner nodes.
      
        * Wait for node to reach PXEWAIT.
      
        The above needs more thought wrt firewalled experiments and isolated
        control network.
      
      * Kill off some old MFS copy code since we now get those direct from
        website.
      dd3b8989
    • Mike Hibler's avatar
      If osload part of swapin fails and there is a firewall involved, it is likely · f336fe42
      Mike Hibler authored
      that the firewall rules are preventing essential communication and causing the
      failure, so don't retry.
      
      We should probably only do this if the user has specified additional
      firewall rules.  But right now, I may screw up the default rules too!
      f336fe42
  15. 06 Dec, 2004 4 commits