1. 26 May, 2020 1 commit
    • Leigh B Stoller's avatar
      What started as a small project to support B/LAGGs on stitching links, · 02a1c736
      Leigh B Stoller authored
      but developed into a giant BAGG of problems:
      
      For the first part, the complication stems from representing stitching
      links as a fake node with an interface. In general this is fine since up
      to now, every stitching link has been a plain wire, and so snmpit does
      the right thing cause the port is in tagged mode. But this breaks down
      when the link is actually an aggregate (LAG/BAGG) since snmpit needs to
      take a different path for that, setVlanOnTrunks2(), which operates on all
      of the trunk links between switches, and knows how to deal with link
      aggregation. Trying to convince snmpit to handle fake switches and
      trunks with only one end, seemed like a bad idea. So I opted for adding
      a LAG field to the interfaces table so we can mark those interfaces. And
      I changed snmpit_stack to look for those ports, and redirect them onto
      the trunk path.
      
      This worked great on our Dell switches, but not on scidmz (an HP). Which
      was strange cause it was failing in an identical situation that seemed
      to work fine on our Moonshot HP (bighp1).
      
      Many hours later ... we determine that Version 7 firmware has a
      different ifindex mapping for BAGGs, and that we have been lucky not to
      have hit that problem on the moonshot cluster. Many more hours later,
      Kirk discovered that the very recent firmware update to scidmz resulted
      in snmp no longer being able to change the membership of BAGGs. Holy Bat
      BAGG!
      
      The best alternative was to use the libNetconf module that is already
      used to speak CLI to the switch for OpenFlow configuration. Just
      manipulate the BAGG with via the CLI, the commands are pretty simple.
      
      Well, that didn't quite work cause scidmz does not allow password based
      authentication (at some point it might have), and the ssh key that
      scidmz does accept is in /root/.ssh/, and snmpit runs as the user. No
      problem, lets just add another key pair and stick that in /usr/testbed/etc
      where the user can access it. But, ssh will not allow a 644 private
      key file to be used. So ... copy the that file to /tmp (so that the
      user owns it) and chmod it to 600, and pass that filename down into the
      libNetconf module, which has been changed to optionally use an ssh key
      file.
      
      So to sum up, there are two new node_attributes set on the stitching
      aggregate for scidmz:
      
      * snmpit_badBAGG: which says the firmware no longer allows snmpit to
        manipulate BAGGs.
      
      * snmpit_sshkey: which is the path to the ssh key for libNetConf,
        instead of password based authentication.
      
      The bottom line is ... do not upgrade our other HP switches.
      02a1c736
  2. 20 Apr, 2020 1 commit
  3. 19 Apr, 2020 1 commit
  4. 15 Apr, 2020 2 commits
  5. 07 Apr, 2020 1 commit
    • Leigh B Stoller's avatar
      The other part of what Mike did in commit bab624d5 (PREPARE=2): · d854028a
      Leigh B Stoller authored
      If a node (or nodetype) has the delayreloadtillalloc attribute set, then
      in nfree the partitions are cleared and the node is powered off. Reload
      is being delayed till the next time the node is allocated to an
      experiment.
      
      Add -Z option to os_load, which is like -P, but applies the clear
      metadata option to all disks.
      
      os_setup and os_load have been changed accordingly.
      d854028a
  6. 26 Mar, 2020 1 commit
  7. 24 Mar, 2020 1 commit
  8. 18 Mar, 2020 1 commit
    • Leigh B Stoller's avatar
      Fixed the triple problem. This was really messing up the battery · 732f9a52
      Leigh B Stoller authored
      monitor, since what was really happening was we were leaving unconsumed
      output from the arduino, and the sync code was never really syncing, it
      was just spitting the command over and over, and reading only one
      result.
      
      Also added "use strict" and fixed all the warnings/errors.
      732f9a52
  9. 16 Mar, 2020 2 commits
  10. 11 Mar, 2020 2 commits
  11. 05 Mar, 2020 1 commit
  12. 29 Feb, 2020 1 commit
    • Mike Hibler's avatar
      Don't report DHCP on the mgmt interface as a booting event. · 8c797bc1
      Mike Hibler authored
      On machines that DHCP on their management interfaces (e.g., at UMass),
      this can cause seeming random reboots of machines that are free. They
      (actually stated) get an unexpected BOOTING event which starts a
      timeout waiting for further state transitions that never happen, and
      eventually stated power cycles the nodes.
      8c797bc1
  13. 11 Feb, 2020 1 commit
  14. 06 Feb, 2020 1 commit
  15. 09 Jan, 2020 1 commit
  16. 07 Jan, 2020 1 commit
  17. 24 Dec, 2019 1 commit
  18. 12 Dec, 2019 2 commits
  19. 03 Dec, 2019 1 commit
  20. 30 Oct, 2019 1 commit
  21. 23 Oct, 2019 1 commit
    • Mike Hibler's avatar
      Add a table for the S4048. · 177a3799
      Mike Hibler authored
      Same as S3048, so we could just make 4048 switches be force10-3048 in the DB.
      But I will forget that 15 minutes from now, so let's make an explicit type.
      177a3799
  22. 14 Oct, 2019 1 commit
    • Mike Hibler's avatar
      Finish off the "copystatus" blockstore server command. · 2c0cdccf
      Mike Hibler authored
      A command to get the status of an ongoing copy. Will also let you
      know if a copy has aborted for some reason. However, lots more work
      will be required before we can gracefully recover (i.e., continue)
      one of those as any failure on the server side causes the boss scripts
      to tear down all the DB state or, leave the blockstore in a "failed"
      state that requires a great deal of manual effort for resurrection.
      2c0cdccf
  23. 30 Sep, 2019 1 commit
    • Mike Hibler's avatar
      Implement an "on server" strategy for copying persistent datasets. · fce7c7c7
      Mike Hibler authored
      This is implemented as a variant of createdataset. If you do:
      
          createdataset -F pid/old pid/new
      
      It will create a new dataset, initializing it with the contents of old.
      The new dataset will of course have the same size, type, and filesystem type
      (if any). Right now the old and new both have to be in the same project, and
      new gets placed in the same pool on the same server (i.e., this is a local
      "zfs send | zfs recv" pipeline).
      
      Implementing copy as a variant of create will hopefully make it easy for
      Leigh in the portal interface as he doesn't have to treat it any different
      than a normal create: fire it off in the background and wait til the lease
      state becomes "valid".
      
      Since a copy could takes hours or even days, there are plenty of opportunities
      for failure that I have not considered too much yet, e.g., the storage server
      rebooting in the middle or boss rebooting in the middle. These things could
      happen already, but we have just made the window of opportunity much larger.
      
      Anyway, this mechanism can serve as the basis for creating persistent datasets
      from clones or other ephemeral datasets.
      fce7c7c7
  24. 19 Sep, 2019 2 commits
  25. 16 Sep, 2019 1 commit
  26. 10 Sep, 2019 1 commit
  27. 29 Aug, 2019 1 commit
  28. 19 Aug, 2019 1 commit
  29. 12 Aug, 2019 1 commit
  30. 06 Aug, 2019 1 commit
  31. 01 Aug, 2019 1 commit
    • Leigh B Stoller's avatar
      Changes to the reservation to support reserving specific nodes: · f21a3123
      Leigh B Stoller authored
      A new flag in the nodes table marks a node as being independently
      reservable by the reservation system. In general, the reservation system
      treats the node type as an opaque string, so why not make it a node_id.
      The nodes table flag is used in various queries to distinguish between
      nodes that reserved as a type and nodes that are individually reserved.
      Everything else pretty much falls into place.
      
      Minor changes to mapper admission control to look for the use of a
      specific node that is reserved to someone else. Also minor changes in
      ptopgen to remove reserved nodes from the ptop file when they reserved
      to a different project.
      f21a3123
  32. 26 Jun, 2019 1 commit
    • chuck cranor's avatar
      add power support for IBM BladeCenter chassis · f66517f5
      chuck cranor authored
      Add power support for the IBM BladeCenter chassis (power_ibmbc.pm).
      This is the chassis used in the old Roadrunner cluster at LANL.  Each
      chassis has 14 blades in it.  The management IP API is accessed from
      boss via ssh.  A ssh keypair should be setup to allow for passwordless
      ssh access.  We assume the admin has installed the keypair on boss
      (in /usr/testbed/etc/{ibmbc,ibmbc.pub}) and on each chassis for
      the standard "USERID" account.  The key files should be owned by
      an account like "operator" to avoid ssh complaining about key file
      permissions in some cases.
      
      The module will end up running commands like:
      
            ssh USERID@chassis-mm power -on -T 'blade[1]'
            ssh USERID@chassis-mm power -off -T 'blade[1]'
            ssh USERID@chassis-mm power -cycle -T 'blade[1]'
      
      (we'll add "-i /usr/testbed/etc/ibmbc" to "ssh" if the key file is present)
      
      using this requires the following "mysql tbdb" cmds on boss:
        one-time operation:
           insert into node_types (class,type) values ('power', 'ibmbc');
      
        per-chassis operations:
           # assumes that "rr1" is blade1 of chassis "bch1"
           insert into nodes (node_id,type,phys_nodeid,role,priority,status,
                              eventstate,op_mode,allocstate)
           values ('bch1', 'ibmbc', 'bch1', 'powerctrl', 10001, 'down',
                              'ISUP', 'NONE', 'FREE_DIRTY');
      
           # adds IP of the chassis management module
           insert into interfaces (node_id,IP,mask,interface_type,iface,role) values
                ('bch1', '172.19.148.61', '255.255.240.0','','eth0','other');
      
        per-blade operation:
           insert into outlets (node_id,power_id,outlet)
                  values ('rr1', 'bch1', 1);   # outlet 1==blade1, etc.
      f66517f5
  33. 19 Jun, 2019 2 commits
    • Leigh B Stoller's avatar
      dadc32bc
    • Mike Hibler's avatar
      Further tweaks to jumbo frames code. · 571b4a14
      Mike Hibler authored
      Now use a sitevar, general/allowjumboframes, rather than MAINSITE
      to determine whether we should even attempt any jumbo frames magic.
      
      Use a per-link/lan setting rather than the hacky per-experiment
      setting to let the user decide if they want to use jumbos. In NS
      world, we already had a link/lan method (set-settings) to specify
      virt_lan_settings which is where it winds up now.
      
      Client-side fixes to make jumbos work with vnodes.
      571b4a14
  34. 13 Jun, 2019 1 commit