• Mike Hibler's avatar
    Scaling work on blockstore setup and teardown along with a couple of bug fixes. · eb0ff3b8
    Mike Hibler authored
    Previously we were failing on an experiment with 40 blockstores. Now we can
    handle at least 75. We still fail at 100 due to client-side timeouts, but that
    would be addressed with a longer timeout (which involves making new images).
    
    Attacked this on a number of fronts.
    
    On the infrastructure side:
    
    - Batch destruction calls. We were making one ssh call per blockstore
      to tear these down, leaving a lot of dead time. Now we batch them up
      in groups of 10 per call just like creation. They still serialize on
      the One True Lock, but switching is much faster.
    
    - Don't create new snapshots after destroying a clone. This was a bug.
      If the use of a persistent blockstore was read-write, we forced a new
      snapshot even if it was a RW clone. This resulted in two calls from
      boss to the storage server and two API calls: one to destroy the old
      snapshot and one to create the new one.
    
    Client-side:
    
    - Increase the timeout on first attach to iSCSI. One True Lock in
      action again. In the case where the storage server has a lot of
      blockstores to create, they would serialize with each blockstore
      taking 8-10 seconds to create. Meanwhile the node attaching to the
      blockstore would timeout after two minutes in the "login" call.
      Normally we would not hit this as the server would probably only
      be setting up 1-3 blockstores and the nodes would most likely first
      need to load an image and do a bunch or other boot time operations
      before attempting the login. There is now a loop around the iSCSI
      login operation that will try up to five times (10 minutes total)
      before giving up. This is completely arbitrary, but making it much
      longer will lead to triggering the node reboot timeout anyway.
    
    Server-side:
    
    - Cache the results of the libfreenas freenasVolumeList call.
      The call can take a second or more as it can make up to three API
      calls plus a ZFS CLI call. On blockstore VM creation, we were calling
      this twice through different paths. Now the second call will use
      the cached results. The cache is invalidated whenever we drop the
      global lock or make a POST-style API call (that might change the
      returned values).
    
    - Get rid of gratuitous synchronization. There was a stub vnode function
      on the creation path that was grabbing the lock, doing nothing, and then
      freeing the lock. This caused all the vnodes to pile up and then be
      released to pile up again.
    
    - Properly identify all the clones of a snapshot so that they all get
      torn down correctly. The ZFS get command we were using to read the
      "clones" property of a snapshot will return at most 1024 bytes of
      property value. When the property is a comma separated list of ZFS
      names, you hit that limit with about 50-60 clones (given our naming
      conventions). Now we have to do a get of every volume and look at the
      "origin" property which identifies any snapshot the volume is associated
      with.
    
    - Properly synchronize overlapping destruction/setup. We call snmpit to
      remove switch VLANs before we start tearing down nodes. This potentially
      allows the VLAN tags to become free for reuse by other blockstore
      experiments before we have torn down the old vnodes (and their VLAN
      devices) on the storage server. This was creating chaos on the server.
      Now we identify this situation and stall any new creations until the
      previous VLAN devices goes away. Again, this is an arbitrary
      wait/timeout (10 minutes now) and can still fail. But this only comes
      into play if a new blockstore allocation comes immediately on the heels
      of a large deallocation.
    
    - Wait longer to get the One True Lock during teardown. Failure to get
      the lock at the beginning of the teardown process would result in all
      the iSCSI and ZFS state getting left behind, but all the vnode state
      being removed. Hence, a great deal of manual cleanup on the server
      was required. The solution? You guessed it, another arbitrary timeout,
      longer than before.
    eb0ff3b8
Name
Last commit
Last update
..
autoconf Loading commit data...
event Loading commit data...
lib Loading commit data...
mobile Loading commit data...
os Loading commit data...
protogeni Loading commit data...
sensors Loading commit data...
tip Loading commit data...
tmcc Loading commit data...
tools Loading commit data...
xmlrpc Loading commit data...
GNUmakefile.in Loading commit data...
GNUmakerules Loading commit data...
Makeconf.in Loading commit data...
config.h.in Loading commit data...
configure Loading commit data...
configure.ac Loading commit data...
setversion.in Loading commit data...
shadow.spec Loading commit data...