1. 08 Jan, 2015 12 commits
  2. 06 Jan, 2015 22 commits
  3. 05 Jan, 2015 6 commits
    • David S. Miller's avatar
      Merge branch 'rt_cong_ctrl' · a918eb9f
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      net: allow setting congctl via routing table
      
      This is the second part of our work and allows for setting the congestion
      control algorithm via routing table. For details, please see individual
      patches.
      
      Since patch 1 is a bug fix, we suggest applying patch 1 to net, and then
      merging net into net-next, for example, and following up with the remaining
      feature patches wrt dependencies.
      
      Joint work with Florian Westphal, suggested by Hannes Frederic Sowa.
      
      Patch for iproute2 is available under [1], but will be reposted with along
      with the man-page update when this set hits net-next.
      
        [1] http://patchwork.ozlabs.org/patch/418149/
      
      
      
      Thanks!
      
      v2 -> v3:
       - Added module auto-loading as suggested by David Miller, thanks!
        - Added patch 2 for handling possible sleeps in fib6
        - While working on this, we discovered a bug, hence fix in patch 1
        - Added auto-loading to patch 4
       - Rebased, retested, rest the same.
      v1 -> v2:
       - Very sorry, I noticed I had decnet disabled during testing.
         Added missing header include in decnet, rest as is.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a918eb9f
    • Daniel Borkmann's avatar
      net: tcp: add per route congestion control · 81164413
      Daniel Borkmann authored
      
      
      This work adds the possibility to define a per route/destination
      congestion control algorithm. Generally, this opens up the possibility
      for a machine with different links to enforce specific congestion
      control algorithms with optimal strategies for each of them based
      on their network characteristics, even transparently for a single
      application listening on all links.
      
      For our specific use case, this additionally facilitates deployment
      of DCTCP, for example, applications can easily serve internal
      traffic/dsts in DCTCP and external one with CUBIC. Other scenarios
      would also allow for utilizing e.g. long living, low priority
      background flows for certain destinations/routes while still being
      able for normal traffic to utilize the default congestion control
      algorithm. We also thought about a per netns setting (where different
      defaults are possible), but given its actually a link specific
      property, we argue that a per route/destination setting is the most
      natural and flexible.
      
      The administrator can utilize this through ip-route(8) by appending
      "congctl [lock] <name>", where <name> denotes the name of a
      congestion control algorithm and the optional lock parameter allows
      to enforce the given algorithm so that applications in user space
      would not be allowed to overwrite that algorithm for that destination.
      
      The dst metric lookups are being done when a dst entry is already
      available in order to avoid a costly lookup and still before the
      algorithms are being initialized, thus overhead is very low when the
      feature is not being used. While the client side would need to drop
      the current reference on the module, on server side this can actually
      even be avoided as we just got a flat-copied socket clone.
      
      Joint work with Florian Westphal.
      Suggested-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      81164413
    • Daniel Borkmann's avatar
      net: tcp: add RTAX_CC_ALGO fib handling · ea697639
      Daniel Borkmann authored
      
      
      This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
      control metric to be set up and dumped back to user space.
      
      While the internal representation of RTAX_CC_ALGO is handled as a u32
      key, we avoided to expose this implementation detail to user space, thus
      instead, we chose the netlink attribute that is being exchanged between
      user space to be the actual congestion control algorithm name, similarly
      as in the setsockopt(2) API in order to allow for maximum flexibility,
      even for 3rd party modules.
      
      It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
      it should have been stored in RTAX_FEATURES instead, we first thought
      about reusing it for the congestion control key, but it brings more
      complications and/or confusion than worth it.
      
      Joint work with Florian Westphal.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ea697639
    • Daniel Borkmann's avatar
      net: tcp: add key management to congestion control · c5c6a8ab
      Daniel Borkmann authored
      
      
      This patch adds necessary infrastructure to the congestion control
      framework for later per route congestion control support.
      
      For a per route congestion control possibility, our aim is to store
      a unique u32 key identifier into dst metrics, which can then be
      mapped into a tcp_congestion_ops struct. We argue that having a
      RTAX key entry is the most simple, generic and easy way to manage,
      and also keeps the memory footprint of dst entries lower on 64 bit
      than with storing a pointer directly, for example. Having a unique
      key id also allows for decoupling actual TCP congestion control
      module management from the FIB layer, i.e. we don't have to care
      about expensive module refcounting inside the FIB at this point.
      
      We first thought of using an IDR store for the realization, which
      takes over dynamic assignment of unused key space and also performs
      the key to pointer mapping in RCU. While doing so, we stumbled upon
      the issue that due to the nature of dynamic key distribution, it
      just so happens, arguably in very rare occasions, that excessive
      module loads and unloads can lead to a possible reuse of previously
      used key space. Thus, previously stale keys in the dst metric are
      now being reassigned to a different congestion control algorithm,
      which might lead to unexpected behaviour. One way to resolve this
      would have been to walk FIBs on the actually rare occasion of a
      module unload and reset the metric keys for each FIB in each netns,
      but that's just very costly.
      
      Therefore, we argue a better solution is to reuse the unique
      congestion control algorithm name member and map that into u32 key
      space through jhash. For that, we split the flags attribute (as it
      currently uses 2 bits only anyway) into two u32 attributes, flags
      and key, so that we can keep the cacheline boundary of 2 cachelines
      on x86_64 and cache the precalculated key at registration time for
      the fast path. On average we might expect 2 - 4 modules being loaded
      worst case perhaps 15, so a key collision possibility is extremely
      low, and guaranteed collision-free on LE/BE for all in-tree modules.
      Overall this results in much simpler code, and all without the
      overhead of an IDR. Due to the deterministic nature, modules can
      now be unloaded, the congestion control algorithm for a specific
      but unloaded key will fall back to the default one, and on module
      reload time it will switch back to the expected algorithm
      transparently.
      
      Joint work with Florian Westphal.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c5c6a8ab
    • Daniel Borkmann's avatar
      net: tcp: refactor reinitialization of congestion control · 29ba4fff
      Daniel Borkmann authored
      
      
      We can just move this to an extra function and make the code
      a bit more readable, no functional change.
      
      Joint work with Florian Westphal.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29ba4fff
    • Florian Westphal's avatar
      net: fib6: convert cfg metric to u32 outside of table write lock · e715b6d3
      Florian Westphal authored
      
      
      Do the nla validation earlier, outside the write lock.
      
      This is needed by followup patch which needs to be able to call
      request_module (which can sleep) if needed.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e715b6d3