1. 06 Jan, 2015 6 commits
  2. 05 Jan, 2015 22 commits
    • David S. Miller's avatar
      Merge branch 'rt_cong_ctrl' · a918eb9f
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      net: allow setting congctl via routing table
      
      This is the second part of our work and allows for setting the congestion
      control algorithm via routing table. For details, please see individual
      patches.
      
      Since patch 1 is a bug fix, we suggest applying patch 1 to net, and then
      merging net into net-next, for example, and following up with the remaining
      feature patches wrt dependencies.
      
      Joint work with Florian Westphal, suggested by Hannes Frederic Sowa.
      
      Patch for iproute2 is available under [1], but will be reposted with along
      with the man-page update when this set hits net-next.
      
        [1] http://patchwork.ozlabs.org/patch/418149/
      
      
      
      Thanks!
      
      v2 -> v3:
       - Added module auto-loading as suggested by David Miller, thanks!
        - Added patch 2 for handling possible sleeps in fib6
        - While working on this, we discovered a bug, hence fix in patch 1
        - Added auto-loading to patch 4
       - Rebased, retested, rest the same.
      v1 -> v2:
       - Very sorry, I noticed I had decnet disabled during testing.
         Added missing header include in decnet, rest as is.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a918eb9f
    • Daniel Borkmann's avatar
      net: tcp: add per route congestion control · 81164413
      Daniel Borkmann authored
      
      
      This work adds the possibility to define a per route/destination
      congestion control algorithm. Generally, this opens up the possibility
      for a machine with different links to enforce specific congestion
      control algorithms with optimal strategies for each of them based
      on their network characteristics, even transparently for a single
      application listening on all links.
      
      For our specific use case, this additionally facilitates deployment
      of DCTCP, for example, applications can easily serve internal
      traffic/dsts in DCTCP and external one with CUBIC. Other scenarios
      would also allow for utilizing e.g. long living, low priority
      background flows for certain destinations/routes while still being
      able for normal traffic to utilize the default congestion control
      algorithm. We also thought about a per netns setting (where different
      defaults are possible), but given its actually a link specific
      property, we argue that a per route/destination setting is the most
      natural and flexible.
      
      The administrator can utilize this through ip-route(8) by appending
      "congctl [lock] <name>", where <name> denotes the name of a
      congestion control algorithm and the optional lock parameter allows
      to enforce the given algorithm so that applications in user space
      would not be allowed to overwrite that algorithm for that destination.
      
      The dst metric lookups are being done when a dst entry is already
      available in order to avoid a costly lookup and still before the
      algorithms are being initialized, thus overhead is very low when the
      feature is not being used. While the client side would need to drop
      the current reference on the module, on server side this can actually
      even be avoided as we just got a flat-copied socket clone.
      
      Joint work with Florian Westphal.
      
      Suggested-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      81164413
    • Daniel Borkmann's avatar
      net: tcp: add RTAX_CC_ALGO fib handling · ea697639
      Daniel Borkmann authored
      
      
      This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
      control metric to be set up and dumped back to user space.
      
      While the internal representation of RTAX_CC_ALGO is handled as a u32
      key, we avoided to expose this implementation detail to user space, thus
      instead, we chose the netlink attribute that is being exchanged between
      user space to be the actual congestion control algorithm name, similarly
      as in the setsockopt(2) API in order to allow for maximum flexibility,
      even for 3rd party modules.
      
      It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
      it should have been stored in RTAX_FEATURES instead, we first thought
      about reusing it for the congestion control key, but it brings more
      complications and/or confusion than worth it.
      
      Joint work with Florian Westphal.
      
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ea697639
    • Daniel Borkmann's avatar
      net: tcp: add key management to congestion control · c5c6a8ab
      Daniel Borkmann authored
      
      
      This patch adds necessary infrastructure to the congestion control
      framework for later per route congestion control support.
      
      For a per route congestion control possibility, our aim is to store
      a unique u32 key identifier into dst metrics, which can then be
      mapped into a tcp_congestion_ops struct. We argue that having a
      RTAX key entry is the most simple, generic and easy way to manage,
      and also keeps the memory footprint of dst entries lower on 64 bit
      than with storing a pointer directly, for example. Having a unique
      key id also allows for decoupling actual TCP congestion control
      module management from the FIB layer, i.e. we don't have to care
      about expensive module refcounting inside the FIB at this point.
      
      We first thought of using an IDR store for the realization, which
      takes over dynamic assignment of unused key space and also performs
      the key to pointer mapping in RCU. While doing so, we stumbled upon
      the issue that due to the nature of dynamic key distribution, it
      just so happens, arguably in very rare occasions, that excessive
      module loads and unloads can lead to a possible reuse of previously
      used key space. Thus, previously stale keys in the dst metric are
      now being reassigned to a different congestion control algorithm,
      which might lead to unexpected behaviour. One way to resolve this
      would have been to walk FIBs on the actually rare occasion of a
      module unload and reset the metric keys for each FIB in each netns,
      but that's just very costly.
      
      Therefore, we argue a better solution is to reuse the unique
      congestion control algorithm name member and map that into u32 key
      space through jhash. For that, we split the flags attribute (as it
      currently uses 2 bits only anyway) into two u32 attributes, flags
      and key, so that we can keep the cacheline boundary of 2 cachelines
      on x86_64 and cache the precalculated key at registration time for
      the fast path. On average we might expect 2 - 4 modules being loaded
      worst case perhaps 15, so a key collision possibility is extremely
      low, and guaranteed collision-free on LE/BE for all in-tree modules.
      Overall this results in much simpler code, and all without the
      overhead of an IDR. Due to the deterministic nature, modules can
      now be unloaded, the congestion control algorithm for a specific
      but unloaded key will fall back to the default one, and on module
      reload time it will switch back to the expected algorithm
      transparently.
      
      Joint work with Florian Westphal.
      
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c5c6a8ab
    • Daniel Borkmann's avatar
      net: tcp: refactor reinitialization of congestion control · 29ba4fff
      Daniel Borkmann authored
      
      
      We can just move this to an extra function and make the code
      a bit more readable, no functional change.
      
      Joint work with Florian Westphal.
      
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29ba4fff
    • Florian Westphal's avatar
      net: fib6: convert cfg metric to u32 outside of table write lock · e715b6d3
      Florian Westphal authored
      
      
      Do the nla validation earlier, outside the write lock.
      
      This is needed by followup patch which needs to be able to call
      request_module (which can sleep) if needed.
      
      Joint work with Daniel Borkmann.
      
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e715b6d3
    • Daniel Borkmann's avatar
      net: fib6: fib6_commit_metrics: fix potential NULL pointer dereference · 0409c9a5
      Daniel Borkmann authored
      When IPv6 host routes with metrics attached are being added, we fetch
      the metrics store from the dst via COW through dst_metrics_write_ptr(),
      added through commit e5fd387a.
      
      One remaining problem here is that we actually call into inet_getpeer()
      and may end up allocating/creating a new peer from the kmemcache, which
      may fail.
      
      Example trace from perf probe (inet_getpeer:41) where create is 1:
      
      ip 6877 [002] 4221.391591: probe:inet_getpeer: (ffffffff8165e293)
        85e294 inet_getpeer.part.7 (<- kmem_cache_alloc())
        85e578 inet_getpeer
        8eb333 ipv6_cow_metrics
        8f10ff fib6_commit_metrics
      
      Therefore, a check for NULL on the return of dst_metrics_write_ptr()
      is necessary here.
      
      Joint work with Florian Westphal.
      
      Fixes: e5fd387a
      
       ("ipv6: do not overwrite inetpeer metrics prematurely")
      Cc: Michal Kubeček <mkubecek@suse.cz>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0409c9a5
    • Hubert Sokolowski's avatar
      net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined · 6cb69742
      Hubert Sokolowski authored
      
      
      Add checking whether the call to ndo_dflt_fdb_dump is needed.
      It is not expected to call ndo_dflt_fdb_dump unconditionally
      by some drivers (i.e. qlcnic or macvlan) that defines
      own ndo_fdb_dump. Other drivers define own ndo_fdb_dump
      and don't want ndo_dflt_fdb_dump to be called at all.
      At the same time it is desirable to call the default dump
      function on a bridge device.
      Fix attributes that are passed to dev->netdev_ops->ndo_fdb_dump.
      Add extra checking in br_fdb_dump to avoid duplicate entries
      as now filter_dev can be NULL.
      
      Following tests for filtering have been performed before
      the change and after the patch was applied to make sure
      they are the same and it doesn't break the filtering algorithm.
      
      [root@localhost ~]# cd /root/iproute2-3.18.0/bridge
      [root@localhost bridge]# modprobe dummy
      [root@localhost bridge]# ./bridge fdb add f1:f2:f3:f4:f5:f6 dev dummy0
      [root@localhost bridge]# brctl addbr br0
      [root@localhost bridge]# brctl addif  br0 dummy0
      [root@localhost bridge]# ip link set dev br0 address 02:00:00:12:01:04
      [root@localhost bridge]# # show all
      [root@localhost bridge]# ./bridge fdb show
      33:33:00:00:00:01 dev p2p1 self permanent
      01:00:5e:00:00:01 dev p2p1 self permanent
      33:33:ff:ac:ce:32 dev p2p1 self permanent
      33:33:00:00:02:02 dev p2p1 self permanent
      01:00:5e:00:00:fb dev p2p1 self permanent
      33:33:00:00:00:01 dev p7p1 self permanent
      01:00:5e:00:00:01 dev p7p1 self permanent
      33:33:ff:79:50:53 dev p7p1 self permanent
      33:33:00:00:02:02 dev p7p1 self permanent
      01:00:5e:00:00:fb dev p7p1 self permanent
      f2:46:50:85:6d:d9 dev dummy0 master br0 permanent
      f2:46:50:85:6d:d9 dev dummy0 vlan 1 master br0 permanent
      33:33:00:00:00:01 dev dummy0 self permanent
      f1:f2:f3:f4:f5:f6 dev dummy0 self permanent
      33:33:00:00:00:01 dev br0 self permanent
      02:00:00:12:01:04 dev br0 vlan 1 master br0 permanent
      02:00:00:12:01:04 dev br0 master br0 permanent
      [root@localhost bridge]# # filter by bridge
      [root@localhost bridge]# ./bridge fdb show br br0
      f2:46:50:85:6d:d9 dev dummy0 master br0 permanent
      f2:46:50:85:6d:d9 dev dummy0 vlan 1 master br0 permanent
      33:33:00:00:00:01 dev dummy0 self permanent
      f1:f2:f3:f4:f5:f6 dev dummy0 self permanent
      33:33:00:00:00:01 dev br0 self permanent
      02:00:00:12:01:04 dev br0 vlan 1 master br0 permanent
      02:00:00:12:01:04 dev br0 master br0 permanent
      [root@localhost bridge]# # filter by port
      [root@localhost bridge]# ./bridge fdb show brport dummy0
      f2:46:50:85:6d:d9 master br0 permanent
      f2:46:50:85:6d:d9 vlan 1 master br0 permanent
      33:33:00:00:00:01 self permanent
      f1:f2:f3:f4:f5:f6 self permanent
      [root@localhost bridge]# # filter by port + bridge
      [root@localhost bridge]# ./bridge fdb show br br0 brport dummy0
      f2:46:50:85:6d:d9 master br0 permanent
      f2:46:50:85:6d:d9 vlan 1 master br0 permanent
      33:33:00:00:00:01 self permanent
      f1:f2:f3:f4:f5:f6 self permanent
      [root@localhost bridge]#
      
      Signed-off-by: default avatarHubert Sokolowski <hubert.sokolowski@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6cb69742
    • David S. Miller's avatar
      Merge branch 'ip_cmsg_csum' · d4253c62
      David S. Miller authored
      
      
      Tom Herbert says:
      
      ====================
      ip: Support checksum returned in csmg
      
      This patch set allows the packet checksum for a datagram socket
      to be returned in csum data in recvmsg. This allows userspace
      to implement its own checksum over the data, for instance if an
      IP tunnel was be implemented in user space, the inner checksum
      could be validated.
      
      Changes in this patch set:
        - Move checksum conversion to inet_sock from udp_sock. This
          generalizes checksum conversion for use with other protocols.
        - Move IP cmsg constants to a header file and make processing
          of the flags more efficient in ip_cmsg_recv
        - Return checksum value in cmsg. This is specifically the unfolded
          32 bit checksum of the full packet starting from the first byte
          returned in recvmsg
      
      Tested: Wrote a little server to get checksums in cmsg for UDP and
              verfied correct checksum is returned.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d4253c62
    • Tom Herbert's avatar
      ip: Add offset parameter to ip_cmsg_recv · ad6f939a
      Tom Herbert authored
      
      
      Add ip_cmsg_recv_offset function which takes an offset argument
      that indicates the starting offset in skb where data is being received
      from. This will be useful in the case of UDP and provided checksum
      to user space.
      
      ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
      zero.
      
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ad6f939a
    • Tom Herbert's avatar
      ip: Add offset parameter to ip_cmsg_recv · 5961de9f
      Tom Herbert authored
      
      
      Add ip_cmsg_recv_offset function which takes an offset argument
      that indicates the starting offset in skb where data is being received
      from. This will be useful in the case of UDP and provided checksum
      to user space.
      
      ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
      zero.
      
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5961de9f
    • Tom Herbert's avatar
      ip: IP cmsg cleanup · c44d13d6
      Tom Herbert authored
      
      
      Move the IP_CMSG_* constants from ip_sockglue.c to inet_sock.h so that
      they can be referenced in other source files.
      
      Restructure ip_cmsg_recv to not go through flags using shift, check
      for flags by 'and'. This eliminates both the shift and a conditional
      per flag check.
      
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c44d13d6
    • Tom Herbert's avatar
      ip: Move checksum convert defines to inet · 224d019c
      Tom Herbert authored
      
      
      Move convert_csum from udp_sock to inet_sock. This allows the
      possibility that we can use convert checksum for different types
      of sockets and also allows convert checksum to be enabled from
      inet layer (what we'll want to do when enabling IP_CHECKSUM cmsg).
      
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      224d019c
    • Thomas Graf's avatar
      netlink: Warn on unordered or illegal nla_nest_cancel() or nlmsg_cancel() · 149118d8
      Thomas Graf authored
      
      
      Calling nla_nest_cancel() in a different order as the nesting was
      built up can lead to negative offsets being calculated which
      results in skb_trim() being called with an underflowed unsigned
      int. Warn if mark < skb->data as it's definitely a bug.
      
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      149118d8
    • David S. Miller's avatar
      Merge branch 'cxgb4-next' · a515abd7
      David S. Miller authored
      
      
      Hariprasad Shenai says:
      
      ====================
      RDMA/cxgb4/cxgb4vf/csiostor: Cleanup register defines
      
      This series continues to cleanup all the macros/register defines related to
      SGE, PCIE, MC, MA, TCAM, MAC, etc that are defined in t4_regs.h and the
      affected files.
      
      Will post another 1 or 2 series so that we can cover all the macros so that
      they all follow the same style to be consistent.
      
      The patches series is created against 'net-next' tree.
      And includes patches on cxgb4, cxgb4vf, iw_cxgb4 and csiostor driver.
      
      We have included all the maintainers of respective drivers. Kindly review the
      change and let us know in case of any review comments.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a515abd7
    • Hariprasad Shenai's avatar
      cxgb4/cxgb4vf/csiostor: Cleanup PL, XGMAC, SF and MC related register defines · 0d804338
      Hariprasad Shenai authored
      
      
      This patch cleanups all PL, XGMAC and SF related macros/register defines
      that are defined in t4_regs.h and the affected files
      
      Signed-off-by: default avatarHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d804338
    • Hariprasad Shenai's avatar
      cxgb4/csiostor: Cleanup TP, MPS and TCAM related register defines · 837e4a42
      Hariprasad Shenai authored
      
      
      This patch cleanups all TP, MPS and TCAM related macros/register defines
      that are defined in t4_regs.h and the affected files
      
      Signed-off-by: default avatarHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      837e4a42
    • Hariprasad Shenai's avatar
      cxgb4/cxg4vf/csiostor: Cleanup MC, MA and CIM related register defines · 89c3a86c
      Hariprasad Shenai authored
      
      
      This patch cleanups all MC, MA and CIM related macros/register defines that are
      defined in t4_regs.h and the affected files.
      
      Signed-off-by: default avatarHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89c3a86c
    • Hariprasad Shenai's avatar
      cxgb4/cxgb4vf/csiostor: Cleanup SGE and PCI related register defines · f061de42
      Hariprasad Shenai authored
      
      
      This patch cleansup remaining SGE related macros/register defines and all PCI
      related ones that are defined in t4_regs.h and the affected files.
      
      Signed-off-by: default avatarHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f061de42
    • Hariprasad Shenai's avatar
      RDMA/cxgb4/cxgb4vf/csiostor: Cleanup SGE register defines · f612b815
      Hariprasad Shenai authored
      
      
      This patch cleanups all SGE related macros/register defines that are
      defined in t4_regs.h and the affected files.
      
      Signed-off-by: default avatarHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f612b815
    • Sathya Perla's avatar
      be2net: support TX batching using skb->xmit_more flag · 5f07b3c5
      Sathya Perla authored
      
      
      This patch uses skb->xmit_more flag to batch TX requests.
      TX is flushed either when xmit_more is false or there is
      no more space in the TXQ.
      
      Skyhawk-R and BEx chips require an even number of wrbs to be posted.
      So, when a batch of TX requests is accumulated, the last header wrb
      may need to be fixed with an extra dummy wrb.
      
      This patch refactors be_xmit() routine as a sequence of be_xmit_enqueue()
      and be_xmit_flush() calls. The Tx completion code is also
      updated to be able to unmap/free a batch of skbs rather than a single
      skb.
      
      Signed-off-by: default avatarSathya Perla <sathya.perla@emulex.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5f07b3c5
    • Krzysztof Kozlowski's avatar
      at86rf230: Constify struct regmap_config · 889ee2c7
      Krzysztof Kozlowski authored
      
      
      The regmap_config struct may be const because it is not modified by the
      driver and regmap_init() accepts pointer to const.
      
      Signed-off-by: default avatarKrzysztof Kozlowski <k.kozlowski@samsung.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      889ee2c7
  3. 04 Jan, 2015 9 commits
  4. 03 Jan, 2015 3 commits
    • David S. Miller's avatar
      Merge branch 'rhashtable-next' · 7beceebf
      David S. Miller authored
      
      
      Thomas Graf says:
      
      ====================
      rhashtable: Per bucket locks & deferred table resizing
      
      Prepares for and introduces per bucket spinlocks and deferred table
      resizing. This allows for parallel table mutations in different hash
      buckets from atomic context. The resizing occurs in the background
      in a separate worker thread while lookups, inserts, and removals can
      continue.
      
      Also modified the chain linked list to be terminated with a special
      nulls marker to allow entries to move between multiple lists.
      
      Last but not least, reintroduces lockless netlink_lookup() with
      deferred Netlink socket destruction to avoid the side effect of
      increased netlink_release() runtime.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7beceebf
    • Thomas Graf's avatar
      netlink: Lockless lookup with RCU grace period in socket release · 21e4902a
      Thomas Graf authored
      Defers the release of the socket reference using call_rcu() to
      allow using an RCU read-side protected call to rhashtable_lookup()
      
      This restores behaviour and performance gains as previously
      introduced by e341694e
      
       ("netlink: Convert netlink_lookup() to use
      RCU protected hash table") without the side effect of severely
      delayed socket destruction.
      
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21e4902a
    • Thomas Graf's avatar
      rhashtable: Supports for nulls marker · f89bd6f8
      Thomas Graf authored
      
      
      In order to allow for wider usage of rhashtable, use a special nulls
      marker to terminate each chain. The reason for not using the existing
      nulls_list is that the prev pointer usage would not be valid as entries
      can be linked in two different buckets at the same time.
      
      The 4 nulls base bits can be set through the rhashtable_params structure
      like this:
      
      struct rhashtable_params params = {
              [...]
              .nulls_base = (1U << RHT_BASE_SHIFT),
      };
      
      This reduces the hash length from 32 bits to 27 bits.
      
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f89bd6f8