1. 27 Jan, 2011 3 commits
  2. 26 Jan, 2011 1 commit
    • David S. Miller's avatar
      net: Implement read-only protection and COW'ing of metrics. · 62fa8a84
      David S. Miller authored
      Routing metrics are now copy-on-write.
      
      Initially a route entry points it's metrics at a read-only location.
      If a routing table entry exists, it will point there.  Else it will
      point at the all zero metric place-holder called 'dst_default_metrics'.
      
      The writeability state of the metrics is stored in the low bits of the
      metrics pointer, we have two bits left to spare if we want to store
      more states.
      
      For the initial implementation, COW is implemented simply via kmalloc.
      However future enhancements will change this to place the writable
      metrics somewhere else, in order to increase sharing.  Very likely
      this "somewhere else" will be the inetpeer cache.
      
      Note also that this means that metrics updates may transiently fail
      if we cannot COW the metrics successfully.
      
      But even by itself, this patch should decrease memory usage and
      increase cache locality especially for routing workloads.  In those
      cases the read-only metric copies stay in place and never get written
      to.
      
      TCP workloads where metrics get updated, and those rare cases where
      PMTU triggers occur, will take a very slight performance hit.  But
      that hit will be alleviated when the long-term writable metrics
      move to a more sharable location.
      
      Since the metrics storage went from a u32 array of RTAX_MAX entries to
      what is essentially a pointer, some retooling of the dst_entry layout
      was necessary.
      
      Most importantly, we need to preserve the alignment of the reference
      count so that it doesn't share cache lines with the read-mostly state,
      as per Eric Dumazet's alignment assertion checks.
      
      The only non-trivial bit here is the move of the 'flags' member into
      the writeable cacheline.  This is OK since we are always accessing the
      flags around the same moment when we made a modification to the
      reference count.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      62fa8a84
  3. 24 Jan, 2011 4 commits
  4. 23 Jan, 2011 5 commits
    • Rusty Russell's avatar
      Remove MAYBE_BUILD_BUG_ON · 1765e3a4
      Rusty Russell authored
      Now BUILD_BUG_ON() can handle optimizable constants, we don't need
      MAYBE_BUILD_BUG_ON any more.
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      1765e3a4
    • Rusty Russell's avatar
      BUILD_BUG_ON: make it handle more cases · 7ef88ad5
      Rusty Russell authored
      BUILD_BUG_ON used to use the optimizer to do code elimination or fail
      at link time; it was changed to first the size of a negative array (a
      nicer compile time error), then (in
      8c87df45) to a bitfield.
      
      This forced us to change some non-constant cases to MAYBE_BUILD_BUG_ON();
      as Jan points out in that commit, it didn't work as intended anyway.
      
      bitfields: needs a literal constant at parse time, and can't be put under
      	"if (__builtin_constant_p(x))" for example.
      negative array: can handle anything, but if the compiler can't tell it's
      	a constant, silently has no effect.
      link time: breaks link if the compiler can't determine the value, but the
      	linker output is not usually as informative as a compiler error.
      
      If we use the negative-array-size method *and* the link time trick,
      we get the ability to use BUILD_BUG_ON() under __builtin_constant_p()
      branches, and maximal ability for the compiler to detect errors at
      build time.
      
      We also document it thoroughly.
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Cc: Jan Beulich <JBeulich@novell.com>
      Acked-by: default avatarHollis Blanchard <hollisb@us.ibm.com>
      7ef88ad5
    • Linus Walleij's avatar
      param: add null statement to compiled-in module params · b75be420
      Linus Walleij authored
      Add an unused struct declaration statement requiring a
      terminating semicolon to the compile-in case to provoke an
      error if __MODULE_INFO() is used without the terminating
      semicolon. Previously MODULE_ALIAS("foo") (no semicolon)
      compiled fine if MODULE was not selected.
      
      Cc: Dan Carpenter <error27@gmail.com>
      Signed-off-by: default avatarLinus Walleij <linus.walleij@stericsson.com>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      b75be420
    • Rusty Russell's avatar
      module: fix linker error for MODULE_VERSION when !MODULE and CONFIG_SYSFS=n · 3b90a5b2
      Rusty Russell authored
      lib/built-in.o:(__modver+0x8): undefined reference to `__modver_version_show'
      lib/built-in.o:(__modver+0x2c): undefined reference to `__modver_version_show'
      
      Simplest to just not emit anything: if they've disabled SYSFS they probably
      want the smallest kernel possible.
      Reported-by: default avatarRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      3b90a5b2
    • Dmitry Torokhov's avatar
      module: show version information for built-in modules in sysfs · e94965ed
      Dmitry Torokhov authored
      Currently only drivers that are built as modules have their versions
      shown in /sys/module/<module_name>/version, but this information might
      also be useful for built-in drivers as well. This especially important
      for drivers that do not define any parameters - such drivers, if
      built-in, are completely invisible from userspace.
      
      This patch changes MODULE_VERSION() macro so that in case when we are
      compiling built-in module, version information is stored in a separate
      section. Kernel then uses this data to create 'version' sysfs attribute
      in the same fashion it creates attributes for module parameters.
      Signed-off-by: default avatarDmitry Torokhov <dtor@vmware.com>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      e94965ed
  5. 22 Jan, 2011 1 commit
    • Ben Hutchings's avatar
      genirq: Add IRQ affinity notifiers · cd7eab44
      Ben Hutchings authored
      When initiating I/O on a multiqueue and multi-IRQ device, we may want
      to select a queue for which the response will be handled on the same
      or a nearby CPU.  This requires a reverse-map of IRQ affinity.  Add a
      notification mechanism to support this.
      
      This is based closely on work by Thomas Gleixner <tglx@linutronix.de>.
      Signed-off-by: default avatarBen Hutchings <bhutchings@solarflare.com>
      Cc: linux-net-drivers@solarflare.com
      Cc: Tom Herbert <therbert@google.com>
      Cc: David Miller <davem@davemloft.net>
      LKML-Reference: <1295470904.11126.84.camel@bwh-desktop>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      cd7eab44
  6. 21 Jan, 2011 6 commits
  7. 20 Jan, 2011 14 commits
    • David S. Miller's avatar
      686a2955
    • Rafael J. Wysocki's avatar
      ACPI: Introduce acpi_os_ioremap() · 2d6d9fd3
      Rafael J. Wysocki authored
      Commit ca9b600b ("ACPI / PM: Make suspend_nvs_save() use
      acpi_os_map_memory()") attempted to prevent the code in osl.c and nvs.c
      from using different ioremap() variants by making the latter use
      acpi_os_map_memory() for mapping the NVS pages.  However, that also
      requires acpi_os_unmap_memory() to be used for unmapping them, which
      causes synchronize_rcu() to be executed many times in a row
      unnecessarily and introduces substantial delays during resume on some
      systems.
      
      Instead of using acpi_os_map_memory() for mapping the NVS pages in nvs.c
      introduce acpi_os_ioremap() calling ioremap_cache() and make the code in
      both osl.c and nvs.c use it.
      Reported-by: default avatarJeff Chua <jeff.chua.linux@gmail.com>
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d6d9fd3
    • KAMEZAWA Hiroyuki's avatar
      memcg: fix USED bit handling at uncharge in THP · ca3e0214
      KAMEZAWA Hiroyuki authored
      Now, under THP:
      
      at charge:
        - PageCgroupUsed bit is set to all page_cgroup on a hugepage.
          ....set to 512 pages.
      at uncharge
        - PageCgroupUsed bit is unset on the head page.
      
      So, some pages will remain with "Used" bit.
      
      This patch fixes that Used bit is set only to the head page.
      Used bits for tail pages will be set at splitting if necessary.
      
      This patch adds this lock order:
         compound_lock() -> page_cgroup_move_lock().
      
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca3e0214
    • Shan Wei's avatar
      dccp: clean up unused DCCP_STATE_MASK definition · d18046b3
      Shan Wei authored
      Remove unused DCCP_STATE_MASK macro.
      Signed-off-by: default avatarShan Wei <shanwei@cn.fujitsu.com>
      Acked-by: default avatarGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d18046b3
    • Eric Dumazet's avatar
      net_sched: RCU conversion of stab · a2da570d
      Eric Dumazet authored
      This patch converts stab qdisc management to RCU, so that we can perform
      the qdisc_calculate_pkt_len() call before getting qdisc lock.
      
      This shortens the lock's held time in __dev_xmit_skb().
      
      This permits more qdiscs to get TCQ_F_CAN_BYPASS status, avoiding lot of
      cache misses and so reducing latencies.
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Jesper Dangaard Brouer <hawk@diku.dk>
      CC: Jarek Poplawski <jarkao2@gmail.com>
      CC: Jamal Hadi Salim <hadi@cyberus.ca>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2da570d
    • Eric Dumazet's avatar
      net_sched: move TCQ_F_THROTTLED flag · fd245a4a
      Eric Dumazet authored
      In commit 37112105 (net: QDISC_STATE_RUNNING dont need atomic bit
      ops) I moved QDISC_STATE_RUNNING flag to __state container, located in
      the cache line containing qdisc lock and often dirtied fields.
      
      I now move TCQ_F_THROTTLED bit too, so that we let first cache line read
      mostly, and shared by all cpus. This should speedup HTB/CBQ for example.
      
      Not using test_bit()/__clear_bit()/__test_and_set_bit allows to use an
      "unsigned int" for __state container, reducing by 8 bytes Qdisc size.
      
      Introduce helpers to hide implementation details.
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Jesper Dangaard Brouer <hawk@diku.dk>
      CC: Jarek Poplawski <jarkao2@gmail.com>
      CC: Jamal Hadi Salim <hadi@cyberus.ca>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd245a4a
    • Patrick McHardy's avatar
      netfilter: nf_conntrack: fix linker error with NF_CONNTRACK_TIMESTAMP=n · 2f1e3176
      Patrick McHardy authored
      net/built-in.o: In function `nf_conntrack_init_net':
      net/netfilter/nf_conntrack_core.c:1521:
      	undefined reference to `nf_conntrack_tstamp_init'
      net/netfilter/nf_conntrack_core.c:1531:
      	undefined reference to `nf_conntrack_tstamp_fini'
      
      Add dummy inline functions for the =n case to fix this.
      Reported-by: default avatarJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      2f1e3176
    • Jan Engelhardt's avatar
      netfilter: xtables: add missing header inclusions for headers_check · 06988b06
      Jan Engelhardt authored
      Resolve these warnings on `make headers_check`:
      
      usr/include/linux/netfilter/xt_CT.h:7: found __[us]{8,16,32,64} type
      without #include <linux/types.h>
      ...
      Signed-off-by: default avatarJan Engelhardt <jengelh@medozas.de>
      06988b06
    • Jan Engelhardt's avatar
      netfilter: xtables: remove duplicate member · ba12b130
      Jan Engelhardt authored
      Accidentally missed removing the old out-of-union "inverse" member,
      which caused the struct size to change which then gives size mismatch
      warnings when using an old iptables.
      
      It is interesting to see that gcc did not warn about this before.
      (Filed http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47376 )
      Signed-off-by: default avatarJan Engelhardt <jengelh@medozas.de>
      ba12b130
    • Tejun Heo's avatar
      lockdep: Move early boot local IRQ enable/disable status to init/main.c · 2ce802f6
      Tejun Heo authored
      During early boot, local IRQ is disabled until IRQ subsystem is
      properly initialized.  During this time, no one should enable
      local IRQ and some operations which usually are not allowed with
      IRQ disabled, e.g. operations which might sleep or require
      communications with other processors, are allowed.
      
      lockdep tracked this with early_boot_irqs_off/on() callbacks.
      As other subsystems need this information too, move it to
      init/main.c and make it generally available.  While at it,
      toggle the boolean to early_boot_irqs_disabled instead of
      enabled so that it can be initialized with %false and %true
      indicates the exceptional condition.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20110120110635.GB6036@htj.dyndns.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      2ce802f6
    • Jan Engelhardt's avatar
      netfilter: xtables: remove extraneous header that slipped in · 5d844928
      Jan Engelhardt authored
      Commit 0b8ad876 (netfilter: xtables: add missing header files to export
      list) erroneously added this.
      Signed-off-by: default avatarJan Engelhardt <jengelh@medozas.de>
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      5d844928
    • John Fastabend's avatar
      net_sched: implement a root container qdisc sch_mqprio · b8970f0b
      John Fastabend authored
      This implements a mqprio queueing discipline that by default creates
      a pfifo_fast qdisc per tx queue and provides the needed configuration
      interface.
      
      Using the mqprio qdisc the number of tcs currently in use along
      with the range of queues alloted to each class can be configured. By
      default skbs are mapped to traffic classes using the skb priority.
      This mapping is configurable.
      
      Configurable parameters,
      
      struct tc_mqprio_qopt {
      	__u8    num_tc;
      	__u8    prio_tc_map[TC_BITMASK + 1];
      	__u8    hw;
      	__u16   count[TC_MAX_QUEUE];
      	__u16   offset[TC_MAX_QUEUE];
      };
      
      Here the count/offset pairing give the queue alignment and the
      prio_tc_map gives the mapping from skb->priority to tc.
      
      The hw bit determines if the hardware should configure the count
      and offset values. If the hardware bit is set then the operation
      will fail if the hardware does not implement the ndo_setup_tc
      operation. This is to avoid undetermined states where the hardware
      may or may not control the queue mapping. Also minimal bounds
      checking is done on the count/offset to verify a queue does not
      exceed num_tx_queues and that queue ranges do not overlap. Otherwise
      it is left to user policy or hardware configuration to create
      useful mappings.
      
      It is expected that hardware QOS schemes can be implemented by
      creating appropriate mappings of queues in ndo_tc_setup().
      
      One expected use case is drivers will use the ndo_setup_tc to map
      queue ranges onto 802.1Q traffic classes. This provides a generic
      mechanism to map network traffic onto these traffic classes and
      removes the need for lower layer drivers to know specifics about
      traffic types.
      Signed-off-by: default avatarJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b8970f0b
    • John Fastabend's avatar
      net: implement mechanism for HW based QOS · 4f57c087
      John Fastabend authored
      This patch provides a mechanism for lower layer devices to
      steer traffic using skb->priority to tx queues. This allows
      for hardware based QOS schemes to use the default qdisc without
      incurring the penalties related to global state and the qdisc
      lock. While reliably receiving skbs on the correct tx ring
      to avoid head of line blocking resulting from shuffling in
      the LLD. Finally, all the goodness from txq caching and xps/rps
      can still be leveraged.
      
      Many drivers and hardware exist with the ability to implement
      QOS schemes in the hardware but currently these drivers tend
      to rely on firmware to reroute specific traffic, a driver
      specific select_queue or the queue_mapping action in the
      qdisc.
      
      By using select_queue for this drivers need to be updated for
      each and every traffic type and we lose the goodness of much
      of the upstream work. Firmware solutions are inherently
      inflexible. And finally if admins are expected to build a
      qdisc and filter rules to steer traffic this requires knowledge
      of how the hardware is currently configured. The number of tx
      queues and the queue offsets may change depending on resources.
      Also this approach incurs all the overhead of a qdisc with filters.
      
      With the mechanism in this patch users can set skb priority using
      expected methods ie setsockopt() or the stack can set the priority
      directly. Then the skb will be steered to the correct tx queues
      aligned with hardware QOS traffic classes. In the normal case with
      single traffic class and all queues in this class everything
      works as is until the LLD enables multiple tcs.
      
      To steer the skb we mask out the lower 4 bits of the priority
      and allow the hardware to configure upto 15 distinct classes
      of traffic. This is expected to be sufficient for most applications
      at any rate it is more then the 8021Q spec designates and is
      equal to the number of prio bands currently implemented in
      the default qdisc.
      
      This in conjunction with a userspace application such as
      lldpad can be used to implement 8021Q transmission selection
      algorithms one of these algorithms being the extended transmission
      selection algorithm currently being used for DCB.
      Signed-off-by: default avatarJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4f57c087
    • Vlad Dogaru's avatar
      net_device: add support for network device groups · cbda10fa
      Vlad Dogaru authored
      Net devices can now be grouped, enabling simpler manipulation from
      userspace. This patch adds a group field to the net_device structure, as
      well as rtnetlink support to query and modify it.
      Signed-off-by: default avatarVlad Dogaru <ddvlad@rosedu.org>
      Acked-by: default avatarJamal Hadi Salim <hadi@cyberus.ca>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cbda10fa
  8. 19 Jan, 2011 6 commits