1. 10 Jul, 2013 1 commit
  2. 26 Jun, 2013 1 commit
    • Nicolas Schichan's avatar
      net: fix kernel deadlock with interface rename and netdev name retrieval. · 5dbe7c17
      Nicolas Schichan authored
      
      
      When the kernel (compiled with CONFIG_PREEMPT=n) is performing the
      rename of a network interface, it can end up waiting for a workqueue
      to complete. If userland is able to invoke a SIOCGIFNAME ioctl or a
      SO_BINDTODEVICE getsockopt in between, the kernel will deadlock due to
      the fact that read_secklock_begin() will spin forever waiting for the
      writer process (the one doing the interface rename) to update the
      devnet_rename_seq sequence.
      
      This patch fixes the problem by adding a helper (netdev_get_name())
      and using it in the code handling the SIOCGIFNAME ioctl and
      SO_BINDTODEVICE setsockopt.
      
      The netdev_get_name() helper uses raw_seqcount_begin() to avoid
      spinning forever, waiting for devnet_rename_seq->sequence to become
      even. cond_resched() is used in the contended case, before retrying
      the access to give the writer process a chance to finish.
      
      The use of raw_seqcount_begin() will incur some unneeded work in the
      reader process in the contended case, but this is better than
      deadlocking the system.
      Signed-off-by: default avatarNicolas Schichan <nschichan@freebox.fr>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5dbe7c17
  3. 25 Jun, 2013 1 commit
    • Eliezer Tamir's avatar
      net: poll/select low latency socket support · 2d48d67f
      Eliezer Tamir authored
      
      
      select/poll busy-poll support.
      
      Split sysctl value into two separate ones, one for read and one for poll.
      updated Documentation/sysctl/net.txt
      
      Add a new poll flag POLL_LL. When this flag is set, sock_poll will call
      sk_poll_ll if possible. sock_poll sets this flag in its return value
      to indicate to select/poll when a socket that can busy poll is found.
      
      When poll/select have nothing to report, call the low-level
      sock_poll again until we are out of time or we find something.
      
      Once the system call finds something, it stops setting POLL_LL, so it can
      return the result to the user ASAP.
      Signed-off-by: default avatarEliezer Tamir <eliezer.tamir@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d48d67f
  4. 17 Jun, 2013 1 commit
  5. 10 Jun, 2013 1 commit
  6. 29 May, 2013 1 commit
  7. 11 May, 2013 1 commit
    • Eric Dumazet's avatar
      ipv6: do not clear pinet6 field · f77d6021
      Eric Dumazet authored
      We have seen multiple NULL dereferences in __inet6_lookup_established()
      
      After analysis, I found that inet6_sk() could be NULL while the
      check for sk_family == AF_INET6 was true.
      
      Bug was added in linux-2.6.29 when RCU lookups were introduced in UDP
      and TCP stacks.
      
      Once an IPv6 socket, using SLAB_DESTROY_BY_RCU is inserted in a hash
      table, we no longer can clear pinet6 field.
      
      This patch extends logic used in commit fcbdf09d
      
      
      ("net: fix nulls list corruptions in sk_prot_alloc")
      
      TCP/UDP/UDPLite IPv6 protocols provide their own .clear_sk() method
      to make sure we do not clear pinet6 field.
      
      At socket clone phase, we do not really care, as cloning the parent (non
      NULL) pinet6 is not adding a fatal race.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f77d6021
  8. 09 Apr, 2013 2 commits
  9. 31 Mar, 2013 1 commit
    • Keller, Jacob E's avatar
      net: add option to enable error queue packets waking select · 7d4c04fc
      Keller, Jacob E authored
      
      
      Currently, when a socket receives something on the error queue it only wakes up
      the socket on select if it is in the "read" list, that is the socket has
      something to read. It is useful also to wake the socket if it is in the error
      list, which would enable software to wait on error queue packets without waking
      up for regular data on the socket. The main use case is for receiving
      timestamped transmit packets which return the timestamp to the socket via the
      error queue. This enables an application to select on the socket for the error
      queue only instead of for the regular traffic.
      
      -v2-
      * Added the SO_SELECT_ERR_QUEUE socket option to every architechture specific file
      * Modified every socket poll function that checks error queue
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Cc: Jeffrey Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Matthew Vick <matthew.vick@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d4c04fc
  10. 21 Mar, 2013 1 commit
  11. 22 Feb, 2013 1 commit
  12. 18 Feb, 2013 2 commits
  13. 04 Feb, 2013 1 commit
  14. 23 Jan, 2013 1 commit
  15. 17 Jan, 2013 1 commit
    • Vincent Bernat's avatar
      sk-filter: Add ability to lock a socket filter program · d59577b6
      Vincent Bernat authored
      
      
      While a privileged program can open a raw socket, attach some
      restrictive filter and drop its privileges (or send the socket to an
      unprivileged program through some Unix socket), the filter can still
      be removed or modified by the unprivileged program. This commit adds a
      socket option to lock the filter (SO_LOCK_FILTER) preventing any
      modification of a socket filter program.
      
      This is similar to OpenBSD BIOCLOCK ioctl on bpf sockets, except even
      root is not allowed change/drop the filter.
      
      The state of the lock can be read with getsockopt(). No error is
      triggered if the state is not changed. -EPERM is returned when a user
      tries to remove the lock or to change/remove the filter while the lock
      is active. The check is done directly in sk_attach_filter() and
      sk_detach_filter() and does not affect only setsockopt() syscall.
      Signed-off-by: default avatarVincent Bernat <bernat@luffy.cx>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d59577b6
  16. 21 Dec, 2012 1 commit
  17. 26 Nov, 2012 1 commit
    • Brian Haley's avatar
      sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name · c91f6df2
      Brian Haley authored
      
      
      Instead of having the getsockopt() of SO_BINDTODEVICE return an index, which
      will then require another call like if_indextoname() to get the actual interface
      name, have it return the name directly.
      
      This also matches the existing man page description on socket(7) which mentions
      the argument being an interface name.
      
      If the value has not been set, zero is returned and optlen will be set to zero
      to indicate there is no interface name present.
      
      Added a seqlock to protect this code path, and dev_ifname(), from someone
      changing the device name via dev_change_name().
      
      v2: Added seqlock protection while copying device name.
      
      v3: Fixed word wrap in patch.
      Signed-off-by: default avatarBrian Haley <brian.haley@hp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c91f6df2
  18. 18 Nov, 2012 1 commit
    • Eric W. Biederman's avatar
      net: Allow userns root control of the core of the network stack. · 5e1fccc0
      Eric W. Biederman authored
      
      
      Allow an unpriviled user who has created a user namespace, and then
      created a network namespace to effectively use the new network
      namespace, by reducing capable(CAP_NET_ADMIN) and
      capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
      CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.
      
      Settings that merely control a single network device are allowed.
      Either the network device is a logical network device where
      restrictions make no difference or the network device is hardware NIC
      that has been explicity moved from the initial network namespace.
      
      In general policy and network stack state changes are allowed
      while resource control is left unchanged.
      
      Allow ethtool ioctls.
      
      Allow binding to network devices.
      Allow setting the socket mark.
      Allow setting the socket priority.
      
      Allow setting the network device alias via sysfs.
      Allow setting the mtu via sysfs.
      Allow changing the network device flags via sysfs.
      Allow setting the network device group via sysfs.
      
      Allow the following network device ioctls.
      SIOCGMIIPHY
      SIOCGMIIREG
      SIOCSIFNAME
      SIOCSIFFLAGS
      SIOCSIFMETRIC
      SIOCSIFMTU
      SIOCSIFHWADDR
      SIOCSIFSLAVE
      SIOCADDMULTI
      SIOCDELMULTI
      SIOCSIFHWBROADCAST
      SIOCSMIIREG
      SIOCBONDENSLAVE
      SIOCBONDRELEASE
      SIOCBONDSETHWADDR
      SIOCBONDCHANGEACTIVE
      SIOCBRADDIF
      SIOCBRDELIF
      SIOCSHWTSTAMP
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e1fccc0
  19. 01 Nov, 2012 1 commit
    • Pavel Emelyanov's avatar
      sk-filter: Add ability to get socket filter program (v2) · a8fc9277
      Pavel Emelyanov authored
      
      
      The SO_ATTACH_FILTER option is set only. I propose to add the get
      ability by using SO_ATTACH_FILTER in getsockopt. To be less
      irritating to eyes the SO_GET_FILTER alias to it is declared. This
      ability is required by checkpoint-restore project to be able to
      save full state of a socket.
      
      There are two issues with getting filter back.
      
      First, kernel modifies the sock_filter->code on filter load, thus in
      order to return the filter element back to user we have to decode it
      into user-visible constants. Fortunately the modification in question
      is interconvertible.
      
      Second, the BPF_S_ALU_DIV_K code modifies the command argument k to
      speed up the run-time division by doing kernel_k = reciprocal(user_k).
      Bad news is that different user_k may result in same kernel_k, so we
      can't get the original user_k back. Good news is that we don't have
      to do it. What we need to is calculate a user2_k so, that
      
        reciprocal(user2_k) == reciprocal(user_k) == kernel_k
      
      i.e. if it's re-loaded back the compiled again value will be exactly
      the same as it was. That said, the user2_k can be calculated like this
      
        user2_k = reciprocal(kernel_k)
      
      with an exception, that if kernel_k == 0, then user2_k == 1.
      
      The optlen argument is treated like this -- when zero, kernel returns
      the amount of sock_fprog elements in filter, otherwise it should be
      large enough for the sock_fprog array.
      
      changes since v1:
      * Declared SO_GET_FILTER in all arch headers
      * Added decode of vlan-tag codes
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a8fc9277
  20. 26 Oct, 2012 2 commits
  21. 21 Oct, 2012 1 commit
    • Pavel Emelyanov's avatar
      sockopt: Make SO_BINDTODEVICE readable · f7b86bfe
      Pavel Emelyanov authored
      
      
      The SO_BINDTODEVICE option is the only SOL_SOCKET one that can be set, but
      cannot be get via sockopt API. The only way we can find the device id a
      socket is bound to is via sock-diag interface. But the diag works only on
      hashed sockets, while the opt in question can be set for yet unhashed one.
      
      That said, in order to know what device a socket is bound to (we do want
      to know this in checkpoint-restore project) I propose to make this option
      getsockopt-able and report the respective device index.
      
      Another solution to the problem might be to teach the sock-diag reporting
      info on unhashed sockets. Should I go this way instead?
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f7b86bfe
  22. 27 Sep, 2012 1 commit
  23. 24 Sep, 2012 2 commits
    • Eric Dumazet's avatar
      net: guard tcp_set_keepalive() to tcp sockets · 3e10986d
      Eric Dumazet authored
      
      
      Its possible to use RAW sockets to get a crash in
      tcp_set_keepalive() / sk_reset_timer()
      
      Fix is to make sure socket is a SOCK_STREAM one.
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3e10986d
    • Eric Dumazet's avatar
      net: use a per task frag allocator · 5640f768
      Eric Dumazet authored
      
      
      We currently use a per socket order-0 page cache for tcp_sendmsg()
      operations.
      
      This page is used to build fragments for skbs.
      
      Its done to increase probability of coalescing small write() into
      single segments in skbs still in write queue (not yet sent)
      
      But it wastes a lot of memory for applications handling many mostly
      idle sockets, since each socket holds one page in sk->sk_sndmsg_page
      
      Its also quite inefficient to build TSO 64KB packets, because we need
      about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
      page allocator more than wanted.
      
      This patch adds a per task frag allocator and uses bigger pages,
      if available. An automatic fallback is done in case of memory pressure.
      
      (up to 32768 bytes per frag, thats order-3 pages on x86)
      
      This increases TCP stream performance by 20% on loopback device,
      but also benefits on other network devices, since 8x less frags are
      mapped on transmit and unmapped on tx completion. Alexander Duyck
      mentioned a probable performance win on systems with IOMMU enabled.
      
      Its possible some SG enabled hardware cant cope with bigger fragments,
      but their ndo_start_xmit() should already handle this, splitting a
      fragment in sub fragments, since some arches have PAGE_SIZE=65536
      
      Successfully tested on various ethernet devices.
      (ixgbe, igb, bnx2x, tg3, mellanox mlx4)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: default avatarVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5640f768
  24. 14 Sep, 2012 3 commits
    • Daniel Wagner's avatar
      cgroup: Assign subsystem IDs during compile time · 8a8e04df
      Daniel Wagner authored
      WARNING: With this change it is impossible to load external built
      controllers anymore.
      
      In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
      set, corresponding subsys_id should also be a constant. Up to now,
      net_prio_subsys_id and net_cls_subsys_id would be of the type int and
      the value would be assigned during runtime.
      
      By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
      to IS_ENABLED, all *_subsys_id will have constant value. That means we
      need to remove all the code which assumes a value can be assigned to
      net_prio_subsys_id and net_cls_subsys_id.
      
      A close look is necessary on the RCU part which was introduces by
      following patch:
      
        commit f8451725
      
      
        Author:	Herbert Xu <herbert@gondor.apana.org.au>  Mon May 24 09:12:34 2010
        Committer:	David S. Miller <davem@davemloft.net>  Mon May 24 09:12:34 2010
      
        cls_cgroup: Store classid in struct sock
      
        Tis code was added to init_cgroup_cls()
      
      	  /* We can't use rcu_assign_pointer because this is an int. */
      	  smp_wmb();
      	  net_cls_subsys_id = net_cls_subsys.subsys_id;
      
        respectively to exit_cgroup_cls()
      
      	  net_cls_subsys_id = -1;
      	  synchronize_rcu();
      
        and in module version of task_cls_classid()
      
      	  rcu_read_lock();
      	  id = rcu_dereference(net_cls_subsys_id);
      	  if (id >= 0)
      		  classid = container_of(task_subsys_state(p, id),
      					 struct cgroup_cls_state, css)->classid;
      	  rcu_read_unlock();
      
      Without an explicit explaination why the RCU part is needed. (The
      rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
      in a later commit, but that is a minor detail.)
      
      So here is my pondering why it was introduced and why it safe to
      remove it now. Note that this code was copied over to net_prio the
      reasoning holds for that subsystem too.
      
      The idea behind the RCU use for net_cls_subsys_id is to make sure we
      get a valid pointer back from task_subsys_state(). task_subsys_state()
      is just blindly accessing the subsys array and returning the
      pointer. Obviously, passing in -1 as id into task_subsys_state()
      returns an invalid value (out of lower bound).
      
      So this code makes sure that only after module is loaded and the
      subsystem registered, the id is assigned.
      
      Before unregistering the module all old readers must have left the
      critical section. This is done by assigning -1 to the id and issuing a
      synchronized_rcu(). Any new readers wont call task_subsys_state()
      anymore and therefore it is safe to unregister the subsystem.
      
      The new code relies on the same trick, but it looks at the subsys
      pointer return by task_subsys_state() (remember the id is constant
      and therefore we allways have a valid index into the subsys
      array).
      
      No precautions need to be taken during module loading
      module. Eventually, all CPUs will get a valid pointer back from
      task_subsys_state() because rebind_subsystem() which is called after
      the module init() function will assigned subsys[net_cls_subsys_id] the
      newly loaded module subsystem pointer.
      
      When the subsystem is about to be removed, rebind_subsystem() will
      called before the module exit() function. In this case,
      rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
      and then it calls synchronize_rcu(). All old readers have left by then
      the critical section. Any new reader wont access the subsystem
      anymore.  At this point we are safe to unregister the subsystem. No
      synchronize_rcu() call is needed.
      Signed-off-by: default avatarDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      8a8e04df
    • Daniel Wagner's avatar
      cgroup: net_prio: Do not define task_netpioidx() when not selected · 51e4e7fa
      Daniel Wagner authored
      
      
      task_netprioidx() should not be defined in case the configuration is
      CONFIG_NETPRIO_CGROUP=n. The reason is that in a following patch the
      net_prio_subsys_id will only be defined if CONFIG_NETPRIO_CGROUP!=n.
      When net_prio is not built at all any callee should only get an empty
      task_netprioidx() without any references to net_prio_subsys_id.
      Signed-off-by: default avatarDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      51e4e7fa
    • Daniel Wagner's avatar
      cgroup: net_cls: Do not define task_cls_classid() when not selected · 8fb974c9
      Daniel Wagner authored
      
      
      task_cls_classid() should not be defined in case the configuration is
      CONFIG_NET_CLS_CGROUP=n. The reason is that in a following patch the
      net_cls_subsys_id will only be defined if CONFIG_NET_CLS_CGROUP!=n.
      When net_cls is not built at all a callee should only get an empty
      task_cls_classid() without any references to net_cls_subsys_id.
      Signed-off-by: default avatarDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      8fb974c9
  25. 10 Sep, 2012 1 commit
  26. 03 Sep, 2012 1 commit
  27. 24 Aug, 2012 1 commit
    • Neil Horman's avatar
      cls_cgroup: Allow classifier cgroups to have their classid reset to 0 · 3afa6d00
      Neil Horman authored
      
      
      The network classifier cgroup initalizes each cgroups instance classid value to
      0.  However, the sock_update_classid function only updates classid's in sockets
      if the tasks cgroup classid is not zero, and if it differs from the current
      classid.  The later check is to prevent cache line dirtying, but the former is
      detrimental, as it prevents resetting a classid for a cgroup to 0.  While this
      is not a common action, it has administrative usefulness (if the admin wants to
      disable classification of a certain group temporarily for instance).
      
      Easy fix, just remove the zero check.  Tested successfully by myself
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      CC: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3afa6d00
  28. 14 Aug, 2012 2 commits
  29. 02 Aug, 2012 1 commit
  30. 31 Jul, 2012 4 commits
    • Mel Gorman's avatar
      netvm: prevent a stream-specific deadlock · c76562b6
      Mel Gorman authored
      
      
      This patch series is based on top of "Swap-over-NBD without deadlocking
      v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.
      
      When a user or administrator requires swap for their application, they
      create a swap partition and file, format it with mkswap and activate it
      with swapon.  In diskless systems this is not an option so if swap if
      required then swapping over the network is considered.  The two likely
      scenarios are when blade servers are used as part of a cluster where the
      form factor or maintenance costs do not allow the use of disks and thin
      clients.
      
      The Linux Terminal Server Project recommends the use of the Network Block
      Device (NBD) for swap but this is not always an option.  There is no
      guarantee that the network attached storage (NAS) device is running Linux
      or supports NBD.  However, it is likely that it supports NFS so there are
      users that want support for swapping over NFS despite any performance
      concern.  Some distributions currently carry patches that support swapping
      over NFS but it would be preferable to support it in the mainline kernel.
      
      Patch 1 avoids a stream-specific deadlock that potentially affects TCP.
      
      Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
      	reserves.
      
      Patch 3 adds three helpers for filesystems to handle swap cache pages.
      	For example, page_file_mapping() returns page->mapping for
      	file-backed pages and the address_space of the underlying
      	swap file for swap cache pages.
      
      Patch 4 adds two address_space_operations to allow a filesystem
      	to pin all metadata relevant to a swapfile in memory. Upon
      	successful activation, the swapfile is marked SWP_FILE and
      	the address space operation ->direct_IO is used for writing
      	and ->readpage for reading in swap pages.
      
      Patch 5 notes that patch 3 is bolting
      	filesystem-specific-swapfile-support onto the side and that
      	the default handlers have different information to what
      	is available to the filesystem. This patch refactors the
      	code so that there are generic handlers for each of the new
      	address_space operations.
      
      Patch 6 adds an API to allow a vector of kernel addresses to be
      	translated to struct pages and pinned for IO.
      
      Patch 7 adds support for using highmem pages for swap by kmapping
      	the pages before calling the direct_IO handler.
      
      Patch 8 updates NFS to use the helpers from patch 3 where necessary.
      
      Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.
      
      Patch 10 implements the new swapfile-related address_space operations
      	for NFS and teaches the direct IO handler how to manage
      	kernel addresses.
      
      Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
      	where appropriate.
      
      Patch 12 fixes a NULL pointer dereference that occurs when using
      	swap-over-NFS.
      
      With the patches applied, it is possible to mount a swapfile that is on an
      NFS filesystem.  Swap performance is not great with a swap stress test
      taking roughly twice as long to complete than if the swap device was
      backed by NBD.
      
      This patch: netvm: prevent a stream-specific deadlock
      
      It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
      that we're over the global rmem limit.  This will prevent SOCK_MEMALLOC
      buffers from receiving data, which will prevent userspace from running,
      which is needed to reduce the buffered data.
      
      Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.  Once
      this change it applied, it is important that sockets that set
      SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
      If this happens, a warning is generated and the tokens reclaimed to avoid
      accounting errors until the bug is fixed.
      
      [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c76562b6
    • Mel Gorman's avatar
      netvm: set PF_MEMALLOC as appropriate during SKB processing · b4b9e355
      Mel Gorman authored
      
      
      In order to make sure pfmemalloc packets receive all memory needed to
      proceed, ensure processing of pfmemalloc SKBs happens under PF_MEMALLOC.
      This is limited to a subset of protocols that are expected to be used for
      writing to swap.  Taps are not allowed to use PF_MEMALLOC as these are
      expected to communicate with userspace processes which could be paged out.
      
      [a.p.zijlstra@chello.nl: Ideas taken from various patches]
      [jslaby@suse.cz: Lock imbalance fix]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4b9e355
    • Mel Gorman's avatar
      netvm: allow skb allocation to use PFMEMALLOC reserves · c93bdd0e
      Mel Gorman authored
      
      
      Change the skb allocation API to indicate RX usage and use this to fall
      back to the PFMEMALLOC reserve when needed.  SKBs allocated from the
      reserve are tagged in skb->pfmemalloc.  If an SKB is allocated from the
      reserve and the socket is later found to be unrelated to page reclaim, the
      packet is dropped so that the memory remains available for page reclaim.
      Network protocols are expected to recover from this packet loss.
      
      [a.p.zijlstra@chello.nl: Ideas taken from various patches]
      [davem@davemloft.net: Use static branches, coding style corrections]
      [sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c93bdd0e
    • Mel Gorman's avatar
      netvm: allow the use of __GFP_MEMALLOC by specific sockets · 7cb02404
      Mel Gorman authored
      
      
      Allow specific sockets to be tagged SOCK_MEMALLOC and use __GFP_MEMALLOC
      for their allocations.  These sockets will be able to go below watermarks
      and allocate from the emergency reserve.  Such sockets are to be used to
      service the VM (iow.  to swap over).  They must be handled kernel side,
      exposing such a socket to user-space is a bug.
      
      There is a risk that the reserves be depleted so for now, the
      administrator is responsible for increasing min_free_kbytes as necessary
      to prevent deadlock for their workloads.
      
      [a.p.zijlstra@chello.nl: Original patches]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7cb02404