1. 05 Nov, 2015 5 commits
  2. 15 Apr, 2015 1 commit
  3. 10 Feb, 2015 1 commit
  4. 29 Jan, 2015 1 commit
    • Linus Torvalds's avatar
      vm: add VM_FAULT_SIGSEGV handling support · 33692f27
      Linus Torvalds authored
      The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
      "you should SIGSEGV" error, because the SIGSEGV case was generally
      handled by the caller - usually the architecture fault handler.
      
      That results in lots of duplication - all the architecture fault
      handlers end up doing very similar "look up vma, check permissions, do
      retries etc" - but it generally works.  However, there are cases where
      the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
      
      In particular, when accessing the stack guard page, libsigsegv expects a
      SIGSEGV.  And it usually got one, because the stack growth is handled by
      that duplicated architecture fault handler.
      
      However, when the generic VM layer started propagating the error return
      from the stack expansion in commit fee7e49d ("mm: propagate error
      from stack expansion even for guard page"), that now exposed the
      existing VM_FAULT_SIGBUS result to user space.  And user space really
      expected SIGSEGV, not SIGBUS.
      
      To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
      duplicate architecture fault handlers about it.  They all already have
      the code to handle SIGSEGV, so it's about just tying that new return
      value to the existing code, but it's all a bit annoying.
      
      This is the mindless minimal patch to do this.  A more extensive patch
      would be to try to gather up the mostly shared fault handling logic into
      one generic helper routine, and long-term we really should do that
      cleanup.
      
      Just from this patch, you can generally see that most architectures just
      copied (directly or indirectly) the old x86 way of doing things, but in
      the meantime that original x86 model has been improved to hold the VM
      semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
      "newer" things, so it would be a good idea to bring all those
      improvements to the generic case and teach other architectures about
      them too.
      Reported-and-tested-by: default avatarTakashi Iwai <tiwai@suse.de>
      Tested-by: default avatarJan Engelhardt <jengelh@inai.de>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
      Cc: linux-arch@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33692f27
  5. 09 Oct, 2014 1 commit
  6. 16 Jul, 2014 1 commit
    • NeilBrown's avatar
      sched: Remove proliferation of wait_on_bit() action functions · 74316201
      NeilBrown authored
      The current "wait_on_bit" interface requires an 'action'
      function to be provided which does the actual waiting.
      There are over 20 such functions, many of them identical.
      Most cases can be satisfied by one of just two functions, one
      which uses io_schedule() and one which just uses schedule().
      
      So:
       Rename wait_on_bit and        wait_on_bit_lock to
              wait_on_bit_action and wait_on_bit_lock_action
       to make it explicit that they need an action function.
      
       Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
       which are *not* given an action function but implicitly use
       a standard one.
       The decision to error-out if a signal is pending is now made
       based on the 'mode' argument rather than being encoded in the action
       function.
      
       All instances of the old wait_on_bit and wait_on_bit_lock which
       can use the new version have been changed accordingly and their
       action functions have been discarded.
       wait_on_bit{_lock} does not return any specific error code in the
       event of a signal so the caller must check for non-zero and
       interpolate their own error code as appropriate.
      
      The wait_on_bit() call in __fscache_wait_on_invalidate() was
      ambiguous as it specified TASK_UNINTERRUPTIBLE but used
      fscache_wait_bit_interruptible as an action function.
      David Howells confirms this should be uniformly
      "uninterruptible"
      
      The main remaining user of wait_on_bit{,_lock}_action is NFS
      which needs to use a freezer-aware schedule() call.
      
      A comment in fs/gfs2/glock.c notes that having multiple 'action'
      functions is useful as they display differently in the 'wchan'
      field of 'ps'. (and /proc/$PID/wchan).
      As the new bit_wait{,_io} functions are tagged "__sched", they
      will not show up at all, but something higher in the stack.  So
      the distinction will still be visible, only with different
      function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
      gfs2/glock.c case).
      
      Since first version of this patch (against 3.15) two new action
      functions appeared, on in NFS and one in CIFS.  CIFS also now
      uses an action function that makes the same freezer aware
      schedule call as NFS.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
      Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steve French <sfrench@samba.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brownSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      74316201
  7. 23 Jun, 2014 1 commit
    • Hugh Dickins's avatar
      mm: let mm_find_pmd fix buggy race with THP fault · f72e7dcd
      Hugh Dickins authored
      Trinity has reported:
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
          IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
          CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G        W
                                  3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
          lock_acquire (arch/x86/include/asm/current.h:14
                        kernel/locking/lockdep.c:3602)
          _raw_spin_lock (include/linux/spinlock_api_smp.h:143
                          kernel/locking/spinlock.c:151)
          remove_migration_pte (mm/migrate.c:137)
          rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
          remove_migration_ptes (mm/migrate.c:224)
          migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
          migrate_misplaced_page (mm/migrate.c:1733)
          __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
          handle_mm_fault (mm/memory.c:3948)
          __get_user_pages (mm/memory.c:1851)
          __mlock_vma_pages_range (mm/mlock.c:255)
          __mm_populate (mm/mlock.c:711)
          SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)
      
      I believe this comes about because, whereas collapsing and splitting THP
      functions take anon_vma lock in write mode (which excludes concurrent
      rmap walks), faulting THP functions (write protection and misplaced
      NUMA) do not - and mostly they do not need to.
      
      But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
      an instant (indeed, for a long instant, given the inter-CPU TLB flush in
      there), leaves *pmd neither present not trans_huge.
      
      Which can confuse a concurrent rmap walk, as when removing migration
      ptes, seen in the dumped trace.  Although that rmap walk has a 4k page
      to insert, anon_vmas containing THPs are in no way segregated from
      4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
      that instant when a trans_huge pmd is temporarily absent.
      
      I don't think we need strengthen the locking at the THP end: it's easily
      handled with an ACCESS_ONCE() before testing both conditions.
      
      And since mm_find_pmd() had only one caller who wanted a THP rather than
      a pmd, let's slightly repurpose it to fail when it hits a THP or
      non-present pmd, and open code split_huge_page_address() again.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Lameter <cl@gentwo.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f72e7dcd
  8. 04 Mar, 2014 1 commit
    • David Rientjes's avatar
      mm: close PageTail race · 668f9abb
      David Rientjes authored
      Commit bf6bddf1 ("mm: introduce compaction and migration for
      ballooned pages") introduces page_count(page) into memory compaction
      which dereferences page->first_page if PageTail(page).
      
      This results in a very rare NULL pointer dereference on the
      aforementioned page_count(page).  Indeed, anything that does
      compound_head(), including page_count() is susceptible to racing with
      prep_compound_page() and seeing a NULL or dangling page->first_page
      pointer.
      
      This patch uses Andrea's implementation of compound_trans_head() that
      deals with such a race and makes it the default compound_head()
      implementation.  This includes a read memory barrier that ensures that
      if PageTail(head) is true that we return a head page that is neither
      NULL nor dangling.  The patch then adds a store memory barrier to
      prep_compound_page() to ensure page->first_page is set.
      
      This is the safest way to ensure we see the head page that we are
      expecting, PageTail(page) is already in the unlikely() path and the
      memory barriers are unfortunately required.
      
      Hugetlbfs is the exception, we don't enforce a store memory barrier
      during init since no race is possible.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      668f9abb
  9. 23 Jan, 2014 2 commits
    • Paul Gortmaker's avatar
      mm: audit/fix non-modular users of module_init in core code · a64fb3cd
      Paul Gortmaker authored
      Code that is obj-y (always built-in) or dependent on a bool Kconfig
      (built-in or absent) can never be modular.  So using module_init as an
      alias for __initcall can be somewhat misleading.
      
      Fix these up now, so that we can relocate module_init from init.h into
      module.h in the future.  If we don't do this, we'd have to add module.h
      to obviously non-modular code, and that would be a worse thing.
      
      The audit targets the following module_init users for change:
       mm/ksm.c                       bool KSM
       mm/mmap.c                      bool MMU
       mm/huge_memory.c               bool TRANSPARENT_HUGEPAGE
       mm/mmu_notifier.c              bool MMU_NOTIFIER
      
      Note that direct use of __initcall is discouraged, vs.  one of the
      priority categorized subgroups.  As __initcall gets mapped onto
      device_initcall, our use of subsys_initcall (which makes sense for these
      files) will thus change this registration from level 6-device to level
      4-subsys (i.e.  slightly earlier).
      
      However no observable impact of that difference has been observed during
      testing.
      
      One might think that core_initcall (l2) or postcore_initcall (l3) would
      be more appropriate for anything in mm/ but if we look at some actual
      init functions themselves, we see things like:
      
      mm/huge_memory.c --> hugepage_init     --> hugepage_init_sysfs
      mm/mmap.c        --> init_user_reserve --> sysctl_user_reserve_kbytes
      mm/ksm.c         --> ksm_init          --> sysfs_create_group
      
      and hence the choice of subsys_initcall (l4) seems reasonable, and at
      the same time minimizes the risk of changing the priority too
      drastically all at once.  We can adjust further in the future.
      
      Also, several instances of missing ";" at EOL are fixed.
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a64fb3cd
    • Sasha Levin's avatar
      mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE · 309381fe
      Sasha Levin authored
      Most of the VM_BUG_ON assertions are performed on a page.  Usually, when
      one of these assertions fails we'll get a BUG_ON with a call stack and
      the registers.
      
      I've recently noticed based on the requests to add a small piece of code
      that dumps the page to various VM_BUG_ON sites that the page dump is
      quite useful to people debugging issues in mm.
      
      This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
      VM_BUG_ON() does, also dumps the page before executing the actual
      BUG_ON.
      
      [akpm@linux-foundation.org: fix up includes]
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      309381fe
  10. 21 Jan, 2014 5 commits
  11. 12 Nov, 2014 1 commit
  12. 12 Nov, 2013 1 commit
  13. 11 Sep, 2013 1 commit
  14. 08 Mar, 2013 1 commit
  15. 27 Feb, 2013 1 commit
    • Sasha Levin's avatar
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin authored
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: default avatarPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  16. 23 Feb, 2013 15 commits
    • Hugh Dickins's avatar
      ksm: allocate roots when needed · ef53d16c
      Hugh Dickins authored
      It is a pity to have MAX_NUMNODES+MAX_NUMNODES tree roots statically
      allocated, particularly when very few users will ever actually tune
      merge_across_nodes 0 to use more than 1+1 of those trees.  Not a big
      deal (only 16kB wasted on each machine with CONFIG_MAXSMP), but a pity.
      
      Start off with 1+1 statically allocated, then if merge_across_nodes is
      ever tuned, allocate for nr_node_ids+nr_node_ids.  Do not attempt to
      free up the extra if it's tuned back, that would be a waste of effort.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef53d16c
    • Hugh Dickins's avatar
      mm,ksm: FOLL_MIGRATION do migration_entry_wait · 5117b3b8
      Hugh Dickins authored
      In "ksm: remove old stable nodes more thoroughly" I said that I'd never
      seen its WARN_ON_ONCE(page_mapped(page)).  True at the time of writing,
      but it soon appeared once I tried fuller tests on the whole series.
      
      It turned out to be due to the KSM page migration itself: unmerge_and_
      remove_all_rmap_items() failed to locate and replace all the KSM pages,
      because of that hiatus in page migration when old pte has been replaced
      by migration entry, but not yet by new pte.  follow_page() finds no page
      at that instant, but a KSM page reappears shortly after, without a
      fault.
      
      Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait()
      for KSM's break_cow().  I'd have preferred to avoid another flag, and do
      it every time, in case someone else makes the same easy mistake; but did
      not find another transgressor (the common get_user_pages() is of course
      safe), and cannot be sure that every follow_page() caller is prepared to
      sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can
      already sleep there, since anon_vma locking was changed to mutex, but
      maybe that's somehow excluded.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5117b3b8
    • Hugh Dickins's avatar
      ksm: shrink 32-bit rmap_item back to 32 bytes · bc56620b
      Hugh Dickins authored
      Think of struct rmap_item as an extension of struct page (restricted to
      MADV_MERGEABLE areas): there may be a lot of them, we need to keep them
      small, especially on 32-bit architectures of limited lowmem.
      
      Siting "int nid" after "unsigned int checksum" works nicely on 64-bit,
      making no change to its 64-byte struct rmap_item; but bloats the 32-bit
      struct rmap_item from (nicely cache-aligned) 32 bytes to 36 bytes, which
      rounds up to 40 bytes once allocated from slab.  We'd better avoid that.
      
      Hey, I only just remembered that the anon_vma pointer in struct
      rmap_item has no purpose until the rmap_item is hung from a stable tree
      node (which has its own nid field); and rmap_item's nid field no purpose
      than to say which tree root to tell rb_erase() when unlinking from an
      unstable tree.
      
      Double them up in a union.  There's just one place where we set anon_vma
      early (when we already hold mmap_sem): now we must remove tree_rmap_item
      from its unstable tree there, before overwriting nid.  No need to
      spatter BUG()s around: we'd be seeing oopses if this were wrong.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc56620b
    • Hugh Dickins's avatar
      ksm: treat unstable nid like in stable tree · b599cbdf
      Hugh Dickins authored
      An inconsistency emerged in reviewing the NUMA node changes to KSM: when
      meeting a page from the wrong NUMA node in a stable tree, we say that
      it's okay for comparisons, but not as a leaf for merging; whereas when
      meeting a page from the wrong NUMA node in an unstable tree, we bail out
      immediately.
      
      Now, it might be that a wrong NUMA node in an unstable tree is more
      likely to correlate with instablility (different content, with rbnode
      now misplaced) than page migration; but even so, we are accustomed to
      instablility in the unstable tree.
      
      Without strong evidence for which strategy is generally better, I'd
      rather be consistent with what's done in the stable tree: accept a page
      from the wrong NUMA node for comparison, but not as a leaf for merging.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b599cbdf
    • Hugh Dickins's avatar
      ksm: add some comments · 8fdb3dbf
      Hugh Dickins authored
      Added slightly more detail to the Documentation of merge_across_nodes, a
      few comments in areas indicated by review, and renamed get_ksm_page()'s
      argument from "locked" to "lock_it".  No functional change.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fdb3dbf
    • Hugh Dickins's avatar
      ksm: stop hotremove lockdep warning · ef4d43a8
      Hugh Dickins authored
      Complaints are rare, but lockdep still does not understand the way
      ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and holds
      it until the ksm_memory_callback(MEM_OFFLINE): that appears to be a
      problem because notifier callbacks are made under down_read of
      blocking_notifier_head->rwsem (so first the mutex is taken while holding
      the rwsem, then later the rwsem is taken while still holding the mutex);
      but is not in fact a problem because mem_hotplug_mutex is held
      throughout the dance.
      
      There was an attempt to fix this with mutex_lock_nested(); but if that
      happened to fool lockdep two years ago, apparently it does so no longer.
      
      I had hoped to eradicate this issue in extending KSM page migration not
      to need the ksm_thread_mutex.  But then realized that although the page
      migration itself is safe, we do still need to lock out ksmd and other
      users of get_ksm_page() while offlining memory - at some point between
      MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may
      vanish, and get_ksm_page()'s accesses to them become a violation.
      
      So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to
      MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining()
      checks, to achieve the same lockout without being caught by lockdep.
      This is less elegant for KSM, but it's more important to keep lockdep
      useful to other users - and I apologize for how long it took to fix.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Tested-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef4d43a8
    • Hugh Dickins's avatar
      ksm: make !merge_across_nodes migration safe · 4146d2d6
      Hugh Dickins authored
      The new KSM NUMA merge_across_nodes knob introduces a problem, when it's
      set to non-default 0: if a KSM page is migrated to a different NUMA node,
      how do we migrate its stable node to the right tree?  And what if that
      collides with an existing stable node?
      
      ksm_migrate_page() can do no more than it's already doing, updating
      stable_node->kpfn: the stable tree itself cannot be manipulated without
      holding ksm_thread_mutex.  So accept that a stable tree may temporarily
      indicate a page belonging to the wrong NUMA node, leave updating until the
      next pass of ksmd, just be careful not to merge other pages on to a
      misplaced page.  Note nid of holding tree in stable_node, and recognize
      that it will not always match nid of kpfn.
      
      A misplaced KSM page is discovered, either when ksm_do_scan() next comes
      around to one of its rmap_items (we now have to go to cmp_and_merge_page
      even on pages in a stable tree), or when stable_tree_search() arrives at a
      matching node for another page, and this node page is found misplaced.
      
      In each case, move the misplaced stable_node to a list of migrate_nodes
      (and use the address of migrate_nodes as magic by which to identify them):
      we don't need them in a tree.  If stable_tree_search() finds no match for
      a page, but it's currently exiled to this list, then slot its stable_node
      right there into the tree, bringing all of its mappings with it; otherwise
      they get migrated one by one to the original page of the colliding node.
      stable_tree_search() is now modelled more like stable_tree_insert(), in
      order to handle these insertions of migrated nodes.
      
      remove_node_from_stable_tree(), remove_all_stable_nodes() and
      ksm_check_stable_tree() have to handle the migrate_nodes list as well as
      the stable tree itself.  Less obviously, we do need to prune the list of
      stale entries from time to time (scan_get_next_rmap_item() does it once
      each full scan): whereas stale nodes in the stable tree get naturally
      pruned as searches try to brush past them, these migrate_nodes may get
      forgotten and accumulate.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4146d2d6
    • Hugh Dickins's avatar
      ksm: make KSM page migration possible · c8d6553b
      Hugh Dickins authored
      KSM page migration is already supported in the case of memory hotremove,
      which takes the ksm_thread_mutex across all its migrations to keep life
      simple.
      
      But the new KSM NUMA merge_across_nodes knob introduces a problem, when
      it's set to non-default 0: if a KSM page is migrated to a different NUMA
      node, how do we migrate its stable node to the right tree?  And what if
      that collides with an existing stable node?
      
      So far there's no provision for that, and this patch does not attempt to
      deal with it either.  But how will I test a solution, when I don't know
      how to hotremove memory?  The best answer is to enable KSM page migration
      in all cases now, and test more common cases.  With THP and compaction
      added since KSM came in, page migration is now mainstream, and it's a
      shame that a KSM page can frustrate freeing a page block.
      
      Without worrying about merge_across_nodes 0 for now, this patch gets KSM
      page migration working reliably for default merge_across_nodes 1 (but
      leave the patch enabling it until near the end of the series).
      
      It's much simpler than I'd originally imagined, and does not require an
      additional tier of locking: page migration relies on the page lock, KSM
      page reclaim relies on the page lock, the page lock is enough for KSM page
      migration too.
      
      Almost all the care has to be in get_ksm_page(): that's the function which
      worries about when a stable node is stale and should be freed, now it also
      has to worry about the KSM page being migrated.
      
      The only new overhead is an additional put/get/lock/unlock_page when
      stable_tree_search() arrives at a matching node: to make sure migration
      respects the raised page count, and so does not migrate the page while
      we're busy with it here.  That's probably avoidable, either by changing
      internal interfaces from using kpage to stable_node, or by moving the
      ksm_migrate_page() callsite into a page_freeze_refs() section (even if not
      swapcache); but this works well, I've no urge to pull it apart now.
      
      (Descents of the stable tree may pass through nodes whose KSM pages are
      under migration: being unlocked, the raised page count does not prevent
      that, nor need it: it's safe to memcmp against either old or new page.)
      
      You might worry about mremap, and whether page migration's rmap_walk to
      remove migration entries will find all the KSM locations where it inserted
      earlier: that should already be handled, by the satisfyingly heavy hammer
      of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,).
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8d6553b
    • Hugh Dickins's avatar
      ksm: remove old stable nodes more thoroughly · cbf86cfe
      Hugh Dickins authored
      Switching merge_across_nodes after running KSM is liable to oops on stale
      nodes still left over from the previous stable tree.  It's not something
      that people will often want to do, but it would be lame to demand a reboot
      when they're trying to determine which merge_across_nodes setting is best.
      
      How can this happen?  We only permit switching merge_across_nodes when
      pages_shared is 0, and usually set run 2 to force that beforehand, which
      ought to unmerge everything: yet oopses still occur when you then run 1.
      
      Three causes:
      
      1. The old stable tree (built according to the inverse
         merge_across_nodes) has not been fully torn down.  A stable node
         lingers until get_ksm_page() notices that the page it references no
         longer references it: but the page is not necessarily freed as soon as
         expected, particularly when swapcache.
      
         Fix this with a pass through the old stable tree, applying
         get_ksm_page() to each of the remaining nodes (most found stale and
         removed immediately), with forced removal of any left over.  Unless the
         page is still mapped: I've not seen that case, it shouldn't occur, but
         better to WARN_ON_ONCE and EBUSY than BUG.
      
      2. __ksm_enter() has a nice little optimization, to insert the new mm
         just behind ksmd's cursor, so there's a full pass for it to stabilize
         (or be removed) before ksmd addresses it.  Nice when ksmd is running,
         but not so nice when we're trying to unmerge all mms: we were missing
         those mms forked and inserted behind the unmerge cursor.  Easily fixed
         by inserting at the end when KSM_RUN_UNMERGE.
      
      3.  It is possible for a KSM page to be faulted back from swapcache
         into an mm, just after unmerge_and_remove_all_rmap_items() scanned past
         it.  Fix this by copying on fault when KSM_RUN_UNMERGE: but that is
         private to ksm.c, so dissolve the distinction between
         ksm_might_need_to_copy() and ksm_does_need_to_copy(), doing it all in
         the one call into ksm.c.
      
      A long outstanding, unrelated bugfix sneaks in with that third fix:
      ksm_does_need_to_copy() would copy from a !PageUptodate page (implying I/O
      error when read in from swap) to a page which it then marks Uptodate.  Fix
      this case by not copying, letting do_swap_page() discover the error.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cbf86cfe
    • Hugh Dickins's avatar
      ksm: get_ksm_page locked · 8aafa6a4
      Hugh Dickins authored
      In some places where get_ksm_page() is used, we need the page to be locked.
      
      When KSM migration is fully enabled, we shall want that to make sure that
      the page just acquired cannot be migrated beneath us (raised page count is
      only effective when there is serialization to make sure migration
      notices).  Whereas when navigating through the stable tree, we certainly
      do not want to lock each node (raised page count is enough to guarantee
      the memcmps, even if page is migrated to another node).
      
      Since we're about to add another use case, add the locked argument to
      get_ksm_page() now.
      
      Hmm, what's that rcu_read_lock() about?  Complete misunderstanding, I
      really got the wrong end of the stick on that!  There's a configuration in
      which page_cache_get_speculative() can do something cheaper than
      get_page_unless_zero(), relying on its caller's rcu_read_lock() to have
      disabled preemption for it.  There's no need for rcu_read_lock() around
      get_page_unless_zero() (and mapping checks) here.  Cut out that silliness
      before making this any harder to understand.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8aafa6a4
    • Hugh Dickins's avatar
      ksm: reorganize ksm_check_stable_tree · ee0ea59c
      Hugh Dickins authored
      Memory hotremove's ksm_check_stable_tree() is pitifully inefficient
      (restarting whenever it finds a stale node to remove), but rearrange so
      that at least it does not needlessly restart from nid 0 each time.  And
      add a couple of comments: here is why we keep pfn instead of page.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee0ea59c
    • Hugh Dickins's avatar
      ksm: trivial tidyups · e850dcf5
      Hugh Dickins authored
      Add NUMA() and DO_NUMA() macros to minimize blight of #ifdef
      CONFIG_NUMAs (but indeed we don't want to expand struct rmap_item by nid
      when not NUMA).  Add comment, remove "unsigned" from rmap_item->nid, as
      "int nid" elsewhere.  Define ksm_merge_across_nodes 1U when #ifndef NUMA
      to help optimizing out.  Use ?: in get_kpfn_nid().  Adjust a few
      comments noticed in ongoing work.
      
      Leave stable_tree_insert()'s rb_linkage until after the node has been
      set up, as unstable_tree_search_insert() does: ksm_thread_mutex and page
      lock make either way safe, but we're going to copy and I prefer this
      precedent.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e850dcf5
    • Petr Holasek's avatar
      ksm: allow trees per NUMA node · 90bd6fd3
      Petr Holasek authored
      Here's a KSM series, based on mmotm 2013-01-23-17-04: starting with
      Petr's v7 "KSM: numa awareness sysfs knob"; then fixing the two issues
      we had with that, fully enabling KSM page migration on the way.
      
      (A different kind of KSM/NUMA issue which I've certainly not begun to
      address here: when KSM pages are unmerged, there's usually no sense in
      preferring to allocate the new pages local to the caller's node.)
      
      This patch:
      
      Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
      which control merging pages across different numa nodes.  When it is set
      to zero only pages from the same node are merged, otherwise pages from
      all nodes can be merged together (default behavior).
      
      Typical use-case could be a lot of KVM guests on NUMA machine and cpus
      from more distant nodes would have significant increase of access
      latency to the merged ksm page.  Sysfs knob was choosen for higher
      variability when some users still prefers higher amount of saved
      physical memory regardless of access latency.
      
      Every numa node has its own stable & unstable trees because of faster
      searching and inserting.  Changing of merge_across_nodes value is
      possible only when there are not any ksm shared pages in system.
      
      I've tested this patch on numa machines with 2, 4 and 8 nodes and
      measured speed of memory access inside of KVM guests with memory pinned
      to one of nodes with this benchmark:
      
        http://pholasek.fedorapeople.org/alloc_pg.c
      
      Population standard deviations of access times in percentage of average
      were following:
      
      merge_across_nodes=1
      2 nodes 1.4%
      4 nodes 1.6%
      8 nodes	1.7%
      
      merge_across_nodes=0
      2 nodes	1%
      4 nodes	0.32%
      8 nodes	0.018%
      
      RFC: https://lkml.org/lkml/2011/11/30/91
      v1: https://lkml.org/lkml/2012/1/23/46
      v2: https://lkml.org/lkml/2012/6/29/105
      v3: https://lkml.org/lkml/2012/9/14/550
      v4: https://lkml.org/lkml/2012/9/23/137
      v5: https://lkml.org/lkml/2012/12/10/540
      v6: https://lkml.org/lkml/2012/12/23/154
      v7: https://lkml.org/lkml/2012/12/27/225
      
      Hugh notes that this patch brings two problems, whose solution needs
      further support in mm/ksm.c, which follows in subsequent patches:
      
      1) switching merge_across_nodes after running KSM is liable to oops
         on stale nodes still left over from the previous stable tree;
      
      2) memory hotremove may migrate KSM pages, but there is no provision
         here for !merge_across_nodes to migrate nodes to the proper tree.
      Signed-off-by: default avatarPetr Holasek <pholasek@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90bd6fd3
    • Sasha Levin's avatar
      mm/ksm.c: use new hashtable implementation · 4ca3a69b
      Sasha Levin authored
      Switch ksm to use the new hashtable implementation.  This reduces the
      amount of generic unrelated code in the ksm module.
      Signed-off-by: default avatarSasha Levin <levinsasha928@gmail.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ca3a69b
    • Johannes Weiner's avatar
      mm: reduce rmap overhead for ex-KSM page copies created on swap faults · af34770e
      Johannes Weiner authored
      When ex-KSM pages are faulted from swap cache, the fault handler is not
      capable of re-establishing anon_vma-spanning KSM pages.  In this case, a
      copy of the page is created instead, just like during a COW break.
      
      These freshly made copies are known to be exclusive to the faulting VMA
      and there is no reason to go look for this page in parent and sibling
      processes during rmap operations.
      
      Use page_add_new_anon_rmap() for these copies.  This also puts them on
      the proper LRU lists and marks them SwapBacked, so we can get rid of
      doing this ad-hoc in the KSM copy code.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af34770e
  17. 20 Dec, 2012 1 commit