1. 13 Jan, 2011 4 commits
  2. 26 Oct, 2010 4 commits
    • Namhyung Kim's avatar
      mm: wrap get_locked_pte() using __cond_lock() · 25ca1d6c
      Namhyung Kim authored
      
      
      The get_locked_pte() conditionally grabs 'ptl' in case of returning
      non-NULL.  This leads sparse to complain about context imbalance.  Rename
      and wrap it using __cond_lock() to make sparse happy.
      Signed-off-by: default avatarNamhyung Kim <namhyung@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25ca1d6c
    • Michel Lespinasse's avatar
      mm: retry page fault when blocking on disk transfer · d065bd81
      Michel Lespinasse authored
      
      
      This change reduces mmap_sem hold times that are caused by waiting for
      disk transfers when accessing file mapped VMAs.
      
      It introduces the VM_FAULT_ALLOW_RETRY flag, which indicates that the call
      site wants mmap_sem to be released if blocking on a pending disk transfer.
      In that case, filemap_fault() returns the VM_FAULT_RETRY status bit and
      do_page_fault() will then re-acquire mmap_sem and retry the page fault.
      
      It is expected that the retry will hit the same page which will now be
      cached, and thus it will complete with a low mmap_sem hold time.
      
      Tests:
      
      - microbenchmark: thread A mmaps a large file and does random read accesses
        to the mmaped area - achieves about 55 iterations/s. Thread B does
        mmap/munmap in a loop at a separate location - achieves 55 iterations/s
        before, 15000 iterations/s after.
      
      - We are seeing related effects in some applications in house, which show
        significant performance regressions when running without this change.
      
      [akpm@linux-foundation.org: fix warning & crash]
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatar"H. Peter Anvin" <hpa@zytor.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d065bd81
    • Will Deacon's avatar
      mm: fix typo in mm.h when NODE_NOT_IN_PAGE_FLAGS · bce54bbf
      Will Deacon authored
      
      
      NODE_NOT_IN_PAGE_FLAGS is defined in mm.h when the node information is not
      stored in the page flags bitmap.
      
      Unfortunately, there's a typo in one of the checks for it.  This patch
      fixes it (s/NODE_NOT_IN_PAGEFLAGS/NODE_NOT_IN_PAGE_FLAGS/).  Since this
      has been around for ages, I doubt it's been causing any serious problems.
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bce54bbf
    • Michael Rubin's avatar
      mm: add account_page_writeback() · f629d1c9
      Michael Rubin authored
      
      
      To help developers and applications gain visibility into writeback
      behaviour this patch adds two counters to /proc/vmstat.
      
        # grep nr_dirtied /proc/vmstat
        nr_dirtied 3747
        # grep nr_written /proc/vmstat
        nr_written 3618
      
      These entries allow user apps to understand writeback behaviour over time
      and learn how it is impacting their performance.  Currently there is no
      way to inspect dirty and writeback speed over time.  It's not possible for
      nr_dirty/nr_writeback.
      
      These entries are necessary to give visibility into writeback behaviour.
      We have /proc/diskstats which lets us understand the io in the block
      layer.  We have blktrace for more in depth understanding.  We have
      e2fsprogs and debugsfs to give insight into the file systems behaviour,
      but we don't offer our users the ability understand what writeback is
      doing.  There is no way to know how active it is over the whole system, if
      it's falling behind or to quantify it's efforts.  With these values
      exported users can easily see how much data applications are sending
      through writeback and also at what rates writeback is processing this
      data.  Comparing the rates of change between the two allow developers to
      see when writeback is not able to keep up with incoming traffic and the
      rate of dirty memory being sent to the IO back end.  This allows folks to
      understand their io workloads and track kernel issues.  Non kernel
      engineers at Google often use these counters to solve puzzling performance
      problems.
      
      Patch #4 adds a pernode vmstat file with nr_dirtied and nr_written
      
      Patch #5 add writeback thresholds to /proc/vmstat
      
      Currently these values are in debugfs. But they should be promoted to
      /proc since they are useful for developers who are writing databases
      and file servers and are not debugging the kernel.
      
      The output is as below:
      
       # grep threshold /proc/vmstat
       nr_pages_dirty_threshold 409111
       nr_pages_dirty_background_threshold 818223
      
      This patch:
      
      This allows code outside of the mm core to safely manipulate page
      writeback state and not worry about the other accounting.  Not using these
      routines means that some code will lose track of the accounting and we get
      bugs.
      
      Modify nilfs2 to use interface.
      Signed-off-by: default avatarMichael Rubin <mrubin@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Jiro SEKIBA <jir@unicus.jp>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f629d1c9
  3. 08 Oct, 2010 1 commit
    • Andi Kleen's avatar
      Encode huge page size for VM_FAULT_HWPOISON errors · aa50d3a7
      Andi Kleen authored
      
      
      This fixes a problem introduced with the hugetlb hwpoison handling
      
      The user space SIGBUS signalling wants to know the size of the hugepage
      that caused a HWPOISON fault.
      
      Unfortunately the architecture page fault handlers do not have easy
      access to the struct page.
      
      Pass the information out in the fault error code instead.
      
      I added a separate VM_FAULT_HWPOISON_LARGE bit for this case and encode
      the hpage index in some free upper bits of the fault code. The small
      page hwpoison keeps stays with the VM_FAULT_HWPOISON name to minimize
      changes.
      
      Also add code to hugetlb.h to convert that index into a page shift.
      
      Will be used in a further patch.
      
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: fengguang.wu@intel.com
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      aa50d3a7
  4. 09 Sep, 2010 1 commit
  5. 28 Aug, 2010 1 commit
  6. 27 Aug, 2010 1 commit
  7. 24 Aug, 2010 1 commit
  8. 09 Aug, 2010 1 commit
    • Christoph Hellwig's avatar
      check ATTR_SIZE contraints in inode_change_ok · 2c27c65e
      Christoph Hellwig authored
      
      
      Make sure we check the truncate constraints early on in ->setattr by adding
      those checks to inode_change_ok.  Also clean up and document inode_change_ok
      to make this obvious.
      
      As a fallout we don't have to call inode_newsize_ok from simple_setsize and
      simplify it down to a truncate_setsize which doesn't return an error.  This
      simplifies a lot of setattr implementations and means we use truncate_setsize
      almost everywhere.  Get rid of fat_setsize now that it's trivial and mark
      ext2_setsize static to make the calling convention obvious.
      
      Keep the inode_newsize_ok in vmtruncate for now as all callers need an
      audit for its removal anyway.
      
      Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
      needs a deeper audit, but that is left for later.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      2c27c65e
  9. 01 Aug, 2010 1 commit
    • Huang Ying's avatar
      KVM: Avoid killing userspace through guest SRAO MCE on unmapped pages · bf998156
      Huang Ying authored
      
      
      In common cases, guest SRAO MCE will cause corresponding poisoned page
      be un-mapped and SIGBUS be sent to QEMU-KVM, then QEMU-KVM will relay
      the MCE to guest OS.
      
      But it is reported that if the poisoned page is accessed in guest
      after unmapping and before MCE is relayed to guest OS, userspace will
      be killed.
      
      The reason is as follows. Because poisoned page has been un-mapped,
      guest access will cause guest exit and kvm_mmu_page_fault will be
      called. kvm_mmu_page_fault can not get the poisoned page for fault
      address, so kernel and user space MMIO processing is tried in turn. In
      user MMIO processing, poisoned page is accessed again, then userspace
      is killed by force_sig_info.
      
      To fix the bug, kvm_mmu_page_fault send HWPOISON signal to QEMU-KVM
      and do not try kernel and user space MMIO processing for poisoned
      page.
      
      [xiao: fix warning introduced by avi]
      Reported-by: default avatarMax Asbock <masbock@linux.vnet.ibm.com>
      Signed-off-by: default avatarHuang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      bf998156
  10. 18 Jul, 2010 1 commit
    • Dave Chinner's avatar
      mm: add context argument to shrinker callback · 7f8275d0
      Dave Chinner authored
      
      
      The current shrinker implementation requires the registered callback
      to have global state to work from. This makes it difficult to shrink
      caches that are not global (e.g. per-filesystem caches). Pass the shrinker
      structure to the callback so that users can embed the shrinker structure
      in the context the shrinker needs to operate on and get back to it in the
      callback via container_of().
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      7f8275d0
  11. 25 May, 2010 3 commits
    • Chris Metcalf's avatar
      mm: make lowmem_page_address() use PFN_PHYS() for improved portability · c6f6b596
      Chris Metcalf authored
      
      
      This ensures that platforms with lowmem PAs above 32 bits work correctly
      by avoiding truncating the PA during a left shift.
      Signed-off-by: default avatarChris Metcalf <cmetcalf@tilera.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6f6b596
    • Mel Gorman's avatar
      mm: compaction: memory compaction core · 748446bb
      Mel Gorman authored
      
      
      This patch is the core of a mechanism which compacts memory in a zone by
      relocating movable pages towards the end of the zone.
      
      A single compaction run involves a migration scanner and a free scanner.
      Both scanners operate on pageblock-sized areas in the zone.  The migration
      scanner starts at the bottom of the zone and searches for all movable
      pages within each area, isolating them onto a private list called
      migratelist.  The free scanner starts at the top of the zone and searches
      for suitable areas and consumes the free pages within making them
      available for the migration scanner.  The pages isolated for migration are
      then migrated to the newly isolated free pages.
      
      [aarcange@redhat.com: Fix unsafe optimisation]
      [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      748446bb
    • Mel Gorman's avatar
      mm: migration: avoid race between shift_arg_pages() and rmap_walk() during... · a8bef8ff
      Mel Gorman authored
      
      mm: migration: avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
      
      Page migration requires rmap to be able to find all ptes mapping a page
      at all times, otherwise the migration entry can be instantiated, but it
      is possible to leave one behind if the second rmap_walk fails to find
      the page.  If this page is later faulted, migration_entry_to_page() will
      call BUG because the page is locked indicating the page was migrated by
      the migration PTE not cleaned up. For example
      
        kernel BUG at include/linux/swapops.h:105!
        invalid opcode: 0000 [#1] PREEMPT SMP
        ...
        Call Trace:
         [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
         [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
         [<ffffffff813099b5>] page_fault+0x25/0x30
         [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
         [<ffffffff8111329b>] search_binary_handler+0x173/0x313
         [<ffffffff81114896>] do_execve+0x219/0x30a
         [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
         [<ffffffff8100320a>] stub_execve+0x6a/0xc0
        RIP  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
      
      There is a race between shift_arg_pages and migration that triggers this
      bug.  A temporary stack is setup during exec and later moved.  If
      migration moves a page in the temporary stack and the VMA is then removed
      before migration completes, the migration PTE may not be found leading to
      a BUG when the stack is faulted.
      
      This patch causes pages within the temporary stack during exec to be
      skipped by migration.  It does this by marking the VMA covering the
      temporary stack with an otherwise impossible combination of VMA flags.
      These flags are cleared when the temporary stack is moved to its final
      location.
      
      [kamezawa.hiroyu@jp.fujitsu.com: idea for having migration skip temporary stacks]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8bef8ff
  12. 07 Apr, 2010 1 commit
    • Naoya Horiguchi's avatar
      pagemap: fix pfn calculation for hugepage · 116354d1
      Naoya Horiguchi authored
      
      
      When we look into pagemap using page-types with option -p, the value of
      pfn for hugepages looks wrong (see below.) This is because pte was
      evaluated only once for one vma although it should be updated for each
      hugepage.  This patch fixes it.
      
        $ page-types -p 3277 -Nl -b huge
        voffset   offset  len     flags
        7f21e8a00 11e400  1       ___U___________H_G________________
        7f21e8a01 11e401  1ff     ________________TG________________
                     ^^^
        7f21e8c00 11e400  1       ___U___________H_G________________
        7f21e8c01 11e401  1ff     ________________TG________________
                     ^^^
      
      One hugepage contains 1 head page and 511 tail pages in x86_64 and each
      two lines represent each hugepage.  Voffset and offset mean virtual
      address and physical address in the page unit, respectively.  The
      different hugepages should not have the same offset value.
      
      With this patch applied:
      
        $ page-types -p 3386 -Nl -b huge
        voffset   offset   len    flags
        7fec7a600 112c00   1      ___UD__________H_G________________
        7fec7a601 112c01   1ff    ________________TG________________
                     ^^^
        7fec7a800 113200   1      ___UD__________H_G________________
        7fec7a801 113201   1ff    ________________TG________________
                     ^^^
                     OK
      
      More info:
      
      - This patch modifies walk_page_range()'s hugepage walker.  But the
        change only affects pagemap_read(), which is the only caller of hugepage
        callback.
      
      - Without this patch, hugetlb_entry() callback is called per vma, that
        doesn't match the natural expectation from its name.
      
      - With this patch, hugetlb_entry() is called per hugepte entry and the
        callback can become much simpler.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMatt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      116354d1
  13. 26 Mar, 2010 1 commit
    • Peter Zijlstra's avatar
      x86, perf, bts, mm: Delete the never used BTS-ptrace code · faa4602e
      Peter Zijlstra authored
      
      
      Support for the PMU's BTS features has been upstreamed in
      v2.6.32, but we still have the old and disabled ptrace-BTS,
      as Linus noticed it not so long ago.
      
      It's buggy: TIF_DEBUGCTLMSR is trampling all over that MSR without
      regard for other uses (perf) and doesn't provide the flexibility
      needed for perf either.
      
      Its users are ptrace-block-step and ptrace-bts, since ptrace-bts
      was never used and ptrace-block-step can be implemented using a
      much simpler approach.
      
      So axe all 3000 lines of it. That includes the *locked_memory*()
      APIs in mm/mlock.c as well.
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Markus Metzger <markus.t.metzger@intel.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <20100325135413.938004390@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      faa4602e
  14. 12 Mar, 2010 2 commits
  15. 06 Mar, 2010 4 commits
    • Rik van Riel's avatar
      mm: remove VM_LOCK_RMAP code · fc148a5f
      Rik van Riel authored
      
      
      When a VMA is in an inconsistent state during setup or teardown, the worst
      that can happen is that the rmap code will not be able to find the page.
      
      The mapping is in the process of being torn down (PTEs just got
      invalidated by munmap), or set up (no PTEs have been instantiated yet).
      
      It is also impossible for the rmap code to follow a pointer to an already
      freed VMA, because the rmap code holds the anon_vma->lock, which the VMA
      teardown code needs to take before the VMA is removed from the anon_vma
      chain.
      
      Hence, we should not need the VM_LOCK_RMAP locking at all.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc148a5f
    • Rik van Riel's avatar
      mm: change anon_vma linking to fix multi-process server scalability issue · 5beb4930
      Rik van Riel authored
      
      
      The old anon_vma code can lead to scalability issues with heavily forking
      workloads.  Specifically, each anon_vma will be shared between the parent
      process and all its child processes.
      
      In a workload with 1000 child processes and a VMA with 1000 anonymous
      pages per process that get COWed, this leads to a system with a million
      anonymous pages in the same anon_vma, each of which is mapped in just one
      of the 1000 processes.  However, the current rmap code needs to walk them
      all, leading to O(N) scanning complexity for each page.
      
      This can result in systems where one CPU is walking the page tables of
      1000 processes in page_referenced_one, while all other CPUs are stuck on
      the anon_vma lock.  This leads to catastrophic failure for a benchmark
      like AIM7, where the total number of processes can reach in the tens of
      thousands.  Real workloads are still a factor 10 less process intensive
      than AIM7, but they are catching up.
      
      This patch changes the way anon_vmas and VMAs are linked, which allows us
      to associate multiple anon_vmas with a VMA.  At fork time, each child
      process gets its own anon_vmas, in which its COWed pages will be
      instantiated.  The parents' anon_vma is also linked to the VMA, because
      non-COWed pages could be present in any of the children.
      
      This reduces rmap scanning complexity to O(1) for the pages of the 1000
      child processes, with O(N) complexity for at most 1/N pages in the system.
       This reduces the average scanning cost in heavily forking workloads from
      O(N) to 2.
      
      The only real complexity in this patch stems from the fact that linking a
      VMA to anon_vmas now involves memory allocations.  This means vma_adjust
      can fail, if it needs to attach a VMA to anon_vma structures.  This in
      turn means error handling needs to be added to the calling functions.
      
      A second source of complexity is that, because there can be multiple
      anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
      "the" anon_vma lock.  To prevent the rmap code from walking up an
      incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
      flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
      to make sure it is impossible to compile a kernel that needs both symbolic
      values for the same bitflag.
      
      Some test results:
      
      Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
      box with 16GB RAM and not quite enough IO), the system ends up running
      >99% in system time, with every CPU on the same anon_vma lock in the
      pageout code.
      
      With these changes, AIM7 hits the cross-over point around 29.7k users.
      This happens with ~99% IO wait time, there never seems to be any spike in
      system time.  The anon_vma lock contention appears to be resolved.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5beb4930
    • KAMEZAWA Hiroyuki's avatar
      mm: avoid false sharing of mm_counter · 34e55232
      KAMEZAWA Hiroyuki authored
      
      
      Considering the nature of per mm stats, it's the shared object among
      threads and can be a cache-miss point in the page fault path.
      
      This patch adds per-thread cache for mm_counter.  RSS value will be
      counted into a struct in task_struct and synchronized with mm's one at
      events.
      
      Now, in this patch, the event is the number of calls to handle_mm_fault.
      Per-thread value is added to mm at each 64 calls.
      
       rough estimation with small benchmark on parallel thread (2threads) shows
       [before]
           4.5 cache-miss/faults
       [after]
           4.0 cache-miss/faults
       Anyway, the most contended object is mmap_sem if the number of threads grows.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34e55232
    • KAMEZAWA Hiroyuki's avatar
      mm: clean up mm_counter · d559db08
      KAMEZAWA Hiroyuki authored
      
      
      Presently, per-mm statistics counter is defined by macro in sched.h
      
      This patch modifies it to
        - defined in mm.h as inlinf functions
        - use array instead of macro's name creation.
      
      This patch is for reducing patch size in future patch to modify
      implementation of per-mm counter.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d559db08
  16. 12 Feb, 2010 2 commits
    • Yinghai Lu's avatar
      sparsemem: Put mem map for one node together. · 9bdac914
      Yinghai Lu authored
      
      
      Add vmemmap_alloc_block_buf for mem map only.
      
      It will fallback to the old way if it cannot get a block that big.
      
      Before this patch, when a node have 128g ram installed, memmap are
      split into two parts or more.
      [    0.000000]  [ffffea0000000000-ffffea003fffffff] PMD -> [ffff880100600000-ffff88013e9fffff] on node 1
      [    0.000000]  [ffffea0040000000-ffffea006fffffff] PMD -> [ffff88013ec00000-ffff88016ebfffff] on node 1
      [    0.000000]  [ffffea0070000000-ffffea007fffffff] PMD -> [ffff882000600000-ffff8820105fffff] on node 0
      [    0.000000]  [ffffea0080000000-ffffea00bfffffff] PMD -> [ffff882010800000-ffff8820507fffff] on node 0
      [    0.000000]  [ffffea00c0000000-ffffea00dfffffff] PMD -> [ffff882050a00000-ffff8820709fffff] on node 0
      [    0.000000]  [ffffea00e0000000-ffffea00ffffffff] PMD -> [ffff884000600000-ffff8840205fffff] on node 2
      [    0.000000]  [ffffea0100000000-ffffea013fffffff] PMD -> [ffff884020800000-ffff8840607fffff] on node 2
      [    0.000000]  [ffffea0140000000-ffffea014fffffff] PMD -> [ffff884060a00000-ffff8840709fffff] on node 2
      [    0.000000]  [ffffea0150000000-ffffea017fffffff] PMD -> [ffff886000600000-ffff8860305fffff] on node 3
      [    0.000000]  [ffffea0180000000-ffffea01bfffffff] PMD -> [ffff886030800000-ffff8860707fffff] on node 3
      [    0.000000]  [ffffea01c0000000-ffffea01ffffffff] PMD -> [ffff888000600000-ffff8880405fffff] on node 4
      [    0.000000]  [ffffea0200000000-ffffea022fffffff] PMD -> [ffff888040800000-ffff8880707fffff] on node 4
      [    0.000000]  [ffffea0230000000-ffffea023fffffff] PMD -> [ffff88a000600000-ffff88a0105fffff] on node 5
      [    0.000000]  [ffffea0240000000-ffffea027fffffff] PMD -> [ffff88a010800000-ffff88a0507fffff] on node 5
      [    0.000000]  [ffffea0280000000-ffffea029fffffff] PMD -> [ffff88a050a00000-ffff88a0709fffff] on node 5
      [    0.000000]  [ffffea02a0000000-ffffea02bfffffff] PMD -> [ffff88c000600000-ffff88c0205fffff] on node 6
      [    0.000000]  [ffffea02c0000000-ffffea02ffffffff] PMD -> [ffff88c020800000-ffff88c0607fffff] on node 6
      [    0.000000]  [ffffea0300000000-ffffea030fffffff] PMD -> [ffff88c060a00000-ffff88c0709fffff] on node 6
      [    0.000000]  [ffffea0310000000-ffffea033fffffff] PMD -> [ffff88e000600000-ffff88e0305fffff] on node 7
      [    0.000000]  [ffffea0340000000-ffffea037fffffff] PMD -> [ffff88e030800000-ffff88e0707fffff] on node 7
      
      after patch will get
      [    0.000000]  [ffffea0000000000-ffffea006fffffff] PMD -> [ffff880100200000-ffff88016e5fffff] on node 0
      [    0.000000]  [ffffea0070000000-ffffea00dfffffff] PMD -> [ffff882000200000-ffff8820701fffff] on node 1
      [    0.000000]  [ffffea00e0000000-ffffea014fffffff] PMD -> [ffff884000200000-ffff8840701fffff] on node 2
      [    0.000000]  [ffffea0150000000-ffffea01bfffffff] PMD -> [ffff886000200000-ffff8860701fffff] on node 3
      [    0.000000]  [ffffea01c0000000-ffffea022fffffff] PMD -> [ffff888000200000-ffff8880701fffff] on node 4
      [    0.000000]  [ffffea0230000000-ffffea029fffffff] PMD -> [ffff88a000200000-ffff88a0701fffff] on node 5
      [    0.000000]  [ffffea02a0000000-ffffea030fffffff] PMD -> [ffff88c000200000-ffff88c0701fffff] on node 6
      [    0.000000]  [ffffea0310000000-ffffea037fffffff] PMD -> [ffff88e000200000-ffff88e0701fffff] on node 7
      
      -v2: change buf to vmemmap_buf instead according to Ingo
           also add CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER according to Ingo
      -v3: according to Andrew, use sizeof(name) instead of hard coded 15
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <1265793639-15071-19-git-send-email-yinghai@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Acked-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      9bdac914
    • Yinghai Lu's avatar
      x86: Make 64 bit use early_res instead of bootmem before slab · 08677214
      Yinghai Lu authored
      
      
      Finally we can use early_res to replace bootmem for x86_64 now.
      
      Still can use CONFIG_NO_BOOTMEM to enable it or not.
      
      -v2: fix 32bit compiling about MAX_DMA32_PFN
      -v3: folded bug fix from LKML message below
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4B747239.4070907@kernel.org>
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      08677214
  17. 01 Feb, 2010 1 commit
  18. 16 Jan, 2010 1 commit
    • David Howells's avatar
      nommu: fix shared mmap after truncate shrinkage problems · 7e660872
      David Howells authored
      
      
      Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen
      over the end of a truncation.  The problem is that
      ramfs_nommu_check_mappings() checks that the reduced file size against the
      VMA tree, but not the vm_region tree.
      
      The following sequence of events can cause the problem:
      
      	fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600);
      	ftruncate(fd, 32 * 1024);
      	a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
      	b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
      	munmap(a, 32 * 1024);
      	ftruncate(fd, 16 * 1024);
      	c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
      
      Mapping 'a' creates a vm_region covering 32KB of the file.  Mapping 'b'
      sees that the vm_region from 'a' is covering the region it wants and so
      shares it, pinning it in memory.
      
      Mapping 'a' then goes away and the file is truncated to the end of VMA
      'b'.  However, the region allocated by 'a' is still in effect, and has
      _not_ been reduced.
      
      Mapping 'c' is then created, and because there's a vm_region covering the
      desired region, get_unmapped_area() is _not_ called to repeat the check,
      and the mapping is granted, even though the pages from the latter half of
      the mapping have been discarded.
      
      However:
      
      	d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
      
      Mapping 'd' should work, and should end up sharing the region allocated by
      'a'.
      
      To deal with this, we shrink the vm_region struct during the truncation,
      lest do_mmap_pgoff() take it as licence to share the full region
      automatically without calling the get_unmapped_area() file op again.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e660872
  19. 04 Jan, 2010 1 commit
    • Christoph Lameter's avatar
      this_cpu: Page allocator conversion · 99dcc3e5
      Christoph Lameter authored
      
      
      Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.
      
      This drastically reduces the size of struct zone for systems with large
      amounts of processors and allows placement of critical variables of struct
      zone in one cacheline even on very large systems.
      
      Another effect is that the pagesets of one processor are placed near one
      another. If multiple pagesets from different zones fit into one cacheline
      then additional cacheline fetches can be avoided on the hot paths when
      allocating memory from multiple zones.
      
      Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
      are reduced and we can drop the zone_pcp macro.
      
      Hotplug handling is also simplified since cpu alloc can bring up and
      shut down cpu areas for a specific cpu as a whole. So there is no need to
      allocate or free individual pagesets.
      
      V7-V8:
      - Explain chicken egg dilemmna with percpu allocator.
      
      V4-V5:
      - Fix up cases where per_cpu_ptr is called before irq disable
      - Integrate the bootstrap logic that was separate before.
      
      tj: Build failure in pageset_cpuup_callback() due to missing ret
          variable fixed.
      Reviewed-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      99dcc3e5
  20. 16 Dec, 2009 5 commits
    • Yinghai Lu's avatar
      x86: Fix checking of SRAT when node 0 ram is not from 0 · 32996250
      Yinghai Lu authored
      Found one system that boot from socket1 instead of socket0, SRAT get rejected...
      
      [    0.000000] SRAT: Node 1 PXM 0 0-a0000
      [    0.000000] SRAT: Node 1 PXM 0 100000-80000000
      [    0.000000] SRAT: Node 1 PXM 0 100000000-2080000000
      [    0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000
      [    0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000
      [    0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000
      [    0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000
      [    0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000
      [    0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000
      [    0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000
      ...
      [    0.000000] NUMA: Allocated memnodemap from 500000 - 701040
      [    0.000000] NUMA: Using 20 for the hash shift.
      [    0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used
      [    0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used
      [    0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used
      [    0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used
      [    0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used
      [    0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used
      [    0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used
      [    0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used
      [    0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used
      [    0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used
      [    0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used.
      [    0.000000] SRAT: SRAT not used.
      
      the early_node_map is not sorted because node0 with non zero start come first.
      
      so try to sort it right away after all regions are registered.
      
      also fixs refression by 8716273c
      
       (x86: Export srat physical topology)
      
      -v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g)
      -v3: update comments.
      Reported-and-tested-by: default avatarJens Axboe <jens.axboe@oracle.com>
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4B2579D2.3010201@kernel.org>
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      32996250
    • Andi Kleen's avatar
      HWPOISON: Add soft page offline support · facb6011
      Andi Kleen authored
      
      
      This is a simpler, gentler variant of memory_failure() for soft page
      offlining controlled from user space.  It doesn't kill anything, just
      tries to invalidate and if that doesn't work migrate the
      page away.
      
      This is useful for predictive failure analysis, where a page has
      a high rate of corrected errors, but hasn't gone bad yet. Instead
      it can be offlined early and avoided.
      
      The offlining is controlled from sysfs, including a new generic
      entry point for hard page offlining for symmetry too.
      
      We use the page isolate facility to prevent re-allocation
      race. Normally this is only used by memory hotplug. To avoid
      races with memory allocation I am using lock_system_sleep().
      This avoids the situation where memory hotplug is about
      to isolate a page range and then hwpoison undoes that work.
      This is a big hammer currently, but the simplest solution
      currently.
      
      When the page is not free or LRU we try to free pages
      from slab and other caches. The slab freeing is currently
      quite dumb and does not try to focus on the specific slab
      cache which might own the page. This could be potentially
      improved later.
      
      Thanks to Fengguang Wu and Haicheng Li for some fixes.
      
      [Added fix from Andrew Morton to adapt to new migrate_pages prototype]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      facb6011
    • Wu Fengguang's avatar
      HWPOISON: Add unpoisoning support · 847ce401
      Wu Fengguang authored
      
      
      The unpoisoning interface is useful for stress testing tools to
      reclaim poisoned pages (to prevent OOM)
      
      There is no hardware level unpoisioning, so this
      cannot be used for real memory errors, only for software injected errors.
      
      Note that it may leak pages silently - those who have been removed from
      LRU cache, but not isolated from page cache/swap cache at hwpoison time.
      Especially the stress test of dirty swap cache pages shall reboot system
      before exhausting memory.
      
      AK: Fix comments, add documentation, add printks, rename symbol
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      847ce401
    • Andi Kleen's avatar
      HWPOISON: Turn ref argument into flags argument · 82ba011b
      Andi Kleen authored
      
      
      Now that "ref" is just a boolean turn it into
      a flags argument. First step is only a single flag
      that makes the code's intention more clear, but more
      may follow.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      82ba011b
    • Andi Kleen's avatar
      HWPOISON: Be more aggressive at freeing non LRU caches · 588f9ce6
      Andi Kleen authored
      
      
      shake_page handles more types of page caches than lru_drain_all()
      
      - per cpu page allocator pages
      - per CPU LRU
      
      Stops early when the page became free.
      
      Used in followon patches.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      588f9ce6
  21. 15 Dec, 2009 3 commits
    • Naoya Horiguchi's avatar
      mm hugetlb: add hugepage support to pagemap · 5dc37642
      Naoya Horiguchi authored
      
      
      This patch enables extraction of the pfn of a hugepage from
      /proc/pid/pagemap in an architecture independent manner.
      
      Details
      -------
      My test program (leak_pagemap) works as follows:
       - creat() and mmap() a file on hugetlbfs (file size is 200MB == 100 hugepages,)
       - read()/write() something on it,
       - call page-types with option -p,
       - munmap() and unlink() the file on hugetlbfs
      
      Without my patches
      ------------------
      $ ./leak_pagemap
                   flags page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000000000000          1        0  __________________________________
      0x0000000000000804          1        0  __R________M______________________ referenced,mmap
      0x000000000000086c         81        0  __RU_lA____M______________________ referenced,uptodate,lru,active,mmap
      0x0000000000005808          5        0  ___U_______Ma_b___________________ uptodate,mmap,anonymous,swapbacked
      0x0000000000005868         12        0  ___U_lA____Ma_b___________________ uptodate,lru,active,mmap,anonymous,swapbacked
      0x000000000000586c          1        0  __RU_lA____Ma_b___________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
                   total        101        0
      
      The output of page-types don't show any hugepage.
      
      With my patches
      ---------------
      $ ./leak_pagemap
                   flags page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000000000000          1        0  __________________________________
      0x0000000000030000      51100      199  ________________TG________________ compound_tail,huge
      0x0000000000028018        100        0  ___UD__________H_G________________ uptodate,dirty,compound_head,huge
      0x0000000000000804          1        0  __R________M______________________ referenced,mmap
      0x000000000000080c          1        0  __RU_______M______________________ referenced,uptodate,mmap
      0x000000000000086c         80        0  __RU_lA____M______________________ referenced,uptodate,lru,active,mmap
      0x0000000000005808          4        0  ___U_______Ma_b___________________ uptodate,mmap,anonymous,swapbacked
      0x0000000000005868         12        0  ___U_lA____Ma_b___________________ uptodate,lru,active,mmap,anonymous,swapbacked
      0x000000000000586c          1        0  __RU_lA____Ma_b___________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
                   total      51300      200
      
      The output of page-types shows 51200 pages contributing to hugepages,
      containing 100 head pages and 51100 tail pages as expected.
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5dc37642
    • Huang Shijie's avatar
      include/linux/mm.h: remove unneeded ifdef · f096e59e
      Huang Shijie authored
      
      
      The check code for CONFIG_SWAP is redundant, because there is a
      non-CONFIG_SWAP version for PageSwapCache() which just returns 0.
      Signed-off-by: default avatarHuang Shijie <shijie8@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f096e59e
    • Hugh Dickins's avatar
      mm: define PAGE_MAPPING_FLAGS · 3ca7b3c5
      Hugh Dickins authored
      
      
      At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set
      in page->mapping, with the higher bits a pointer to the anon_vma; and have
      defined PageKsm(page) as that with NULL anon_vma.
      
      But KSM swapping will need to store a pointer there: so in preparation for
      that, now define PAGE_MAPPING_FLAGS as the low two bits, including
      PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
      other use for the bit emerges).
      
      Declare page_rmapping(page) to return the pointer part of page->mapping,
      and page_anon_vma(page) to return the anon_vma pointer when that's what it
      is.  Use these in a few appropriate places: notably, unuse_vma() has been
      testing page->mapping, but is better to be testing page_anon_vma() (cases
      may be added in which flag bits are set without any pointer).
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ca7b3c5