1. 06 Aug, 2014 3 commits
  2. 30 Jul, 2014 1 commit
  3. 23 Jul, 2014 1 commit
  4. 04 Jun, 2014 5 commits
  5. 07 May, 2014 1 commit
  6. 25 Apr, 2014 1 commit
    • Linus Torvalds's avatar
      mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts · 1cf35d47
      Linus Torvalds authored
      
      
      The mmu-gather operation 'tlb_flush_mmu()' has done two things: the
      actual tlb flush operation, and the batched freeing of the pages that
      the TLB entries pointed at.
      
      This splits the operation into separate phases, so that the forced
      batched flushing done by zap_pte_range() can now do the actual TLB flush
      while still holding the page table lock, but delay the batched freeing
      of all the pages to after the lock has been dropped.
      
      This in turn allows us to avoid a race condition between
      set_page_dirty() (as called by zap_pte_range() when it finds a dirty
      shared memory pte) and page_mkclean(): because we now flush all the
      dirty page data from the TLB's while holding the pte lock,
      page_mkclean() will be held up walking the (recently cleaned) page
      tables until after the TLB entries have been flushed from all CPU's.
      
      Reported-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Tested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1cf35d47
  7. 22 Apr, 2014 1 commit
    • Linus Torvalds's avatar
      mm: make fixup_user_fault() check the vma access rights too · 1b17844b
      Linus Torvalds authored
      
      
      fixup_user_fault() is used by the futex code when the direct user access
      fails, and the futex code wants it to either map in the page in a usable
      form or return an error.  It relied on handle_mm_fault() to map the
      page, and correctly checked the error return from that, but while that
      does map the page, it doesn't actually guarantee that the page will be
      mapped with sufficient permissions to be then accessed.
      
      So do the appropriate tests of the vma access rights by hand.
      
      [ Side note: arguably handle_mm_fault() could just do that itself, but
        we have traditionally done it in the caller, because some callers -
        notably get_user_pages() - have been able to access pages even when
        they are mapped with PROT_NONE.  Maybe we should re-visit that design
        decision, but in the meantime this is the minimal patch. ]
      
      Found by Dave Jones running his trinity tool.
      
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b17844b
  8. 07 Apr, 2014 5 commits
    • Miklos Szeredi's avatar
      mm: remove unused arg of set_page_dirty_balance() · ed6d7c8e
      Miklos Szeredi authored
      There's only one caller of set_page_dirty_balance() and that will call it
      with page_mkwrite == 0.
      
      The page_mkwrite argument was unused since commit b827e496
      
       "mm: close
      page_mkwrite races".
      
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed6d7c8e
    • Michal Hocko's avatar
      memcg: rename high level charging functions · d715ae08
      Michal Hocko authored
      
      
      mem_cgroup_newpage_charge is used only for charging anonymous memory so
      it is better to rename it to mem_cgroup_charge_anon.
      
      mem_cgroup_cache_charge is used for file backed memory so rename it to
      mem_cgroup_charge_file.
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d715ae08
    • Kirill A. Shutemov's avatar
      mm: add debugfs tunable for fault_around_order · 1592eef0
      Kirill A. Shutemov authored
      
      
      Let's allow people to tweak faultaround at runtime.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ning Qu <quning@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1592eef0
    • Kirill A. Shutemov's avatar
      mm: introduce vm_ops->map_pages() · 8c6e50b0
      Kirill A. Shutemov authored
      
      
      Here's new version of faultaround patchset.  It took a while to tune it
      and collect performance data.
      
      First patch adds new callback ->map_pages to vm_operations_struct.
      
      ->map_pages() is called when VM asks to map easy accessible pages.
      Filesystem should find and map pages associated with offsets from
      "pgoff" till "max_pgoff".  ->map_pages() is called with page table
      locked and must not block.  If it's not possible to reach a page without
      blocking, filesystem should skip it.  Filesystem should use do_set_pte()
      to setup page table entry.  Pointer to entry associated with offset
      "pgoff" is passed in "pte" field in vm_fault structure.  Pointers to
      entries for other offsets should be calculated relative to "pte".
      
      Currently VM use ->map_pages only on read page fault path.  We try to
      map FAULT_AROUND_PAGES a time.  FAULT_AROUND_PAGES is 16 for now.
      Performance data for different FAULT_AROUND_ORDER is below.
      
      TODO:
       - implement ->map_pages() for shmem/tmpfs;
       - modify get_user_pages() to be able to use ->map_pages() and implement
         mmap(MAP_POPULATE|MAP_NONBLOCK) on top.
      
      =========================================================================
      Tested on 4-socket machine (120 threads) with 128GiB of RAM.
      
      Few real-world workloads. The sweet spot for FAULT_AROUND_ORDER here is
      somewhere between 3 and 5. Let's say 4 :)
      
      Linux build (make -j60)
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
      	minor-faults		283,301,572	247,151,987	212,215,789	204,772,882	199,568,944	194,703,779	193,381,485
      	time, seconds		151.227629483	153.920996480	151.356125472	150.863792049	150.879207877	151.150764954	151.450962358
      Linux rebuild (make -j60)
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
      	minor-faults		5,396,854	4,148,444	2,855,286	2,577,282	2,361,957	2,169,573	2,112,643
      	time, seconds		27.404543757	27.559725591	27.030057426	26.855045126	26.678618635	26.974523490	26.761320095
      Git test suite (make -j60 test)
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
      	minor-faults		129,591,823	99,200,751	66,106,718	57,606,410	51,510,808	45,776,813	44,085,515
      	time, seconds		66.087215026	64.784546905	64.401156567	65.282708668	66.034016829	66.793780811	67.237810413
      
      Two synthetic tests: access every word in file in sequential/random order.
      It doesn't improve much after FAULT_AROUND_ORDER == 4.
      
      Sequential access 16GiB file
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
       1 thread
      	minor-faults		4,195,437	2,098,275	525,068		262,251		131,170		32,856		8,282
      	time, seconds		7.250461742	6.461711074	5.493859139	5.488488147	5.707213983	5.898510832	5.109232856
       8 threads
      	minor-faults		33,557,540	16,892,728	4,515,848	2,366,999	1,423,382	442,732		142,339
      	time, seconds		16.649304881	9.312555263	6.612490639	6.394316732	6.669827501	6.75078944	6.371900528
       32 threads
      	minor-faults		134,228,222	67,526,810	17,725,386	9,716,537	4,763,731	1,668,921	537,200
      	time, seconds		49.164430543	29.712060103	12.938649729	10.175151004	11.840094583	9.594081325	9.928461797
       60 threads
      	minor-faults		251,687,988	126,146,952	32,919,406	18,208,804	10,458,947	2,733,907	928,217
      	time, seconds		86.260656897	49.626551828	22.335007632	17.608243696	16.523119035	16.339489186	16.326390902
       120 threads
      	minor-faults		503,352,863	252,939,677	67,039,168	35,191,827	19,170,091	4,688,357	1,471,862
      	time, seconds		124.589206333	79.757867787	39.508707872	32.167281632	29.972989292	28.729834575	28.042251622
      Random access 1GiB file
       1 thread
      	minor-faults		262,636		132,743		34,369		17,299		8,527		3,451		1,222
      	time, seconds		15.351890914	16.613802482	16.569227308	15.179220992	16.557356122	16.578247824	15.365266994
       8 threads
      	minor-faults		2,098,948	1,061,871	273,690		154,501		87,110		25,663		7,384
      	time, seconds		15.040026343	15.096933500	14.474757288	14.289129964	14.411537468	14.296316837	14.395635804
       32 threads
      	minor-faults		8,390,734	4,231,023	1,054,432	528,847		269,242		97,746		26,881
      	time, seconds		20.430433109	21.585235358	22.115062928	14.872878951	14.880856305	14.883370649	14.821261690
       60 threads
      	minor-faults		15,733,258	7,892,809	1,973,393	988,266		594,789		164,994		51,691
      	time, seconds		26.577302548	25.692397770	18.728863715	20.153026398	21.619101933	17.745086260	17.613215273
       120 threads
      	minor-faults		31,471,111	15,816,616	3,959,209	1,978,685	1,008,299	264,635		96,010
      	time, seconds		41.835322703	40.459786095	36.085306105	35.313894834	35.814445675	36.552633793	34.289210594
      
      Touch only one page in page table in 16GiB file
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
       1 thread
      	minor-faults		8,372		8,324		8,270		8,260		8,249		8,239		8,237
      	time, seconds		0.039892712	0.045369149	0.051846126	0.063681685	0.079095975	0.17652406	0.541213386
       8 threads
      	minor-faults		65,731		65,681		65,628		65,620		65,608		65,599		65,596
      	time, seconds		0.124159196	0.488600638	0.156854426	0.191901957	0.242631486	0.543569456	1.677303984
       32 threads
      	minor-faults		262,388		262,341		262,285		262,276		262,266		262,257		263,183
      	time, seconds		0.452421421	0.488600638	0.565020946	0.648229739	0.789850823	1.651584361	5.000361559
       60 threads
      	minor-faults		491,822		491,792		491,723		491,711		491,701		491,691		491,825
      	time, seconds		0.763288616	0.869620515	0.980727360	1.161732354	1.466915814	3.04041448	9.308612938
       120 threads
      	minor-faults		983,466		983,655		983,366		983,372		983,363		984,083		984,164
      	time, seconds		1.595846553	1.667902182	2.008959376	2.425380942	2.941368804	5.977807890	18.401846125
      
      This patch (of 2):
      
      Introduce new vm_ops callback ->map_pages() and uses it for mapping easy
      accessible pages around fault address.
      
      On read page fault, if filesystem provides ->map_pages(), we try to map up
      to FAULT_AROUND_PAGES pages around page fault address in hope to reduce
      number of minor page faults.
      
      We call ->map_pages first and use ->fault() as fallback if page by the
      offset is not ready to be mapped (cold page cache or something).
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ning Qu <quning@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c6e50b0
    • Davidlohr Bueso's avatar
      mm/memory.c: update comment in unmap_single_vma() · 7aa6b4ad
      Davidlohr Bueso authored
      
      
      The described issue now occurs inside mmap_region().  And unfortunately
      is still valid.
      
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7aa6b4ad
  9. 04 Apr, 2014 1 commit
    • Hugh Dickins's avatar
      mm: get_user_pages(write,force) refuse to COW in shared areas · cda540ac
      Hugh Dickins authored
      
      
      get_user_pages(write=1, force=1) has always had odd behaviour on write-
      protected shared mappings: although it demands FMODE_WRITE-access to the
      underlying object (do_mmap_pgoff sets neither VM_SHARED nor VM_MAYWRITE
      without that), it ends up with do_wp_page substituting private anonymous
      Copied-On-Write pages for the shared file pages in the area.
      
      That was long ago intentional, as a safety measure to prevent ptrace
      setting a breakpoint (or POKETEXT or POKEDATA) from inadvertently
      corrupting the underlying executable.  Yet exec and dynamic loaders open
      the file read-only, and use MAP_PRIVATE rather than MAP_SHARED.
      
      The traditional odd behaviour still causes surprises and bugs in mm, and
      is probably not what any caller wants - even the comment on the flag
      says "You do not want this" (although it's undoubtedly necessary for
      overriding userspace protections in some contexts, and good when !write).
      
      Let's stop doing that.  But it would be dangerous to remove the long-
      standing safety at this stage, so just make get_user_pages(write,force)
      fail with EFAULT when applied to a write-protected shared area.
      Infiniband may in future want to force write through to underlying
      object: we can add another FOLL_flag later to enable that if required.
      
      Odd though the old behaviour was, there is no doubt that we may turn out
      to break userspace with this change, and have to revert it quickly.
      Issue a WARN_ON_ONCE to help debug the changed case (easily triggered by
      userspace, so only once to prevent spamming the logs); and delay a few
      associated cleanups until this change is proved.
      
      get_user_pages callers who might see trouble from this change:
        ptrace poking, or writing to /proc/<pid>/mem
        drivers/infiniband/
        drivers/media/v4l2-core/
        drivers/gpu/drm/exynos/exynos_drm_gem.c
        drivers/staging/tidspbridge/core/tiomap3430.c
      if they ever apply get_user_pages to write-protected shared mappings
      of an object which was opened for writing.
      
      I went to apply the same change to mm/nommu.c, but retreated.  NOMMU has
      no place for COW, and its VM_flags conventions are not the same: I'd be
      more likely to screw up NOMMU than make an improvement there.
      
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cda540ac
  10. 03 Apr, 2014 8 commits
  11. 25 Feb, 2014 2 commits
  12. 23 Jan, 2014 2 commits
    • Sasha Levin's avatar
      mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE · 309381fe
      Sasha Levin authored
      
      
      Most of the VM_BUG_ON assertions are performed on a page.  Usually, when
      one of these assertions fails we'll get a BUG_ON with a call stack and
      the registers.
      
      I've recently noticed based on the requests to add a small piece of code
      that dumps the page to various VM_BUG_ON sites that the page dump is
      quite useful to people debugging issues in mm.
      
      This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
      VM_BUG_ON() does, also dumps the page before executing the actual
      BUG_ON.
      
      [akpm@linux-foundation.org: fix up includes]
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      309381fe
    • Dave Hansen's avatar
      mm: print more details for bad_page() · f0b791a3
      Dave Hansen authored
      
      
      bad_page() is cool in that it prints out a bunch of data about the page.
      But, I can never remember which page flags are good and which are bad,
      or whether ->index or ->mapping is required to be NULL.
      
      This patch allows bad/dump_page() callers to specify a string about why
      they are dumping the page and adds explanation strings to a number of
      places.  It also adds a 'bad_flags' argument to bad_page(), which it
      then dumps out separately from the flags which are actually set.
      
      This way, the messages will show specifically why the page was bad,
      *specifically* which flags it is complaining about, if it was a page
      flag combination which was the problem.
      
      [akpm@linux-foundation.org: switch to pr_alert]
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0b791a3
  13. 21 Jan, 2014 2 commits
    • Kirill A. Shutemov's avatar
      mm: create a separate slab for page->ptl allocation · b35f1819
      Kirill A. Shutemov authored
      
      
      If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
      is 72 bytes.  For page->ptl they will be allocated from kmalloc-96 slab,
      so we loose 24 on each.  An average system can easily allocate few tens
      thousands of page->ptl and overhead is significant.
      
      Let's create a separate slab for page->ptl allocation to solve this.
      
      To make sure that it really works this time, some numbers from my test
      machine (just booted, no load):
      
      Before:
        # grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo
        kmalloc-96         31987  32190    128   30    1 : tunables  120   60    8 : slabdata   1073   1073     92
      After:
        # grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo
        page->ptl          27516  28143     72   53    1 : tunables  120   60    8 : slabdata    531    531      9
        kmalloc-96          3853   5280    128   30    1 : tunables  120   60    8 : slabdata    176    176      0
      
      Note that the patch is useful not only for debug case, but also for
      PREEMPT_RT, where spinlock_t is always bloated.
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b35f1819
    • Dan Williams's avatar
      dma-debug: introduce debug_dma_assert_idle() · 0abdd7a8
      Dan Williams authored
      Record actively mapped pages and provide an api for asserting a given
      page is dma inactive before execution proceeds.  Placing
      debug_dma_assert_idle() in cow_user_page() flagged the violation of the
      dma-api in the NET_DMA implementation (see commit 77873803
      
       "net_dma:
      mark broken").
      
      The implementation includes the capability to count, in a limited way,
      repeat mappings of the same page that occur without an intervening
      unmap.  This 'overlap' counter is limited to the few bits of tag space
      in a radix tree.  This mechanism is added to mitigate false negative
      cases where, for example, a page is dma mapped twice and
      debug_dma_assert_idle() is called after the page is un-mapped once.
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Vinod Koul <vinod.koul@intel.com>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: James Bottomley <JBottomley@Parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0abdd7a8
  14. 20 Dec, 2013 2 commits
  15. 20 Nov, 2013 1 commit
  16. 14 Nov, 2013 4 commits