1. 20 Jan, 2016 13 commits
  2. 18 Jan, 2016 1 commit
  3. 17 Jan, 2016 1 commit
    • Doron Tsur's avatar
      net/mlx5_core: Fix trimming down IRQ number · 0b6e26ce
      Doron Tsur authored
      With several ConnectX-4 cards installed on a server, one may receive
      irqn > 255 from the kernel API, which we mistakenly trim to 8bit.
      
      This causes EQ creation failure with the following stack trace:
      [<ffffffff812a11f4>] dump_stack+0x48/0x64
      [<ffffffff810ace21>] __setup_irq+0x3a1/0x4f0
      [<ffffffff810ad7e0>] request_threaded_irq+0x120/0x180
      [<ffffffffa0923660>] ? mlx5_eq_int+0x450/0x450 [mlx5_core]
      [<ffffffffa0922f64>] mlx5_create_map_eq+0x1e4/0x2b0 [mlx5_core]
      [<ffffffffa091de01>] alloc_comp_eqs+0xb1/0x180 [mlx5_core]
      [<ffffffffa091ea99>] mlx5_dev_init+0x5e9/0x6e0 [mlx5_core]
      [<ffffffffa091ec29>] init_one+0x99/0x1c0 [mlx5_core]
      [<ffffffff812e2afc>] local_pci_probe+0x4c/0xa0
      
      Fixing it by changing of the irqn type from u8 to unsigned int to
      support values > 255
      
      Fixes: 61d0e73e ('net/mlx5_core: Use the the real irqn in eq->irqn')
      Reported-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDoron Tsur <doront@mellanox.com>
      Signed-off-by: default avatarMatan Barak <matanb@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b6e26ce
  4. 16 Jan, 2016 6 commits
  5. 15 Jan, 2016 19 commits
    • Wang Xiaoqiang's avatar
      mm/mlock.c: change can_do_mlock return value type to boolean · 7f43add4
      Wang Xiaoqiang authored
      Since can_do_mlock only return 1 or 0, so make it boolean.
      
      No functional change.
      
      [akpm@linux-foundation.org: update declaration in mm.h]
      Signed-off-by: default avatarWang Xiaoqiang <wangxq10@lzu.edu.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f43add4
    • Kirill A. Shutemov's avatar
      memblock: fix section mismatch · 036fbb21
      Kirill A. Shutemov authored
      allmodconfig produces following warning for me:
      
        WARNING: vmlinux.o(.text.unlikely+0x10314): Section mismatch in reference from the function movable_node_is_enabled() to the variable .meminit.data:movable_node_enabled
        The function movable_node_is_enabled() references
        the variable __meminitdata movable_node_enabled.
        This is often because movable_node_is_enabled lacks a __meminitdata
        annotation or the annotation of movable_node_enabled is wrong.
      
      Let's mark the function with __meminit.  It fixes the warning.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      036fbb21
    • Dominik Dingel's avatar
      mm: bring in additional flag for fixup_user_fault to signal unlock · 4a9e1cda
      Dominik Dingel authored
      During Jason's work with postcopy migration support for s390 a problem
      regarding gmap faults was discovered.
      
      The gmap code will call fixup_user_fault which will end up always in
      handle_mm_fault.  Till now we never cared about retries, but as the
      userfaultfd code kind of relies on it.  this needs some fix.
      
      This patchset does not take care of the futex code.  I will now look
      closer at this.
      
      This patch (of 2):
      
      With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
      to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
      faulting we ever unlocked mmap_sem.
      
      This patch brings in the logic to handle retries as well as it cleans up
      the current documentation.  fixup_user_fault was not having the same
      semantics as filemap_fault.  It never indicated if a retry happened and
      so a caller wasn't able to handle that case.  So we now changed the
      behaviour to always retry a locked mmap_sem.
      Signed-off-by: default avatarDominik Dingel <dingel@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: "Jason J. Herne" <jjherne@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a9e1cda
    • Dan Williams's avatar
      mm, x86: get_user_pages() for dax mappings · 3565fce3
      Dan Williams authored
      A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
      has established a devm_memremap_pages() mapping, i.e.  when the pfn_t
      return from ->direct_access() has PFN_DEV and PFN_MAP set.  Later, when
      encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
      struct dev_pagemap instance to keep the result of pfn_to_page() valid
      until put_page().
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Tested-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3565fce3
    • Dan Williams's avatar
      mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd · 5c7fb56e
      Dan Williams authored
      A dax-huge-page mapping while it uses some thp helpers is ultimately not
      a transparent huge page.  The distinction is especially important in the
      get_user_pages() path.  pmd_devmap() is used to distinguish dax-pmds
      from pmd_huge() and pmd_trans_huge() which have slightly different
      semantics.
      
      Explicitly mark the pmd_trans_huge() helpers that dax needs by adding
      pmd_devmap() checks.
      
      [kirill.shutemov@linux.intel.com: fix regression in handling mlocked pages in  __split_huge_pmd()]
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c7fb56e
    • Dan Williams's avatar
      mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup · 5c2c2587
      Dan Williams authored
      get_dev_page() enables paths like get_user_pages() to pin a dynamically
      mapped pfn-range (devm_memremap_pages()) while the resulting struct page
      objects are in use.  Unlike get_page() it may fail if the device is, or
      is in the process of being, disabled.  While the initial lookup of the
      range may be an expensive list walk, the result is cached to speed up
      subsequent lookups which are likely to be in the same mapped range.
      
      devm_memremap_pages() now requires a reference counter to be specified
      at init time.  For pmem this means moving request_queue allocation into
      pmem_alloc() so the existing queue usage counter can track "device
      pages".
      
      ZONE_DEVICE pages always have an elevated count and will never be on an
      lru reclaim list.  That space in 'struct page' can be redirected for
      other uses, but for safety introduce a poison value that will always
      trip __list_add() to assert.  This allows half of the struct list_head
      storage to be reclaimed with some assurance to back up the assumption
      that the page count never goes to zero and a list_add() is never
      attempted.
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Tested-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c2c2587
    • Dan Williams's avatar
      mm, dax: convert vmf_insert_pfn_pmd() to pfn_t · f25748e3
      Dan Williams authored
      Similar to the conversion of vm_insert_mixed() use pfn_t in the
      vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the
      pfn is backed by a devm_memremap_pages() mapping.
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f25748e3
    • Dan Williams's avatar
      mm, dax, gpu: convert vm_insert_mixed to pfn_t · 01c8f1c4
      Dan Williams authored
      Convert the raw unsigned long 'pfn' argument to pfn_t for the purpose of
      evaluating the PFN_MAP and PFN_DEV flags.  When both are set it triggers
      _PAGE_DEVMAP to be set in the resulting pte.
      
      There are no functional changes to the gpu drivers as a result of this
      conversion.
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: David Airlie <airlied@linux.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01c8f1c4
    • Dan Williams's avatar
      hugetlb: fix compile error on tile · 888cdbc2
      Dan Williams authored
      Inlude asm/pgtable.h to get the definition for pud_t to fix:
      
        include/linux/hugetlb.h:203:29: error: unknown type name 'pud_t'
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Liviu Dudau <liviu.dudau@arm.com>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      888cdbc2
    • Dan Williams's avatar
      x86, mm: introduce vmem_altmap to augment vmemmap_populate() · 4b94ffdc
      Dan Williams authored
      In support of providing struct page for large persistent memory
      capacities, use struct vmem_altmap to change the default policy for
      allocating memory for the memmap array.  The default vmemmap_populate()
      allocates page table storage area from the page allocator.  Given
      persistent memory capacities relative to DRAM it may not be feasible to
      store the memmap in 'System Memory'.  Instead vmem_altmap represents
      pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
      requests.
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b94ffdc
    • Dan Williams's avatar
      mm: introduce find_dev_pagemap() · 9476df7d
      Dan Williams authored
      There are several scenarios where we need to retrieve and update
      metadata associated with a given devm_memremap_pages() mapping, and the
      only lookup key available is a pfn in the range:
      
      1/ We want to augment vmemmap_populate() (called via arch_add_memory())
         to allocate memmap storage from pre-allocated pages reserved by the
         device driver.  At vmemmap_alloc_block_buf() time it grabs device pages
         rather than page allocator pages.  This is in support of
         devm_memremap_pages() mappings where the memmap is too large to fit in
         main memory (i.e. large persistent memory devices).
      
      2/ Taking a reference against the mapping when inserting device pages
         into the address_space radix of a given inode.  This facilitates
         unmap_mapping_range() and truncate_inode_pages() operations when the
         driver is tearing down the mapping.
      
      3/ get_user_pages() operations on ZONE_DEVICE memory require taking a
         reference against the mapping so that the driver teardown path can
         revoke and drain usage of device pages.
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Tested-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9476df7d
    • Dan Williams's avatar
      mm: skip memory block registration for ZONE_DEVICE · 260ae3f7
      Dan Williams authored
      Prevent userspace from trying and failing to online ZONE_DEVICE pages
      which are meant to never be onlined.
      
      For example on platforms with a udev rule like the following:
      
        SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"
      
      ...will generate futile attempts to online the ZONE_DEVICE sections.
      Example kernel messages:
      
          Built 1 zonelists in Node order, mobility grouping on.  Total pages: 1004747
          Policy zone: Normal
          online_pages [mem 0x248000000-0x24fffffff] failed
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      260ae3f7
    • Dan Williams's avatar
      mm, dax, pmem: introduce pfn_t · 34c0fd54
      Dan Williams authored
      For the purpose of communicating the optional presence of a 'struct
      page' for the pfn returned from ->direct_access(), introduce a type that
      encapsulates a page-frame-number plus flags.  These flags contain the
      historical "page_link" encoding for a scatterlist entry, but can also
      denote "device memory".  Where "device memory" is a set of pfns that are
      not part of the kernel's linear mapping by default, but are accessed via
      the same memory controller as ram.
      
      The motivation for this new type is large capacity persistent memory
      that needs struct page entries in the 'memmap' to support 3rd party DMA
      (i.e.  O_DIRECT I/O with a persistent memory source/target).  However,
      we also need it in support of maintaining a list of mapped inodes which
      need to be unmapped at driver teardown or freeze_bdev() time.
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34c0fd54
    • Dan Williams's avatar
      kvm: rename pfn_t to kvm_pfn_t · ba049e93
      Dan Williams authored
      To date, we have implemented two I/O usage models for persistent memory,
      PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
      userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
      to be the target of direct-i/o.  It allows userspace to coordinate
      DMA/RDMA from/to persistent memory.
      
      The implementation leverages the ZONE_DEVICE mm-zone that went into
      4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
      and dynamically mapped by a device driver.  The pmem driver, after
      mapping a persistent memory range into the system memmap via
      devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
      page-backed pmem-pfns via flags in the new pfn_t type.
      
      The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
      resulting pte(s) inserted into the process page tables with a new
      _PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it keys
      off _PAGE_DEVMAP to pin the device hosting the page range active.
      Finally, get_page() and put_page() are modified to take references
      against the device driver established page mapping.
      
      Finally, this need for "struct page" for persistent memory requires
      memory capacity to store the memmap array.  Given the memmap array for a
      large pool of persistent may exhaust available DRAM introduce a
      mechanism to allocate the memmap from persistent memory.  The new
      "struct vmem_altmap *" parameter to devm_memremap_pages() enables
      arch_add_memory() to use reserved pmem capacity rather than the page
      allocator.
      
      This patch (of 18):
      
      The core has developed a need for a "pfn_t" type [1].  Move the existing
      pfn_t in KVM to kvm_pfn_t [2].
      
      [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
      [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.htmlSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba049e93
    • Dan Williams's avatar
      dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() · b2e0d162
      Dan Williams authored
      The DAX implementation needs to protect new calls to ->direct_access()
      and usage of its return value against the driver for the underlying
      block device being disabled.  Use blk_queue_enter()/blk_queue_exit() to
      hold off blk_cleanup_queue() from proceeding, or otherwise fail new
      mapping requests if the request_queue is being torn down.
      
      This also introduces blk_dax_ctl to simplify the interface from fs/dax.c
      through dax_map_atomic() to bdev_direct_access().
      
      [willy@linux.intel.com: fix read() of a hole]
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2e0d162
    • Minchan Kim's avatar
      mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called · b8d3c4c3
      Minchan Kim authored
      We don't need to split THP page when MADV_FREE syscall is called if
      [start, len] is aligned with THP size.  The split could be done when VM
      decide to free it in reclaim path if memory pressure is heavy.  With
      that, we could avoid unnecessary THP split.
      
      For the feature, this patch changes pte dirtness marking logic of THP.
      Now, it marks every ptes of pages dirty unconditionally in splitting,
      which makes MADV_FREE void.  So, instead, this patch propagates pmd
      dirtiness to all pages via PG_dirty and restores pte dirtiness from
      PG_dirty.  With this, if pmd is clean(ie, MADV_FREEed) when split
      happens(e,g, shrink_page_list), all of pages are clean too so we could
      discard them.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Evans <je@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b8d3c4c3
    • Minchan Kim's avatar
      mm: move lazily freed pages to inactive list · 10853a03
      Minchan Kim authored
      MADV_FREE is a hint that it's okay to discard pages if there is memory
      pressure and we use reclaimers(ie, kswapd and direct reclaim) to free
      them so there is no value keeping them in the active anonymous LRU so
      this patch moves them to inactive LRU list's head.
      
      This means that MADV_FREE-ed pages which were living on the inactive
      list are reclaimed first because they are more likely to be cold rather
      than recently active pages.
      
      An arguable issue for the approach would be whether we should put the
      page to the head or tail of the inactive list.  I chose head because the
      kernel cannot make sure it's really cold or warm for every MADV_FREE
      usecase but at least we know it's not *hot*, so landing of inactive head
      would be a comprimise for various usecases.
      
      This fixes suboptimal behavior of MADV_FREE when pages living on the
      active list will sit there for a long time even under memory pressure
      while the inactive list is reclaimed heavily.  This basically breaks the
      whole purpose of using MADV_FREE to help the system to free memory which
      is might not be used.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Evans <je@fb.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10853a03
    • Minchan Kim's avatar
      mm: support madvise(MADV_FREE) · 854e9ed0
      Minchan Kim authored
      Linux doesn't have an ability to free pages lazy while other OS already
      have been supported that named by madvise(MADV_FREE).
      
      The gain is clear that kernel can discard freed pages rather than
      swapping out or OOM if memory pressure happens.
      
      Without memory pressure, freed pages would be reused by userspace
      without another additional overhead(ex, page fault + allocation +
      zeroing).
      
      Jason Evans said:
      
      : Facebook has been using MAP_UNINITIALIZED
      : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
      : several years, but there are operational costs to maintaining this
      : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
      : in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
      : increased throughput for much of our workload by ~5%, and although the
      : benefit has decreased using newer hardware and kernels, there is still
      : enough benefit that we cannot reasonably retire it without a replacement.
      :
      : Aside from Facebook operations, there are numerous broadly used
      : applications that would benefit from MADV_FREE.  The ones that immediately
      : come to mind are redis, varnish, and MariaDB.  I don't have much insight
      : into Android internals and development process, but I would hope to see
      : MADV_FREE support eventually end up there as well to benefit applications
      : linked with the integrated jemalloc.
      :
      : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
      : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
      : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
      : (and AIX, but I'm not sure it even compiles on AIX).  The lack of
      : MADV_FREE on Linux forced me down a long series of increasingly
      : sophisticated heuristics for madvise() volume reduction, and even so this
      : remains a common performance issue for people using jemalloc on Linux.
      : Please integrate MADV_FREE; many people will benefit substantially.
      
      How it works:
      
      When madvise syscall is called, VM clears dirty bit of ptes of the
      range.  If memory pressure happens, VM checks dirty bit of page table
      and if it found still "clean", it means it's a "lazyfree pages" so VM
      could discard the page instead of swapping out.  Once there was store
      operation for the page before VM peek a page to reclaim, dirty bit is
      set so VM can swap out the page instead of discarding.
      
      One thing we should notice is that basically, MADV_FREE relies on dirty
      bit in page table entry to decide whether VM allows to discard the page
      or not.  IOW, if page table entry includes marked dirty bit, VM
      shouldn't discard the page.
      
      However, as a example, if swap-in by read fault happens, page table
      entry doesn't have dirty bit so MADV_FREE could discard the page
      wrongly.
      
      For avoiding the problem, MADV_FREE did more checks with PageDirty and
      PageSwapCache.  It worked out because swapped-in page lives on swap
      cache and since it is evicted from the swap cache, the page has PG_dirty
      flag.  So both page flags check effectively prevent wrong discarding by
      MADV_FREE.
      
      However, a problem in above logic is that swapped-in page has PG_dirty
      still after they are removed from swap cache so VM cannot consider the
      page as freeable any more even if madvise_free is called in future.
      
      Look at below example for detail.
      
          ptr = malloc();
          memset(ptr);
          ..
          ..
          .. heavy memory pressure so all of pages are swapped out
          ..
          ..
          var = *ptr; -> a page swapped-in and could be removed from
                         swapcache. Then, page table doesn't mark
                         dirty bit and page descriptor includes PG_dirty
          ..
          ..
          madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
          ..
          ..
          ..
          .. heavy memory pressure again.
          .. In this time, VM cannot discard the page because the page
          .. has *PG_dirty*
      
      To solve the problem, this patch clears PG_dirty if only the page is
      owned exclusively by current process when madvise is called because
      PG_dirty represents ptes's dirtiness in several processes so we could
      clear it only if we own it exclusively.
      
      Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
      and hope glibc supports it) and jemalloc/tcmalloc already have supported
      the feature for other OS(ex, FreeBSD)
      
        barrios@blaptop:~/benchmark/ebizzy$ lscpu
        Architecture:          x86_64
        CPU op-mode(s):        32-bit, 64-bit
        Byte Order:            Little Endian
        CPU(s):                12
        On-line CPU(s) list:   0-11
        Thread(s) per core:    1
        Core(s) per socket:    1
        Socket(s):             12
        NUMA node(s):          1
        Vendor ID:             GenuineIntel
        CPU family:            6
        Model:                 2
        Stepping:              3
        CPU MHz:               3200.185
        BogoMIPS:              6400.53
        Virtualization:        VT-x
        Hypervisor vendor:     KVM
        Virtualization type:   full
        L1d cache:             32K
        L1i cache:             32K
        L2 cache:              4096K
        NUMA node0 CPU(s):     0-11
        ebizzy benchmark(./ebizzy -S 10 -n 512)
      
        Higher avg is better.
      
         vanilla-jemalloc             MADV_free-jemalloc
      
        1 thread
        records: 10                   records: 10
        avg:   2961.90                avg:  12069.70
        std:     71.96(2.43%)         std:    186.68(1.55%)
        max:   3070.00                max:  12385.00
        min:   2796.00                min:  11746.00
      
        2 thread
        records: 10                   records: 10
        avg:   5020.00                avg:  17827.00
        std:    264.87(5.28%)         std:    358.52(2.01%)
        max:   5244.00                max:  18760.00
        min:   4251.00                min:  17382.00
      
        4 thread
        records: 10                   records: 10
        avg:   8988.80                avg:  27930.80
        std:   1175.33(13.08%)        std:   3317.33(11.88%)
        max:   9508.00                max:  30879.00
        min:   5477.00                min:  21024.00
      
        8 thread
        records: 10                   records: 10
        avg:  13036.50                avg:  33739.40
        std:    170.67(1.31%)         std:   5146.22(15.25%)
        max:  13371.00                max:  40572.00
        min:  12785.00                min:  24088.00
      
        16 thread
        records: 10                   records: 10
        avg:  11092.40                avg:  31424.20
        std:    710.60(6.41%)         std:   3763.89(11.98%)
        max:  12446.00                max:  36635.00
        min:   9949.00                min:  25669.00
      
        32 thread
        records: 10                   records: 10
        avg:  11067.00                avg:  34495.80
        std:    971.06(8.77%)         std:   2721.36(7.89%)
        max:  12010.00                max:  38598.00
        min:   9002.00                min:  30636.00
      
      In summary, MADV_FREE is about much faster than MADV_DONTNEED.
      
      This patch (of 12):
      
      Add core MADV_FREE implementation.
      
      [akpm@linux-foundation.org: small cleanups]
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jason Evans <je@fb.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: "Shaohua Li" <shli@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      854e9ed0
    • Vladimir Davydov's avatar
      mm: add page_check_address_transhuge() helper · 8749cfea
      Vladimir Davydov authored
      page_referenced_one() and page_idle_clear_pte_refs_one() duplicate the
      code for looking up pte of a (possibly transhuge) page.  Move this code
      to a new helper function, page_check_address_transhuge(), and make the
      above mentioned functions use it.
      
      This is just a cleanup, no functional changes are intended.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Reviewed-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8749cfea