1. 17 Dec, 2012 1 commit
  2. 12 Dec, 2012 1 commit
    • Hugh Dickins's avatar
      tmpfs: support SEEK_DATA and SEEK_HOLE (reprise) · 220f2ac9
      Hugh Dickins authored
      Revert 3.5's commit f21f8062 ("tmpfs: revert SEEK_DATA and
      SEEK_HOLE") to reinstate 4fb5ef08 ("tmpfs: support SEEK_DATA and
      SEEK_HOLE"), with the intervening additional arg to
      generic_file_llseek_size().
      
      In 3.8, ext4 is expected to join btrfs, ocfs2 and xfs with proper
      SEEK_DATA and SEEK_HOLE support; and a good case has now been made for
      it on tmpfs, so let's join the party.
      
      It's quite easy for tmpfs to scan the radix_tree to support llseek's new
      SEEK_DATA and SEEK_HOLE options: so add them while the minutiae are
      still on my mind (in particular, the !PageUptodate-ness of pages
      fallocated but still unwritten).
      
      [akpm@linux-foundation.org: fix warning with CONFIG_TMPFS=n]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jaegeuk Hanse <jaegeuk.hanse@gmail.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Zheng Liu <wenqing.lz@taobao.com>
      Cc: Jeff liu <jeff.liu@oracle.com>
      Cc: Paul Eggert <eggert@cs.ucla.edu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Josef Bacik <josef@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andreas Dilger <adilger@dilger.ca>
      Cc: Marco Stornelli <marco.stornelli@gmail.com>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: Sunil Mushran <sunil.mushran@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      220f2ac9
  3. 06 Dec, 2012 1 commit
    • Mel Gorman's avatar
      tmpfs: fix shared mempolicy leak · 18a2f371
      Mel Gorman authored
      This fixes a regression in 3.7-rc, which has since gone into stable.
      
      Commit 00442ad0 ("mempolicy: fix a memory corruption by refcount
      imbalance in alloc_pages_vma()") changed get_vma_policy() to raise the
      refcount on a shmem shared mempolicy; whereas shmem_alloc_page() went
      on expecting alloc_page_vma() to drop the refcount it had acquired.
      This deserves a rework: but for now fix the leak in shmem_alloc_page().
      
      Hugh: shmem_swapin() did not need a fix, but surely it's clearer to use
      the same refcounting there as in shmem_alloc_page(), delete its onstack
      mempolicy, and the strange mpol_cond_copy() and __mpol_cond_copy() -
      those were invented to let swapin_readahead() make an unknown number of
      calls to alloc_pages_vma() with one mempolicy; but since 00442ad0,
      alloc_pages_vma() has kept refcount in balance, so now no problem.
      Reported-and-tested-by: default avatarTommi Rantala <tt.rantala@gmail.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18a2f371
  4. 16 Nov, 2012 2 commits
    • Hugh Dickins's avatar
      tmpfs: change final i_blocks BUG to WARNING · 0f3c42f5
      Hugh Dickins authored
      Under a particular load on one machine, I have hit shmem_evict_inode()'s
      BUG_ON(inode->i_blocks), enough times to narrow it down to a particular
      race between swapout and eviction.
      
      It comes from the "if (freed > 0)" asymmetry in shmem_recalc_inode(),
      and the lack of coherent locking between mapping's nrpages and shmem's
      swapped count.  There's a window in shmem_writepage(), between lowering
      nrpages in shmem_delete_from_page_cache() and then raising swapped
      count, when the freed count appears to be +1 when it should be 0, and
      then the asymmetry stops it from being corrected with -1 before hitting
      the BUG.
      
      One answer is coherent locking: using tree_lock throughout, without
      info->lock; reasonable, but the raw_spin_lock in percpu_counter_add() on
      used_blocks makes that messier than expected.  Another answer may be a
      further effort to eliminate the weird shmem_recalc_inode() altogether,
      but previous attempts at that failed.
      
      So far undecided, but for now change the BUG_ON to WARN_ON: in usual
      circumstances it remains a useful consistency check.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f3c42f5
    • Hugh Dickins's avatar
      tmpfs: fix shmem_getpage_gfp() VM_BUG_ON · 215c02bc
      Hugh Dickins authored
      Fuzzing with trinity hit the "impossible" VM_BUG_ON(error) (which Fedora
      has converted to WARNING) in shmem_getpage_gfp():
      
        WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
        Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
        Call Trace:
          warn_slowpath_common+0x7f/0xc0
          warn_slowpath_null+0x1a/0x20
          shmem_getpage_gfp+0xa5c/0xa70
          shmem_fault+0x4f/0xa0
          __do_fault+0x71/0x5c0
          handle_pte_fault+0x97/0xae0
          handle_mm_fault+0x289/0x350
          __do_page_fault+0x18e/0x530
          do_page_fault+0x2b/0x50
          page_fault+0x28/0x30
          tracesys+0xe1/0xe6
      
      Thanks to Johannes for pointing to truncation: free_swap_and_cache()
      only does a trylock on the page, so the page lock we've held since
      before confirming swap is not enough to protect against truncation.
      
      What cleanup is needed in this case? Just delete_from_swap_cache(),
      which takes care of the memcg uncharge.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      215c02bc
  5. 09 Oct, 2012 2 commits
    • Hugh Dickins's avatar
      tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking · 35c2a7f4
      Hugh Dickins authored
      Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(),
      	u64 inum = fid->raw[2];
      which is unhelpfully reported as at the end of shmem_alloc_inode():
      
      BUG: unable to handle kernel paging request at ffff880061cd3000
      IP: [<ffffffff812190d0>] shmem_alloc_inode+0x40/0x40
      Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      Call Trace:
       [<ffffffff81488649>] ? exportfs_decode_fh+0x79/0x2d0
       [<ffffffff812d77c3>] do_handle_open+0x163/0x2c0
       [<ffffffff812d792c>] sys_open_by_handle_at+0xc/0x10
       [<ffffffff83a5f3f8>] tracesys+0xe1/0xe6
      
      Right, tmpfs is being stupid to access fid->raw[2] before validating that
      fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may
      fall at the end of a page, and the next page not be present.
      
      But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being
      careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and
      could oops in the same way: add the missing fh_len checks to those.
      Reported-by: default avatarSasha Levin <levinsasha928@gmail.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      35c2a7f4
    • Konstantin Khlebnikov's avatar
      mm: kill vma flag VM_CAN_NONLINEAR · 0b173bc4
      Konstantin Khlebnikov authored
      Move actual pte filling for non-linear file mappings into the new special
      vma operation: ->remap_pages().
      
      Filesystems must implement this method to get non-linear mapping support,
      if it uses filemap_fault() then generic_file_remap_pages() can be used.
      
      Now device drivers can implement this method and obtain nonlinear vma support.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>	#arch/tile
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b173bc4
  6. 24 Aug, 2012 1 commit
    • Aristeu Rozanski's avatar
      xattr: extract simple_xattr code from tmpfs · 38f38657
      Aristeu Rozanski authored
      Extract in-memory xattr APIs from tmpfs. Will be used by cgroup.
      
      $ size vmlinux.o
         text    data     bss     dec     hex filename
      4658782  880729 5195032 10734543         a3cbcf vmlinux.o
      $ size vmlinux.o
         text    data     bss     dec     hex filename
      4658957  880729 5195032 10734718         a3cc7e vmlinux.o
      
      v7:
      - checkpatch warnings fixed
      - Implement the changes requested by Hugh Dickins:
      	- make simple_xattrs_init and simple_xattrs_free inline
      	- get rid of locking and list reinitialization in simple_xattrs_free,
      	  they're not needed
      v6:
      - no changes
      v5:
      - no changes
      v4:
      - move simple_xattrs_free() to fs/xattr.c
      v3:
      - in kmem_xattrs_free(), reinitialize the list
      - use simple_xattr_* prefix
      - introduce simple_xattr_add() to prevent direct list usage
      Original-patch-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarAristeu Rozanski <aris@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      38f38657
  7. 31 Jul, 2012 1 commit
  8. 14 Jul, 2012 1 commit
  9. 11 Jul, 2012 3 commits
    • Hugh Dickins's avatar
      shmem: cleanup shmem_add_to_page_cache · b065b432
      Hugh Dickins authored
      shmem_add_to_page_cache() has three callsites, but only one of them wants
      the radix_tree_preload() (an exceptional entry guarantees that the radix
      tree node is present in the other cases), and only that site can achieve
      mem_cgroup_uncharge_cache_page() (PageSwapCache makes it a no-op in the
      other cases).  We did it this way originally to reflect
      add_to_page_cache_locked(); but it's confusing now, so move the radix_tree
      preloading and mem_cgroup uncharging to that one caller.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b065b432
    • Hugh Dickins's avatar
      shmem: fix negative rss in memcg memory.stat · d1899228
      Hugh Dickins authored
      When adding the page_private checks before calling shmem_replace_page(), I
      did realize that there is a further race, but thought it too unlikely to
      need a hurried fix.
      
      But independently I've been chasing why a mem cgroup's memory.stat
      sometimes shows negative rss after all tasks have gone: I expected it to
      be a stats gathering bug, but actually it's shmem swapping's fault.
      
      It's an old surprise, that when you lock_page(lookup_swap_cache(swap)),
      the page may have been removed from swapcache before getting the lock; or
      it may have been freed and reused and be back in swapcache; and it can
      even be using the same swap location as before (page_private same).
      
      The swapoff case is already secure against this (swap cannot be reused
      until the whole area has been swapped off, and a new swapped on); and
      shmem_getpage_gfp() is protected by shmem_add_to_page_cache()'s check for
      the expected radix_tree entry - but a little too late.
      
      By that time, we might have already decided to shmem_replace_page(): I
      don't know of a problem from that, but I'd feel more at ease not to do so
      spuriously.  And we have already done mem_cgroup_cache_charge(), on
      perhaps the wrong mem cgroup: and this charge is not then undone on the
      error path, because PageSwapCache ends up preventing that.
      
      It's this last case which causes the occasional negative rss in
      memory.stat: the page is charged here as cache, but (sometimes) found to
      be anon when eventually it's uncharged - and in between, it's an
      undeserved charge on the wrong memcg.
      
      Fix this by adding an earlier check on the radix_tree entry: it's
      inelegant to descend the tree twice, but swapping is not the fast path,
      and a better solution would need a pair (try+commit) of memcg calls, and a
      rework of shmem_replace_page() to keep out of the swapcache.
      
      We can use the added shmem_confirm_swap() function to replace the
      find_get_page+page_cache_release we were already doing on the error path.
      And add a comment on that -EEXIST: it seems a peculiar errno to be using,
      but originates from its use in radix_tree_insert().
      
      [It can be surprising to see positive rss left in a memcg's memory.stat
      after all tasks have gone, since it is supposed to count anonymous but not
      shmem.  Aside from sharing anon pages via fork with a task in some other
      memcg, it often happens after swapping: because a swap page can't be freed
      while under writeback, nor while locked.  So it's not an error, and these
      residual pages are easily freed once pressure demands.]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1899228
    • Hugh Dickins's avatar
      tmpfs: revert SEEK_DATA and SEEK_HOLE · f21f8062
      Hugh Dickins authored
      Revert 4fb5ef08 ("tmpfs: support SEEK_DATA and SEEK_HOLE").  I believe
      it's correct, and it's been nice to have from rc1 to rc6; but as the
      original commit said:
      
      I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
      would be of any use to them on tmpfs.  This code adds 92 lines and 752
      bytes on x86_64 - is that bloat or worthwhile?
      
      Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs to
      the dumb generic support for v3.5.  We can always reinstate it later if
      useful, and anyone needing it in a hurry can just get it out of git.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Josef Bacik <josef@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andreas Dilger <adilger@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Marco Stornelli <marco.stornelli@gmail.com>
      Cc: Jeff liu <jeff.liu@oracle.com>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f21f8062
  10. 13 Jun, 2012 1 commit
    • Eric Dumazet's avatar
      splice: fix racy pipe->buffers uses · 047fe360
      Eric Dumazet authored
      Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered
      by splice_shrink_spd() called from vmsplice_to_pipe()
      
      commit 35f3d14d (pipe: add support for shrinking and growing pipes)
      added capability to adjust pipe->buffers.
      
      Problem is some paths don't hold pipe mutex and assume pipe->buffers
      doesn't change for their duration.
      
      Fix this by adding nr_pages_max field in struct splice_pipe_desc, and
      use it in place of pipe->buffers where appropriate.
      
      splice_shrink_spd() loses its struct pipe_inode_info argument.
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Tom Herbert <therbert@google.com>
      Cc: stable <stable@vger.kernel.org> # 2.6.35
      Tested-by: default avatarDave Jones <davej@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      047fe360
  11. 07 Jun, 2012 1 commit
    • Hugh Dickins's avatar
      shmem: replace_page must flush_dcache and others · 0142ef6c
      Hugh Dickins authored
      Commit bde05d1c ("shmem: replace page if mapping excludes its zone")
      is not at all likely to break for anyone, but it was an earlier version
      from before review feedback was incorporated.  Fix that up now.
      
      * shmem_replace_page must flush_dcache_page after copy_highpage [akpm]
      * Expand comment on why shmem_unuse_inode needs page_swapcount [akpm]
      * Remove excess of VM_BUG_ONs from shmem_replace_page [wangcong]
      * Check page_private matches swap before calling shmem_replace_page [hughd]
      * shmem_replace_page allow for unexpected race in radix_tree lookup [hughd]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Stephane Marchesin <marcheu@chromium.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dave Airlie <airlied@gmail.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Rob Clark <rob.clark@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0142ef6c
  12. 29 May, 2012 10 commits
    • Al Viro's avatar
      ->encode_fh() API change · b0b0382b
      Al Viro authored
      pass inode + parent's inode or NULL instead of dentry + bool saying
      whether we want the parent or not.
      
      NOTE: that needs ceph fix folded in.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      b0b0382b
    • Hugh Dickins's avatar
      tmpfs: support SEEK_DATA and SEEK_HOLE · 4fb5ef08
      Hugh Dickins authored
      It's quite easy for tmpfs to scan the radix_tree to support llseek's new
      SEEK_DATA and SEEK_HOLE options: so add them while the minutiae are still
      on my mind (in particular, the !PageUptodate-ness of pages fallocated but
      still unwritten).
      
      But I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
      would be of any use to them on tmpfs.  This code adds 92 lines and 752
      bytes on x86_64 - is that bloat or worthwhile?
      
      [akpm@linux-foundation.org: fix warning with CONFIG_TMPFS=n]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Josef Bacik <josef@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andreas Dilger <adilger@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Marco Stornelli <marco.stornelli@gmail.com>
      Cc: Jeff liu <jeff.liu@oracle.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4fb5ef08
    • Hugh Dickins's avatar
      tmpfs: quit when fallocate fills memory · 1aac1400
      Hugh Dickins authored
      As it stands, a large fallocate() on tmpfs is liable to fill memory with
      pages, freed on failure except when they run into swap, at which point
      they become fixed into the file despite the failure.  That feels quite
      wrong, to be consuming resources precisely when they're in short supply.
      
      Go the other way instead: shmem_fallocate() indicate the range it has
      fallocated to shmem_writepage(), keeping count of pages it's allocating;
      shmem_writepage() reactivate instead of swapping out pages fallocated by
      this syscall (but happily swap out those from earlier occasions), keeping
      count; shmem_fallocate() compare counts and give up once the reactivated
      pages have started to coming back to writepage (approximately: some zones
      would in fact recycle faster than others).
      
      This is a little unusual, but works well: although we could consider the
      failure to swap as a bug, and fix it later with SWAP_MAP_FALLOC handling
      added in swapfile.c and memcontrol.c, I doubt that we shall ever want to.
      
      (If there's no swap, an over-large fallocate() on tmpfs is limited in the
      same way as writing: stopped by rlimit, or by tmpfs mount size if that was
      set sensibly, or by __vm_enough_memory() heuristics if OVERCOMMIT_GUESS or
      OVERCOMMIT_NEVER.  If OVERCOMMIT_ALWAYS, then it is liable to OOM-kill
      others as writing would, but stops and frees if interrupted.)
      
      Now that everything is freed on failure, we can then skip updating ctime.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Cong Wang <amwang@redhat.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1aac1400
    • Hugh Dickins's avatar
      tmpfs: undo fallocation on failure · 1635f6a7
      Hugh Dickins authored
      In the previous episode, we left the already-fallocated pages attached to
      the file when shmem_fallocate() fails part way through.
      
      Now try to do better, by extending the earlier optimization of !Uptodate
      pages (then always under page lock) to !Uptodate pages (outside of page
      lock), representing fallocated pages.  And don't waste time clearing them
      at the time of fallocate(), leave that until later if necessary.
      
      Adapt shmem_truncate_range() to shmem_undo_range(), so that a failing
      fallocate can recognize and remove precisely those !Uptodate allocations
      which it added (and were not independently allocated by racing tasks).
      
      But unless we start playing with swapfile.c and memcontrol.c too, once one
      of our fallocated pages reaches shmem_writepage(), we do then have to
      instantiate it as an ordinarily allocated page, before swapping out.  This
      is unsatisfactory, but improved in the next episode.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Cong Wang <amwang@redhat.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1635f6a7
    • Hugh Dickins's avatar
      tmpfs: support fallocate preallocation · e2d12e22
      Hugh Dickins authored
      The systemd plumbers expressed a wish that tmpfs support preallocation.
      Cong Wang wrote a patch, but several kernel guys expressed scepticism:
      https://lkml.org/lkml/2011/11/18/137
      
      Christoph Hellwig: What for exactly? Please explain why preallocating on
      tmpfs would make any sense.
      
      Kay Sievers: To be able to safely use mmap(), regarding SIGBUS, on files
      on the /dev/shm filesystem.  The glibc fallback loop for -ENOSYS [or
      -EOPNOTSUPP] on fallocate is just ugly.
      
      Hugh Dickins: If tmpfs is going to support
      fallocate(FALLOC_FL_PUNCH_HOLE), it would seem perverse to permit the
      deallocation but fail the allocation.  Christoph Hellwig: Agreed.
      
      Now that we do have shmem_fallocate() for hole-punching, plumb in basic
      support for preallocation mode too.  It's fairly straightforward (though
      quite a few details needed attention), except for when it fails part way
      through.  What a pity that fallocate(2) was not specified to return the
      length allocated, permitting short fallocations!
      
      As it is, when it fails part way through, we ought to free what has just
      been allocated by this system call; but must be very sure not to free any
      allocated earlier, or any allocated by racing accesses (not all excluded
      by i_mutex).
      
      But we cannot distinguish them: so in this patch simply leak allocations
      on partial failure (they will be freed later if the file is removed).
      
      An attractive alternative approach would have been for fallocate() not to
      allocate pages at all, but note reservations by entries in the radix-tree.
       But that would give less assurance, and, critically, would be hard to fit
      with mem cgroups (who owns the reservations?): allocating pages lets
      fallocate() behave in just the same way as write().
      Based-on-patch-by: default avatarCong Wang <amwang@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Cong Wang <amwang@redhat.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2d12e22
    • Hugh Dickins's avatar
      mm/fs: remove truncate_range · 17cf28af
      Hugh Dickins authored
      Remove vmtruncate_range(), and remove the truncate_range method from
      struct inode_operations: only tmpfs ever supported it, and tmpfs has now
      converted over to using the fallocate method of file_operations.
      
      Update Documentation accordingly, adding (setlease and) fallocate lines.
      And while we're in mm.h, remove duplicate declarations of shmem_lock() and
      shmem_file_setup(): everyone is now using the ones in shmem_fs.h.
      Based-on-patch-by: default avatarCong Wang <amwang@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Cong Wang <amwang@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      17cf28af
    • Hugh Dickins's avatar
      tmpfs: support fallocate FALLOC_FL_PUNCH_HOLE · 83e4fa9c
      Hugh Dickins authored
      tmpfs has supported hole-punching since 2.6.16, via
      madvise(,,MADV_REMOVE).
      
      But nowadays fallocate(,FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE,,) is
      the agreed way to punch holes.
      
      So add shmem_fallocate() to support that, and tweak shmem_truncate_range()
      to support partial pages at both the beginning and end of range (never
      needed for madvise, which demands rounded addr and rounds up length).
      Based-on-patch-by: default avatarCong Wang <amwang@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Cong Wang <amwang@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83e4fa9c
    • Hugh Dickins's avatar
      tmpfs: optimize clearing when writing · ec9516fb
      Hugh Dickins authored
      Nick proposed years ago that tmpfs should avoid clearing its pages where
      write will overwrite them with new data, as ramfs has long done.  But I
      messed it up and just got bad data.  Tried again recently, it works
      fine.
      
      Here's time output for writing 4GiB 16 times on this Core i5 laptop:
      
      before: real	0m21.169s user	0m0.028s sys	0m21.057s
              real	0m21.382s user	0m0.016s sys	0m21.289s
              real	0m21.311s user	0m0.020s sys	0m21.217s
      
      after:  real	0m18.273s user	0m0.032s sys	0m18.165s
              real	0m18.354s user	0m0.020s sys	0m18.265s
              real	0m18.440s user	0m0.032s sys	0m18.337s
      
      ramfs:  real	0m16.860s user	0m0.028s sys	0m16.765s
              real	0m17.382s user	0m0.040s sys	0m17.273s
              real	0m17.133s user	0m0.044s sys	0m17.021s
      
      Yes, I have done perf reports, but they need more explanation than they
      deserve: in summary, clear_page vanishes, its cache loading shifts into
      copy_user_generic_unrolled; shmem_getpage_gfp goes down, and
      surprisingly mark_page_accessed goes way up - I think because they are
      respectively where the cache gets to be reloaded after being purged by
      clear or copy.
      Suggested-by: default avatarNick Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec9516fb
    • Hugh Dickins's avatar
      tmpfs: enable NOSEC optimization · 2f6e38f3
      Hugh Dickins authored
      Let tmpfs into the NOSEC optimization (avoiding file_remove_suid()
      overhead on most common writes): set MS_NOSEC on its superblocks.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f6e38f3
    • Hugh Dickins's avatar
      shmem: replace page if mapping excludes its zone · bde05d1c
      Hugh Dickins authored
      The GMA500 GPU driver uses GEM shmem objects, but with a new twist: the
      backing RAM has to be below 4GB.  Not a problem while the boards
      supported only 4GB: but now Intel's D2700MUD boards support 8GB, and
      their GMA3600 is managed by the GMA500 driver.
      
      shmem/tmpfs has never pretended to support hardware restrictions on the
      backing memory, but it might have appeared to do so before v3.1, and
      even now it works fine until a page is swapped out then back in.  When
      read_cache_page_gfp() supplied a freshly allocated page for copy, that
      compensated for whatever choice might have been made by earlier swapin
      readahead; but swapoff was likely to destroy the illusion.
      
      We'd like to continue to support GMA500, so now add a new
      shmem_should_replace_page() check on the zone when about to move a page
      from swapcache to filecache (in swapin and swapoff cases), with
      shmem_replace_page() to allocate and substitute a suitable page (given
      gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32).
      
      This does involve a minor extension to mem_cgroup_replace_page_cache()
      (the page may or may not have already been charged); and I've removed a
      comment and call to mem_cgroup_uncharge_cache_page(), which in fact is
      always a no-op while PageSwapCache.
      
      Also removed optimization of an unlikely path in shmem_getpage_gfp(),
      now that we need to check PageSwapCache more carefully (a racing caller
      might already have made the copy).  And at one point shmem_unuse_inode()
      needs to use the hitherto private page_swapcount(), to guard against
      racing with inode eviction.
      
      It would make sense to extend shmem_should_replace_page(), to cover
      cpuset and NUMA mempolicy restrictions too, but set that aside for now:
      needs a cleanup of shmem mempolicy handling, and more testing, and ought
      to handle swap faults in do_swap_page() as well as shmem.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Stephane Marchesin <marcheu@chromium.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dave Airlie <airlied@gmail.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Rob Clark <rob.clark@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bde05d1c
  13. 15 May, 2012 1 commit
  14. 05 May, 2012 1 commit
  15. 21 Mar, 2012 1 commit
  16. 20 Mar, 2012 3 commits
  17. 13 Feb, 2012 1 commit
  18. 23 Jan, 2012 2 commits
    • Hugh Dickins's avatar
      SHM_UNLOCK: fix Unevictable pages stranded after swap · 24513264
      Hugh Dickins authored
      Commit cc39c6a9 ("mm: account skipped entries to avoid looping in
      find_get_pages") correctly fixed an infinite loop; but left a problem
      that find_get_pages() on shmem would return 0 (appearing to callers to
      mean end of tree) when it meets a run of nr_pages swap entries.
      
      The only uses of find_get_pages() on shmem are via pagevec_lookup(),
      called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
      scan_mapping_unevictable_pages().  The first is already commented, and
      not worth worrying about; but the second can leave pages on the
      Unevictable list after an unusual sequence of swapping and locking.
      
      Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
      swap) instead of pagevec_lookup().
      
      But I don't want to contaminate vmscan.c with shmem internals, nor
      shmem.c with LRU locking.  So move scan_mapping_unevictable_pages() into
      shmem.c, renaming it shmem_unlock_mapping(); and rename
      check_move_unevictable_page() to check_move_unevictable_pages(), looping
      down an array of pages, oftentimes under the same lock.
      
      Leave out the "rotate unevictable list" block: that's a leftover from
      when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
      handling involved looking at pages at tail of LRU.
      
      Was there significance to the sequence first ClearPageUnevictable, then
      test page_evictable, then SetPageUnevictable here? I think not, we're
      under LRU lock, and have no barriers between those.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24513264
    • Hugh Dickins's avatar
      SHM_UNLOCK: fix long unpreemptible section · 85046579
      Hugh Dickins authored
      scan_mapping_unevictable_pages() is used to make SysV SHM_LOCKed pages
      evictable again once the shared memory is unlocked.  It does this with
      pagevec_lookup()s across the whole object (which might occupy most of
      memory), and takes 300ms to unlock 7GB here.  A cond_resched() every
      PAGEVEC_SIZE pages would be good.
      
      However, KOSAKI-san points out that this is called under shmem.c's
      info->lock, and it's also under shm.c's shm_lock(), both spinlocks.
      There is no strong reason for that: we need to take these pages off the
      unevictable list soonish, but those locks are not required for it.
      
      So move the call to scan_mapping_unevictable_pages() from shmem.c's
      unlock handling up to shm.c's unlock handling.  Remove the recently
      added barrier, not needed now we have spin_unlock() before the scan.
      
      Use get_file(), with subsequent fput(), to make sure we have a reference
      to mapping throughout scan_mapping_unevictable_pages(): that's something
      that was previously guaranteed by the shm_lock().
      
      Remove shmctl's lru_add_drain_all(): we don't fault in pages at SHM_LOCK
      time, and we lazily discover them to be Unevictable later, so it serves
      no purpose for SHM_LOCK; and serves no purpose for SHM_UNLOCK, since
      pages still on pagevec are not marked Unevictable.
      
      The original code avoided redundant rescans by checking VM_LOCKED flag
      at its level: now avoid them by checking shp's SHM_LOCKED.
      
      The original code called scan_mapping_unevictable_pages() on a locked
      area at shm_destroy() time: perhaps we once had accounting cross-checks
      which required that, but not now, so skip the overhead and just let
      inode eviction deal with them.
      
      Put check_move_unevictable_page() and scan_mapping_unevictable_pages()
      under CONFIG_SHMEM (with stub for the TINY case when ramfs is used),
      more as comment than to save space; comment them used for SHM_UNLOCK.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85046579
  19. 06 Jan, 2012 1 commit
  20. 03 Jan, 2012 5 commits