1. 06 Nov, 2015 1 commit
    • Mel Gorman's avatar
      mm: page_alloc: hide some GFP internals and document the bits and flag combinations · dd56b046
      Mel Gorman authored
      Andrew stated the following
      
      	We have quite a history of remote parts of the kernel using
      	weird/wrong/inexplicable combinations of __GFP_ flags.	I tend
      	to think that this is because we didn't adequately explain the
      	interface.
      
      	And I don't think that gfp.h really improved much in this area as
      	a result of this patchset.  Could you go through it some time and
      	decide if we've adequately documented all this stuff?
      
      This patches first moves some GFP flag combinations that are part of the MM
      internals to mm/internal.h. The rest of the patch documents the __GFP_FOO
      bits under various headings and then documents the flag combinations. It
      will not help callers that are brain damaged but the clarity might motivate
      some fixes and avoid future mistakes.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd56b046
  2. 05 Nov, 2015 2 commits
    • Hugh Dickins's avatar
      tmpfs: avoid a little creat and stat slowdown · d0424c42
      Hugh Dickins authored
      LKP reports that v4.2 commit afa2db2f ("tmpfs: truncate prealloc
      blocks past i_size") causes a 14.5% slowdown in the AIM9 creat-clo
      benchmark.
      
      creat-clo does just what you'd expect from the name, and creat's O_TRUNC
      on 0-length file does indeed get into more overhead now shmem_setattr()
      tests "0 <= 0" instead of "0 < 0".
      
      I'm not sure how much we care, but I think it would not be too VW-like to
      add in a check for whether any pages (or swap) are allocated: if none are
      allocated, there's none to remove from the radix_tree.  At first I thought
      that check would be good enough for the unmaps too, but no: we should not
      skip the unlikely case of unmapping pages beyond the new EOF, which were
      COWed from holes which have now been reclaimed, leaving none.
      
      This gives me an 8.5% speedup: on Haswell instead of LKP's Westmere, and
      running a debug config before and after: I hope those account for the
      lesser speedup.
      
      And probably someone has a benchmark where a thousand threads keep on
      stat'ing the same file repeatedly: forestall that report by adjusting v4.3
      commit 44a30220 ("shmem: recalculate file inode when fstat") not to
      take the spinlock in shmem_getattr() when there's no work to do.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarYing Huang <ying.huang@linux.intel.com>
      Tested-by: default avatarYing Huang <ying.huang@linux.intel.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0424c42
    • Hugh Dickins's avatar
      mm: rename mem_cgroup_migrate to mem_cgroup_replace_page · 45637bab
      Hugh Dickins authored
      After v4.3's commit 0610c25d ("memcg: fix dirty page migration")
      mem_cgroup_migrate() doesn't have much to offer in page migration: convert
      migrate_misplaced_transhuge_page() to set_page_memcg() instead.
      
      Then rename mem_cgroup_migrate() to mem_cgroup_replace_page(), since its
      remaining callers are replace_page_cache_page() and shmem_replace_page():
      both of whom passed lrucare true, so just eliminate that argument.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      45637bab
  3. 08 Sep, 2015 1 commit
  4. 06 Aug, 2015 1 commit
    • Stephen Smalley's avatar
      ipc: use private shmem or hugetlbfs inodes for shm segments. · e1832f29
      Stephen Smalley authored
      The shm implementation internally uses shmem or hugetlbfs inodes for shm
      segments.  As these inodes are never directly exposed to userspace and
      only accessed through the shm operations which are already hooked by
      security modules, mark the inodes with the S_PRIVATE flag so that inode
      security initialization and permission checking is skipped.
      
      This was motivated by the following lockdep warning:
      
        ======================================================
         [ INFO: possible circular locking dependency detected ]
         4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G        W
        -------------------------------------------------------
         httpd/1597 is trying to acquire lock:
         (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130
         but task is already holding lock:
         (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180
         which lock already depends on the new lock.
         the existing dependency chain (in reverse order) is:
         -> #3 (&mm->mmap_sem){++++++}:
              lock_acquire+0xc7/0x270
              __might_fault+0x7a/0xa0
              filldir+0x9e/0x130
              xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
              xfs_readdir+0x1b4/0x330 [xfs]
              xfs_file_readdir+0x2b/0x30 [xfs]
              iterate_dir+0x97/0x130
              SyS_getdents+0x91/0x120
              entry_SYSCALL_64_fastpath+0x12/0x76
         -> #2 (&xfs_dir_ilock_class){++++.+}:
              lock_acquire+0xc7/0x270
              down_read_nested+0x57/0xa0
              xfs_ilock+0x167/0x350 [xfs]
              xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
              xfs_attr_get+0xbd/0x190 [xfs]
              xfs_xattr_get+0x3d/0x70 [xfs]
              generic_getxattr+0x4f/0x70
              inode_doinit_with_dentry+0x162/0x670
              sb_finish_set_opts+0xd9/0x230
              selinux_set_mnt_opts+0x35c/0x660
              superblock_doinit+0x77/0xf0
              delayed_superblock_init+0x10/0x20
              iterate_supers+0xb3/0x110
              selinux_complete_init+0x2f/0x40
              security_load_policy+0x103/0x600
              sel_write_load+0xc1/0x750
              __vfs_write+0x37/0x100
              vfs_write+0xa9/0x1a0
              SyS_write+0x58/0xd0
              entry_SYSCALL_64_fastpath+0x12/0x76
        ...
      Signed-off-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Reported-by: default avatarMorten Stevens <mstevens@fedoraproject.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Eric Paris <eparis@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1832f29
  5. 24 Jun, 2015 1 commit
    • Josef Bacik's avatar
      tmpfs: truncate prealloc blocks past i_size · afa2db2f
      Josef Bacik authored
      One of the rocksdb people noticed that when you do something like this
      
          fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, 10M)
          pwrite(fd, buf, 5M, 0)
          ftruncate(5M)
      
      on tmpfs, the file would still take up 10M: which led to super fun
      issues because we were getting ENOSPC before we thought we should be
      getting ENOSPC.  This patch fixes the problem, and mirrors what all the
      other fs'es do (and was agreed to be the correct behaviour at LSF).
      
      I tested it locally to make sure it worked properly with the following
      
          xfs_io -f -c "falloc -k 0 10M" -c "pwrite 0 5M" -c "truncate 5M" file
      
      Without the patch we have "Blocks: 20480", with the patch we have the
      correct value of "Blocks: 10240".
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      afa2db2f
  6. 18 Jun, 2015 1 commit
  7. 11 May, 2015 1 commit
  8. 10 May, 2015 3 commits
    • Al Viro's avatar
      don't pass nameidata to ->follow_link() · 6e77137b
      Al Viro authored
      its only use is getting passed to nd_jump_link(), which can obtain
      it from current->nameidata
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      6e77137b
    • Al Viro's avatar
      new ->follow_link() and ->put_link() calling conventions · 680baacb
      Al Viro authored
      a) instead of storing the symlink body (via nd_set_link()) and returning
      an opaque pointer later passed to ->put_link(), ->follow_link() _stores_
      that opaque pointer (into void * passed by address by caller) and returns
      the symlink body.  Returning ERR_PTR() on error, NULL on jump (procfs magic
      symlinks) and pointer to symlink body for normal symlinks.  Stored pointer
      is ignored in all cases except the last one.
      
      Storing NULL for opaque pointer (or not storing it at all) means no call
      of ->put_link().
      
      b) the body used to be passed to ->put_link() implicitly (via nameidata).
      Now only the opaque pointer is.  In the cases when we used the symlink body
      to free stuff, ->follow_link() now should store it as opaque pointer in addition
      to returning it.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      680baacb
    • Al Viro's avatar
      shmem: switch to simple_follow_link() · 60380f19
      Al Viro authored
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      60380f19
  9. 15 Apr, 2015 1 commit
  10. 11 Apr, 2015 1 commit
  11. 25 Mar, 2015 1 commit
  12. 23 Feb, 2015 1 commit
    • Sasha Levin's avatar
      mm: shmem: check for mapping owner before dereferencing · f0774d88
      Sasha Levin authored
      mapping->host can be NULL and shouldn't be dereferenced before being checked.
      
      [ 1295.741844] GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] SMP KASAN
      [ 1295.746387] Dumping ftrace buffer:
      [ 1295.748217]    (ftrace buffer empty)
      [ 1295.749527] Modules linked in:
      [ 1295.750268] CPU: 62 PID: 23410 Comm: trinity-c70 Not tainted 3.19.0-next-20150219-sasha-00045-g9130270f #1939
      [ 1295.750268] task: ffff8803a49db000 ti: ffff8803a4dc8000 task.ti: ffff8803a4dc8000
      [ 1295.750268] RIP: shmem_mapping (mm/shmem.c:1458)
      [ 1295.750268] RSP: 0000:ffff8803a4dcfbf8  EFLAGS: 00010206
      [ 1295.750268] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 00000000000f2804
      [ 1295.750268] RDX: 0000000000000005 RSI: 0400000000000794 RDI: 0000000000000028
      [ 1295.750268] RBP: ffff8803a4dcfc08 R08: 0000000000000000 R09: 00000000031de000
      [ 1295.750268] R10: dffffc0000000000 R11: 00000000031c1000 R12: 0400000000000794
      [ 1295.750268] R13: 00000000031c2000 R14: 00000000031de000 R15: ffff880e3bdc1000
      [ 1295.750268] FS:  00007f8703c7e700(0000) GS:ffff881164800000(0000) knlGS:0000000000000000
      [ 1295.750268] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1295.750268] CR2: 0000000004e58000 CR3: 00000003a9f3c000 CR4: 00000000000007a0
      [ 1295.750268] DR0: ffffffff81000000 DR1: 0000009494949494 DR2: 0000000000000000
      [ 1295.750268] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000d0602
      [ 1295.750268] Stack:
      [ 1295.750268]  ffff8803a4dcfec8 ffffffffbb1dc770 ffff8803a4dcfc38 ffffffffad6f230b
      [ 1295.750268]  ffffffffad6f2b0d 0000014100000000 ffff88001e17c08b ffff880d9453fe08
      [ 1295.750268]  ffff8803a4dcfd18 ffffffffad6f2ce2 ffff8803a49dbcd8 ffff8803a49dbce0
      [ 1295.750268] Call Trace:
      [ 1295.750268] mincore_page (mm/mincore.c:61)
      [ 1295.750268] ? mincore_pte_range (include/linux/spinlock.h:312 mm/mincore.c:131)
      [ 1295.750268] mincore_pte_range (mm/mincore.c:151)
      [ 1295.750268] ? mincore_unmapped_range (mm/mincore.c:113)
      [ 1295.750268] __walk_page_range (mm/pagewalk.c:51 mm/pagewalk.c:90 mm/pagewalk.c:116 mm/pagewalk.c:204)
      [ 1295.750268] walk_page_range (mm/pagewalk.c:275)
      [ 1295.750268] SyS_mincore (mm/mincore.c:191 mm/mincore.c:253 mm/mincore.c:220)
      [ 1295.750268] ? mincore_pte_range (mm/mincore.c:220)
      [ 1295.750268] ? mincore_unmapped_range (mm/mincore.c:113)
      [ 1295.750268] ? __mincore_unmapped_range (mm/mincore.c:105)
      [ 1295.750268] ? ptlock_free (mm/mincore.c:24)
      [ 1295.750268] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1610)
      [ 1295.750268] ia32_do_call (arch/x86/ia32/ia32entry.S:446)
      [ 1295.750268] Code: e5 48 c1 ea 03 53 48 89 fb 48 83 ec 08 80 3c 02 00 75 4f 48 b8 00 00 00 00 00 fc ff df 48 8b 1b 48 8d 7b 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 3f 48 b8 00 00 00 00 00 fc ff df 48 8b 5b 28 48
      
      All code
      ========
         0:	e5 48                	in     $0x48,%eax
         2:	c1 ea 03             	shr    $0x3,%edx
         5:	53                   	push   %rbx
         6:	48 89 fb             	mov    %rdi,%rbx
         9:	48 83 ec 08          	sub    $0x8,%rsp
         d:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)
        11:	75 4f                	jne    0x62
        13:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
        1a:	fc ff df
        1d:	48 8b 1b             	mov    (%rbx),%rbx
        20:	48 8d 7b 28          	lea    0x28(%rbx),%rdi
        24:	48 89 fa             	mov    %rdi,%rdx
        27:	48 c1 ea 03          	shr    $0x3,%rdx
        2b:*	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)		<-- trapping instruction
        2f:	75 3f                	jne    0x70
        31:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
        38:	fc ff df
        3b:	48 8b 5b 28          	mov    0x28(%rbx),%rbx
        3f:	48                   	rex.W
      	...
      
      Code starting with the faulting instruction
      ===========================================
         0:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)
         4:	75 3f                	jne    0x45
         6:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
         d:	fc ff df
        10:	48 8b 5b 28          	mov    0x28(%rbx),%rbx
        14:	48                   	rex.W
      	...
      [ 1295.750268] RIP shmem_mapping (mm/shmem.c:1458)
      [ 1295.750268]  RSP <ffff8803a4dcfbf8>
      
      Fixes: 97b713ba ("fs: kill BDI_CAP_SWAP_BACKED")
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      f0774d88
  13. 22 Feb, 2015 1 commit
    • David Howells's avatar
      VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) · e36cb0b8
      David Howells authored
      Convert the following where appropriate:
      
       (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).
      
       (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).
      
       (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry).  This is actually more
           complicated than it appears as some calls should be converted to
           d_can_lookup() instead.  The difference is whether the directory in
           question is a real dir with a ->lookup op or whether it's a fake dir with
           a ->d_automount op.
      
      In some circumstances, we can subsume checks for dentry->d_inode not being
      NULL into this, provided we the code isn't in a filesystem that expects
      d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
      use d_inode() rather than d_backing_inode() to get the inode pointer).
      
      Note that the dentry type field may be set to something other than
      DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
      manages the fall-through from a negative dentry to a lower layer.  In such a
      case, the dentry type of the negative union dentry is set to the same as the
      type of the lower dentry.
      
      However, if you know d_inode is not NULL at the call site, then you can use
      the d_is_xxx() functions even in a filesystem.
      
      There is one further complication: a 0,0 chardev dentry may be labelled
      DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE.  Strictly, this was
      intended for special directory entry types that don't have attached inodes.
      
      The following perl+coccinelle script was used:
      
      use strict;
      
      my @callers;
      open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
          die "Can't grep for S_ISDIR and co. callers";
      @callers = <$fd>;
      close($fd);
      unless (@callers) {
          print "No matches\n";
          exit(0);
      }
      
      my @cocci = (
          '@@',
          'expression E;',
          '@@',
          '',
          '- S_ISLNK(E->d_inode->i_mode)',
          '+ d_is_symlink(E)',
          '',
          '@@',
          'expression E;',
          '@@',
          '',
          '- S_ISDIR(E->d_inode->i_mode)',
          '+ d_is_dir(E)',
          '',
          '@@',
          'expression E;',
          '@@',
          '',
          '- S_ISREG(E->d_inode->i_mode)',
          '+ d_is_reg(E)' );
      
      my $coccifile = "tmp.sp.cocci";
      open($fd, ">$coccifile") || die $coccifile;
      print($fd "$_\n") || die $coccifile foreach (@cocci);
      close($fd);
      
      foreach my $file (@callers) {
          chomp $file;
          print "Processing ", $file, "\n";
          system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
      	die "spatch failed";
      }
      
      [AV: overlayfs parts skipped]
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      e36cb0b8
  14. 11 Feb, 2015 1 commit
  15. 10 Feb, 2015 1 commit
  16. 05 Feb, 2015 1 commit
  17. 20 Jan, 2015 2 commits
  18. 17 Dec, 2014 1 commit
  19. 23 Oct, 2014 1 commit
    • Miklos Szeredi's avatar
      shmem: support RENAME_WHITEOUT · 46fdb794
      Miklos Szeredi authored
      Allocate a dentry, initialize it with a whiteout and hash it in the place
      of the old dentry.  Later the old dentry will be moved away and the
      whiteout will remain.
      
      i_mutex protects agains concurrent readdir.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      46fdb794
  20. 09 Oct, 2014 1 commit
  21. 26 Sep, 2014 1 commit
    • Miklos Szeredi's avatar
      shmem: fix nlink for rename overwrite directory · b928095b
      Miklos Szeredi authored
      If overwriting an empty directory with rename, then need to drop the extra
      nlink.
      
      Test prog:
      
      #include <stdio.h>
      #include <fcntl.h>
      #include <err.h>
      #include <sys/stat.h>
      
      int main(void)
      {
      	const char *test_dir1 = "test-dir1";
      	const char *test_dir2 = "test-dir2";
      	int res;
      	int fd;
      	struct stat statbuf;
      
      	res = mkdir(test_dir1, 0777);
      	if (res == -1)
      		err(1, "mkdir(\"%s\")", test_dir1);
      
      	res = mkdir(test_dir2, 0777);
      	if (res == -1)
      		err(1, "mkdir(\"%s\")", test_dir2);
      
      	fd = open(test_dir2, O_RDONLY);
      	if (fd == -1)
      		err(1, "open(\"%s\")", test_dir2);
      
      	res = rename(test_dir1, test_dir2);
      	if (res == -1)
      		err(1, "rename(\"%s\", \"%s\")", test_dir1, test_dir2);
      
      	res = fstat(fd, &statbuf);
      	if (res == -1)
      		err(1, "fstat(%i)", fd);
      
      	if (statbuf.st_nlink != 0) {
      		fprintf(stderr, "nlink is %lu, should be 0\n", statbuf.st_nlink);
      		return 1;
      	}
      
      	return 0;
      }
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      b928095b
  22. 07 Sep, 2014 1 commit
    • Tejun Heo's avatar
      percpu_counter: add @gfp to percpu_counter_init() · 908c7f19
      Tejun Heo authored
      Percpu allocator now supports allocation mask.  Add @gfp to
      percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
      with percpu_counters too.
      
      We could have left percpu_counter_init() alone and added
      percpu_counter_init_gfp(); however, the number of users isn't that
      high and introducing _gfp variants to all percpu data structures would
      be quite ugly, so let's just do the conversion.  This is the one with
      the most users.  Other percpu data structures are a lot easier to
      convert.
      
      This patch doesn't make any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatar"David S. Miller" <davem@davemloft.net>
      Cc: x86@kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      908c7f19
  23. 08 Aug, 2014 5 commits
    • David Herrmann's avatar
      shm: wait for pins to be released when sealing · 05f65b5c
      David Herrmann authored
      If we set SEAL_WRITE on a file, we must make sure there cannot be any
      ongoing write-operations on the file.  For write() calls, we simply lock
      the inode mutex, for mmap() we simply verify there're no writable
      mappings.  However, there might be pages pinned by AIO, Direct-IO and
      similar operations via GUP.  We must make sure those do not write to the
      memfd file after we set SEAL_WRITE.
      
      As there is no way to notify GUP users to drop pages or to wait for them
      to be done, we implement the wait ourself: When setting SEAL_WRITE, we
      check all pages for their ref-count.  If it's bigger than 1, we know
      there's some user of the page.  We then mark the page and wait for up to
      150ms for those ref-counts to be dropped.  If the ref-counts are not
      dropped in time, we refuse the seal operation.
      Signed-off-by: default avatarDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      05f65b5c
    • David Herrmann's avatar
      shm: add memfd_create() syscall · 9183df25
      David Herrmann authored
      memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
      that you can pass to mmap().  It can support sealing and avoids any
      connection to user-visible mount-points.  Thus, it's not subject to quotas
      on mounted file-systems, but can be used like malloc()'ed memory, but with
      a file-descriptor to it.
      
      memfd_create() returns the raw shmem file, so calls like ftruncate() can
      be used to modify the underlying inode.  Also calls like fstat() will
      return proper information and mark the file as regular file.  If you want
      sealing, you can specify MFD_ALLOW_SEALING.  Otherwise, sealing is not
      supported (like on all other regular files).
      
      Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
      subject to a filesystem size limit.  It is still properly accounted to
      memcg limits, though, and to the same overcommit or no-overcommit
      accounting as all user memory.
      Signed-off-by: default avatarDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9183df25
    • David Herrmann's avatar
      shm: add sealing API · 40e041a2
      David Herrmann authored
      If two processes share a common memory region, they usually want some
      guarantees to allow safe access. This often includes:
        - one side cannot overwrite data while the other reads it
        - one side cannot shrink the buffer while the other accesses it
        - one side cannot grow the buffer beyond previously set boundaries
      
      If there is a trust-relationship between both parties, there is no need
      for policy enforcement.  However, if there's no trust relationship (eg.,
      for general-purpose IPC) sharing memory-regions is highly fragile and
      often not possible without local copies.  Look at the following two
      use-cases:
      
        1) A graphics client wants to share its rendering-buffer with a
           graphics-server. The memory-region is allocated by the client for
           read/write access and a second FD is passed to the server. While
           scanning out from the memory region, the server has no guarantee that
           the client doesn't shrink the buffer at any time, requiring rather
           cumbersome SIGBUS handling.
        2) A process wants to perform an RPC on another process. To avoid huge
           bandwidth consumption, zero-copy is preferred. After a message is
           assembled in-memory and a FD is passed to the remote side, both sides
           want to be sure that neither modifies this shared copy, anymore. The
           source may have put sensible data into the message without a separate
           copy and the target may want to parse the message inline, to avoid a
           local copy.
      
      While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
      ways to achieve most of this, the first one is unproportionally ugly to
      use in libraries and the latter two are broken/racy or even disabled due
      to denial of service attacks.
      
      This patch introduces the concept of SEALING.  If you seal a file, a
      specific set of operations is blocked on that file forever.  Unlike locks,
      seals can only be set, never removed.  Hence, once you verified a specific
      set of seals is set, you're guaranteed that no-one can perform the blocked
      operations on this file, anymore.
      
      An initial set of SEALS is introduced by this patch:
        - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
                  in size. This affects ftruncate() and open(O_TRUNC).
        - GROW: If SEAL_GROW is set, the file in question cannot be increased
                in size. This affects ftruncate(), fallocate() and write().
        - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
                 are possible. This affects fallocate(PUNCH_HOLE), mmap() and
                 write().
        - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
                This basically prevents the F_ADD_SEAL operation on a file and
                can be set to prevent others from adding further seals that you
                don't want.
      
      The described use-cases can easily use these seals to provide safe use
      without any trust-relationship:
      
        1) The graphics server can verify that a passed file-descriptor has
           SEAL_SHRINK set. This allows safe scanout, while the client is
           allowed to increase buffer size for window-resizing on-the-fly.
           Concurrent writes are explicitly allowed.
        2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
           SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
           process can modify the data while the other side parses it.
           Furthermore, it guarantees that even with writable FDs passed to the
           peer, it cannot increase the size to hit memory-limits of the source
           process (in case the file-storage is accounted to the source).
      
      The new API is an extension to fcntl(), adding two new commands:
        F_GET_SEALS: Return a bitset describing the seals on the file. This
                     can be called on any FD if the underlying file supports
                     sealing.
        F_ADD_SEALS: Change the seals of a given file. This requires WRITE
                     access to the file and F_SEAL_SEAL may not already be set.
                     Furthermore, the underlying file must support sealing and
                     there may not be any existing shared mapping of that file.
                     Otherwise, EBADF/EPERM is returned.
                     The given seals are _added_ to the existing set of seals
                     on the file. You cannot remove seals again.
      
      The fcntl() handler is currently specific to shmem and disabled on all
      files. A file needs to explicitly support sealing for this interface to
      work. A separate syscall is added in a follow-up, which creates files that
      support sealing. There is no intention to support this on other
      file-systems. Semantics are unclear for non-volatile files and we lack any
      use-case right now. Therefore, the implementation is specific to shmem.
      Signed-off-by: default avatarDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      40e041a2
    • Johannes Weiner's avatar
      mm: memcontrol: rewrite uncharge API · 0a31bc97
      Johannes Weiner authored
      The memcg uncharging code that is involved towards the end of a page's
      lifetime - truncation, reclaim, swapout, migration - is impressively
      complicated and fragile.
      
      Because anonymous and file pages were always charged before they had their
      page->mapping established, uncharges had to happen when the page type
      could still be known from the context; as in unmap for anonymous, page
      cache removal for file and shmem pages, and swap cache truncation for swap
      pages.  However, these operations happen well before the page is actually
      freed, and so a lot of synchronization is necessary:
      
      - Charging, uncharging, page migration, and charge migration all need
        to take a per-page bit spinlock as they could race with uncharging.
      
      - Swap cache truncation happens during both swap-in and swap-out, and
        possibly repeatedly before the page is actually freed.  This means
        that the memcg swapout code is called from many contexts that make
        no sense and it has to figure out the direction from page state to
        make sure memory and memory+swap are always correctly charged.
      
      - On page migration, the old page might be unmapped but then reused,
        so memcg code has to prevent untimely uncharging in that case.
        Because this code - which should be a simple charge transfer - is so
        special-cased, it is not reusable for replace_page_cache().
      
      But now that charged pages always have a page->mapping, introduce
      mem_cgroup_uncharge(), which is called after the final put_page(), when we
      know for sure that nobody is looking at the page anymore.
      
      For page migration, introduce mem_cgroup_migrate(), which is called after
      the migration is successful and the new page is fully rmapped.  Because
      the old page is no longer uncharged after migration, prevent double
      charges by decoupling the page's memcg association (PCG_USED and
      pc->mem_cgroup) from the page holding an actual charge.  The new bits
      PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
      to the new page during migration.
      
      mem_cgroup_migrate() is suitable for replace_page_cache() as well,
      which gets rid of mem_cgroup_replace_page_cache().  However, care
      needs to be taken because both the source and the target page can
      already be charged and on the LRU when fuse is splicing: grab the page
      lock on the charge moving side to prevent changing pc->mem_cgroup of a
      page under migration.  Also, the lruvecs of both pages change as we
      uncharge the old and charge the new during migration, and putback may
      race with us, so grab the lru lock and isolate the pages iff on LRU to
      prevent races and ensure the pages are on the right lruvec afterward.
      
      Swap accounting is massively simplified: because the page is no longer
      uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
      transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
      before the final put_page() in page reclaim.
      
      Finally, page_cgroup changes are now protected by whatever protection the
      page itself offers: anonymous pages are charged under the page table lock,
      whereas page cache insertions, swapin, and migration hold the page lock.
      Uncharging happens under full exclusion with no outstanding references.
      Charging and uncharging also ensure that the page is off-LRU, which
      serializes against charge migration.  Remove the very costly page_cgroup
      lock and set pc->flags non-atomically.
      
      [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
      [vdavydov@parallels.com: fix flags definition]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Tested-by: default avatarJet Chen <jet.chen@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Tested-by: default avatarFelipe Balbi <balbi@ti.com>
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a31bc97
    • Johannes Weiner's avatar
      mm: memcontrol: rewrite charge API · 00501b53
      Johannes Weiner authored
      These patches rework memcg charge lifetime to integrate more naturally
      with the lifetime of user pages.  This drastically simplifies the code and
      reduces charging and uncharging overhead.  The most expensive part of
      charging and uncharging is the page_cgroup bit spinlock, which is removed
      entirely after this series.
      
      Here are the top-10 profile entries of a stress test that reads a 128G
      sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
       executing in the root memcg).  Before:
      
          15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.31%              cat  [kernel.kallsyms]   [k] memset
          11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage
           4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.38%              cat  [kernel.kallsyms]   [k] put_page
           2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge
           2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common
           1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
      
      After:
      
          15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.48%           cat  [kernel.kallsyms]   [k] memset
          11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage
           3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.46%           cat  [kernel.kallsyms]   [k] put_page
           2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
           1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk
           1.30%           cat  [kernel.kallsyms]   [k] kfree
      
      As you can see, the memcg footprint has shrunk quite a bit.
      
         text    data     bss     dec     hex filename
        37970    9892     400   48262    bc86 mm/memcontrol.o.old
        35239    9892     400   45531    b1db mm/memcontrol.o
      
      This patch (of 4):
      
      The memcg charge API charges pages before they are rmapped - i.e.  have an
      actual "type" - and so every callsite needs its own set of charge and
      uncharge functions to know what type is being operated on.  Worse,
      uncharge has to happen from a context that is still type-specific, rather
      than at the end of the page's lifetime with exclusive access, and so
      requires a lot of synchronization.
      
      Rewrite the charge API to provide a generic set of try_charge(),
      commit_charge() and cancel_charge() transaction operations, much like
      what's currently done for swap-in:
      
        mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
        pages from the memcg if necessary.
      
        mem_cgroup_commit_charge() commits the page to the charge once it
        has a valid page->mapping and PageAnon() reliably tells the type.
      
        mem_cgroup_cancel_charge() aborts the transaction.
      
      This reduces the charge API and enables subsequent patches to
      drastically simplify uncharging.
      
      As pages need to be committed after rmap is established but before they
      are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
      additions again.  Revive lru_cache_add_active_or_unevictable().
      
      [hughd@google.com: fix shmem_unuse]
      [hughd@google.com: Add comments on the private use of -EAGAIN]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00501b53
  24. 07 Aug, 2014 2 commits
  25. 06 Aug, 2014 4 commits
  26. 23 Jul, 2014 2 commits
    • Hugh Dickins's avatar
      shmem: fix splicing from a hole while it's punched · b1a36650
      Hugh Dickins authored
      shmem_fault() is the actual culprit in trinity's hole-punch starvation,
      and the most significant cause of such problems: since a page faulted is
      one that then appears page_mapped(), needing unmap_mapping_range() and
      i_mmap_mutex to be unmapped again.
      
      But it is not the only way in which a page can be brought into a hole in
      the radix_tree while that hole is being punched; and Vlastimil's testing
      implies that if enough other processors are busy filling in the hole,
      then shmem_undo_range() can be kept from completing indefinitely.
      
      shmem_file_splice_read() is the main other user of SGP_CACHE, which can
      instantiate shmem pagecache pages in the read-only case (without holding
      i_mutex, so perhaps concurrently with a hole-punch).  Probably it's
      silly not to use SGP_READ already (using the ZERO_PAGE for holes): which
      ought to be safe, but might bring surprises - not a change to be rushed.
      
      shmem_read_mapping_page_gfp() is an internal interface used by
      drivers/gpu/drm GEM (and next by uprobes): it should be okay.  And
      shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when
      called internally by the kernel (perhaps for a stacking filesystem,
      which might rely on holes to be reserved): it's unclear whether it could
      be provoked to keep hole-punch busy or not.
      
      We could apply the same umbrella as now used in shmem_fault() to
      shmem_file_splice_read() and the others; but it looks ugly, and use over
      a range raises questions - should it actually be per page? can these get
      starved themselves?
      
      The origin of this part of the problem is my v3.1 commit d0823576
      ("mm: pincer in truncate_inode_pages_range"), once it was duplicated
      into shmem.c.  It seemed like a nice idea at the time, to ensure
      (barring RCU lookup fuzziness) that there's an instant when the entire
      hole is empty; but the indefinitely repeated scans to ensure that make
      it vulnerable.
      
      Revert that "enhancement" to hole-punch from shmem_undo_range(), but
      retain the unproblematic rescanning when it's truncating; add a couple
      of comments there.
      
      Remove the "indices[0] >= end" test: that is now handled satisfactorily
      by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
      to be worth avoiding here.
      
      But if we do not always loop indefinitely, we do need to handle the case
      of swap swizzled back to page before shmem_free_swap() gets it: add a
      retry for that case, as suggested by Konstantin Khlebnikov; and for the
      case of page swizzled back to swap, as suggested by Johannes Weiner.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lukas Czerner <lczerner@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.1+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b1a36650
    • Hugh Dickins's avatar
      shmem: fix faulting into a hole, not taking i_mutex · 8e205f77
      Hugh Dickins authored
      Commit f00cdc6d ("shmem: fix faulting into a hole while it's
      punched") was buggy: Sasha sent a lockdep report to remind us that
      grabbing i_mutex in the fault path is a no-no (write syscall may already
      hold i_mutex while faulting user buffer).
      
      We tried a completely different approach (see following patch) but that
      proved inadequate: good enough for a rational workload, but not good
      enough against trinity - which forks off so many mappings of the object
      that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
      into serious starvation when concurrent faults force the puncher to fall
      back to single-page unmap_mapping_range() searches of the i_mmap tree.
      
      So return to the original umbrella approach, but keep away from i_mutex
      this time.  We really don't want to bloat every shmem inode with a new
      mutex or completion, just to protect this unlikely case from trinity.
      So extend the original with wait_queue_head on stack at the hole-punch
      end, and wait_queue item on the stack at the fault end.
      
      This involves further use of i_lock to guard against the races: lockdep
      has been happy so far, and I see fs/inode.c:unlock_new_inode() holds
      i_lock around wake_up_bit(), which is comparable to what we do here.
      i_lock is more convenient, but we could switch to shmem's info->lock.
      
      This issue has been tagged with CVE-2014-4171, which will require commit
      f00cdc6d and this and the following patch to be backported: we
      suggest to 3.1+, though in fact the trinity forkbomb effect might go
      back as far as 2.6.16, when madvise(,,MADV_REMOVE) came in - or might
      not, since much has changed, with i_mmap_mutex a spinlock before 3.0.
      Anyone running trinity on 3.0 and earlier? I don't think we need care.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lukas Czerner <lczerner@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.1+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8e205f77
  27. 03 Jul, 2014 1 commit
    • Hugh Dickins's avatar
      shmem: fix init_page_accessed use to stop !PageLRU bug · 66d2f4d2
      Hugh Dickins authored
      Under shmem swapping load, I sometimes hit the VM_BUG_ON_PAGE(!PageLRU)
      in isolate_lru_pages() at mm/vmscan.c:1281!
      
      Commit 2457aec6 ("mm: non-atomically mark page accessed during page
      cache allocation where possible") looks like interrupted work-in-progress.
      
      mm/filemap.c's call to init_page_accessed() is fine, but not mm/shmem.c's
      - shmem_write_begin() is clearly wrong to use it after shmem_getpage(),
      when the page is always visible in radix_tree, and often already on LRU.
      
      Revert change to shmem_write_begin(), and use init_page_accessed() or
      mark_page_accessed() appropriately for SGP_WRITE in shmem_getpage_gfp().
      
      SGP_WRITE also covers shmem_symlink(), which did not mark_page_accessed()
      before; but since many other filesystems use [__]page_symlink(), which did
      and does mark the page accessed, consider this as rectifying an oversight.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Prabhakar Lad <prabhakar.csengg@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66d2f4d2