Skip to content
  • Dave Hansen's avatar
    fs: do not prefault sys_write() user buffer pages · 998ef75d
    Dave Hansen authored
    
    
    === Short summary ====
    
    iov_iter_fault_in_readable() works around a really rare case and we can
    avoid the deadlock it addresses in another way: disable page faults and
    work around copy failures by faulting after the copy in a slow path
    instead of before in a hot one.
    
    I have a little microbenchmark that does repeated, small writes to tmpfs.
    This patch speeds that micro up by 6.2%.
    
    === Long version ===
    
    When doing a sys_write() we have a source buffer in userspace and then a
    target file page.
    
    If both of those are the same physical page, there is a potential deadlock
    that we avoid.  It would happen something like this:
    
    1. We start the write to the file
    2. Allocate page cache page and set it !Uptodate
    3. Touch the userspace buffer to copy in the user data
    4. Page fault (since source of the write not yet mapped)
    5. Page fault code tries to lock the page and deadlocks
    
    (more details on this below)
    
    To avoid this, we prefault the page to guarantee that this fault does not
    occur.  But, this prefault comes at a cost.  It is one of the most
    expensive things that we do in a hot write() path (especially if we
    compare it to the read path).  It is working around a pretty rare case.
    
    To fix this, it's pretty simple.  We move the "prefault" code to run after
    we attempt the copy.  We explicitly disable page faults _during_ the copy,
    detect the copy failure, then execute the "prefault" ouside of where the
    page lock needs to be held.
    
    iov_iter_copy_from_user_atomic() actually already has an implicit
    pagefault_disable() inside of it (at least on x86), but we add an explicit
    one.  I don't think we can depend on every kmap_atomic() implementation to
    pagefault_disable() for eternity.
    
    ===================================================
    
    The stack trace when this happens looks like this:
    
      wait_on_page_bit_killable+0xc0/0xd0
      __lock_page_or_retry+0x84/0xa0
      filemap_fault+0x1ed/0x3d0
      __do_fault+0x41/0xc0
      handle_mm_fault+0x9bb/0x1210
      __do_page_fault+0x17f/0x3d0
      do_page_fault+0xc/0x10
      page_fault+0x22/0x30
      generic_perform_write+0xca/0x1a0
      __generic_file_write_iter+0x190/0x1f0
      ext4_file_write_iter+0xe9/0x460
      __vfs_write+0xaa/0xe0
      vfs_write+0xa6/0x1a0
      SyS_write+0x46/0xa0
      entry_SYSCALL_64_fastpath+0x12/0x6a
      0xffffffffffffffff
    
    (Note, this does *NOT* happen in practice today because
     the kmap_atomic() does a pagefault_disable().  The trace
     above was obtained by taking out the pagefault_disable().)
    
    You can trigger the deadlock with this little code snippet:
    
    	fd = open("foo", O_RDWR);
    	fdmap = mmap(NULL, len, PROT_WRITE|PROT_READ, MAP_SHARED, fd, 0);
    	write(fd, &fdmap[0], 1);
    
    Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Jens Axboe <axboe@fb.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: NeilBrown <neilb@suse.de>
    Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
    Cc: Paul Cassella <cassella@cray.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    998ef75d