1. 25 Oct, 2016 19 commits
  2. 22 Oct, 2016 21 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.8.4 · a2b42342
      Greg Kroah-Hartman authored
      a2b42342
    • Glauber Costa's avatar
      cfq: fix starvation of asynchronous writes · 9362516c
      Glauber Costa authored
      
      
      commit 3932a86b4b9d1f0b049d64d4591ce58ad18b44ec upstream.
      
      While debugging timeouts happening in my application workload (ScyllaDB), I have
      observed calls to open() taking a long time, ranging everywhere from 2 seconds -
      the first ones that are enough to time out my application - to more than 30
      seconds.
      
      The problem seems to happen because XFS may block on pending metadata updates
      under certain circumnstances, and that's confirmed with the following backtrace
      taken by the offcputime tool (iovisor/bcc):
      
          ffffffffb90c57b1 finish_task_switch
          ffffffffb97dffb5 schedule
          ffffffffb97e310c schedule_timeout
          ffffffffb97e1f12 __down
          ffffffffb90ea821 down
          ffffffffc046a9dc xfs_buf_lock
          ffffffffc046abfb _xfs_buf_find
          ffffffffc046ae4a xfs_buf_get_map
          ffffffffc046babd xfs_buf_read_map
          ffffffffc0499931 xfs_trans_read_buf_map
          ffffffffc044a561 xfs_da_read_buf
          ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
          ffffffffc0452b90 xfs_dir2_leaf_lookup_int
          ffffffffc0452e0f xfs_dir2_leaf_lookup
          ffffffffc044d9d3 xfs_dir_lookup
          ffffffffc047d1d9 xfs_lookup
          ffffffffc0479e53 xfs_vn_lookup
          ffffffffb925347a path_openat
          ffffffffb9254a71 do_filp_open
          ffffffffb9242a94 do_sys_open
          ffffffffb9242b9e sys_open
          ffffffffb97e42b2 entry_SYSCALL_64_fastpath
          00007fb0698162ed [unknown]
      
      Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
      high "Dispatch wait" times, on the dozens of seconds range and consistent with
      the open() times I have saw in that run.
      
      Still from the blktrace output, we can after searching a bit, identify the
      request that wasn't dispatched:
      
        8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <- (8,1) 141696240
        8,0   11      153    81.092472889   804  Q  WM 141698288 + 8 [xfsaild/sda1]
        8,0   11      154    81.092473207   804  G  WM 141698288 + 8 [xfsaild/sda1]
        8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   22911) [xfsaild/sda1]
        <==== 'I' means Inserted (into the IO scheduler) ===================================>
        8,0    0   289372    96.718761435     0  D  WM 141698288 + 8 (15626265317) [swapper/0]
        <==== Only 15s later the CFQ scheduler dispatches the request ======================>
      
      As we can see above, in this particular example CFQ took 15 seconds to dispatch
      this request. Going back to the full trace, we can see that the xfsaild queue
      had plenty of opportunity to run, and it was selected as the active queue many
      times. It would just always be preempted by something else (example):
      
        8,0    1        0    81.117912979     0  m   N cfq1618SN / insert_request
        8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
        8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
        8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired t=1
        8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
        8,0    1        0    81.117915340     0  m   N / served: vt=1948520448 min_vt=1948520448
        8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1 disp=0 charge=0 iops=1 sect=0
      
      where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
      IO dispatchers.
      
      The requests preempting the xfsaild queue are synchronous requests. That's a
      characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
      While it can be argued that preempting ASYNC requests in favor of SYNC is part
      of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
      goal.
      
      Moreover, unless I am misunderstanding something, that breaks the expectation
      set by the "fifo_expire_async" tunable, which in my system is set to the
      default.
      
      Looking at the code, it seems to me that the issue is that after we make
      an async queue active, there is no guarantee that it will execute any request.
      
      When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
      requests in flight. An incoming request from another queue can also preempt it
      in such situation before we have the chance to execute anything (as seen in the
      trace above).
      
      This patch sets the must_dispatch flag if we notice that we have requests
      that are already fifo_expired. This flag is always cleared after
      cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
      the queue for subsequent requests (unless they are themselves expired)
      
      Care is taken during preempt to still allow rt requests to preempt us
      regardless.
      
      Testing my workload with this patch applied produces much better results.
      From the application side I see no timeouts, and the open() latency histogram
      generated by systemtap looks much better, with the worst outlier at 131ms:
      
      Latency histogram of xfs_buf_lock acquisition (microseconds):
       value |-------------------------------------------------- count
           0 |                                                     11
           1 |@@@@                                                161
           2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
           4 |@                                                    54
           8 |                                                     36
          16 |                                                      7
          32 |                                                      0
          64 |                                                      0
             ~
        1024 |                                                      0
        2048 |                                                      0
        4096 |                                                      1
        8192 |                                                      1
       16384 |                                                      2
       32768 |                                                      0
       65536 |                                                      0
      131072 |                                                      1
      262144 |                                                      0
      524288 |                                                      0
      Signed-off-by: default avatarGlauber Costa <glauber@scylladb.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: linux-block@vger.kernel.org
      CC: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarGlauber Costa <glauber@scylladb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9362516c
    • Vishal Verma's avatar
      acpi, nfit: check for the correct event code in notifications · afac7081
      Vishal Verma authored
      commit c09f12186d6b03b798832d95289af76495990192 upstream.
      
      Commit 20985164 "acpi: nfit: Add support for hot-add" added
      support for _FIT notifications, but it neglected to verify the
      notification event code matches the one in the ACPI spec for
      "NFIT Update". Currently there is only one code in the spec, but
      once additional codes are added, older kernels (without this fix)
      will misbehave by assuming all event notifications are for an
      NFIT Update.
      
      Fixes: 20985164
      
       ("acpi: nfit: Add support for hot-add")
      Cc: <linux-acpi@vger.kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarLinda Knippers <linda.knippers@hpe.com>
      Signed-off-by: default avatarVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      afac7081
    • Laszlo Ersek's avatar
      drm: virtio: reinstate drm_virtio_set_busid() · 3245ff58
      Laszlo Ersek authored
      commit c2cbc38b9715bd8318062e600668fc30e5a3fbfa upstream.
      
      Before commit a3257256 ("drm: Lobotomize set_busid nonsense for !pci
      drivers"), several DRM drivers for platform devices used to expose an
      explicit "drm_driver.set_busid" callback, invariably backed by
      drm_platform_set_busid().
      
      Commit a3257256 removed drm_platform_set_busid(), along with the
      referring .set_busid field initializations. This was justified because
      interchangeable functionality had been implemented in drm_dev_alloc() /
      drm_dev_init(), which DRM_IOCTL_SET_VERSION would rely on going forward.
      
      However, commit a3257256 also removed drm_virtio_set_busid(), for
      which the same consolidation was not appropriate: this .set_busid callback
      had been implemented with drm_pci_set_busid(), and not
      drm_platform_set_busid(). The error regressed Xorg/xserver on QEMU's
      "virtio-vga" card; the drmGetBusid() function from libdrm would no longer
      return stable PCI identifiers like "pci:0000:00:02.0", but rather unstable
      platform ones like "virtio0".
      
      Reinstate drm_virtio_set_busid() with judicious use of
      
        git checkout -p a3257256
      
      ^ -- drivers/gpu/drm/virtio
      
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Emil Velikov <emil.l.velikov@gmail.com>
      Cc: Gerd Hoffmann <kraxel@redhat.com>
      Cc: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
      Cc: Hans de Goede <hdegoede@redhat.com>
      Cc: Joachim Frieben <jfrieben@hotmail.com>
      Reported-by: default avatarJoachim Frieben <jfrieben@hotmail.com>
      Fixes: a3257256
      Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1366842
      
      Signed-off-by: default avatarLaszlo Ersek <lersek@redhat.com>
      Reviewed-by: default avatarEmil Velikov <emil.l.velikov@gmail.com>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3245ff58
    • David Howells's avatar
      cachefiles: Fix attempt to read i_blocks after deleting file [ver #2] · 336f2e1e
      David Howells authored
      commit a818101d7b92e76db2f9a597e4830734767473b9 upstream.
      
      An NULL-pointer dereference happens in cachefiles_mark_object_inactive()
      when it tries to read i_blocks so that it can tell the cachefilesd daemon
      how much space it's making available.
      
      The problem is that cachefiles_drop_object() calls
      cachefiles_mark_object_inactive() after calling cachefiles_delete_object()
      because the object being marked active staves off attempts to (re-)use the
      file at that filename until after it has been deleted.  This means that
      d_inode is NULL by the time we come to try to access it.
      
      To fix the problem, have the caller of cachefiles_mark_object_inactive()
      supply the number of blocks freed up.
      
      Without this, the following oops may occur:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
      IP: [<ffffffffa06c5cc1>] cachefiles_mark_object_inactive+0x61/0xb0 [cachefiles]
      ...
      CPU: 11 PID: 527 Comm: kworker/u64:4 Tainted: G          I    ------------   3.10.0-470.el7.x86_64 #1
      Hardware name: Hewlett-Packard HP Z600 Workstation/0B54h, BIOS 786G4 v03.19 03/11/2011
      Workqueue: fscache_object fscache_object_work_func [fscache]
      task: ffff880035edaf10 ti: ffff8800b77c0000 task.ti: ffff8800b77c0000
      RIP: 0010:[<ffffffffa06c5cc1>] cachefiles_mark_object_inactive+0x61/0xb0 [cachefiles]
      RSP: 0018:ffff8800b77c3d70  EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff8800bf6cc400 RCX: 0000000000000034
      RDX: 0000000000000000 RSI: ffff880090ffc710 RDI: ffff8800bf761ef8
      RBP: ffff8800b77c3d88 R08: 2000000000000000 R09: 0090ffc710000000
      R10: ff51005d2ff1c400 R11: 0000000000000000 R12: ffff880090ffc600
      R13: ffff8800bf6cc520 R14: ffff8800bf6cc400 R15: ffff8800bf6cc498
      FS:  0000000000000000(0000) GS:ffff8800bb8c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000098 CR3: 00000000019ba000 CR4: 00000000000007e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Stack:
       ffff880090ffc600 ffff8800bf6cc400 ffff8800867df140 ffff8800b77c3db0
       ffffffffa06c48cb ffff880090ffc600 ffff880090ffc180 ffff880090ffc658
       ffff8800b77c3df0 ffffffffa085d846 ffff8800a96b8150 ffff880090ffc600
      Call Trace:
       [<ffffffffa06c48cb>] cachefiles_drop_object+0x6b/0xf0 [cachefiles]
       [<ffffffffa085d846>] fscache_drop_object+0xd6/0x1e0 [fscache]
       [<ffffffffa085d615>] fscache_object_work_func+0xa5/0x200 [fscache]
       [<ffffffff810a605b>] process_one_work+0x17b/0x470
       [<ffffffff810a6e96>] worker_thread+0x126/0x410
       [<ffffffff810a6d70>] ? rescuer_thread+0x460/0x460
       [<ffffffff810ae64f>] kthread+0xcf/0xe0
       [<ffffffff810ae580>] ? kthread_create_on_node+0x140/0x140
       [<ffffffff81695418>] ret_from_fork+0x58/0x90
       [<ffffffff810ae580>] ? kthread_create_on_node+0x140/0x140
      
      The oopsing code shows:
      
      	callq  0xffffffff810af6a0 <wake_up_bit>
      	mov    0xf8(%r12),%rax
      	mov    0x30(%rax),%rax
      	mov    0x98(%rax),%rax   <---- oops here
      	lock add %rax,0x130(%rbx)
      
      where this is:
      
      	d_backing_inode(object->dentry)->i_blocks
      
      Fixes: a5b3a80b
      
       (CacheFiles: Provide read-and-reset release counters for cachefilesd)
      Reported-by: default avatarJianhong Yin <jiyin@redhat.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarJeff Layton <jlayton@redhat.com>
      Reviewed-by: default avatarSteve Dickson <steved@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      336f2e1e
    • Miklos Szeredi's avatar
      vfs: move permission checking into notify_change() for utimes(NULL) · 7bf99896
      Miklos Szeredi authored
      
      
      commit f2b20f6ee842313a0d681dbbf7f87b70291a6a3b upstream.
      
      This fixes a bug where the permission was not properly checked in
      overlayfs.  The testcase is ltp/utimensat01.
      
      It is also cleaner and safer to do the permission checking in the vfs
      helper instead of the caller.
      
      This patch introduces an additional ia_valid flag ATTR_TOUCH (since
      touch(1) is the most obvious user of utimes(NULL)) that is passed into
      notify_change whenever the conditions for this special permission checking
      mode are met.
      Reported-by: default avatarAihua Zhang <zhangaihua1@huawei.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Tested-by: default avatarAihua Zhang <zhangaihua1@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7bf99896
    • Marcelo Ricardo Leitner's avatar
      dlm: free workqueues after the connections · 6b746940
      Marcelo Ricardo Leitner authored
      commit 3a8db79889ce16930aff19b818f5b09651bb7644 upstream.
      
      After backporting commit ee44b4bc ("dlm: use sctp 1-to-1 API")
      series to a kernel with an older workqueue which didn't use RCU yet, it
      was noticed that we are freeing the workqueues in dlm_lowcomms_stop()
      too early as free_conn() will try to access that memory for canceling
      the queued works if any.
      
      This issue was introduced by commit 0d737a8c as before it such
      attempt to cancel the queued works wasn't performed, so the issue was
      not present.
      
      This patch fixes it by simply inverting the free order.
      
      Fixes: 0d737a8c
      
       ("dlm: fix race while closing connections")
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6b746940
    • Marcelo Cerri's avatar
      crypto: vmx - Fix memory corruption caused by p8_ghash · 3fae7862
      Marcelo Cerri authored
      
      
      commit 80da44c29d997e28c4442825f35f4ac339813877 upstream.
      
      This patch changes the p8_ghash driver to use ghash-generic as a fixed
      fallback implementation. This allows the correct value of descsize to be
      defined directly in its shash_alg structure and avoids problems with
      incorrect buffer sizes when its state is exported or imported.
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Fixes: cc333cd6
      
       ("crypto: vmx - Adding GHASH routines for VMX module")
      Signed-off-by: default avatarMarcelo Cerri <marcelo.cerri@canonical.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3fae7862
    • Marcelo Cerri's avatar
      crypto: ghash-generic - move common definitions to a new header file · e15e0b84
      Marcelo Cerri authored
      commit a397ba829d7f8aff4c90af3704573a28ccd61a59 upstream.
      
      Move common values and types used by ghash-generic to a new header file
      so drivers can directly use ghash-generic as a fallback implementation.
      
      Fixes: cc333cd6
      
       ("crypto: vmx - Adding GHASH routines for VMX module")
      Signed-off-by: default avatarMarcelo Cerri <marcelo.cerri@canonical.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e15e0b84
    • Jan Kara's avatar
      ext4: unmap metadata when zeroing blocks · fb13b62a
      Jan Kara authored
      
      
      commit 9b623df614576680cadeaa4d7e0b5884de8f7c17 upstream.
      
      When zeroing blocks for DAX allocations, we also have to unmap aliases
      in the block device mappings.  Otherwise writeback can overwrite zeros
      with stale data from block device page cache.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fb13b62a
    • gmail's avatar
      ext4: release bh in make_indexed_dir · ff50a724
      gmail authored
      commit e81d44778d1d57bbaef9e24c4eac7c8a7a401d40 upstream.
      
      The commit 6050d47a
      
      : "ext4: bail out from make_indexed_dir() on
      first error" could end up leaking bh2 in the error path.
      
      [ Also avoid renaming bh2 to bh, which just confuses things --tytso ]
      Signed-off-by: default avataryangsheng <yngsion@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ff50a724
    • Ross Zwisler's avatar
      ext4: allow DAX writeback for hole punch · 99fa4c50
      Ross Zwisler authored
      
      
      commit cca32b7eeb4ea24fa6596650e06279ad9130af98 upstream.
      
      Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
      This is because the logic around filemap_write_and_wait_range() in
      ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
      not for dirty DAX exceptional entries.
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      99fa4c50
    • Eric Biggers's avatar
      ext4: fix memory leak when symlink decryption fails · 5373f6cc
      Eric Biggers authored
      
      
      commit dcce7a46c6f28f41447272fb44348ead8f584573 upstream.
      
      This bug was introduced in v4.8-rc1.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5373f6cc
    • Fabian Frederick's avatar
      ext4: fix memory leak in ext4_insert_range() · 7a689387
      Fabian Frederick authored
      
      
      commit edf15aa180d7b98fe16bd3eda42f9dd0e60dee20 upstream.
      
      Running xfstests generic/013 with kmemleak gives the following:
      
      unreferenced object 0xffff8801d3d27de0 (size 96):
        comm "fsstress", pid 4941, jiffies 4294860168 (age 53.485s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff818eaaf3>] kmemleak_alloc+0x23/0x40
          [<ffffffff81179805>] __kmalloc+0xf5/0x1d0
          [<ffffffff8122ef5c>] ext4_find_extent+0x1ec/0x2f0
          [<ffffffff8123530c>] ext4_insert_range+0x34c/0x4a0
          [<ffffffff81235942>] ext4_fallocate+0x4e2/0x8b0
          [<ffffffff81181334>] vfs_fallocate+0x134/0x210
          [<ffffffff8118203f>] SyS_fallocate+0x3f/0x60
          [<ffffffff818efa9b>] entry_SYSCALL_64_fastpath+0x13/0x8f
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Problem seems mitigated by dropping refs and freeing path
      when there's no path[depth].p_ext
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7a689387
    • wangguang's avatar
      ext4: bugfix for mmaped pages in mpage_release_unused_pages() · 30ac674a
      wangguang authored
      
      
      commit 4e800c0359d9a53e6bf0ab216954971b2515247f upstream.
      
      Pages clear buffers after ext4 delayed block allocation failed,
      However, it does not clean its pte_dirty flag.
      if the pages unmap ,in cording to the pte_dirty ,
      unmap_page_range may try to call __set_page_dirty,
      
      which may lead to the bugon at
      mpage_prepare_extent_to_map:head = page_buffers(page);.
      
      This patch just call clear_page_dirty_for_io to clean pte_dirty
      at mpage_release_unused_pages for pages mmaped.
      
      Steps to reproduce the bug:
      
      (1) mmap a file in ext4
      	addr = (char *)mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED,
      	       	            fd, 0);
      	memset(addr, 'i', 4096);
      
      (2) return EIO at
      
      	ext4_writepages->mpage_map_and_submit_extent->mpage_map_one_extent
      
      which causes this log message to be print:
      
                      ext4_msg(sb, KERN_CRIT,
                              "Delayed block allocation failed for "
                              "inode %lu at logical offset %llu with"
                              " max blocks %u with error %d",
                              inode->i_ino,
                              (unsigned long long)map->m_lblk,
                              (unsigned)map->m_len, -err);
      
      (3)Unmap the addr cause warning at
      
      	__set_page_dirty:WARN_ON_ONCE(warn && !PageUptodate(page));
      
      (4) wait for a minute,then bugon happen.
      Signed-off-by: default avatarwangguang <wangguang03@zte.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      30ac674a
    • Daeho Jeong's avatar
      ext4: reinforce check of i_dtime when clearing high fields of uid and gid · eac6c9e4
      Daeho Jeong authored
      
      
      commit 93e3b4e6631d2a74a8cf7429138096862ff9f452 upstream.
      
      Now, ext4_do_update_inode() clears high 16-bit fields of uid/gid
      of deleted and evicted inode to fix up interoperability with old
      kernels. However, it checks only i_dtime of an inode to determine
      whether the inode was deleted and evicted, and this is very risky,
      because i_dtime can be used for the pointer maintaining orphan inode
      list, too. We need to further check whether the i_dtime is being
      used for the orphan inode list even if the i_dtime is not NULL.
      
      We found that high 16-bit fields of uid/gid of inode are unintentionally
      and permanently cleared when the inode truncation is just triggered,
      but not finished, and the inode metadata, whose high uid/gid bits are
      cleared, is written on disk, and the sudden power-off follows that
      in order.
      Signed-off-by: default avatarDaeho Jeong <daeho.jeong@samsung.com>
      Signed-off-by: default avatarHobin Woo <hobin.woo@samsung.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eac6c9e4
    • Eric Whitney's avatar
      ext4: enforce online defrag restriction for encrypted files · ddcd9969
      Eric Whitney authored
      
      
      commit 14fbd4aa613bd5110556c281799ce36dc6f3ba97 upstream.
      
      Online defragging of encrypted files is not currently implemented.
      However, the move extent ioctl can still return successfully when
      called.  For example, this occurs when xfstest ext4/020 is run on an
      encrypted file system, resulting in a corrupted test file and a
      corresponding test failure.
      
      Until the proper functionality is implemented, fail the move extent
      ioctl if either the original or donor file is encrypted.
      Signed-off-by: default avatarEric Whitney <enwlinux@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ddcd9969
    • Jan Kara's avatar
      jbd2: fix lockdep annotation in add_transaction_credits() · 4cd2546a
      Jan Kara authored
      commit e03a9976afce6634826d56c33531dd10bb9a9166 upstream.
      
      Thomas has reported a lockdep splat hitting in
      add_transaction_credits(). The problem is that that function calls
      jbd2_might_wait_for_commit() while holding j_state_lock which is wrong
      (we do not really wait for transaction commit while holding that lock).
      
      Fix the problem by moving jbd2_might_wait_for_commit() into places where
      we are ready to wait for transaction commit and thus j_state_lock is
      unlocked.
      
      Fixes: 1eaa566d
      
      Reported-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4cd2546a
    • Wei Fang's avatar
      vfs,mm: fix a dead loop in truncate_inode_pages_range() · 3d549dcf
      Wei Fang authored
      commit c2a9737f45e27d8263ff9643f994bda9bac0b944 upstream.
      
      We triggered a deadloop in truncate_inode_pages_range() on 32 bits
      architecture with the test case bellow:
      
      	...
      	fd = open();
      	write(fd, buf, 4096);
      	preadv64(fd, &iovec, 1, 0xffffffff000);
      	ftruncate(fd, 0);
      	...
      
      Then ftruncate() will not return forever.
      
      The filesystem used in this case is ubifs, but it can be triggered on
      many other filesystems.
      
      When preadv64() is called with offset=0xffffffff000, a page with
      index=0xffffffff will be added to the radix tree of ->mapping.  Then
      this page can be found in ->mapping with pagevec_lookup().  After that,
      truncate_inode_pages_range(), which is called in ftruncate(), will fall
      into an infinite loop:
      
       - find a page with index=0xffffffff, since index>=end, this page won't
         be truncated
      
       - index++, and index become 0
      
       - the page with index=0xffffffff will be found again
      
      The data type of index is unsigned long, so index won't overflow to 0 on
      64 bits architecture in this case, and the dead loop won't happen.
      
      Since truncate_inode_pages_range() is executed with holding lock of
      inode->i_rwsem, any operation related with this lock will be blocked,
      and a hung task will happen, e.g.:
      
        INFO: task truncate_test:3364 blocked for more than 120 seconds.
        ...
           call_rwsem_down_write_failed+0x17/0x30
           generic_file_write_iter+0x32/0x1c0
           ubifs_write_iter+0xcc/0x170
           __vfs_write+0xc4/0x120
           vfs_write+0xb2/0x1b0
           SyS_write+0x46/0xa0
      
      The page with index=0xffffffff added to ->mapping is useless.  Fix this
      by checking the read position before allocating pages.
      
      Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.com
      
      Signed-off-by: default avatarWei Fang <fangwei1@huawei.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3d549dcf
    • Gerald Schaefer's avatar
      mm/hugetlb: fix memory offline with hugepage size > memory block size · d9cf9c31
      Gerald Schaefer authored
      commit 2247bb335ab9c40058484cac36ea74ee652f3b7b upstream.
      
      Patch series "mm/hugetlb: memory offline issues with hugepages", v4.
      
      This addresses several issues with hugepages and memory offline.  While
      the first patch fixes a panic, and is therefore rather important, the
      last patch is just a performance optimization.
      
      The second patch fixes a theoretical issue with reserved hugepages,
      while still leaving some ugly usability issue, see description.
      
      This patch (of 3):
      
      dissolve_free_huge_pages() will either run into the VM_BUG_ON() or a
      list corruption and addressing exception when trying to set a memory
      block offline that is part (but not the first part) of a "gigantic"
      hugetlb page with a size > memory block size.
      
      When no other smaller hugetlb page sizes are present, the VM_BUG_ON()
      will trigger directly.  In the other case we will run into an addressing
      exception later, because dissolve_free_huge_page() will not work on the
      head page of the compound hugetlb page which will result in a NULL
      hstate from page_hstate().
      
      To fix this, first remove the VM_BUG_ON() because it is wrong, and then
      use the compound head page in dissolve_free_huge_page().  This means
      that an unused pre-allocated gigantic page that has any part of itself
      inside the memory block that is going offline will be dissolved
      completely.  Losing an unused gigantic hugepage is preferable to failing
      the memory offline, for example in the situation where a (possibly
      faulty) memory DIMM needs to go offline.
      
      Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
      Link: http://lkml.kernel.org/r/20160926172811.94033-2-gerald.schaefer@de.ibm.com
      
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rui Teng <rui.teng@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d9cf9c31
    • Manfred Spraul's avatar
      ipc/sem.c: fix complex_count vs. simple op race · a9d465be
      Manfred Spraul authored
      commit 5864a2fd3088db73d47942370d0f7210a807b9bc upstream.
      
      Commit 6d07b68c ("ipc/sem.c: optimize sem_lock()") introduced a
      race:
      
      sem_lock has a fast path that allows parallel simple operations.
      There are two reasons why a simple operation cannot run in parallel:
       - a non-simple operations is ongoing (sma->sem_perm.lock held)
       - a complex operation is sleeping (sma->complex_count != 0)
      
      As both facts are stored independently, a thread can bypass the current
      checks by sleeping in the right positions.  See below for more details
      (or kernel bugzilla 105651).
      
      The patch fixes that by creating one variable (complex_mode)
      that tracks both reasons why parallel operations are not possible.
      
      The patch also updates stale documentation regarding the locking.
      
      With regards to stable kernels:
      The patch is required for all kernels that include the
      commit 6d07b68c ("ipc/sem.c: optimize sem_lock()") (3.10?)
      
      The alternative is to revert the patch that introduced the race.
      
      The patch is safe for backporting, i.e. it makes no assumptions
      about memory barriers in spin_unlock_wait().
      
      Background:
      Here is the race of the current implementation:
      
      Thread A: (simple op)
      - does the first "sma->complex_count == 0" test
      
      Thread B: (complex op)
      - does sem_lock(): This includes an array scan. But the scan can't
        find Thread A, because Thread A does not own sem->lock yet.
      - the thread does the operation, increases complex_count,
        drops sem_lock, sleeps
      
      Thread A:
      - spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
      - sleeps before the complex_count test
      
      Thread C: (complex op)
      - does sem_lock (no array scan, complex_count==1)
      - wakes up Thread B.
      - decrements complex_count
      
      Thread A:
      - does the complex_count test
      
      Bug:
      Now both thread A and thread C operate on the same array, without
      any synchronization.
      
      Fixes: 6d07b68c ("ipc/sem.c: optimize sem_lock()")
      Link: http://lkml.kernel.org/r/1469123695-5661-1-git-send-email-manfred@colorfullife.com
      
      
      Reported-by: <felixh@informatik.uni-bremen.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <1vier1@web.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a9d465be