1. 26 Sep, 2014 6 commits
  2. 13 Sep, 2014 2 commits
    • Al Viro's avatar
      move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon) · 6f18493e
      Al Viro authored
      
      
      and lock the right list there
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      6f18493e
    • Linus Torvalds's avatar
      vfs: fix bad hashing of dentries · 99d263d4
      Linus Torvalds authored
      Josef Bacik found a performance regression between 3.2 and 3.10 and
      narrowed it down to commit bfcfaa77
      
       ("vfs: use 'unsigned long'
      accesses for dcache name comparison and hashing"). He reports:
      
       "The test case is essentially
      
            for (i = 0; i < 1000000; i++)
                    mkdir("a$i");
      
        On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k
        dir/sec with 3.10.  This is because we spend waaaaay more time in
        __d_lookup on 3.10 than in 3.2.
      
        The new hashing function for strings is suboptimal for <
        sizeof(unsigned long) string names (and hell even > sizeof(unsigned
        long) string names that I've tested).  I broke out the old hashing
        function and the new one into a userspace helper to get real numbers
        and this is what I'm getting:
      
            Old hash table had 1000000 entries, 0 dupes, 0 max dupes
            New hash table had 12628 entries, 987372 dupes, 900 max dupes
            We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash
      
        My test does the hash, and then does the d_hash into a integer pointer
        array the same size as the dentry hash table on my system, and then
        just increments the value at the address we got to see how many
        entries we overlap with.
      
        As you can see the old hash function ended up with all 1 million
        entries in their own bucket, whereas the new one they are only
        distributed among ~12.5k buckets, which is why we're using so much
        more CPU in __d_lookup".
      
      The reason for this hash regression is two-fold:
      
       - On 64-bit architectures the down-mixing of the original 64-bit
         word-at-a-time hash into the final 32-bit hash value is very
         simplistic and suboptimal, and just adds the two 32-bit parts
         together.
      
         In particular, because there is no bit shuffling and the mixing
         boundary is also a byte boundary, similar character patterns in the
         low and high word easily end up just canceling each other out.
      
       - the old byte-at-a-time hash mixed each byte into the final hash as it
         hashed the path component name, resulting in the low bits of the hash
         generally being a good source of hash data.  That is not true for the
         word-at-a-time case, and the hash data is distributed among all the
         bits.
      
      The fix is the same in both cases: do a better job of mixing the bits up
      and using as much of the hash data as possible.  We already have the
      "hash_32|64()" functions to do that.
      
      Reported-by: default avatarJosef Bacik <jbacik@fb.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99d263d4
  3. 07 Aug, 2014 9 commits
  4. 11 Jun, 2014 1 commit
    • Al Viro's avatar
      lock_parent: don't step on stale ->d_parent of all-but-freed one · c2338f2d
      Al Viro authored
      
      
      Dentry that had been through (or into) __dentry_kill() might be seen
      by shrink_dentry_list(); that's normal, it'll be taken off the shrink
      list and freed if __dentry_kill() has already finished.  The problem
      is, its ->d_parent might be pointing to already freed dentry, so
      lock_parent() needs to be careful.
      
      We need to check that dentry hasn't already gone into __dentry_kill()
      *and* grab rcu_read_lock() before dropping ->d_lock - the latter makes
      sure that whatever we see in ->d_parent after dropping ->d_lock it
      won't be freed until we drop rcu_read_lock().
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      c2338f2d
  5. 06 Jun, 2014 1 commit
  6. 31 May, 2014 1 commit
    • Linus Torvalds's avatar
      dcache: add missing lockdep annotation · 9f12600f
      Linus Torvalds authored
      lock_parent() very much on purpose does nested locking of dentries, and
      is careful to maintain the right order (lock parent first).  But because
      it didn't annotate the nested locking order, lockdep thought it might be
      a deadlock on d_lock, and complained.
      
      Add the proper annotation for the inner locking of the child dentry to
      make lockdep happy.
      
      Introduced by commit 046b961b
      
       ("shrink_dentry_list(): take parent's
      ->d_lock earlier").
      
      Reported-and-tested-by: default avatarJosh Boyer <jwboyer@fedoraproject.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f12600f
  7. 30 May, 2014 3 commits
    • Al Viro's avatar
      dentry_kill() doesn't need the second argument now · 8cbf74da
      Al Viro authored
      
      
      it's 1 in the only remaining caller.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      8cbf74da
    • Al Viro's avatar
      dealing with the rest of shrink_dentry_list() livelock · b2b80195
      Al Viro authored
      
      
      We have the same problem with ->d_lock order in the inner loop, where
      we are dropping references to ancestors.  Same solution, basically -
      instead of using dentry_kill() we use lock_parent() (introduced in the
      previous commit) to get that lock in a safe way, recheck ->d_count
      (in case if lock_parent() has ended up dropping and retaking ->d_lock
      and somebody managed to grab a reference during that window), trylock
      the inode->i_lock and use __dentry_kill() to do the rest.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      b2b80195
    • Al Viro's avatar
      shrink_dentry_list(): take parent's ->d_lock earlier · 046b961b
      Al Viro authored
      
      
      The cause of livelocks there is that we are taking ->d_lock on
      dentry and its parent in the wrong order, forcing us to use
      trylock on the parent's one.  d_walk() takes them in the right
      order, and unfortunately it's not hard to create a situation
      when shrink_dentry_list() can't make progress since trylock
      keeps failing, and shrink_dcache_parent() or check_submounts_and_drop()
      keeps calling d_walk() disrupting the very shrink_dentry_list() it's
      waiting for.
      
      Solution is straightforward - if that trylock fails, let's unlock
      the dentry itself and take locks in the right order.  We need to
      stabilize ->d_parent without holding ->d_lock, but that's doable
      using RCU.  And we'd better do that in the very beginning of the
      loop in shrink_dentry_list(), since the checks on refcount, etc.
      would need to be redone anyway.
      
      That deals with a half of the problem - killing dentries on the
      shrink list itself.  Another one (dropping their parents) is
      in the next commit.
      
      locking parent is interesting - it would be easy to do rcu_read_lock(),
      lock whatever we think is a parent, lock dentry itself and check
      if the parent is still the right one.  Except that we need to check
      that *before* locking the dentry, or we are risking taking ->d_lock
      out of order.  Fortunately, once the D1 is locked, we can check if
      D2->d_parent is equal to D1 without the need to lock D2; D2->d_parent
      can start or stop pointing to D1 only under D1->d_lock, so taking
      D1->d_lock is enough.  In other words, the right solution is
      rcu_read_lock/lock what looks like parent right now/check if it's
      still our parent/rcu_read_unlock/lock the child.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      046b961b
  8. 29 May, 2014 2 commits
  9. 28 May, 2014 1 commit
  10. 03 May, 2014 3 commits
    • Miklos Szeredi's avatar
      dcache: don't need rcu in shrink_dentry_list() · 60942f2f
      Miklos Szeredi authored
      
      
      Since now the shrink list is private and nobody can free the dentry while
      it is on the shrink list, we can remove RCU protection from this.
      
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      60942f2f
    • Al Viro's avatar
      more graceful recovery in umount_collect() · 9c8c10e2
      Al Viro authored
      
      
      Start with shrink_dcache_parent(), then scan what remains.
      
      First of all, BUG() is very much an overkill here; we are holding
      ->s_umount, and hitting BUG() means that a lot of interesting stuff
      will be hanging after that point (sync(2), for example).  Moreover,
      in cases when there had been more than one leak, we'll be better
      off reporting all of them.  And more than just the last component
      of pathname - %pd is there for just such uses...
      
      That was the last user of dentry_lru_del(), so kill it off...
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      9c8c10e2
    • Al Viro's avatar
      don't remove from shrink list in select_collect() · fe91522a
      Al Viro authored
      
      
      	If we find something already on a shrink list, just increment
      data->found and do nothing else.  Loops in shrink_dcache_parent() and
      check_submounts_and_drop() will do the right thing - everything we
      did put into our list will be evicted and if there had been nothing,
      but data->found got non-zero, well, we have somebody else shrinking
      those guys; just try again.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      fe91522a
  11. 01 May, 2014 1 commit
    • Al Viro's avatar
      dentry_kill(): don't try to remove from shrink list · 41edf278
      Al Viro authored
      
      
      If the victim in on the shrink list, don't remove it from there.
      If shrink_dentry_list() manages to remove it from the list before
      we are done - fine, we'll just free it as usual.  If not - mark
      it with new flag (DCACHE_MAY_FREE) and leave it there.
      
      Eventually, shrink_dentry_list() will get to it, remove the sucker
      from shrink list and call dentry_kill(dentry, 0).  Which is where
      we'll deal with freeing.
      
      Since now dentry_kill(dentry, 0) may happen after or during
      dentry_kill(dentry, 1), we need to recognize that (by seeing
      DCACHE_DENTRY_KILLED already set), unlock everything
      and either free the sucker (in case DCACHE_MAY_FREE has been
      set) or leave it for ongoing dentry_kill(dentry, 1) to deal with.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      41edf278
  12. 30 Apr, 2014 4 commits
  13. 19 Apr, 2014 1 commit
    • Al Viro's avatar
      fix races between __d_instantiate() and checks of dentry flags · 22213318
      Al Viro authored
      
      
      in non-lazy walk we need to be careful about dentry switching from
      negative to positive - both ->d_flags and ->d_inode are updated,
      and in some places we might see only one store.  The cases where
      dentry has been obtained by dcache lookup with ->i_mutex held on
      parent are safe - ->d_lock and ->i_mutex provide all the barriers
      we need.  However, there are several places where we run into
      trouble:
      	* do_last() fetches ->d_inode, then checks ->d_flags and
      assumes that inode won't be NULL unless d_is_negative() is true.
      Race with e.g. creat() - we might have fetched the old value of
      ->d_inode (still NULL) and new value of ->d_flags (already not
      DCACHE_MISS_TYPE).  Lin Ming has observed and reported the resulting
      oops.
      	* a bunch of places checks ->d_inode for being non-NULL,
      then checks ->d_flags for "is it a symlink".  Race with symlink(2)
      in case if our CPU sees ->d_inode update first - we see non-NULL
      there, but ->d_flags still contains DCACHE_MISS_TYPE instead of
      DCACHE_SYMLINK_TYPE.  Result: false negative on "should we follow
      link here?", with subsequent unpleasantness.
      
      Cc: stable@vger.kernel.org # 3.13 and 3.14 need that one
      Reported-and-tested-by: default avatarLin Ming <minggr@gmail.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      22213318
  14. 01 Apr, 2014 1 commit
  15. 22 Mar, 2014 1 commit
  16. 16 Mar, 2014 1 commit
    • David Herrmann's avatar
      drm: add pseudo filesystem for shared inodes · 31bbe16f
      David Herrmann authored
      
      
      Our current DRM design uses a single address_space for all users of the
      same DRM device. However, there is no way to create an anonymous
      address_space without an underlying inode. Therefore, we wait for the
      first ->open() callback on a registered char-dev and take-over the inode
      of the char-dev. This worked well so far, but has several drawbacks:
       - We screw with FS internals and rely on some non-obvious invariants like
         inode->i_mapping being the same as inode->i_data for char-devs.
       - We don't have any address_space prior to the first ->open() from
         user-space. This leads to ugly fallback code and we cannot allocate
         global objects early.
      
      As pointed out by Al-Viro, fs/anon_inode.c is *not* supposed to be used by
      drivers for anonymous inode-allocation. Therefore, this patch follows the
      proposed alternative solution and adds a pseudo filesystem mount-point to
      DRM. We can then allocate private inodes including a private address_space
      for each DRM device at initialization time.
      
      Note that we could use:
        sysfs_get_inode(sysfs_mnt->mnt_sb, drm_device->dev->kobj.sd);
      to get access to the underlying sysfs-inode of a "struct device" object.
      However, most of this information is currently hidden and it's not clear
      whether this address_space is suitable for driver access. Thus, unless
      linux allows anonymous address_space objects or driver-core provides a
      public inode per device, we're left with our own private internal mount
      point.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarDavid Herrmann <dh.herrmann@gmail.com>
      31bbe16f
  17. 26 Jan, 2014 2 commits