1. 10 Sep, 2013 4 commits
  2. 08 Sep, 2013 5 commits
    • Linus Torvalds's avatar
      vfs: fix dentry RCU to refcounting possibly sleeping dput() · e5c832d5
      Linus Torvalds authored
      This is the fix that the last two commits indirectly led up to - making
      sure that we don't call dput() in a bad context on the dentries we've
      looked up in RCU mode after the sequence count validation fails.
      This basically expands d_rcu_to_refcount() into the callers, and then
      fixes the callers to delay the dput() in the failure case until _after_
      we've dropped all locks and are no longer in an RCU-locked region.
      The case of 'complete_walk()' was trivial, since its failure case did
      the unlock_rcu_walk() directly after the call to d_rcu_to_refcount(),
      and as such that is just a pure expansion of the function with a trivial
      movement of the resulting dput() to after 'unlock_rcu_walk()'.
      In contrast, the unlazy_walk() case was much more complicated, because
      not only does convert two different dentries from RCU to be reference
      counted, but it used to not call unlock_rcu_walk() at all, and instead
      just returned an error and let the caller clean everything up in
      Happily, one of the dentries in question (called "parent" inside
      unlazy_walk()) is the dentry of "nd->path", which terminate_walk() wants
      a refcount to anyway for the non-RCU case.
      So what the new and improved unlazy_walk() does is to first turn that
      dentry into a refcounted one, and once that is set up, the error cases
      can continue to use the terminate_walk() helper for cleanup, but for the
      non-RCU case.  Which makes it possible to drop out of RCU mode if we
      actually hit the sequence number failure case.
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Al Viro's avatar
      introduce kern_path_mountpoint() · 2d864651
      Al Viro authored
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    • Al Viro's avatar
      rename user_path_umountat() to user_path_mountpoint_at() · 197df04c
      Al Viro authored
      ... and move the extern from linux/namei.h to fs/internal.h,
      along with that of vfs_path_lookup().
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    • Al Viro's avatar
      take unlazy_walk() into umount_lookup_last() · 35759521
      Al Viro authored
      ... and massage it a bit to reduce nesting
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    • Linus Torvalds's avatar
      vfs: use lockred "dead" flag to mark unrecoverably dead dentries · 0d98439e
      Linus Torvalds authored
      This simplifies the RCU to refcounting code in particular.
      I was originally intending to leave this for later, but walking through
      all the dput() logic (see previous commit), I realized that the dput()
      "might_sleep()" check was misleadingly weak.  And I removed it as
      misleading, both for performance profiling and for debugging.
      However, the might_sleep() debugging case is actually true: the final
      dput() can indeed sleep, if the inode of the dentry that you are
      releasing ends up sleeping at iput time (see dentry_iput()).  So the
      problem with the might_sleep() in dput() wasn't that it wasn't true, it
      was that it wasn't actually testing and triggering on the interesting
      In particular, just about *any* dput() can indeed sleep, if you happen
      to race with another thread deleting the file in question, and you then
      lose the race to the be the last dput() for that file.  But because it's
      a very rare race, the debugging code would never trigger it in practice.
      Why is this problematic? The new d_rcu_to_refcount() (see commit
      : "vfs: reimplement d_rcu_to_refcount() using
      lockref_get_or_lock()") does a dput() for the failure case, and it does
      it under the RCU lock.  So potentially sleeping really is a bug.
      But there's no way I'm going to fix this with the previous complicated
      "lockref_get_or_lock()" interface.  And rather than revert to the old
      and crufty nested dentry locking code (which did get this right by
      delaying the reference count updates until they were verified to be
      safe), let's make forward progress.
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  3. 03 Sep, 2013 1 commit
    • Jeff Layton's avatar
      vfs: allow umount to handle mountpoints without revalidating them · 8033426e
      Jeff Layton authored
      Christopher reported a regression where he was unable to unmount a NFS
      filesystem where the root had gone stale. The problem is that
      d_revalidate handles the root of the filesystem differently from other
      dentries, but d_weak_revalidate does not. We could simply fix this by
      making d_weak_revalidate return success on IS_ROOT dentries, but there
      are cases where we do want to revalidate the root of the fs.
      A umount is really a special case. We generally aren't interested in
      anything but the dentry and vfsmount that's attached at that point. If
      the inode turns out to be stale we just don't care since the intent is
      to stop using it anyway.
      Try to handle this situation better by treating umount as a special
      case in the lookup code. Have it resolve the parent using normal
      means, and then do a lookup of the final dentry without revalidating
      it. In most cases, the final lookup will come out of the dcache, but
      the case where there's a trailing symlink or !LAST_NORM entry on the
      end complicates things a bit.
      Cc: Neil Brown <neilb@suse.de>
      Reported-by: default avatarChristopher T Vogan <cvogan@us.ibm.com>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  4. 02 Sep, 2013 1 commit
    • Linus Torvalds's avatar
      vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock() · 15570086
      Linus Torvalds authored
      This moves __d_rcu_to_refcount() from <linux/dcache.h> into fs/namei.c
      and re-implements it using the lockref infrastructure instead.  It also
      adds a lot of comments about what is actually going on, because turning
      a dentry that was looked up using RCU into a long-lived reference
      counted entry is one of the more subtle parts of the rcu walk.
      We also used to be _particularly_ subtle in unlazy_walk() where we
      re-validate both the dentry and its parent using the same sequence
      count.  We used to do it by nesting the locks and then verifying the
      sequence count just once.
      That was silly, because nested locking is expensive, but the sequence
      count check is not.  So this just re-validates the dentry and the parent
      separately, avoiding the nested locking, and making the lockref lookup
      Acked-by: default avatarWaiman Long <waiman.long@hp.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  5. 28 Aug, 2013 2 commits
    • Waiman Long's avatar
      vfs: make the dentry cache use the lockref infrastructure · 98474236
      Waiman Long authored
      This just replaces the dentry count/lock combination with the lockref
      structure that contains both a count and a spinlock, and does the
      mechanical conversion to use the lockref infrastructure.
      There are no semantic changes here, it's purely syntactic.  The
      reference lockref implementation uses the spinlock exactly the same way
      that the old dcache code did, and the bulk of this patch is just
      expanding the internal "d_count" use in the dcache code to use
      "d_lockref.count" instead.
      This is purely preparation for the real change to make the reference
      count updates be lockless during the 3.12 merge window.
      [ As with the previous commit, this is a rewritten version of a concept
        originally from Waiman, so credit goes to him, blame for any errors
        goes to me.
        Waiman's patch had some semantic differences for taking advantage of
        the lockless update in dget_parent(), while this patch is
        intentionally a pure search-and-replace change with no semantic
        changes.     - Linus ]
      Signed-off-by: default avatarWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Linus Torvalds's avatar
      Revert "fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink" · f0cc6ffb
      Linus Torvalds authored
      This reverts commit bb2314b4
      It wasn't necessarily wrong per se, but we're still busily discussing
      the exact details of this all, so I'm going to revert it for now.
      It's true that you can already do flink() through /proc and that flink()
      isn't new.  But as Brad Spengler points out, some secure environments do
      not mount proc, and flink adds a new interface that can avoid path
      lookup of the source for those kinds of environments.
      We may re-do this (and even mark it for stable backporting back in 3.11
      and possibly earlier) once the whole discussion about the interface is done.
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Brad Spengler <spender@grsecurity.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  6. 05 Aug, 2013 1 commit
    • Andy Lutomirski's avatar
      fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink · bb2314b4
      Andy Lutomirski authored
      Every now and then someone proposes a new flink syscall, and this spawns
      a long discussion of whether it would be a security problem.  I think
      that this is missing the point: flink is *already* allowed without
      privilege as long as /proc is mounted -- it's called AT_SYMLINK_FOLLOW.
      Now that O_TMPFILE is here, the ability to create a file with O_TMPFILE,
      write it, and link it in is very convenient.  The only problem is that
      it requires that /proc be mounted so that you can do:
      linkat(AT_FDCWD, "/proc/self/fd/<tmpfd>", dfd, path, AT_SYMLINK_NOFOLLOW)
      This sucks -- it's much nicer to do:
      linkat(tmpfd, "", dfd, path, AT_EMPTY_PATH)
      Let's allow it.
      If this turns out to be excessively scary, it we could instead require
      that the inode in question be I_LINKABLE, but this seems pointless given
      the /proc situation
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  7. 13 Jul, 2013 1 commit
    • Al Viro's avatar
      Safer ABI for O_TMPFILE · bb458c64
      Al Viro authored
      [suggested by Rasmus Villemoes] make O_DIRECTORY | O_RDWR part of O_TMPFILE;
      that will fail on old kernels in a lot more cases than what I came up with.
      And make sure O_CREAT doesn't get there...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  8. 29 Jun, 2013 5 commits
  9. 14 Jun, 2013 1 commit
  10. 07 May, 2013 1 commit
    • Jeff Layton's avatar
      audit: vfs: fix audit_inode call in O_CREAT case of do_last · 33e2208a
      Jeff Layton authored
      Jiri reported a regression in auditing of open(..., O_CREAT) syscalls.
      In older kernels, creating a file with open(..., O_CREAT) created
      audit_name records that looked like this:
      type=PATH msg=audit(1360255720.628:64): item=1 name="/abc/foo" inode=138810 dev=fd:00 mode=0100640 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      type=PATH msg=audit(1360255720.628:64): item=0 name="/abc/" inode=138635 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      ...in recent kernels though, they look like this:
      type=PATH msg=audit(1360255402.886:12574): item=2 name=(null) inode=264599 dev=fd:00 mode=0100640 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      type=PATH msg=audit(1360255402.886:12574): item=1 name=(null) inode=264598 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      type=PATH msg=audit(1360255402.886:12574): item=0 name="/abc/foo" inode=264598 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      Richard bisected to determine that the problems started with commit
      , but the log messages have changed with some later
      audit-related patches.
      The problem is that this audit_inode call is passing in the parent of
      the dentry being opened, but audit_inode is being called with the parent
      flag false. This causes later audit_inode and audit_inode_child calls to
      match the wrong entry in the audit_names list.
      This patch simply sets the flag to properly indicate that this inode
      represents the parent. With this, the audit_names entries are back to
      looking like they did before.
      Cc: <stable@vger.kernel.org> # v3.7+
      Reported-by: default avatarJiri Jaburek <jjaburek@redhat.com>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Test By: Richard Guy Briggs <rbriggs@redhat.com>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
  11. 08 Mar, 2013 1 commit
  12. 01 Mar, 2013 1 commit
  13. 26 Feb, 2013 1 commit
    • Jeff Layton's avatar
      vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op · ecf3d1f1
      Jeff Layton authored
      The following set of operations on a NFS client and server will cause
          server# mkdir a
          client# cd a
          server# mv a a.bak
          client# sleep 30  # (or whatever the dir attrcache timeout is)
          client# stat .
          stat: cannot stat `.': Stale NFS file handle
      Obviously, we should not be getting an ESTALE error back there since the
      inode still exists on the server. The problem is that the lookup code
      will call d_revalidate on the dentry that "." refers to, because NFS has
      FS_REVAL_DOT set.
      nfs_lookup_revalidate will see that the parent directory has changed and
      will try to reverify the dentry by redoing a LOOKUP. That of course
      fails, so the lookup code returns ESTALE.
      The problem here is that d_revalidate is really a bad fit for this case.
      What we really want to know at this point is whether the inode is still
      good or not, but we don't really care what name it goes by or whether
      the dcache is still valid.
      Add a new d_op->d_weak_revalidate operation and have complete_walk call
      that instead of d_revalidate. The intent there is to allow for a
      "weaker" d_revalidate that just checks to see whether the inode is still
      good. This is also gives us an opportunity to kill off the FS_REVAL_DOT
      special casing.
      [AV: changed method name, added note in porting, fixed confusion re
      having it possibly called from RCU mode (it won't be)]
      Cc: NeilBrown <neilb@suse.de>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  14. 22 Feb, 2013 6 commits
  15. 20 Dec, 2012 9 commits