1. 16 Mar, 2011 1 commit
  2. 15 Mar, 2011 2 commits
    • Al Viro's avatar
      Allow passing O_PATH descriptors via SCM_RIGHTS datagrams · 326be7b4
      Al Viro authored
      Just need to make sure that AF_UNIX garbage collector won't
      confuse O_PATHed socket on filesystem for real AF_UNIX opened
      socket.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      326be7b4
    • Al Viro's avatar
      New kind of open files - "location only". · 1abf0c71
      Al Viro authored
      New flag for open(2) - O_PATH.  Semantics:
      	* pathname is resolved, but the file itself is _NOT_ opened
      as far as filesystem is concerned.
      	* almost all operations on the resulting descriptors shall
      fail with -EBADF.  Exceptions are:
      	1) operations on descriptors themselves (i.e.
      		close(), dup(), dup2(), dup3(), fcntl(fd, F_DUPFD),
      		fcntl(fd, F_DUPFD_CLOEXEC, ...), fcntl(fd, F_GETFD),
      		fcntl(fd, F_SETFD, ...))
      	2) fcntl(fd, F_GETFL), for a common non-destructive way to
      		check if descriptor is open
      	3) "dfd" arguments of ...at(2) syscalls, i.e. the starting
      		points of pathname resolution
      	* closing such descriptor does *NOT* affect dnotify or
      posix locks.
      	* permissions are checked as usual along the way to file;
      no permission checks are applied to the file itself.  Of course,
      giving such thing to syscall will result in permission checks (at
      the moment it means checking that starting point of ....at() is
      a directory and caller has exec permissions on it).
      
      fget() and fget_light() return NULL on such descriptors; use of
      fget_raw() and fget_raw_light() is needed to get them.  That protects
      existing code from dealing with those things.
      
      There are two things still missing (they come in the next commits):
      one is handling of symlinks (right now we refuse to open them that
      way; see the next commit for semantics related to those) and another
      is descriptor passing via SCM_RIGHTS datagrams.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      1abf0c71
  3. 10 Feb, 2011 1 commit
  4. 04 Feb, 2011 1 commit
  5. 17 Jan, 2011 1 commit
    • Steven Rostedt's avatar
      fs: Remove unlikely() from fget_light() · 3bc0ba43
      Steven Rostedt authored
      There's an unlikely() in fget_light() that assumes the file ref count
      will be 1. Running the annotate branch profiler on a desktop that is
      performing daily tasks (running firefox, evolution, xchat and is also part
      of a distcc farm), it shows that the ref count is not 1 that often.
      
       correct incorrect      %    Function                  File              Line
       ------- ---------      -    --------                  ----              ----
      1035099358 6209599193  85    fget_light              file_table.c         315
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      3bc0ba43
  6. 26 Oct, 2010 1 commit
    • Eric Dumazet's avatar
      fs: allow for more than 2^31 files · 518de9b3
      Eric Dumazet authored
      Robin Holt tried to boot a 16TB system and found af_unix was overflowing
      a 32bit value :
      
      <quote>
      
      We were seeing a failure which prevented boot.  The kernel was incapable
      of creating either a named pipe or unix domain socket.  This comes down
      to a common kernel function called unix_create1() which does:
      
              atomic_inc(&unix_nr_socks);
              if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
                      goto out;
      
      The function get_max_files() is a simple return of files_stat.max_files.
      files_stat.max_files is a signed integer and is computed in
      fs/file_table.c's files_init().
      
              n = (mempages * (PAGE_SIZE / 1024)) / 10;
              files_stat.max_files = n;
      
      In our case, mempages (total_ram_pages) is approx 3,758,096,384
      (0xe0000000).  That leaves max_files at approximately 1,503,238,553.
      This causes 2 * get_max_files() to integer overflow.
      
      </quote>
      
      Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
      integers, and change af_unix to use an atomic_long_t instead of atomic_t.
      
      get_max_files() is changed to return an unsigned long.  get_nr_files() is
      changed to return a long.
      
      unix_nr_socks is changed from atomic_t to atomic_long_t, while not
      strictly needed to address Robin problem.
      
      Before patch (on a 64bit kernel) :
      # echo 2147483648 >/proc/sys/fs/file-max
      # cat /proc/sys/fs/file-max
      -18446744071562067968
      
      After patch:
      # echo 2147483648 >/proc/sys/fs/file-max
      # cat /proc/sys/fs/file-max
      2147483648
      # cat /proc/sys/fs/file-nr
      704     0       2147483648
      Reported-by: default avatarRobin Holt <holt@sgi.com>
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: default avatarDavid Miller <davem@davemloft.net>
      Reviewed-by: default avatarRobin Holt <holt@sgi.com>
      Tested-by: default avatarRobin Holt <holt@sgi.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      518de9b3
  7. 25 Oct, 2010 1 commit
    • Eric Dumazet's avatar
      fs: allow for more than 2^31 files · 7e360c38
      Eric Dumazet authored
      Andrew,
      
      Could you please review this patch, you probably are the right guy to
      take it, because it crosses fs and net trees.
      
      Note : /proc/sys/fs/file-nr is a read-only file, so this patch doesnt
      depend on previous patch (sysctl: fix min/max handling in
      __do_proc_doulongvec_minmax())
      
      Thanks !
      
      [PATCH V4] fs: allow for more than 2^31 files
      
      Robin Holt tried to boot a 16TB system and found af_unix was overflowing
      a 32bit value :
      
      <quote>
      
      We were seeing a failure which prevented boot.  The kernel was incapable
      of creating either a named pipe or unix domain socket.  This comes down
      to a common kernel function called unix_create1() which does:
      
              atomic_inc(&unix_nr_socks);
              if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
                      goto out;
      
      The function get_max_files() is a simple return of files_stat.max_files.
      files_stat.max_files is a signed integer and is computed in
      fs/file_table.c's files_init().
      
              n = (mempages * (PAGE_SIZE / 1024)) / 10;
              files_stat.max_files = n;
      
      In our case, mempages (total_ram_pages) is approx 3,758,096,384
      (0xe0000000).  That leaves max_files at approximately 1,503,238,553.
      This causes 2 * get_max_files() to integer overflow.
      
      </quote>
      
      Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
      integers, and change af_unix to use an atomic_long_t instead of
      atomic_t.
      
      get_max_files() is changed to return an unsigned long.
      get_nr_files() is changed to return a long.
      
      unix_nr_socks is changed from atomic_t to atomic_long_t, while not
      strictly needed to address Robin problem.
      
      Before patch (on a 64bit kernel) :
      # echo 2147483648 >/proc/sys/fs/file-max
      # cat /proc/sys/fs/file-max
      -18446744071562067968
      
      After patch:
      # echo 2147483648 >/proc/sys/fs/file-max
      # cat /proc/sys/fs/file-max
      2147483648
      # cat /proc/sys/fs/file-nr
      704     0       2147483648
      Reported-by: default avatarRobin Holt <holt@sgi.com>
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: default avatarDavid Miller <davem@davemloft.net>
      Reviewed-by: default avatarRobin Holt <holt@sgi.com>
      Tested-by: default avatarRobin Holt <holt@sgi.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      7e360c38
  8. 18 Aug, 2010 2 commits
    • Nick Piggin's avatar
      fs: scale files_lock · 6416ccb7
      Nick Piggin authored
      fs: scale files_lock
      
      Improve scalability of files_lock by adding per-cpu, per-sb files lists,
      protected with an lglock. The lglock provides fast access to the per-cpu lists
      to add and remove files. It also provides a snapshot of all the per-cpu lists
      (although this is very slow).
      
      One difficulty with this approach is that a file can be removed from the list
      by another CPU. We must track which per-cpu list the file is on with a new
      variale in the file struct (packed into a hole on 64-bit archs). Scalability
      could suffer if files are frequently removed from different cpu's list.
      
      However loads with frequent removal of files imply short interval between
      adding and removing the files, and the scheduler attempts to avoid moving
      processes too far away. Also, even in the case of cross-CPU removal, the
      hardware has much more opportunity to parallelise cacheline transfers with N
      cachelines than with 1.
      
      A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
      degenerates to contending on a single lock, which is no worse than before. When
      more than one CPU are allocating files, even if they are always freed by
      different CPUs, there will be more parallelism than the single-lock case.
      
      Testing results:
      
      On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
      to remove the file, the number of times it is removed by the same CPU that
      added it, and the number of times it is removed by the same node that added it.
      
      Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
      kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
      dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)
      
      So a file is removed from the same CPU it was added by over 90% of the time.
      It remains within the same node 95% of the time.
      
      Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.
      
                      throughput
      2.6.34-rc2      24.5
      +patch          24.9
      
                      us      sys     idle    IO wait (in %)
      2.6.34-rc2      51.25   28.25   17.25   3.25
      +patch          53.75   18.5    19      8.75
      
      So significantly less CPU time spent in kernel code, higher idle time and
      slightly higher throughput.
      
      Single threaded performance difference was within the noise of microbenchmarks.
      That is not to say penalty does not exist, the code is larger and more memory
      accesses required so it will be slightly slower.
      
      Cc: linux-kernel@vger.kernel.org
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      6416ccb7
    • Nick Piggin's avatar
      fs: cleanup files_lock locking · ee2ffa0d
      Nick Piggin authored
      fs: cleanup files_lock locking
      
      Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
      manipulate the per-sb files list; unexport the files_lock spinlock.
      
      Cc: linux-kernel@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      ee2ffa0d
  9. 12 Aug, 2010 1 commit
  10. 11 Aug, 2010 1 commit
  11. 28 Jul, 2010 1 commit
    • Eric Paris's avatar
      vfs/fsnotify: fsnotify_close can delay the final work in fput · c1e5c954
      Eric Paris authored
      fanotify almost works like so:
      
      user context calls fsnotify_* function with a struct file.
         fsnotify takes a reference on the struct path
      user context goes about it's buissiness
      
      at some later point in time the fsnotify listener gets the struct path
         fanotify listener calls dentry_open() to create a file which userspace can deal with
            listener drops the reference on the struct path
      at some later point the listener calls close() on it's new file
      
      With the switch from struct path to struct file this presents a problem for
      fput() and fsnotify_close().  fsnotify_close() is called when the filp has
      already reached 0 and __fput() wants to do it's cleanup.
      
      The solution presented here is a bit odd.  If an event is created from a
      struct file we take a reference on the file.  We check however if the f_count
      was already 0 and if so we take an EXTRA reference EVEN THOUGH IT WAS ZERO.
      In __fput() (where we know the f_count hit 0 once) we check if the f_count is
      non-zero and if so we drop that 'extra' ref and return without destroying the
      file.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      c1e5c954
  12. 27 May, 2010 1 commit
    • Al Viro's avatar
      get rid of the magic around f_count in aio · d7065da0
      Al Viro authored
      __aio_put_req() plays sick games with file refcount.  What
      it wants is fput() from atomic context; it's almost always
      done with f_count > 1, so they only have to deal with delayed
      work in rare cases when their reference happens to be the
      last one.  Current code decrements f_count and if it hasn't
      hit 0, everything is fine.  Otherwise it keeps a pointer
      to struct file (with zero f_count!) around and has delayed
      work do __fput() on it.
      
      Better way to do it: use atomic_long_add_unless( , -1, 1)
      instead of !atomic_long_dec_and_test().  IOW, decrement it
      only if it's not the last reference, leave refcount alone
      if it was.  And use normal fput() in delayed work.
      
      I've made that atomic_long_add_unless call a new helper -
      fput_atomic().  Drops a reference to file if it's safe to
      do in atomic (i.e. if that's not the last one), tells if
      it had been able to do that.  aio.c converted to it, __fput()
      use is gone.  req->ki_file *always* contributes to refcount
      now.  And __fput() became static.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      d7065da0
  13. 06 Mar, 2010 1 commit
  14. 07 Feb, 2010 1 commit
  15. 22 Dec, 2009 1 commit
  16. 16 Dec, 2009 6 commits
  17. 24 Oct, 2009 1 commit
  18. 24 Sep, 2009 1 commit
  19. 11 Jun, 2009 2 commits
    • npiggin@suse.de's avatar
      fs: move mark_files_ro into file_table.c · 864d7c4c
      npiggin@suse.de authored
      This function walks the s_files lock, and operates primarily on the
      files in a superblock, so it better belongs here (eg. see also
      fs_may_remount_ro).
      
      [AV: ... and it shouldn't be static after that move]
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      864d7c4c
    • npiggin@suse.de's avatar
      fs: introduce mnt_clone_write · 96029c4e
      npiggin@suse.de authored
      This patch speeds up lmbench lat_mmap test by about another 2% after the
      first patch.
      
      Before:
       avg = 462.286
       std = 5.46106
      
      After:
       avg = 453.12
       std = 9.58257
      
      (50 runs of each, stddev gives a reasonable confidence)
      
      It does this by introducing mnt_clone_write, which avoids some heavyweight
      operations of mnt_want_write if called on a vfsmount which we know already
      has a write count; and mnt_want_write_file, which can call mnt_clone_write
      if the file is open for write.
      
      After these two patches, mnt_want_write and mnt_drop_write go from 7% on
      the profile down to 1.3% (including mnt_clone_write).
      
      [AV: mnt_want_write_file() should take file alone and derive mnt from it;
      not only all callers have that form, but that's the only mnt about which
      we know that it's already held for write if file is opened for write]
      
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      96029c4e
  20. 30 Mar, 2009 1 commit
  21. 16 Mar, 2009 1 commit
  22. 05 Feb, 2009 1 commit
  23. 31 Dec, 2008 1 commit
  24. 13 Nov, 2008 3 commits
  25. 01 Nov, 2008 1 commit
    • Al Viro's avatar
      saner FASYNC handling on file close · 233e70f4
      Al Viro authored
      As it is, all instances of ->release() for files that have ->fasync()
      need to remember to evict file from fasync lists; forgetting that
      creates a hole and we actually have a bunch that *does* forget.
      
      So let's keep our lives simple - let __fput() check FASYNC in
      file->f_flags and call ->fasync() there if it's been set.  And lose that
      crap in ->release() instances - leaving it there is still valid, but we
      don't have to bother anymore.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      233e70f4
  26. 21 Oct, 2008 1 commit
  27. 26 Jul, 2008 1 commit
  28. 01 May, 2008 1 commit
  29. 18 Apr, 2008 2 commits