1. 17 Oct, 2006 1 commit
    • David M. Grimes's avatar
      [PATCH] knfsd: add nfs-export support to tmpfs · 91828a40
      David M. Grimes authored
      We need to encode a decode the 'file' part of a handle.  We simply use the
      inode number and generation number to construct the filehandle.
      The generation number is the time when the file was created.  As inode numbers
      cycle through the full 32 bits before being reused, there is no real chance of
      the same inum being allocated to different files in the same second so this is
      suitably unique.  Using time-of-day rather than e.g.  jiffies makes it less
      likely that the same filehandle can be created after a reboot.
      In order to be able to decode a filehandle we need to be able to lookup by
      inum, which means that the inode needs to be added to the inode hash table
      (tmpfs doesn't currently hash inodes as there is never a need to lookup by
      inum).  To avoid overhead when not exporting, we only hash an inode when it is
      first exported.  This requires a lock to ensure it isn't hashed twice.
      This code is separate from the patch posted in June06 from Atal Shargorodsky
      which provided the same functionality, but does borrow slightly from it.
      Locking comment: Most filesystems that hash their inodes do so at the point
      where the 'struct inode' is initialised, and that has suitable locking
      (I_NEW).  Here in shmem, we are hashing the inode later, the first time we
      need an NFS file handle for it.  We no longer have I_NEW to ensure only one
      thread tries to add it to the hash table.
      Cc: Atal Shargorodsky <atal@codefidence.com>
      Cc: Gilad Ben-Yossef <gilad@codefidence.com>
      Signed-off-by: default avatarDavid M. Grimes <dgrimes@navisite.com>
      Signed-off-by: default avatarNeil Brown <neilb@suse.de>
      Acked-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  2. 01 Oct, 2006 2 commits
  3. 29 Sep, 2006 1 commit
  4. 27 Sep, 2006 2 commits
  5. 26 Sep, 2006 1 commit
  6. 30 Jun, 2006 2 commits
  7. 28 Jun, 2006 1 commit
  8. 26 Jun, 2006 2 commits
  9. 23 Jun, 2006 3 commits
    • Christoph Lameter's avatar
      [PATCH] migration: remove unnecessary PageSwapCache checks · 3c5a87f4
      Christoph Lameter authored
      Remove two unnecessary PageSwapCache checks.  The page refcount is raised
      and therefore page migration cannot occur in both functions.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • David Howells's avatar
      [PATCH] VFS: Permit filesystem to perform statfs with a known root dentry · 726c3342
      David Howells authored
      Give the statfs superblock operation a dentry pointer rather than a superblock
      This complements the get_sb() patch.  That reduced the significance of
      sb->s_root, allowing NFS to place a fake root there.  However, NFS does
      require a dentry to use as a target for the statfs operation.  This permits
      the root in the vfsmount to be used instead.
      linux/mount.h has been added where necessary to make allyesconfig build
      Interest has also been expressed for use with the FUSE and XFS filesystems.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Nathan Scott <nathans@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • David Howells's avatar
      [PATCH] VFS: Permit filesystem to override root dentry on mount · 454e2398
      David Howells authored
      Extend the get_sb() filesystem operation to take an extra argument that
      permits the VFS to pass in the target vfsmount that defines the mountpoint.
      The filesystem is then required to manually set the superblock and root dentry
      pointers.  For most filesystems, this should be done with simple_set_mnt()
      which will set the superblock pointer and then set the root dentry to the
      superblock's s_root (as per the old default behaviour).
      The get_sb() op now returns an integer as there's now no need to return the
      superblock pointer.
      This patch permits a superblock to be implicitly shared amongst several mount
      points, such as can be done with NFS to avoid potential inode aliasing.  In
      such a case, simple_set_mnt() would not be called, and instead the mnt_root
      and mnt_sb would be set directly.
      The patch also makes the following changes:
       (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
           pointer argument and return an integer, so most filesystems have to change
           very little.
       (*) If one of the convenience function is not used, then get_sb() should
           normally call simple_set_mnt() to instantiate the vfsmount. This will
           always return 0, and so can be tail-called from get_sb().
       (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
           dcache upon superblock destruction rather than shrink_dcache_anon().
           This is required because the superblock may now have multiple trees that
           aren't actually bound to s_root, but that still need to be cleaned up. The
           currently called functions assume that the whole tree is rooted at s_root,
           and that anonymous dentries are not the roots of trees which results in
           dentries being left unculled.
           However, with the way NFS superblock sharing are currently set to be
           implemented, these assumptions are violated: the root of the filesystem is
           simply a dummy dentry and inode (the real inode for '/' may well be
           inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
           with child trees.
           [*] Anonymous until discovered from another tree.
       (*) The documentation has been adjusted, including the additional bit of
           changing ext2_* into foo_* in the documentation.
      [akpm@osdl.org: convert ipath_fs, do other stuff]
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Nathan Scott <nathans@sgi.com>
      Cc: Roland Dreier <rolandd@cisco.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  10. 12 Jun, 2006 2 commits
  11. 09 Jun, 2006 1 commit
  12. 22 Apr, 2006 1 commit
    • Lee Schermerhorn's avatar
      [PATCH] add migratepage address space op to shmem · 304dbdb7
      Lee Schermerhorn authored
      Basic problem: pages of a shared memory segment can only be migrated once.
      In 2.6.16 through 2.6.17-rc1, shared memory mappings do not have a
      migratepage address space op.  Therefore, migrate_pages() falls back to
      default processing.  In this path, it will try to pageout() dirty pages.
      Once a shared memory page has been migrated it becomes dirty, so
      migrate_pages() will try to page it out.  However, because the page count
      is 3 [cache + current + pte], pageout() will return PAGE_KEEP because
      is_page_cache_freeable() returns false.  This will abort all subsequent
      This patch adds a migratepage address space op to shared memory segments to
      avoid taking the default path.  We use the "migrate_page()" function
      because it knows how to migrate dirty pages.  This allows shared memory
      segment pages to migrate, subject to other conditions such as # pte's
      referencing the page [page_mapcount(page)], when requested.
      I think this is safe.  If we're migrating a shared memory page, then we
      found the page via a page table, so it must be in memory.
      Can be verified with memtoy and the shmem-mbind-test script, both
      available at:  http://free.linux.hp.com/~lts/Tools/
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  13. 22 Mar, 2006 2 commits
  14. 21 Feb, 2006 1 commit
    • Hugh Dickins's avatar
      [PATCH] tmpfs: fix mount mpol nodelist parsing · b00dc3ad
      Hugh Dickins authored
      I've been dissatisfied with the mpol_nodelist mount option which was
      added to tmpfs earlier in -rc.  Replace it by mpol=policy:nodelist.
      And it was broken: a nodelist is a comma-separated list of numbers and
      ranges; the mount options are a comma-separated list of token=values.
      Whoops, blindly strsep'ing on commas doesn't work so well: since we've
      no numeric tokens, and unlikely to add them, use that to distinguish.
      Move the mpol= parsing to shmem_parse_mpol under CONFIG_NUMA, reject
      all its options as invalid if not NUMA.  /proc shows MPOL_PREFERRED
      as "prefer", so use that name for the policy instead of "preferred".
      Enforce that mpol=default has no nodelist; that mpol=prefer has one
      node only; that mpol=bind has a nodelist; but let mpol=interleave use
      node_online_map if no nodelist given.  Describe this in tmpfs.txt.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarRobin Holt <holt@sgi.com>
      Acked-by: default avatarAndi Kleen <ak@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  15. 01 Feb, 2006 1 commit
  16. 14 Jan, 2006 1 commit
    • Robin Holt's avatar
      [PATCH] Add tmpfs options for memory placement policies · 7339ff83
      Robin Holt authored
      Anything that writes into a tmpfs filesystem is liable to disproportionately
      decrease the available memory on a particular node.  Since there's no telling
      what sort of application (e.g.  dd/cp/cat) might be dropping large files
      there, this lets the admin choose the appropriate default behavior for their
      site's situation.
      Introduce a tmpfs mount option which allows specifying a memory policy and
      a second option to specify the nodelist for that policy.  With the default
      policy, tmpfs will behave as it does today.  This patch adds support for
      preferred, bind, and interleave policies.
      The default policy will cause pages to be added to tmpfs files on the node
      which is doing the writing.  Some jobs expect a single process to create
      and manage the tmpfs files.  This results in a node which has a
      significantly reduced number of free pages.
      With this patch, the administrator can specify the policy and nodes for
      that policy where they would prefer allocations.
      This patch was originally written by Brent Casavant and Hugh Dickins.  I
      added support for the bind and preferred policies and the mpol_nodelist
      mount option.
      Signed-off-by: default avatarBrent Casavant <bcasavan@sgi.com>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarRobin Holt <holt@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  17. 09 Jan, 2006 1 commit
  18. 06 Jan, 2006 2 commits
    • David Howells's avatar
      [PATCH] NOMMU: Make SYSV IPC SHM use ramfs facilities on NOMMU · b0e15190
      David Howells authored
      The attached patch makes the SYSV IPC shared memory facilities use the new
      ramfs facilities on a no-MMU kernel.
      The following changes are made:
       (1) There are now shmem_mmap() and shmem_get_unmapped_area() functions to
           allow the IPC SHM facilities to commune with the tiny-shmem and shmem
       (2) ramfs files now need resizing using do_truncate() rather than by modifying
           the inode size directly (see shmem_file_setup()). This causes ramfs to
           attempt to bind a block of pages of sufficient size to the inode.
       (3) CONFIG_SYSVIPC is no longer contingent on CONFIG_MMU.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Badari Pulavarty's avatar
      [PATCH] madvise(MADV_REMOVE): remove pages from tmpfs shm backing store · f6b3ec23
      Badari Pulavarty authored
      Here is the patch to implement madvise(MADV_REMOVE) - which frees up a
      given range of pages & its associated backing store.  Current
      implementation supports only shmfs/tmpfs and other filesystems return
      "Some app allocates large tmpfs files, then when some task quits and some
      client disconnect, some memory can be released.  However the only way to
      release tmpfs-swap is to MADV_REMOVE". - Andrea Arcangeli
      Databases want to use this feature to drop a section of their bufferpool
      (shared memory segments) - without writing back to disk/swap space.
      This feature is also useful for supporting hot-plug memory on UML.
      Concerns raised by Andrew Morton:
      - "We have no plan for holepunching!  If we _do_ have such a plan (or
        might in the future) then what would the API look like?  I think
        sys_holepunch(fd, start, len), so we should start out with that."
      - Using madvise is very weird, because people will ask "why do I need to
        mmap my file before I can stick a hole in it?"
      - None of the other madvise operations call into the filesystem in this
        manner.  A broad question is: is this capability an MM operation or a
        filesytem operation?  truncate, for example, is a filesystem operation
        which sometimes has MM side-effects.  madvise is an mm operation and with
        this patch, it gains FS side-effects, only they're really, really
        significant ones."
      - Andrea suggested the fs operation too but then it's more efficient to
        have it as a mm operation with fs side effects, because they don't
        immediatly know fd and physical offset of the range.  It's possible to
        fixup in userland and to use the fs operation but it's more expensive,
        the vmas are already in the kernel and we can use them.
      Short term plan &  Future Direction:
      - We seem to need this interface only for shmfs/tmpfs files in the short
        term.  We have to add hooks into the filesystem for correctness and
        completeness.  This is what this patch does.
      - In the future, plan is to support both fs and mmap apis also.  This
        also involves (other) filesystem specific functions to be implemented.
      - Current patch doesn't support VM_NONLINEAR - which can be addressed in
        the future.
      Signed-off-by: default avatarBadari Pulavarty <pbadari@us.ibm.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Andrea Arcangeli <andrea@suse.de>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  19. 03 Jan, 2006 1 commit
    • Zach Brown's avatar
      [PATCH] add AOP_TRUNCATED_PAGE, prepend AOP_ to WRITEPAGE_ACTIVATE · 994fc28c
      Zach Brown authored
      readpage(), prepare_write(), and commit_write() callers are updated to
      understand the special return code AOP_TRUNCATED_PAGE in the style of
      writepage() and WRITEPAGE_ACTIVATE.  AOP_TRUNCATED_PAGE tells the caller that
      the callee has unlocked the page and that the operation should be tried again
      with a new page.  OCFS2 uses this to detect and work around a lock inversion in
      its aop methods.  There should be no change in behaviour for methods that don't
      return AOP_TRUNCATED_PAGE.
      WRITEPAGE_ACTIVATE is also prepended with AOP_ for consistency and they are
      made enums so that kerneldoc can be used to document their semantics.
      Signed-off-by: default avatarZach Brown <zach.brown@oracle.com>
  20. 29 Oct, 2005 3 commits
    • Hugh Dickins's avatar
      [PATCH] mm: split page table lock · 4c21e2f2
      Hugh Dickins authored
      Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
      a many-threaded application which concurrently initializes different parts of
      a large anonymous area.
      This patch corrects that, by using a separate spinlock per page table page, to
      guard the page table entries in that page, instead of using the mm's single
      page_table_lock.  (But even then, page_table_lock is still used to guard page
      table allocation, and anon_vma allocation.)
      In this implementation, the spinlock is tucked inside the struct page of the
      page table page: with a BUILD_BUG_ON in case it overflows - which it would in
      the case of 32-bit PA-RISC with spinlock debugging enabled.
      Splitting the lock is not quite for free: another cacheline access.  Ideally,
      I suppose we would use split ptlock only for multi-threaded processes on
      multi-cpu machines; but deciding that dynamically would have its own costs.
      So for now enable it by config, at some number of cpus - since the Kconfig
      language doesn't support inequalities, let preprocessor compare that with
      NR_CPUS.  But I don't think it's worth being user-configurable: for good
      testing of both split and unsplit configs, split now at 4 cpus, and perhaps
      change that to 8 later.
      There is a benefit even for singly threaded processes: kswapd can be attacking
      one part of the mm while another part is busy faulting.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Nick Piggin's avatar
      [PATCH] core remove PageReserved · b5810039
      Nick Piggin authored
      Remove PageReserved() calls from core code by tightening VM_RESERVED
      handling in mm/ to cover PageReserved functionality.
      PageReserved special casing is removed from get_page and put_page.
      All setting and clearing of PageReserved is retained, and it is now flagged
      in the page_alloc checks to help ensure we don't introduce any refcount
      based freeing of Reserved pages.
      MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
      deprecated.  We never completely handled it correctly anyway, and is be
      reintroduced in future if required (Hugh has a proof of concept).
      Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
      arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
      be trivially removed.
      Last real user of PageReserved is swsusp, which uses PageReserved to
      determine whether a struct page points to valid memory or not.  This still
      needs to be addressed (a generic page_is_ram() should work).
      A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
      thus mapcounted and count towards shared rss).  These writes to the struct
      page could cause excessive cacheline bouncing on big systems.  There are a
      number of ways this could be addressed if it is an issue.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Refcount bug fix for filemap_xip.c
      Signed-off-by: default avatarCarsten Otte <cotte@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Hugh Dickins's avatar
      [PATCH] mm: page fault handlers tidyup · 65500d23
      Hugh Dickins authored
      Impose a little more consistency on the page fault handlers do_wp_page,
      do_swap_page, do_anonymous_page, do_no_page, do_file_page: why not pass their
      arguments in the same order, called the same names?
      break_cow is all very well, but what it did was inlined elsewhere: easier to
      compare if it's brought back into do_wp_page.
      do_file_page's fallback to do_no_page dates from a time when we were testing
      pte_file by using it wherever possible: currently it's peculiar to nonlinear
      vmas, so just check that.  BUG_ON if not?  Better not, it's probably page
      table corruption, so just show the pte: hmm, there's a pte_ERROR macro, let's
      use that for do_wp_page's invalid pfn too.
      Hah!  Someone in the ppc64 world noticed pte_ERROR was unused so removed it:
      restored (and say "pud" not "pmd" in its pud_ERROR).
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  21. 28 Oct, 2005 1 commit
  22. 08 Oct, 2005 1 commit
  23. 09 Sep, 2005 2 commits
  24. 07 Sep, 2005 1 commit
  25. 05 Sep, 2005 2 commits
    • Stephen Smalley's avatar
      [PATCH] Generic VFS fallback for security xattrs · f549d6c1
      Stephen Smalley authored
      This patch modifies the VFS setxattr, getxattr, and listxattr code to fall
      back to the security module for security xattrs if the filesystem does not
      support xattrs natively.  This allows security modules to export the incore
      inode security label information to userspace even if the filesystem does
      not provide xattr storage, and eliminates the need to individually patch
      various pseudo filesystem types to provide such access.  The patch removes
      the existing xattr code from devpts and tmpfs as it is then no longer
      The patch restructures the code flow slightly to reduce duplication between
      the normal path and the fallback path, but this should only have one
      user-visible side effect - a program may get -EACCES rather than
      -EOPNOTSUPP if policy denied access but the filesystem didn't support the
      operation anyway.  Note that the post_setxattr hook call is not needed in
      the fallback case, as the inode_setsecurity hook call handles the incore
      inode security state update directly.  In contrast, we do call fsnotify in
      both cases.
      Signed-off-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Acked-by: default avatarJames Morris <jmorris@namei.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Paolo 'Blaisorblade' Giarrusso's avatar
      [PATCH] shmem_populate: avoid an useless check, and some comments · d44ed4f8
      Paolo 'Blaisorblade' Giarrusso authored
      Either shmem_getpage returns a failure, or it found a page, or it was told
      it couldn't do any I/O.  So it's useless to check nonblock in the else
      branch.  We could add a BUG() there but I preferred to comment the
      offending function.
      This was taken out from one Ingo Molnar's old patch I'm resurrecting.
      Signed-off-by: default avatarPaolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  26. 19 Aug, 2005 1 commit
    • Linus Torvalds's avatar
      Fix nasty ncpfs symlink handling bug. · cc314eef
      Linus Torvalds authored
      This bug could cause oopses and page state corruption, because ncpfs
      used the generic page-cache symlink handlign functions.  But those
      functions only work if the page cache is guaranteed to be "stable", ie a
      page that was installed when the symlink walk was started has to still
      be installed in the page cache at the end of the walk.
      We could have fixed ncpfs to not use the generic helper routines, but it
      is in many ways much cleaner to instead improve on the symlink walking
      helper routines so that they don't require that absolute stability.
      We do this by allowing "follow_link()" to return a error-pointer as a
      cookie, which is fed back to the cleanup "put_link()" routine.  This
      also simplifies NFS symlink handling.
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  27. 21 Jun, 2005 1 commit
    • Hugh Dickins's avatar
      [PATCH] shmem: restore superblock info · 0edd73b3
      Hugh Dickins authored
      To improve shmem scalability, we allowed tmpfs instances which don't need
      their blocks or inodes limited not to count them, and not to allocate any
      sbinfo.  Which was okay when the only use for the sbinfo was accounting
      blocks and inodes; but since then a couple of unrelated projects extending
      tmpfs want to store other data in the sbinfo.  Whether either extension
      reaches mainline is beside the point: I'm guilty of a bad design decision,
      and should restore sbinfo to make any such future extensions easier.
      So, once again allocate a shmem_sb_info for every shmem/tmpfs instance, and
      now let max_blocks 0 indicate unlimited blocks, and max_inodes 0 unlimited
      inodes.  Brent Casavant verified (many months ago) that this does not
      perceptibly impact the scalability (since the unlimited sbinfo cacheline is
      repeatedly accessed but only once dirtied).
      And merge shmem_set_size into its sole caller shmem_remount_fs.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>