1. 30 Oct, 2005 7 commits
    • Matt Mackall's avatar
      [PATCH] Error checks omitted in init_tmpfs() in mm/tiny-shmem.c · 5d57bd39
      Matt Mackall authored
      
      
      From: Hareesh Nagarajan <hnagar2@gmail.com>
      Signed-off-by: default avatarHareesh Nagarajan <hnagar2@gmail.com>
      Acked-by: default avatarMatt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      5d57bd39
    • Paul E. McKenney's avatar
      [PATCH] RCU torture-testing kernel module · a241ec65
      Paul E. McKenney authored
      This patch is a rewrite of the one submitted on October 1st, using modules
      (http://marc.theaimsgroup.com/?l=linux-kernel&m=112819093522998&w=2
      
      ).
      
      This rewrite adds a tristate CONFIG_RCU_TORTURE_TEST, which enables an
      intense torture test of the RCU infratructure.  This is needed due to the
      continued changes to the RCU infrastructure to accommodate dynamic ticks,
      CPU hotplug, realtime, and so on.  Most of the code is in a separate file
      that is compiled only if the CONFIG variable is set.  Documentation on how
      to run the test and interpret the output is also included.
      
      This code has been tested on i386 and ppc64, and an earlier version of the
      code has received extensive testing on a number of architectures as part of
      the PREEMPT_RT patchset.
      Signed-off-by: default avatar"Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a241ec65
    • Tejun Heo's avatar
      [PATCH] vm: remove redundant assignment from __pagevec_release_nonlru() · c7e9dd4d
      Tejun Heo authored
      
      
      This patch removes redundant assignment from __pagevec_release_nonlru().
      pages_to_free.cold is set to pvec->cold by pagevec_init() call right above
      the assignment.
      Signed-off-by: default avatarTejun Heo <htejun@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c7e9dd4d
    • Tejun Heo's avatar
      [PATCH] fs: error case fix in __generic_file_aio_read · 39e88ca2
      Tejun Heo authored
      
      
      When __generic_file_aio_read() hits an error during reading, it reports the
      error iff nothing has successfully been read yet.  This is condition - when
      an error occurs, if nothing has been read/written, report the error code;
      otherwise, report the amount of bytes successfully transferred upto that
      point.
      
      This corner case can be exposed by performing readv(2) with the following
      iov.
      
       iov[0] = len0 @ ptr0
       iov[1] = len1 @ NULL (or any other invalid pointer)
       iov[2] = len2 @ ptr2
      
      When file size is enough, performing above readv(2) results in
      
       len0 bytes from file_pos @ ptr0
       len2 bytes from file_pos + len0 @ ptr2
      
      And the return value is len0 + len2.  Test program is attached to this
      mail.
      
      This patch makes __generic_file_aio_read()'s error handling identical to
      other functions.
      
      #include <stdio.h>
      #include <stdlib.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <unistd.h>
      #include <sys/uio.h>
      #include <errno.h>
      #include <string.h>
      
      int main(int argc, char **argv)
      {
      	const char *path;
      	struct stat stbuf;
      	size_t len0, len1;
      	void *buf0, *buf1;
      	struct iovec iov[3];
      	int fd, i;
      	ssize_t ret;
      
      	if (argc < 2) {
      		fprintf(stderr, "Usage: testreadv path (better be a "
      			"small text file)\n");
      		return 1;
      	}
      	path = argv[1];
      
      	if (stat(path, &stbuf) < 0) {
      		perror("stat");
      		return 1;
      	}
      
      	len0 = stbuf.st_size / 2;
      	len1 = stbuf.st_size - len0;
      
      	if (!len0 || !len1) {
      		fprintf(stderr, "Dude, file is too small\n");
      		return 1;
      	}
      
      	if ((fd = open(path, O_RDONLY)) < 0) {
      		perror("open");
      		return 1;
      	}
      
      	if (!(buf0 = malloc(len0)) || !(buf1 = malloc(len1))) {
      		perror("malloc");
      		return 1;
      	}
      
      	memset(buf0, 0, len0);
      	memset(buf1, 0, len1);
      
      	iov[0].iov_base = buf0;
      	iov[0].iov_len = len0;
      	iov[1].iov_base = NULL;
      	iov[1].iov_len = len1;
      	iov[2].iov_base = buf1;
      	iov[2].iov_len = len1;
      
      	printf("vector ");
      	for (i = 0; i < 3; i++)
      		printf("%p:%zu ", iov[i].iov_base, iov[i].iov_len);
      	printf("\n");
      
      	ret = readv(fd, iov, 3);
      	if (ret < 0)
      		perror("readv");
      
      	printf("readv returned %zd\nbuf0 = [%s]\nbuf1 = [%s]\n",
      	       ret, (char *)buf0, (char *)buf1);
      
      	return 0;
      }
      Signed-off-by: default avatarTejun Heo <htejun@gmail.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      39e88ca2
    • Paul Jackson's avatar
      [PATCH] cpusets: automatic numa mempolicy rebinding · 68860ec1
      Paul Jackson authored
      
      
      This patch automatically updates a tasks NUMA mempolicy when its cpuset
      memory placement changes.  It does so within the context of the task,
      without any need to support low level external mempolicy manipulation.
      
      If a system is not using cpusets, or if running on a system with just the
      root (all-encompassing) cpuset, then this remap is a no-op.  Only when a
      task is moved between cpusets, or a cpusets memory placement is changed
      does the following apply.  Otherwise, the main routine below,
      rebind_policy() is not even called.
      
      When mixing cpusets, scheduler affinity, and NUMA mempolicies, the
      essential role of cpusets is to place jobs (several related tasks) on a set
      of CPUs and Memory Nodes, the essential role of sched_setaffinity is to
      manage a jobs processor placement within its allowed cpuset, and the
      essential role of NUMA mempolicy (mbind, set_mempolicy) is to manage a jobs
      memory placement within its allowed cpuset.
      
      However, CPU affinity and NUMA memory placement are managed within the
      kernel using absolute system wide numbering, not cpuset relative numbering.
      
      This is ok until a job is migrated to a different cpuset, or what's the
      same, a jobs cpuset is moved to different CPUs and Memory Nodes.
      
      Then the CPU affinity and NUMA memory placement of the tasks in the job
      need to be updated, to preserve their cpuset-relative position.  This can
      be done for CPU affinity using sched_setaffinity() from user code, as one
      task can modify anothers CPU affinity.  This cannot be done from an
      external task for NUMA memory placement, as that can only be modified in
      the context of the task using it.
      
      However, it easy enough to remap a tasks NUMA mempolicy automatically when
      a task is migrated, using the existing cpuset mechanism to trigger a
      refresh of a tasks memory placement after its cpuset has changed.  All that
      is needed is the old and new nodemask, and notice to the task that it needs
      to rebind its mempolicy.  The tasks mems_allowed has the old mask, the
      tasks cpuset has the new mask, and the existing
      cpuset_update_current_mems_allowed() mechanism provides the notice.  The
      bitmap/cpumask/nodemask remap operators provide the cpuset relative
      calculations.
      
      This patch leaves open a couple of issues:
      
       1) Updating vma and shmfs/tmpfs/hugetlbfs memory policies:
      
          These mempolicies may reference nodes outside of those allowed to
          the current task by its cpuset.  Tasks are migrated as part of jobs,
          which reside on what might be several cpusets in a subtree.  When such
          a job is migrated, all NUMA memory policy references to nodes within
          that cpuset subtree should be translated, and references to any nodes
          outside that subtree should be left untouched.  A future patch will
          provide the cpuset mechanism needed to mark such subtrees.  With that
          patch, we will be able to correctly migrate these other memory policies
          across a job migration.
      
       2) Updating cpuset, affinity and memory policies in user space:
      
          This is harder.  Any placement state stored in user space using
          system-wide numbering will be invalidated across a migration.  More
          work will be required to provide user code with a migration-safe means
          to manage its cpuset relative placement, while preserving the current
          API's that pass system wide numbers, not cpuset relative numbers across
          the kernel-user boundary.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      68860ec1
    • Paul Jackson's avatar
      [PATCH] cpusets: confine pdflush to its cpuset · 28a42b9e
      Paul Jackson authored
      
      
      This patch keeps pdflush daemons on the same cpuset as their parent, the
      kthread daemon.
      
      Some large NUMA configurations put as much as they can of kernel threads
      and other classic Unix load in what's called a bootcpuset, keeping the rest
      of the system free for dedicated jobs.
      
      This effort is thwarted by pdflush, which dynamically destroys and
      recreates pdflush daemons depending on load.
      
      It's easy enough to force the originally created pdflush deamons into the
      bootcpuset, at system boottime.  But the pdflush threads created later were
      allowed to run freely across the system, due to the necessary line in their
      startup kthread():
      
              set_cpus_allowed(current, CPU_MASK_ALL);
      
      By simply coding pdflush to start its threads with the cpus_allowed
      restrictions of its cpuset (inherited from kthread, its parent) we can
      ensure that dynamically created pdflush threads are also kept in the
      bootcpuset.
      
      On systems w/o cpusets, or w/o a bootcpuset implementation, the following
      will have no affect, leaving pdflush to run on any CPU, as before.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      28a42b9e
    • Jan Kara's avatar
      [PATCH] ext3: Fix unmapped buffers in transaction's lists · aaa4059b
      Jan Kara authored
      
      
      Fix the problem (BUG 4964) with unmapped buffers in transaction's
      t_sync_data list.  The problem is we need to call filesystem's own
      invalidatepage() from block_write_full_page().
      
      block_write_full_page() must call filesystem's invalidatepage().  Otherwise
      following nasty race can happen:
      
         proc 1                                        proc 2
         ------                                        ------
      - write some new data to 'offset'
        => bh gets to the transactions data list
                                                    - starts truncate
                                                      => i_size set to new size
      - mpage_writepages()
        - ext3_ordered_writepage() to 'offset'
          - block_write_full_page()
            - page->index > end_index+1
              - block_invalidatepage()
                - discard_buffer()
                  - clear_buffer_mapped()
      
      - commit triggers and finds unmapped buffer - BOOM!
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      aaa4059b
  2. 29 Oct, 2005 33 commits
    • Nikita Danilov's avatar
      [PATCH] mm/filemap.c:filemap_populate(): move export. · b1459461
      Nikita Danilov authored
      
      
      move EXPORT_SYMBOL(filemap_populate) to the proper place: just after
      function itself: it's easy to miss that function is exported otherwise.
      Signed-off-by: default avatarNikita Danilov <nikita@clusterfs.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b1459461
    • John Hawkes's avatar
      [PATCH] mm: wider use of for_each_*cpu() · 2f96996d
      John Hawkes authored
      
      
      In 'mm' change the explicit use of a for-loop using NR_CPUS into the
      general for_each_cpu() constructs.  This widens the scope of potential
      future optimizations of the general constructs, as well as takes advantage
      of the existing optimizations of first_cpu() and next_cpu(), which is
      advantageous when the true CPU count is much smaller than NR_CPUS.
      Signed-off-by: default avatarJohn Hawkes <hawkes@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      2f96996d
    • Christoph Lameter's avatar
      [PATCH] Remove policy contextualization from mbind · 5fcbb230
      Christoph Lameter authored
      
      
      Policy contextualization is only useful for task based policies and not for
      vma based policies.  It may be useful to define allowed nodes that are not
      accessible from this thread because other threads may have access to these
      nodes.  Without this patch strange memory policy situations may cause an
      application to fail with out of memory.
      
      Example:
      
      Let's say we have two threads A and B that share the same address space and
      a huge array computational array X.
      
      Thread A is restricted by its cpuset to nodes 0 and 1 and thread B is
      restricted by its cpuset to nodes 2 and 3.
      
      Thread A now wants to restrict allocations to the first node and thus
      applies a BIND policy on X to node 0 and 2.  The cpuset limits this to node
      0.  Thus pages for X must be allocated on node 0 now.
      
      Thread B now touches a page that has never been used in X and faults in a
      page.  According to the BIND policy of the vma for X the page must be
      allocated on page 0.  However, the cpuset of B does not allow allocation on
      0 and 1.  Now the application fails in alloc_pages with out of memory.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      5fcbb230
    • Christoph Lameter's avatar
      [PATCH] Implement sys_* do_* layering in the memory policy layer. · 8bccd85f
      Christoph Lameter authored
      
      
      - Do a separation between do_xxx and sys_xxx functions. sys_xxx functions
        take variable sized bitmaps from user space as arguments. do_xxx functions
        take fixed sized nodemask_t as arguments and may be used from inside the
        kernel. Doing so simplifies the initialization code. There is no
        fs = kernel_ds assumption anymore.
      
      - Split up get_nodes into get_nodes (which gets the node list) and
        contextualize_policy which restricts the nodes to those accessible
        to the task and updates cpusets.
      
      - Add comments explaining limitations of bind policy
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8bccd85f
    • Dave Hansen's avatar
      [PATCH] memory hotplug: call setup_per_zone_pages_min after hotplug · 61b13993
      Dave Hansen authored
      
      
      From: IWAMOTO Toshihiro <iwamoto@valinux.co.jp>
      > I found the tests does not work well with Dave's patchset.
      > I've found the followings:
      >
      > 	- setup_per_zone_pages_min() calls should be added in
      > 	   capture_page_range() and online_pages()
      > 	- lru_add_drain() should be called before try_to_migrate_pages()
      
      The following patch deals with the first item.
      Signed-off-by: default avatarIWAMOTO Toshihiro <iwamoto@valinux.co.jp>
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      61b13993
    • Dave Hansen's avatar
      [PATCH] memory hotplug: move section_mem_map alloc to sparse.c · 0b0acbec
      Dave Hansen authored
      
      
      This basically keeps up from having to extern __kmalloc_section_memmap().
      
      The vaddr_in_vmalloc_area() helper could go in a vmalloc header, but that
      header gets hard to work with, because it needs some arch-specific macros.
      Just stick it in here for now, instead of creating another header.
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarLion Vollnhals <webmaster@schiggl.de>
      Signed-off-by: default avatarJiri Slaby <xslaby@fi.muni.cz>
      Signed-off-by: default avatarYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0b0acbec
    • Dave Hansen's avatar
      [PATCH] memory hotplug: sysfs and add/remove functions · 3947be19
      Dave Hansen authored
      
      
      This adds generic memory add/remove and supporting functions for memory
      hotplug into a new file as well as a memory hotplug kernel config option.
      
      Individual architecture patches will follow.
      
      For now, disable memory hotplug when swsusp is enabled.  There's a lot of
      churn there right now.  We'll fix it up properly once it calms down.
      Signed-off-by: default avatarMatt Tolentino <matthew.e.tolentino@intel.com>
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      3947be19
    • Dave Hansen's avatar
      [PATCH] memory hotplug locking: zone span seqlock · bdc8cb98
      Dave Hansen authored
      
      
      See the "fixup bad_range()" patch for more information, but this actually
      creates a the lock to protect things making assumptions about a zone's size
      staying constant at runtime.
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      bdc8cb98
    • Dave Hansen's avatar
      [PATCH] memory hotplug locking: node_size_lock · 208d54e5
      Dave Hansen authored
      
      
      pgdat->node_size_lock is basically only neeeded in one place in the normal
      code: show_mem(), which is the arch-specific sysrq-m printing function.
      
      Strictly speaking, the architectures not doing memory hotplug do no need this
      locking in show_mem().  However, they are all included for completeness.  This
      should also make any future consolidation of all of the implementations a
      little more straightforward.
      
      This lock is also held in the sparsemem code during a memory removal, as
      sections are invalidated.  This is the place there pfn_valid() is made false
      for a memory area that's being removed.  The lock is only required when doing
      pfn_valid() operations on memory which the user does not already have a
      reference on the page, such as in show_mem().
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      208d54e5
    • Dave Hansen's avatar
      [PATCH] memory hotplug prep: fixup bad_range() · c6a57e19
      Dave Hansen authored
      
      
      When doing memory hotplug operations, the size of existing zones can obviously
      change.  This means that zone->zone_{start_pfn,spanned_pages} can change.
      
      There are currently no locks that protect these structure members.  However,
      they are rarely accessed at runtime.  Outside of swsusp, the only place that I
      can find is bad_range().
      
      So, split bad_range() up into two pieces: one that needs to be locked and
      anther that doesn't.
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c6a57e19
    • Dave Hansen's avatar
      [PATCH] memory hotplug prep: __section_nr helper · 4ca644d9
      Dave Hansen authored
      
      
      A little helper that we use in the hotplug code.
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4ca644d9
    • Dave Hansen's avatar
      [PATCH] memory hotplug prep: break out zone initialization · ed8ece2e
      Dave Hansen authored
      
      
      If a zone is empty at boot-time and then hot-added to later, it needs to run
      the same init code that would have been run on it at boot.
      
      This patch breaks out zone table and per-cpu-pages functions for use by the
      hotplug code.  You can almost see all of the free_area_init_core() function on
      one page now.  :)
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ed8ece2e
    • Andrea Arcangeli's avatar
      [PATCH] .text page fault SMP scalability optimization · 1a44e149
      Andrea Arcangeli authored
      
      
      We had a problem on ppc64 where with more than 4 threads a large system
      wouldn't scale well while faulting in the .text (most of the time was spent
      in the kernel despite it was an userland compute intensive app).  The
      reason is the useless overwrite of the same pte from all cpu.
      
      I fixed it this way (verified on an older kernel but the forward port is
      almost identical).  This will benefit all archs not just ppc64.
      Signed-off-by: default avatarAndrea Arcangeli <andrea@suse.de>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      1a44e149
    • Adam Litke's avatar
      [PATCH] hugetlb: demand fault handler · 4c887265
      Adam Litke authored
      
      
      Below is a patch to implement demand faulting for huge pages.  The main
      motivation for changing from prefaulting to demand faulting is so that huge
      page memory areas can be allocated according to NUMA policy.
      
      Thanks to consolidated hugetlb code, switching the behavior requires changing
      only one fault handler.  The bulk of the patch just moves the logic from
      hugelb_prefault() to hugetlb_pte_fault() and find_get_huge_page().
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4c887265
    • Hugh Dickins's avatar
      [PATCH] mm: update comments to pte lock · b8072f09
      Hugh Dickins authored
      
      
      Updated several references to page_table_lock in common code comments.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b8072f09
    • Hugh Dickins's avatar
      [PATCH] mm: fix rss and mmlist locking · f412ac08
      Hugh Dickins authored
      
      
      A couple of oddities were guarded by page_table_lock, no longer properly
      guarded when that is split.
      
      The mm_counters of file_rss and anon_rss: make those an atomic_t, or an
      atomic64_t if the architecture supports it, in such a case.  Definitions by
      courtesy of Christoph Lameter: who spent considerable effort on more scalable
      ways of counting, but found insufficient benefit in practice.
      
      And adding an mm with swap to the mmlist for swapoff: the list is well-
      guarded by its own lock, but the list_empty check now has to be repeated
      inside it.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f412ac08
    • Hugh Dickins's avatar
      [PATCH] mm: split page table lock · 4c21e2f2
      Hugh Dickins authored
      
      
      Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
      a many-threaded application which concurrently initializes different parts of
      a large anonymous area.
      
      This patch corrects that, by using a separate spinlock per page table page, to
      guard the page table entries in that page, instead of using the mm's single
      page_table_lock.  (But even then, page_table_lock is still used to guard page
      table allocation, and anon_vma allocation.)
      
      In this implementation, the spinlock is tucked inside the struct page of the
      page table page: with a BUILD_BUG_ON in case it overflows - which it would in
      the case of 32-bit PA-RISC with spinlock debugging enabled.
      
      Splitting the lock is not quite for free: another cacheline access.  Ideally,
      I suppose we would use split ptlock only for multi-threaded processes on
      multi-cpu machines; but deciding that dynamically would have its own costs.
      So for now enable it by config, at some number of cpus - since the Kconfig
      language doesn't support inequalities, let preprocessor compare that with
      NR_CPUS.  But I don't think it's worth being user-configurable: for good
      testing of both split and unsplit configs, split now at 4 cpus, and perhaps
      change that to 8 later.
      
      There is a benefit even for singly threaded processes: kswapd can be attacking
      one part of the mm while another part is busy faulting.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4c21e2f2
    • Hugh Dickins's avatar
      [PATCH] mm: follow_page with inner ptlock · deceb6cd
      Hugh Dickins authored
      
      
      Final step in pushing down common core's page_table_lock.  follow_page no
      longer wants caller to hold page_table_lock, uses pte_offset_map_lock itself;
      and so no page_table_lock is taken in get_user_pages itself.
      
      But get_user_pages (and get_futex_key) do then need follow_page to pin the
      page for them: take Daniel's suggestion of bitflags to follow_page.
      
      Need one for WRITE, another for TOUCH (it was the accessed flag before:
      vanished along with check_user_page_readable, but surely get_numa_maps is
      wrong to mark every page it finds as accessed), another for GET.
      
      And another, ANON to dispose of untouched_anonymous_page: it seems silly for
      that to descend a second time, let follow_page observe if there was no page
      table and return ZERO_PAGE if so.  Fix minor bug in that: check VM_LOCKED -
      make_pages_present ought to make readonly anonymous present.
      
      Give get_numa_maps a cond_resched while we're there.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      deceb6cd
    • Hugh Dickins's avatar
      [PATCH] mm: kill check_user_page_readable · c34d1b4d
      Hugh Dickins authored
      
      
      check_user_page_readable is a problematic variant of follow_page.  It's used
      only by oprofile's i386 and arm backtrace code, at interrupt time, to
      establish whether a userspace stackframe is currently readable.
      
      This is problematic, because we want to push the page_table_lock down inside
      follow_page, and later split it; whereas oprofile is doing a spin_trylock on
      it (in the i386 case, forgotten in the arm case), and needs that to pin
      perhaps two pages spanned by the stackframe (which might be covered by
      different locks when we split).
      
      I think oprofile is going about this in the wrong way: it doesn't need to know
      the area is readable (neither i386 nor arm uses read protection of user
      pages), it doesn't need to pin the memory, it should simply
      __copy_from_user_inatomic, and see if that succeeds or not.  Sorry, but I've
      not got around to devising the sparse __user annotations for this.
      
      Then we can eliminate check_user_page_readable, and return to a single
      follow_page without the __follow_page variants.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c34d1b4d
    • Hugh Dickins's avatar
      [PATCH] mm: rmap with inner ptlock · c0718806
      Hugh Dickins authored
      
      
      rmap's page_check_address descend without page_table_lock.  First just
      pte_offset_map in case there's no pte present worth locking for, then take
      page_table_lock for the full check, and pass ptl back to caller in the same
      style as pte_offset_map_lock.  __xip_unmap, page_referenced_one and
      try_to_unmap_one use pte_unmap_unlock.  try_to_unmap_cluster also.
      
      page_check_address reformatted to avoid progressive indentation.  No use is
      made of its one error code, return NULL when it fails.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c0718806
    • Hugh Dickins's avatar
      [PATCH] mm: xip_unmap ZERO_PAGE fix · 67b02f11
      Hugh Dickins authored
      
      
      Small fix to the PageReserved patch: the mips ZERO_PAGE(address) depends on
      address, so __xip_unmap is wrong to initialize page with that before address
      is initialized; and in fact must re-evaluate it each iteration.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      67b02f11
    • Hugh Dickins's avatar
      [PATCH] mm: unmap_vmas with inner ptlock · 508034a3
      Hugh Dickins authored
      
      
      Remove the page_table_lock from around the calls to unmap_vmas, and replace
      the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are
      now safe to descend without page_table_lock.
      
      Don't attempt fancy locking for hugepages, just take page_table_lock in
      unmap_hugepage_range.  Which makes zap_hugepage_range, and the hugetlb test in
      zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway.  Nor
      does unmap_vmas have much use for its mm arg now.
      
      The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without
      page_table_lock: if they're implemented at all, they typically come down to
      flush_cache_range (usually done outside page_table_lock) and flush_tlb_range
      (which we already audited for the mprotect case).
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      508034a3
    • Hugh Dickins's avatar
      [PATCH] mm: unlink vma before pagetables · 8f4f8c16
      Hugh Dickins authored
      
      
      In most places the descent from pgd to pud to pmd to pte holds mmap_sem
      (exclusively or not), which ensures that free_pgtables cannot be freeing page
      tables from any level at the same time.  But truncation and reverse mapping
      descend without mmap_sem.
      
      No problem: just make sure that a vma is unlinked from its prio_tree (or
      nonlinear list) and from its anon_vma list, after zapping the vma, but before
      freeing its page tables.  Then neither vmtruncate nor rmap can reach that vma
      whose page tables are now volatile (nor do they need to reach it, since all
      its page entries have been zapped by this stage).
      
      The i_mmap_lock and anon_vma->lock already serialize this correctly; but the
      locking hierarchy is such that we cannot take them while holding
      page_table_lock.  Well, we're trying to push that down anyway.  So in this
      patch, move anon_vma_unlink and unlink_file_vma into free_pgtables, at the
      same time as moving page_table_lock around calls to unmap_vmas.
      
      tlb_gather_mmu and tlb_finish_mmu then fall outside the page_table_lock, but
      we made them preempt_disable and preempt_enable earlier; and a long source
      audit of all the architectures has shown no problem with removing
      page_table_lock from them.  free_pgtables doesn't need page_table_lock for
      itself, nor for what it calls; tlb->mm->nr_ptes is usually protected by
      page_table_lock, but partly by non-exclusive mmap_sem - here it's decremented
      with exclusive mmap_sem, or mm_users 0.  update_hiwater_rss and
      vm_unacct_memory don't need page_table_lock either.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8f4f8c16
    • Hugh Dickins's avatar
      [PATCH] mm: pte_offset_map_lock loops · 705e87c0
      Hugh Dickins authored
      
      
      Convert those common loops using page_table_lock on the outside and
      pte_offset_map within to use just pte_offset_map_lock within instead.
      
      These all hold mmap_sem (some exclusively, some not), so at no level can a
      page table be whipped away from beneath them.  But whereas pte_alloc loops
      tested with the "atomic" pmd_present, these loops are testing with pmd_none,
      which on i386 PAE tests both lower and upper halves.
      
      That's now unsafe, so add a cast into pmd_none to test only the vital lower
      half: we lose a little sensitivity to a corrupt middle directory, but not
      enough to worry about.  It appears that i386 and UML were the only
      architectures vulnerable in this way, and pgd and pud no problem.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      705e87c0
    • Hugh Dickins's avatar
      [PATCH] mm: page fault handler locking · 8f4e2101
      Hugh Dickins authored
      
      
      On the page fault path, the patch before last pushed acquiring the
      page_table_lock down to the head of handle_pte_fault (though it's also taken
      and dropped earlier when a new page table has to be allocated).
      
      Now delete that line, read "entry = *pte" without it, and go off to this or
      that page fault handler on the basis of this unlocked peek.  Usually the
      handler can proceed without the lock, relying on the subsequent locked
      pte_same or pte_none test to back out when necessary; though do_wp_page needs
      the lock immediately, and do_file_page doesn't check (if there's a race,
      install_page just zaps the entry and reinstalls it).
      
      But on those architectures (notably i386 with PAE) whose pte is too big to be
      read atomically, if SMP or preemption is enabled, do_swap_page and
      do_file_page might cause irretrievable damage if passed a Frankenstein entry
      stitched together from unrelated parts.  In those configs, "pte_unmap_same"
      has to take page_table_lock, validate orig_pte still the same, and drop
      page_table_lock before unmapping, before proceeding.
      
      Use pte_offset_map_lock and pte_unmap_unlock throughout the handlers; but lock
      avoidance leaves more lone maps and unmaps than elsewhere.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8f4e2101
    • Hugh Dickins's avatar
      [PATCH] mm: ptd_alloc take ptlock · c74df32c
      Hugh Dickins authored
      
      
      Second step in pushing down the page_table_lock.  Remove the temporary
      bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
      to hold page_table_lock, whether it's on init_mm or a user mm; take
      page_table_lock internally to check if a racing task already allocated.
      
      Convert their callers from common code.  But avoid coming back to change them
      again later: instead of moving the spin_lock(&mm->page_table_lock) down,
      switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
      encapsulate the mapping+locking and unlocking+unmapping together, and in the
      end may use alternatives to the mm page_table_lock itself.
      
      These callers all hold mmap_sem (some exclusively, some not), so at no level
      can a page table be whipped away from beneath them; and pte_alloc uses the
      "atomic" pmd_present to test whether it needs to allocate.  It appears that on
      all arches we can safely descend without page_table_lock.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c74df32c
    • Hugh Dickins's avatar
      [PATCH] mm: ptd_alloc inline and out · 1bb3630e
      Hugh Dickins authored
      
      
      It seems odd to me that, whereas pud_alloc and pmd_alloc test inline, only
      calling out-of-line __pud_alloc __pmd_alloc if allocation needed,
      pte_alloc_map and pte_alloc_kernel are entirely out-of-line.  Though it does
      add a little to kernel size, change them to macros testing inline, calling
      __pte_alloc or __pte_alloc_kernel to allocate out-of-line.  Mark none of them
      as fastcalls, leave that to CONFIG_REGPARM or not.
      
      It also seems more natural for the out-of-line functions to leave the offset
      calculation and map to the inline, which has to do it anyway for the common
      case.  At least mremap move wants __pte_alloc without _map.
      
      Macros rather than inline functions, certainly to avoid the header file issues
      which arise from CONFIG_HIGHPTE needing kmap_types.h, but also in case any
      architectures I haven't built would have other such problems.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      1bb3630e
    • Hugh Dickins's avatar
      [PATCH] mm: init_mm without ptlock · 872fec16
      Hugh Dickins authored
      
      
      First step in pushing down the page_table_lock.  init_mm.page_table_lock has
      been used throughout the architectures (usually for ioremap): not to serialize
      kernel address space allocation (that's usually vmlist_lock), but because
      pud_alloc,pmd_alloc,pte_alloc_kernel expect caller holds it.
      
      Reverse that: don't lock or unlock init_mm.page_table_lock in any of the
      architectures; instead rely on pud_alloc,pmd_alloc,pte_alloc_kernel to take
      and drop it when allocating a new one, to check lest a racing task already
      did.  Similarly no page_table_lock in vmalloc's map_vm_area.
      
      Some temporary ugliness in __pud_alloc and __pmd_alloc: since they also handle
      user mms, which are converted only by a later patch, for now they have to lock
      differently according to whether or not it's init_mm.
      
      If sources get muddled, there's a danger that an arch source taking
      init_mm.page_table_lock will be mixed with common source also taking it (or
      neither take it).  So break the rules and make another change, which should
      break the build for such a mismatch: remove the redundant mm arg from
      pte_alloc_kernel (ppc64 scrapped its distinct ioremap_mm in 2.6.13).
      
      Exceptions: arm26 used pte_alloc_kernel on user mm, now pte_alloc_map; ia64
      used pte_alloc_map on init_mm, now pte_alloc_kernel; parisc had bad args to
      pmd_alloc and pte_alloc_kernel in unused USE_HPPA_IOREMAP code; ppc64
      map_io_page forgot to unlock on failure; ppc mmu_mapin_ram and ppc64 im_free
      took page_table_lock for no good reason.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      872fec16
    • Hugh Dickins's avatar
      [PATCH] mm: ia64 use expand_upwards · 46dea3d0
      Hugh Dickins authored
      
      
      ia64 has expand_backing_store function for growing its Register Backing Store
      vma upwards.  But more complete code for this purpose is found in the
      CONFIG_STACK_GROWSUP part of mm/mmap.c.  Uglify its #ifdefs further to provide
      expand_upwards for ia64 as well as expand_stack for parisc.
      
      The Register Backing Store vma should be marked VM_ACCOUNT.  Implement the
      intention of growing it only a page at a time, instead of passing an address
      outside of the vma to handle_mm_fault, with unknown consequences.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      46dea3d0
    • Hugh Dickins's avatar
      [PATCH] mm: update_hiwaters just in time · 365e9c87
      Hugh Dickins authored
      
      
      update_mem_hiwater has attracted various criticisms, in particular from those
      concerned with mm scalability.  Originally it was called whenever rss or
      total_vm got raised.  Then many of those callsites were replaced by a timer
      tick call from account_system_time.  Now Frank van Maarseveen reports that to
      be found inadequate.  How about this?  Works for Frank.
      
      Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
      update_hiwater_rss and update_hiwater_vm.  Don't attempt to keep
      mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
      by 1): those are hot paths.  Do the opposite, update only when about to lower
      rss (usually by many), or just before final accounting in do_exit.  Handle
      mm->hiwater_vm in the same way, though it's much less of an issue.  Demand
      that whoever collects these hiwater statistics do the work of taking the
      maximum with rss or total_vm.
      
      And there has been no collector of these hiwater statistics in the tree.  The
      new convention needs an example, so match Frank's usage by adding a VmPeak
      line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
      (High-Water-Mark or High-Water-Memory).
      
      There was a particular anomaly during mremap move, that hiwater_vm might be
      captured too high.  A fleeting such anomaly remains, but it's quickly
      corrected now, whereas before it would stick.
      
      What locking?  None: if the app is racy then these statistics will be racy,
      it's not worth any overhead to make them exact.  But whenever it suits,
      hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
      page_table_lock (for now) or with preemption disabled (later on): without
      going to any trouble, minimize the time between reading current values and
      updating, to minimize those occasions when a racing thread bumps a count up
      and back down in between.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      365e9c87
    • Hugh Dickins's avatar
      [PATCH] mm: zap_pte out of line · 861f2fb8
      Hugh Dickins authored
      
      
      There used to be just one call to zap_pte, but it shouldn't be inline now
      there are two.  Check for the common case pte_none before calling, and move
      its rss accounting up into install_page or install_file_pte - which helps the
      next patch.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      861f2fb8
    • Hugh Dickins's avatar
      [PATCH] mm: do_mremap current mm · d0de32d9
      Hugh Dickins authored
      
      
      Cleanup: relieve do_mremap from its surfeit of current->mms.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d0de32d9
    • Hugh Dickins's avatar
      [PATCH] mm: do_swap_page race major · 9e9bef07
      Hugh Dickins authored
      
      
      Small adjustment: do_swap_page should report its !pte_same race as a major
      fault if it had to read into swap cache, because whatever raced with it will
      have found page already in cache and reported minor fault.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9e9bef07