1. 20 Oct, 2008 3 commits
  2. 26 Jul, 2008 1 commit
  3. 28 Apr, 2008 2 commits
    • Miklos Szeredi's avatar
      mm: rotate_reclaimable_page() cleanup · ac6aadb2
      Miklos Szeredi authored
      Clean up messy conditional calling of test_clear_page_writeback() from both
      rotate_reclaimable_page() and end_page_writeback().
      The only user of rotate_reclaimable_page() is end_page_writeback() so this is
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Mel Gorman's avatar
      mm: use zonelists instead of zones when direct reclaiming pages · dac1d27b
      Mel Gorman authored
      The following patches replace multiple zonelists per node with two zonelists
      that are filtered based on the GFP flags.  The patches as a set fix a bug with
      regard to the use of MPOL_BIND and ZONE_MOVABLE.  With this patchset, the
      MPOL_BIND will apply to the two highest zones when the highest zone is
      ZONE_MOVABLE.  This should be considered as an alternative fix for the
      MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters
      only custom zonelists.
      The first patch cleans up an inconsistency where direct reclaim uses
      zonelist->zones where other places use zonelist.
      The second patch introduces a helper function node_zonelist() for looking up
      the appropriate zonelist for a GFP mask which simplifies patches later in the
      The third patch defines/remembers the "preferred zone" for numa statistics, as
      it is no longer always the first zone in a zonelist.
      The forth patch replaces multiple zonelists with two zonelists that are
      filtered.  The two zonelists are due to the fact that the memoryless patchset
      introduces a second set of zonelists for __GFP_THISNODE.
      The fifth patch introduces helper macros for retrieving the zone and node
      indices of entries in a zonelist.
      The final patch introduces filtering of the zonelists based on a nodemask.
      Two zonelists exist per node, one for normal allocations and one for
      Performance results varied depending on the machine configuration.  In real
      workloads the gain/loss will depend on how much the userspace portion of the
      benchmark benefits from having more cache available due to reduced referencing
      of zonelists.
      These are the range of performance losses/gains when running against
      2.6.24-rc4-mm1.  The set and these machines are a mix of i386, x86_64 and
      ppc64 both NUMA and non-NUMA.
      			     loss   to  gain
      Total CPU time on Kernbench: -0.86% to  1.13%
      Elapsed   time on Kernbench: -0.79% to  0.76%
      page_test from aim9:         -4.37% to  0.79%
      brk_test  from aim9:         -0.71% to  4.07%
      fork_test from aim9:         -1.84% to  4.60%
      exec_test from aim9:         -0.71% to  1.08%
      This patch:
      The allocator deals with zonelists which indicate the order in which zones
      should be targeted for an allocation.  Similarly, direct reclaim of pages
      iterates over an array of zones.  For consistency, this patch converts direct
      reclaim to use a zonelist.  No functionality is changed by this patch.  This
      simplifies zonelist iterators in the next patch.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  4. 13 Feb, 2008 1 commit
  5. 07 Feb, 2008 2 commits
  6. 05 Feb, 2008 3 commits
    • Hugh Dickins's avatar
      tmpfs: move swap swizzling into shmem · 73b1262f
      Hugh Dickins authored
      move_to_swap_cache and move_from_swap_cache functions (which swizzle a page
      between tmpfs page cache and swap cache, to avoid page copying) are only used
      by shmem.c; and our subsequent fix for unionfs needs different treatments in
      the two instances of move_from_swap_cache.  Move them from swap_state.c into
      their callsites shmem_writepage, shmem_unuse_inode and shmem_getpage, making
      add_to_swap_cache externally visible.
      shmem.c likes to say set_page_dirty where swap_state.c liked to say
      SetPageDirty: respect that diversity, which __set_page_dirty_no_writeback
      makes moot (and implies we should lose that "shift page from clean_pages to
      dirty_pages list" comment: it's on neither).
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Hugh Dickins's avatar
      swapin needs gfp_mask for loop on tmpfs · 02098fea
      Hugh Dickins authored
      Building in a filesystem on a loop device on a tmpfs file can hang when
      swapping, the loop thread caught in that infamous throttle_vm_writeout.
      In theory this is a long standing problem, which I've either never seen in
      practice, or long ago suppressed the recollection, after discounting my load
      and my tmpfs size as unrealistically high.  But now, with the new aops, it has
      become easy to hang on one machine.
      Loop used to grab_cache_page before the old prepare_write to tmpfs, which
      seems to have been enough to free up some memory for any swapin needed; but
      the new write_begin lets tmpfs find or allocate the page (much nicer, since
      grab_cache_page missed tmpfs pages in swapcache).
      When allocating a fresh page, tmpfs respects loop's mapping_gfp_mask, which
      has __GFP_IO|__GFP_FS stripped off, and throttle_vm_writeout is designed to
      break out when __GFP_IO or GFP_FS is unset; but when tmfps swaps in,
      read_swap_cache_async allocates with GFP_HIGHUSER_MOVABLE regardless of the
      mapping_gfp_mask - hence the hang.
      So, pass gfp_mask down the line from shmem_getpage to shmem_swapin to
      swapin_readahead to read_swap_cache_async to add_to_swap_cache.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Hugh Dickins's avatar
      swapin_readahead: move and rearrange args · 46017e95
      Hugh Dickins authored
      swapin_readahead has never sat well in mm/memory.c: move it to mm/swap_state.c
      beside its kindred read_swap_cache_async.  Why were its args in a different
      order?  rearrange them.  And since it was always followed by a
      read_swap_cache_async of the target page, fold that in and return struct
      page*.  Then CONFIG_SWAP=n no longer needs valid_swaphandles and
      read_swap_cache_async stubs.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  7. 31 Jan, 2008 1 commit
    • Ingo Molnar's avatar
      asm-generic/tlb.h: build fix · 62152d0e
      Ingo Molnar authored
      bring back the avr32, blackfin, sh, sparc architectures into working order,
      by reverting the effects of this change that came in via the x86 tree:
         commit a5a19c63
         Author: Jeremy Fitzhardinge <jeremy@goop.org>
         Date:   Wed Jan 30 13:33:39 2008 +0100
             x86: demacro asm-x86/pgalloc_32.h
      Sorry about that!
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
  8. 30 Jan, 2008 1 commit
    • Jeremy Fitzhardinge's avatar
      x86: demacro asm-x86/pgalloc_32.h · a5a19c63
      Jeremy Fitzhardinge authored
      Convert macros into inline functions, for better type-checking.
      This patch required a little bit of fiddling with headers in order to
      make __(pte|pmd)_free_tlb inline rather than macros.
      asm-generic/tlb.h includes asm/pgalloc.h, though it doesn't directly
      use any pgalloc definitions.  I removed this include to avoid an
      include cycle, but it may cause secondary compile failures by things
      depending on the indirect inclusion; arch/x86/mm/hugetlbpage.c was one
      such place; there may be others.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
  9. 17 Oct, 2007 1 commit
  10. 10 Oct, 2007 1 commit
  11. 17 Jul, 2007 1 commit
    • Andy Whitcroft's avatar
      Lumpy Reclaim V4 · 5ad333eb
      Andy Whitcroft authored
      When we are out of memory of a suitable size we enter reclaim.  The current
      reclaim algorithm targets pages in LRU order, which is great for fairness at
      order-0 but highly unsuitable if you desire pages at higher orders.  To get
      pages of higher order we must shoot down a very high proportion of memory;
      >95% in a lot of cases.
      This patch set adds a lumpy reclaim algorithm to the allocator.  It targets
      groups of pages at the specified order anchored at the end of the active and
      inactive lists.  This encourages groups of pages at the requested orders to
      move from active to inactive, and active to free lists.  This behaviour is
      only triggered out of direct reclaim when higher order pages have been
      This patch set is particularly effective when utilised with an
      anti-fragmentation scheme which groups pages of similar reclaimability
      This patch set is based on Peter Zijlstra's lumpy reclaim V2 patch which forms
      the foundation.  Credit to Mel Gorman for sanitity checking.
      Mel said:
        The patches have an application with hugepage pool resizing.
        When lumpy-reclaim is used used with ZONE_MOVABLE, the hugepages pool can
        be resized with greater reliability.  Testing on a desktop machine with 2GB
        of RAM showed that growing the hugepage pool with ZONE_MOVABLE on it's own
        was very slow as the success rate was quite low.  Without lumpy-reclaim,
        each attempt to grow the pool by 100 pages would yield 1 or 2 hugepages.
        With lumpy-reclaim, getting 40 to 70 hugepages on each attempt was typical.
      [akpm@osdl.org: ia64 pfn_to_nid fixes and loop cleanup]
      [bunk@stusta.de: static declarations for internal functions]
      [a.p.zijlstra@chello.nl: initial lumpy V2 implementation]
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Bob Picco <bob.picco@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  12. 11 Feb, 2007 2 commits
  13. 06 Jan, 2007 1 commit
  14. 07 Dec, 2006 3 commits
    • Rafael J. Wysocki's avatar
      [PATCH] swsusp: use block device offsets to identify swap locations · 3aef83e0
      Rafael J. Wysocki authored
      Make swsusp use block device offsets instead of swap offsets to identify swap
      locations and make it use the same code paths for writing as well as for
      reading data.
      This allows us to use the same code for handling swap files and swap
      partitions and to simplify the code, eg.  by dropping rw_swap_page_sync().
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Rafael J. Wysocki's avatar
      [PATCH] swsusp: use partition device and offset to identify swap areas · 915bae9e
      Rafael J. Wysocki authored
      The Linux kernel handles swap files almost in the same way as it handles swap
      partitions and there are only two differences between these two types of swap
      (1) swap files need not be contiguous,
      (2) the header of a swap file is not in the first block of the partition
          that holds it.  From the swsusp's point of view (1) is not a problem,
          because it is already taken care of by the swap-handling code, but (2) has
          to be taken into consideration.
      In principle the location of a swap file's header may be determined with the
      help of appropriate filesystem driver.  Unfortunately, however, it requires
      the filesystem holding the swap file to be mounted, and if this filesystem is
      journaled, it cannot be mounted during a resume from disk.  For this reason we
      need some other means by which swap areas can be identified.
      For example, to identify a swap area we can use the partition that holds the
      area and the offset from the beginning of this partition at which the swap
      header is located.
      The following patch allows swsusp to identify swap areas this way.  It changes
      swap_type_of() so that it takes an additional argument representing an offset
      of the swap header within the partition represented by its first argument.
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: default avatarPavel Machek <pavel@ucw.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Ashwin Chaugule's avatar
      [PATCH] new scheme to preempt swap token · 7602bdf2
      Ashwin Chaugule authored
      The new swap token patches replace the current token traversal algo.  The old
      algo had a crude timeout parameter that was used to handover the token from
      one task to another.  This algo, transfers the token to the tasks that are in
      need of the token.  The urgency for the token is based on the number of times
      a task is required to swap-in pages.  Accordingly, the priority of a task is
      incremented if it has been badly affected due to swap-outs.  To ensure that
      the token doesnt bounce around rapidly, the token holders are given a priority
      boost.  The priority of tasks is also decremented, if their rate of swap-in's
      keeps reducing.  This way, the condition to check whether to pre-empt the swap
      token, is a matter of comparing two task's priority fields.
      [akpm@osdl.org: cleanups]
      Signed-off-by: default avatarAshwin Chaugule <ashwin.chaugule@celunite.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  15. 26 Sep, 2006 5 commits
    • Andrew Morton's avatar
      [PATCH] swsusp: read speedup · 546e0d27
      Andrew Morton authored
      Implement async reads for swsusp resuming.
      Crufty old PIII testbox:
      	15.7 MB/s -> 20.3 MB/s
      Sony Vaio:
      	14.6 MB/s -> 33.3 MB/s
      I didn't implement the post-resume bio_set_pages_dirty().  I don't really
      understand why resume needs to run set_page_dirty() against these pages.
      It might be a worry that this code modifies PG_Uptodate, PG_Error and
      PG_Locked against the image pages.  Can this possibly affect the resumed-into
      kernel?  Hopefully not, if we're atomically restoring its mem_map?
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Jens Axboe <axboe@suse.de>
      Cc: Laurent Riffard <laurent.riffard@free.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Andrew Morton's avatar
      [PATCH] swsusp: write speedup · ab954160
      Andrew Morton authored
      Switch the swsusp writeout code from 4k-at-a-time to 4MB-at-a-time.
      Crufty old PIII testbox:
      	12.9 MB/s -> 20.9 MB/s
      Sony Vaio:
      	14.7 MB/s -> 26.5 MB/s
      The implementation is crude.  A better one would use larger BIOs, but wouldn't
      gain any performance.
      The memcpys will be mostly pipelined with the IO and basically come for free.
      The ENOMEM path has not been tested.  It should be.
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Christoph Lameter's avatar
      [PATCH] zone_reclaim: dynamic slab reclaim · 0ff38490
      Christoph Lameter authored
      Currently one can enable slab reclaim by setting an explicit option in
      /proc/sys/vm/zone_reclaim_mode.  Slab reclaim is then used as a final
      option if the freeing of unmapped file backed pages is not enough to free
      enough pages to allow a local allocation.
      However, that means that the slab can grow excessively and that most memory
      of a node may be used by slabs.  We have had a case where a machine with
      46GB of memory was using 40-42GB for slab.  Zone reclaim was effective in
      dealing with pagecache pages.  However, slab reclaim was only done during
      global reclaim (which is a bit rare on NUMA systems).
      This patch implements slab reclaim during zone reclaim.  Zone reclaim
      occurs if there is a danger of an off node allocation.  At that point we
      1. Shrink the per node page cache if the number of pagecache
         pages is more than min_unmapped_ratio percent of pages in a zone.
      2. Shrink the slab cache if the number of the nodes reclaimable slab pages
         (patch depends on earlier one that implements that counter)
         are more than min_slab_ratio (a new /proc/sys/vm tunable).
      The shrinking of the slab cache is a bit problematic since it is not node
      specific.  So we simply calculate what point in the slab we want to reach
      (current per node slab use minus the number of pages that neeed to be
      allocated) and then repeately run the global reclaim until that is
      unsuccessful or we have reached the limit.  I hope we will have zone based
      slab reclaim at some point which will make that easier.
      The default for the min_slab_ratio is 5%
      Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.
      [akpm@osdl.org: cleanups]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Martin Schwidefsky's avatar
      [PATCH] out of memory notifier · 8bc719d3
      Martin Schwidefsky authored
      Add a notifer chain to the out of memory killer.  If one of the registered
      callbacks could release some memory, do not kill the process but return and
      retry the allocation that forced the oom killer to run.
      The purpose of the notifier is to add a safety net in the presence of
      memory ballooners.  If the resource manager inflated the balloon to a size
      where memory allocations can not be satisfied anymore, it is better to
      deflate the balloon a bit instead of killing processes.
      The implementation for the s390 ballooner is included.
      [akpm@osdl.org: cleanups]
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Christoph Lameter's avatar
      [PATCH] reduce MAX_NR_ZONES: move HIGHMEM counters into highmem.c/.h · c1f60a5a
      Christoph Lameter authored
      Move totalhigh_pages and nr_free_highpages() into highmem.c/.h
      Move the totalhigh_pages definition into highmem.c/.h.  Move the
      nr_free_highpages function into highmem.c
      [yoichi_yuasa@tripeaks.co.jp: build fix]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarYoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  16. 03 Jul, 2006 1 commit
    • Christoph Lameter's avatar
      [PATCH] ZVC/zone_reclaim: Leave 1% of unmapped pagecache pages for file I/O · 9614634f
      Christoph Lameter authored
      It turns out that it is advantageous to leave a small portion of unmapped file
      backed pages if all of a zone's pages (or almost all pages) are allocated and
      so the page allocator has to go off-node.
      This allows recently used file I/O buffers to stay on the node and
      reduces the times that zone reclaim is invoked if file I/O occurs
      when we run out of memory in a zone.
      The problem is that zone reclaim runs too frequently when the page cache is
      used for file I/O (read write and therefore unmapped pages!) alone and we have
      almost all pages of the zone allocated.  Zone reclaim may remove 32 unmapped
      pages.  File I/O will use these pages for the next read/write requests and the
      unmapped pages increase.  After the zone has filled up again zone reclaim will
      remove it again after only 32 pages.  This cycle is too inefficient and there
      are potentially too many zone reclaim cycles.
      With the 1% boundary we may still remove all unmapped pages for file I/O in
      zone reclaim pass.  However.  it will take a large number of read and writes
      to get back to 1% again where we trigger zone reclaim again.
      The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
      second timeout.
      [akpm@osdl.org: rename the /proc file and the variable]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  17. 30 Jun, 2006 1 commit
  18. 27 Jun, 2006 1 commit
  19. 23 Jun, 2006 5 commits
    • Andrew Morton's avatar
      [PATCH] initialise total_memory() earlier · bd1e22b8
      Andrew Morton authored
      Initialise total_memory earlier in boot.  Because if for some reason we run
      page reclaim early in boot, we don't want total_memory to be zero when we use
      it as a divisor.
      And rename total_memory to vm_total_pages to avoid naming clashes with
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Martin Bligh <mbligh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Con Kolivas's avatar
      [PATCH] mm: fix swap unused warning · bd96b9eb
      Con Kolivas authored
      If CONFIG_SWAP is not defined we get:
      mm/vmscan.c: In function ‘remove_mapping’:
      mm/vmscan.c:387: warning: unused variable ‘swap’
      Convert defines in swap.h into blank inline functions to fix this warning
      and be consistent.
      Signed-off-by: default avatarCon Kolivas <kernel@kolivas.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Christoph Lameter's avatar
      [PATCH] More page migration: use migration entries for file pages · 04e62a29
      Christoph Lameter authored
      This implements the use of migration entries to preserve ptes of file backed
      pages during migration.  Processes can therefore be migrated back and forth
      without loosing their connection to pagecache pages.
      Note that we implement the migration entries only for linear mappings.
      Nonlinear mappings still require the unmapping of the ptes for migration.
      And another writepage() ugliness shows up.  writepage() can drop the page
      lock.  Therefore we have to remove migration ptes before calling writepages()
      in order to avoid having migration entries point to unlocked pages.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Christoph Lameter's avatar
      [PATCH] Swapless page migration: add R/W migration entries · 0697212a
      Christoph Lameter authored
      Implement read/write migration ptes
      We take the upper two swapfiles for the two types of migration ptes and define
      a series of macros in swapops.h.
      The VM is modified to handle the migration entries.  migration entries can
      only be encountered when the page they are pointing to is locked.  This limits
      the number of places one has to fix.  We also check in copy_pte_range and in
      mprotect_pte_range() for migration ptes.
      We check for migration ptes in do_swap_cache and call a function that will
      then wait on the page lock.  This allows us to effectively stop all accesses
      to apge.
      Migration entries are created by try_to_unmap if called for migration and
      removed by local functions in migrate.c
      From: Hugh Dickins <hugh@veritas.com>
        Several times while testing swapless page migration (I've no NUMA, just
        hacking it up to migrate recklessly while running load), I've hit the
        BUG_ON(!PageLocked(p)) in migration_entry_to_page.
        This comes from an orphaned migration entry, unrelated to the current
        correctly locked migration, but hit by remove_anon_migration_ptes as it
        checks an address in each vma of the anon_vma list.
        Such an orphan may be left behind if an earlier migration raced with fork:
        copy_one_pte can duplicate a migration entry from parent to child, after
        remove_anon_migration_ptes has checked the child vma, but before it has
        removed it from the parent vma.  (If the process were later to fault on this
        orphaned entry, it would hit the same BUG from migration_entry_wait.)
        This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
        not.  There's no such problem with file pages, because vma_prio_tree_add
        adds child vma after parent vma, and the page table locking at each end is
        enough to serialize.  Follow that example with anon_vma: add new vmas to the
        tail instead of the head.
        (There's no corresponding problem when inserting migration entries,
        because a missed pte will leave the page count and mapcount high, which is
        allowed for.  And there's no corresponding problem when migrating via swap,
        because a leftover swap entry will be correctly faulted.  But the swapless
        method has no refcounting of its entries.)
      From: Ingo Molnar <mingo@elte.hu>
        pte_unmap_unlock() takes the pte pointer as an argument.
      From: Hugh Dickins <hugh@veritas.com>
        Several times while testing swapless page migration, gcc has tried to exec
        a pointer instead of a string: smells like COW mappings are not being
        properly write-protected on fork.
        The protection in copy_one_pte looks very convincing, until at last you
        realize that the second arg to make_migration_entry is a boolean "write",
        and SWP_MIGRATION_READ is 30.
        Anyway, it's better done like in change_pte_range, using
        is_write_migration_entry and make_migration_entry_read.
      From: Hugh Dickins <hugh@veritas.com>
        Remove unnecessary obfuscation from sys_swapon's range check on swap type,
        which blew up causing memory corruption once swapless migration made
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarChristoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      From: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Andreas Dilger's avatar
      [PATCH] reserve space for swap label · e8f03d02
      Andreas Dilger authored
      Reserve space in the swap disk header for a LABEL and UUID to be specified.
       This has been possible with util-linux-2.12b (via e2fsprogs 1.36
      libblkid), and is used by at least FC3 and later.  The kernel doesn't
      really care about this, but the space shouldn't accidentally be used by
      something else either.
      Also make the on-disk structures be fixed-size types, instead of "int",
      though I don't know of any architecture in use where an "int" isn't the
      same size as a "__u32" (all current kernel arches have it as "unsigned
      Signed-off-by: default avatarAndreas Dilger <adilger@shaw.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  20. 15 May, 2006 1 commit
  21. 26 Apr, 2006 1 commit
  22. 11 Apr, 2006 1 commit
    • Hideo AOKI's avatar
      [PATCH] overcommit: add calculate_totalreserve_pages() · cb45b0e9
      Hideo AOKI authored
      These patches are an enhancement of OVERCOMMIT_GUESS algorithm in
      - why the kernel needed patching
        When the kernel can't allocate anonymous pages in practice, currnet
        OVERCOMMIT_GUESS could return success. This implementation might be
        the cause of oom kill in memory pressure situation.
        If the Linux runs with page reservation features like
        /proc/sys/vm/lowmem_reserve_ratio and without swap region, I think
        the oom kill occurs easily.
      - the overall design approach in the patch
        When the OVERCOMMET_GUESS algorithm calculates number of free pages,
        the reserved free pages are regarded as non-free pages.
        This change helps to avoid the pitfall that the number of free pages
        become less than the number which the kernel tries to keep free.
      - testing results
        I tested the patches using my test kernel module.
        If the patches aren't applied to the kernel, __vm_enough_memory()
        returns success in the situation but autual page allocation is
        On the other hand, if the patches are applied to the kernel, memory
        allocation failure is avoided since __vm_enough_memory() returns
        failure in the situation.
        I checked that on i386 SMP 16GB memory machine. I haven't tested on
        nommu environment currently.
      This patch adds totalreserve_pages for __vm_enough_memory().
      Calculate_totalreserve_pages() checks maximum lowmem_reserve pages and
      pages_high in each zone. Finally, the function stores the sum of each
      zone to totalreserve_pages.
      The totalreserve_pages is calculated when the VM is initilized.
      And the variable is updated when /proc/sys/vm/lowmem_reserve_raito
      or /proc/sys/vm/min_free_kbytes are changed.
      Signed-off-by: default avatarHideo Aoki <haoki@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  23. 23 Mar, 2006 1 commit