Skip to content
Snippets Groups Projects
  1. May 21, 2006
    • Bob Picco's avatar
      [PATCH] Align the node_mem_map endpoints to a MAX_ORDER boundary · e984bb43
      Bob Picco authored
      
      Andy added code to buddy allocator which does not require the zone's
      endpoints to be aligned to MAX_ORDER.  An issue is that the buddy allocator
      requires the node_mem_map's endpoints to be MAX_ORDER aligned.  Otherwise
      __page_find_buddy could compute a buddy not in node_mem_map for partial
      MAX_ORDER regions at zone's endpoints.  page_is_buddy will detect that
      these pages at endpoints are not PG_buddy (they were zeroed out by bootmem
      allocator and not part of zone).  Of course the negative here is we could
      waste a little memory but the positive is eliminating all the old checks
      for zone boundary conditions.
      
      SPARSEMEM won't encounter this issue because of MAX_ORDER size constraint
      when SPARSEMEM is configured.  ia64 VIRTUAL_MEM_MAP doesn't need the logic
      either because the holes and endpoints are handled differently.  This
      leaves checking alloc_remap and other arches which privately allocate for
      node_mem_map.
      
      Signed-off-by: default avatarBob Picco <bob.picco@hp.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      e984bb43
    • Paul Jackson's avatar
      [PATCH] Cpuset: might sleep checking zones allowed fix · bdd804f4
      Paul Jackson authored
      Fix a couple of infrequently encountered 'sleeping function called from
      invalid context' in the cpuset hooks in __alloc_pages.  Could sleep while
      interrupts disabled.
      
      The routine cpuset_zone_allowed() is called by code in mm/page_alloc.c
      __alloc_pages() to determine if a zone is allowed in the current tasks
      cpuset.  This routine can sleep, for certain GFP_KERNEL allocations, if the
      zone is on a memory node not allowed in the current cpuset, but might be
      allowed in a parent cpuset.
      
      But we can't sleep in __alloc_pages() if in interrupt, nor if called for a
      GFP_ATOMIC request (__GFP_WAIT not set in gfp_flags).
      
      The rule was intended to be:
        Don't call cpuset_zone_allowed() if you can't sleep, unless you
        pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
        the code that might scan up ancestor cpusets and sleep.
      
      This rule was being violated in a couple of places, due to a bogus change
      made (by myself, pj) to __alloc_pages() as part of the November 2005 effort
      to cleanup its logic, and also due to a later fix to constrain which swap
      daemons were awoken.
      
      The bogus change can be seen at:
        http://linux.derkeiler.com/Mailing-Lists/Kernel/2005-11/4691.html
      
      
        [PATCH 01/05] mm fix __alloc_pages cpuset ALLOC_* flags
      
      This was first noticed on a tight memory system, in code that was disabling
      interrupts and doing allocation requests with __GFP_WAIT not set, which
      resulted in __might_sleep() writing complaints to the log "Debug: sleeping
      function called ...", when the code in cpuset_zone_allowed() tried to take
      the callback_sem cpuset semaphore.
      
      We haven't seen a system hang on this 'might_sleep' yet, but we are at
      decent risk of seeing it fairly soon, especially since the additional
      cpuset_zone_allowed() check was added, conditioning wakeup_kswapd(), in
      March 2006.
      
      Special thanks to Dave Chinner, for figuring this out, and a tip of the hat
      to Nick Piggin who warned me of this back in Nov 2005, before I was ready
      to listen.
      
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      bdd804f4
  2. May 15, 2006
  3. Apr 26, 2006
  4. Apr 19, 2006
  5. Apr 11, 2006
    • Hideo AOKI's avatar
      [PATCH] overcommit: add calculate_totalreserve_pages() · cb45b0e9
      Hideo AOKI authored
      
      These patches are an enhancement of OVERCOMMIT_GUESS algorithm in
      __vm_enough_memory().
      
      - why the kernel needed patching
      
        When the kernel can't allocate anonymous pages in practice, currnet
        OVERCOMMIT_GUESS could return success. This implementation might be
        the cause of oom kill in memory pressure situation.
      
        If the Linux runs with page reservation features like
        /proc/sys/vm/lowmem_reserve_ratio and without swap region, I think
        the oom kill occurs easily.
      
      - the overall design approach in the patch
      
        When the OVERCOMMET_GUESS algorithm calculates number of free pages,
        the reserved free pages are regarded as non-free pages.
      
        This change helps to avoid the pitfall that the number of free pages
        become less than the number which the kernel tries to keep free.
      
      - testing results
      
        I tested the patches using my test kernel module.
      
        If the patches aren't applied to the kernel, __vm_enough_memory()
        returns success in the situation but autual page allocation is
        failed.
      
        On the other hand, if the patches are applied to the kernel, memory
        allocation failure is avoided since __vm_enough_memory() returns
        failure in the situation.
      
        I checked that on i386 SMP 16GB memory machine. I haven't tested on
        nommu environment currently.
      
      This patch adds totalreserve_pages for __vm_enough_memory().
      
      Calculate_totalreserve_pages() checks maximum lowmem_reserve pages and
      pages_high in each zone. Finally, the function stores the sum of each
      zone to totalreserve_pages.
      
      The totalreserve_pages is calculated when the VM is initilized.
      And the variable is updated when /proc/sys/vm/lowmem_reserve_raito
      or /proc/sys/vm/min_free_kbytes are changed.
      
      Signed-off-by: default avatarHideo Aoki <haoki@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      cb45b0e9
  6. Apr 10, 2006
  7. Mar 27, 2006
  8. Mar 25, 2006
  9. Mar 24, 2006
  10. Mar 22, 2006
  11. Mar 09, 2006
    • Christoph Lameter's avatar
      [PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. · 8fce4d8e
      Christoph Lameter authored
      
      The cache reaper currently tries to free all alien caches and all remote
      per cpu pages in each pass of cache_reap.  For a machines with large number
      of nodes (such as Altix) this may lead to sporadic delays of around ~10ms.
      Interrupts are disabled while reclaiming creating unacceptable delays.
      
      This patch changes that behavior by adding a per cpu reap_node variable.
      Instead of attempting to free all caches, we free only one alien cache and
      the per cpu pages from one remote node.  That reduces the time spend in
      cache_reap.  However, doing so will lengthen the time it takes to
      completely drain all remote per cpu pagesets and all alien caches.  The
      time needed will grow with the number of nodes in the system.  All caches
      are drained when they overflow their respective capacity.  So the drawback
      here is only that a bit of memory may be wasted for awhile longer.
      
      Details:
      
      1. Rename drain_remote_pages to drain_node_pages to allow the specification
         of the node to drain of pcp pages.
      
      2. Add additional functions init_reap_node, next_reap_node for NUMA
         that manage a per cpu reap_node counter.
      
      3. Add a reap_alien function that reaps only from the current reap_node.
      
      For us this seems to be a critical issue.  Holdoffs of an average of ~7ms
      cause some HPC benchmarks to slow down significantly.  F.e.  NAS parallel
      slows down dramatically.  NAS parallel has a 12-16 seconds runtime w/o rotor
      compared to 5.8 secs with the rotor patches.  It gets down to 5.05 secs with
      the additional interrupt holdoff reductions.
      
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8fce4d8e
  12. Feb 20, 2006
    • Christoph Lameter's avatar
      [PATCH] Terminate process that fails on a constrained allocation · 9b0f8b04
      Christoph Lameter authored
      
      Some allocations are restricted to a limited set of nodes (due to memory
      policies or cpuset constraints).  If the page allocator is not able to find
      enough memory then that does not mean that overall system memory is low.
      
      In particular going postal and more or less randomly shooting at processes
      is not likely going to help the situation but may just lead to suicide (the
      whole system coming down).
      
      It is better to signal to the process that no memory exists given the
      constraints that the process (or the configuration of the process) has
      placed on the allocation behavior.  The process may be killed but then the
      sysadmin or developer can investigate the situation.  The solution is
      similar to what we do when running out of hugepages.
      
      This patch adds a check before we kill processes.  At that point
      performance considerations do not matter much so we just scan the zonelist
      and reconstruct a list of nodes.  If the list of nodes does not contain all
      online nodes then this is a constrained allocation and we should kill the
      current process.
      
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9b0f8b04
  13. Feb 17, 2006
  14. Feb 14, 2006
    • Hugh Dickins's avatar
      [PATCH] compound page: default destructor · d98c7a09
      Hugh Dickins authored
      
      Somehow I imagined that calling a NULL destructor would free a compound page
      rather than oopsing.  No, we must supply a default destructor, __free_pages_ok
      using the order noted by prep_compound_page.  hugetlb can still replace this
      as before with its own free_huge_page pointer.
      
      The case that needs this is not common: rarely does put_compound_page's
      put_page_testzero bring the count down to 0.  But if get_user_pages is applied
      to some part of a compound page, without immediate release (e.g.  AIO or
      Infiniband), then it's possible for its put_page to come after the containing
      vma has been unmapped and the driver done its free_pages.
      
      That's just the kind of case compound pages are supposed to be guarding
      against (but Nick points out, nor did PageReserved handle this right).
      
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d98c7a09
    • Hugh Dickins's avatar
      [PATCH] compound page: use page[1].lru · 41d78ba5
      Hugh Dickins authored
      
      If a compound page has its own put_page_testzero destructor (the only current
      example is free_huge_page), that is noted in page[1].mapping of the compound
      page.  But that's rather a poor place to keep it: functions which call
      set_page_dirty_lock after get_user_pages (e.g.  Infiniband's
      __ib_umem_release) ought to be checking first, otherwise set_page_dirty is
      liable to crash on what's not the address of a struct address_space.
      
      And now I'm about to make that worse: it turns out that every compound page
      needs a destructor, so we can no longer rely on hugetlb pages going their own
      special way, to avoid further problems of page->mapping reuse.  For example,
      not many people know that: on 50% of i386 -Os builds, the first tail page of a
      compound page purports to be PageAnon (when its destructor has an odd
      address), which surprises page_add_file_rmap.
      
      Keep the compound page destructor in page[1].lru.next instead.  And to free up
      the common pairing of mapping and index, also move compound page order from
      index to lru.prev.  Slab reuses page->lru too: but if we ever need slab to use
      compound pages, it can easily stack its use above this.
      
      (akpm: decoded version of the above: the tail pages of a compound page now
      have ->mapping==NULL, so there's no need for the set_page_dirty[_lock]()
      caller to check that they're not compund pages before doing the dirty).
      
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      41d78ba5
  15. Feb 05, 2006
  16. Feb 01, 2006
  17. Jan 18, 2006
  18. Jan 17, 2006
  19. Jan 12, 2006
  20. Jan 11, 2006
  21. Jan 09, 2006
  22. Jan 08, 2006
    • Paul Jackson's avatar
      [PATCH] cpuset: memory pressure meter · 3e0d98b9
      Paul Jackson authored
      
      Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
      that the tasks in a cpuset call try_to_free_pages(), the synchronous
      (direct) memory reclaim code.
      
      This enables batch managers monitoring jobs running in dedicated cpusets to
      efficiently detect what level of memory pressure that job is causing.
      
      This is useful both on tightly managed systems running a wide mix of
      submitted jobs, which may choose to terminate or reprioritize jobs that are
      trying to use more memory than allowed on the nodes assigned them, and with
      tightly coupled, long running, massively parallel scientific computing jobs
      that will dramatically fail to meet required performance goals if they
      start to use more memory than allowed to them.
      
      This patch just provides a very economical way for the batch manager to
      monitor a cpuset for signs of memory pressure.  It's up to the batch
      manager or other user code to decide what to do about it and take action.
      
      ==> Unless this feature is enabled by writing "1" to the special file
          /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
          code of __alloc_pages() for this metric reduces to simply noticing
          that the cpuset_memory_pressure_enabled flag is zero.  So only
          systems that enable this feature will compute the metric.
      
      Why a per-cpuset, running average:
      
          Because this meter is per-cpuset, rather than per-task or mm, the
          system load imposed by a batch scheduler monitoring this metric is
          sharply reduced on large systems, because a scan of the tasklist can be
          avoided on each set of queries.
      
          Because this meter is a running average, instead of an accumulating
          counter, a batch scheduler can detect memory pressure with a single
          read, instead of having to read and accumulate results for a period of
          time.
      
          Because this meter is per-cpuset rather than per-task or mm, the
          batch scheduler can obtain the key information, memory pressure in a
          cpuset, with a single read, rather than having to query and accumulate
          results over all the (dynamically changing) set of tasks in the cpuset.
      
      A per-cpuset simple digital filter (requires a spinlock and 3 words of data
      per-cpuset) is kept, and updated by any task attached to that cpuset, if it
      enters the synchronous (direct) page reclaim code.
      
      A per-cpuset file provides an integer number representing the recent
      (half-life of 10 seconds) rate of direct page reclaims caused by the tasks
      in the cpuset, in units of reclaims attempted per second, times 1000.
      
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      3e0d98b9
    • Nick Piggin's avatar
      [PATCH] mm: free_pages opt · 48db57f8
      Nick Piggin authored
      
      Try to streamline free_pages_bulk by ensuring callers don't pass in a
      'count' that exceeds the list size.
      
      Some cleanups:
      Rename __free_pages_bulk to __free_one_page.
      Put the page list manipulation from __free_pages_ok into free_one_page.
      Make __free_pages_ok static.
      
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      48db57f8
    • Nick Piggin's avatar
      [PATCH] mm: cleanup zone_pcp · 23316bc8
      Nick Piggin authored
      
      Use zone_pcp everywhere even though NUMA code "knows" the internal details
      of the zone.  Stop other people trying to copy, and it looks nicer.
      
      Also, only print the pagesets of online cpus in zoneinfo.
      
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: "Seth, Rohit" <rohit.seth@intel.com>
      Cc: Christoph Lameter <christoph@lameter.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      23316bc8
Loading