1. 16 Oct, 2007 4 commits
    • Mel Gorman's avatar
      Split the free lists for movable and unmovable allocations · b2a0ac88
      Mel Gorman authored
      
      
      This patch adds the core of the fragmentation reduction strategy.  It works by
      grouping pages together based on their ability to migrate or be reclaimed.
      Basically, it works by breaking the list in zone->free_area list into
      MIGRATE_TYPES number of lists.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2a0ac88
    • Mel Gorman's avatar
      Add a bitmap that is used to track flags affecting a block of pages · 835c134e
      Mel Gorman authored
      
      
      Here is the latest revision of the anti-fragmentation patches.  Of particular
      note in this version is special treatment of high-order atomic allocations.
      Care is taken to group them together and avoid grouping pages of other types
      near them.  Artifical tests imply that it works.  I'm trying to get the
      hardware together that would allow setting up of a "real" test.  If anyone
      already has a setup and test that can trigger the atomic-allocation problem,
      I'd appreciate a test of these patches and a report.  The second major change
      is that these patches will apply cleanly with patches that implement
      anti-fragmentation through zones.
      
      kernbench shows effectively no performance difference varying between -0.2%
      and +2% on a variety of test machines.  Success rates for huge page allocation
      are dramatically increased.  For example, on a ppc64 machine, the vanilla
      kernel was only able to allocate 1% of memory as a hugepage and this was due
      to a single hugepage reserved as min_free_kbytes.  With these patches applied,
      17% was allocatable as superpages.  With reclaim-related fixes from Andy
      Whitcroft, it was 40% and further reclaim-related improvements should increase
      this further.
      
      Changelog Since V28
      o Group high-order atomic allocations together
      o It is no longer required to set min_free_kbytes to 10% of memory. A value
        of 16384 in most cases will be sufficient
      o Now applied with zone-based anti-fragmentation
      o Fix incorrect VM_BUG_ON within buffered_rmqueue()
      o Reorder the stack so later patches do not back out work from earlier patches
      o Fix bug were journal pages were being treated as movable
      o Bias placement of non-movable pages to lower PFNs
      o More agressive clustering of reclaimable pages in reactions to workloads
        like updatedb that flood the size of inode caches
      
      Changelog Since V27
      
      o Renamed anti-fragmentation to Page Clustering. Anti-fragmentation was giving
        the mistaken impression that it was the 100% solution for high order
        allocations. Instead, it greatly increases the chances high-order
        allocations will succeed and lays the foundation for defragmentation and
        memory hot-remove to work properly
      o Redefine page groupings based on ability to migrate or reclaim instead of
        basing on reclaimability alone
      o Get rid of spurious inits
      o Per-cpu lists are no longer split up per-type. Instead the per-cpu list is
        searched for a page of the appropriate type
      o Added more explanation commentary
      o Fix up bug in pageblock code where bitmap was used before being initalised
      
      Changelog Since V26
      o Fix double init of lists in setup_pageset
      
      Changelog Since V25
      o Fix loop order of for_each_rclmtype_order so that order of loop matches args
      o gfpflags_to_rclmtype uses gfp_t instead of unsigned long
      o Rename get_pageblock_type() to get_page_rclmtype()
      o Fix alignment problem in move_freepages()
      o Add mechanism for assigning flags to blocks of pages instead of page->flags
      o On fallback, do not examine the preferred list of free pages a second time
      
      The purpose of these patches is to reduce external fragmentation by grouping
      pages of related types together.  When pages are migrated (or reclaimed under
      memory pressure), large contiguous pages will be freed.
      
      This patch works by categorising allocations by their ability to migrate;
      
      Movable - The pages may be moved with the page migration mechanism. These are
      	generally userspace pages.
      
      Reclaimable - These are allocations for some kernel caches that are
      	reclaimable or allocations that are known to be very short-lived.
      
      Unmovable - These are pages that are allocated by the kernel that
      	are not trivially reclaimed. For example, the memory allocated for a
      	loaded module would be in this category. By default, allocations are
      	considered to be of this type
      
      HighAtomic - These are high-order allocations belonging to callers that
      	cannot sleep or perform any IO. In practice, this is restricted to
      	jumbo frame allocation for network receive. It is assumed that the
      	allocations are short-lived
      
      Instead of having one MAX_ORDER-sized array of free lists in struct free_area,
      there is one for each type of reclaimability.  Once a 2^MAX_ORDER block of
      pages is split for a type of allocation, it is added to the free-lists for
      that type, in effect reserving it.  Hence, over time, pages of the different
      types can be clustered together.
      
      When the preferred freelists are expired, the largest possible block is taken
      from an alternative list.  Buddies that are split from that large block are
      placed on the preferred allocation-type freelists to mitigate fragmentation.
      
      This implementation gives best-effort for low fragmentation in all zones.
      Ideally, min_free_kbytes needs to be set to a value equal to 4 * (1 <<
      (MAX_ORDER-1)) pages in most cases.  This would be 16384 on x86 and x86_64 for
      example.
      
      Our tests show that about 60-70% of physical memory can be allocated on a
      desktop after a few days uptime.  In benchmarks and stress tests, we are
      finding that 80% of memory is available as contiguous blocks at the end of the
      test.  To compare, a standard kernel was getting < 1% of memory as large pages
      on a desktop and about 8-12% of memory as large pages at the end of stress
      tests.
      
      Following this email are 12 patches that implement thie page grouping feature.
       The first patch introduces a mechanism for storing flags related to a whole
      block of pages.  Then allocations are split between movable and all other
      allocations.  Following that are patches to deal with per-cpu pages and make
      the mechanism configurable.  The next patch moves free pages between lists
      when partially allocated blocks are used for pages of another migrate type.
      The second last patch groups reclaimable kernel allocations such as inode
      caches together.  The final patch related to groupings keeps high-order atomic
      allocations.
      
      The last two patches are more concerned with control of fragmentation.  The
      second last patch biases placement of non-movable allocations towards the
      start of memory.  This is with a view of supporting memory hot-remove of DIMMs
      with higher PFNs in the future.  The biasing could be enforced a lot heavier
      but it would cost.  The last patch agressively clusters reclaimable pages like
      inode caches together.
      
      The fragmentation reduction strategy needs to track if pages within a block
      can be moved or reclaimed so that pages are freed to the appropriate list.
      This patch adds a bitmap for flags affecting a whole a MAX_ORDER block of
      pages.
      
      In non-SPARSEMEM configurations, the bitmap is stored in the struct zone and
      allocated during initialisation.  SPARSEMEM statically allocates the bitmap in
      a struct mem_section so that bitmaps do not have to be resized during memory
      hotadd.  This wastes a small amount of memory per unused section (usually
      sizeof(unsigned long)) but the complexity of dynamically allocating the memory
      is quite high.
      
      Additional credit to Andy Whitcroft who reviewed up an earlier implementation
      of the mechanism an suggested how to make it a *lot* cleaner.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      835c134e
    • Christoph Lameter's avatar
      Memoryless nodes: Fix GFP_THISNODE behavior · 523b9458
      Christoph Lameter authored
      
      
      GFP_THISNODE checks that the zone selected is within the pgdat (node) of the
      first zone of a nodelist.  That only works if the node has memory.  A
      memoryless node will have its first node on another pgdat (node).
      
      GFP_THISNODE currently will return simply memory on the first pgdat.  Thus it
      is returning memory on other nodes.  GFP_THISNODE should fail if there is no
      local memory on a node.
      
      Add a new set of zonelists for each node that only contain the nodes that
      belong to the zones itself so that no fallback is possible.
      
      Then modify gfp_type to pickup the right zone based on the presence of
      __GFP_THISNODE.
      
      Drop the existing GFP_THISNODE checks from the page_allocators hot path.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Acked-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Tested-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarBob Picco <bob.picco@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      523b9458
    • Andy Whitcroft's avatar
      sparsemem: record when a section has a valid mem_map · 540557b9
      Andy Whitcroft authored
      
      
      We have flags to indicate whether a section actually has a valid mem_map
      associated with it.  This is never set and we rely solely on the present bit
      to indicate a section is valid.  By definition a section is not valid if it
      has no mem_map and there is a window during init where the present bit is set
      but there is no mem_map, during which pfn_valid() will return true
      incorrectly.
      
      Use the existing SECTION_HAS_MEM_MAP flag to indicate the presence of a valid
      mem_map.  Switch valid_section{,_nr} and pfn_valid() to this bit.  Add a new
      present_section{,_nr} and pfn_present() interfaces for those users who care to
      know that a section is going to be valid.
      
      [akpm@linux-foundation.org: coding-syle fixes]
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      540557b9
  2. 22 Aug, 2007 1 commit
    • Mel Gorman's avatar
      Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE · b377fd39
      Mel Gorman authored
      
      
      The NUMA layer only supports NUMA policies for the highest zone.  When
      ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
      ZONE_MOVABLE.  The result is that policies are only applied to allocations
      like anonymous pages and page cache allocated from ZONE_MOVABLE when the
      zone is used.
      
      This patch applies policies to the two highest zones when the highest zone
      is ZONE_MOVABLE.  As ZONE_MOVABLE consists of pages from the highest "real"
      zone, it's always functionally equivalent.
      
      The patch has been tested on a variety of machines both NUMA and non-NUMA
      covering x86, x86_64 and ppc64.  No abnormal results were seen in
      kernbench, tbench, dbench or hackbench.  It passes regression tests from
      the numactl package with and without kernelcore= once numactl tests are
      patched to wait for vmstat counters to update.
      
      akpm: this is the nasty hack to fix NUMA mempolicies in the presence of
      ZONE_MOVABLE and kernelcore= in 2.6.23.  Christoph says "For .24 either merge
      the mobility or get the other solution that Mel is working on.  That solution
      would only use a single zonelist per node and filter on the fly.  That may
      help performance and also help to make memory policies work better."
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Tested-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b377fd39
  3. 31 Jul, 2007 1 commit
  4. 17 Jul, 2007 2 commits
    • Andy Whitcroft's avatar
      Lumpy Reclaim V4 · 5ad333eb
      Andy Whitcroft authored
      
      
      When we are out of memory of a suitable size we enter reclaim.  The current
      reclaim algorithm targets pages in LRU order, which is great for fairness at
      order-0 but highly unsuitable if you desire pages at higher orders.  To get
      pages of higher order we must shoot down a very high proportion of memory;
      >95% in a lot of cases.
      
      This patch set adds a lumpy reclaim algorithm to the allocator.  It targets
      groups of pages at the specified order anchored at the end of the active and
      inactive lists.  This encourages groups of pages at the requested orders to
      move from active to inactive, and active to free lists.  This behaviour is
      only triggered out of direct reclaim when higher order pages have been
      requested.
      
      This patch set is particularly effective when utilised with an
      anti-fragmentation scheme which groups pages of similar reclaimability
      together.
      
      This patch set is based on Peter Zijlstra's lumpy reclaim V2 patch which forms
      the foundation.  Credit to Mel Gorman for sanitity checking.
      
      Mel said:
      
        The patches have an application with hugepage pool resizing.
      
        When lumpy-reclaim is used used with ZONE_MOVABLE, the hugepages pool can
        be resized with greater reliability.  Testing on a desktop machine with 2GB
        of RAM showed that growing the hugepage pool with ZONE_MOVABLE on it's own
        was very slow as the success rate was quite low.  Without lumpy-reclaim,
        each attempt to grow the pool by 100 pages would yield 1 or 2 hugepages.
        With lumpy-reclaim, getting 40 to 70 hugepages on each attempt was typical.
      
      [akpm@osdl.org: ia64 pfn_to_nid fixes and loop cleanup]
      [bunk@stusta.de: static declarations for internal functions]
      [a.p.zijlstra@chello.nl: initial lumpy V2 implementation]
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Bob Picco <bob.picco@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ad333eb
    • Mel Gorman's avatar
      Create the ZONE_MOVABLE zone · 2a1e274a
      Mel Gorman authored
      
      
      The following 8 patches against 2.6.20-mm2 create a zone called ZONE_MOVABLE
      that is only usable by allocations that specify both __GFP_HIGHMEM and
      __GFP_MOVABLE.  This has the effect of keeping all non-movable pages within a
      single memory partition while allowing movable allocations to be satisfied
      from either partition.  The patches may be applied with the list-based
      anti-fragmentation patches that groups pages together based on mobility.
      
      The size of the zone is determined by a kernelcore= parameter specified at
      boot-time.  This specifies how much memory is usable by non-movable
      allocations and the remainder is used for ZONE_MOVABLE.  Any range of pages
      within ZONE_MOVABLE can be released by migrating the pages or by reclaiming.
      
      When selecting a zone to take pages from for ZONE_MOVABLE, there are two
      things to consider.  First, only memory from the highest populated zone is
      used for ZONE_MOVABLE.  On the x86, this is probably going to be ZONE_HIGHMEM
      but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64.  Second,
      the amount of memory usable by the kernel will be spread evenly throughout
      NUMA nodes where possible.  If the nodes are not of equal size, the amount of
      memory usable by the kernel on some nodes may be greater than others.
      
      By default, the zone is not as useful for hugetlb allocations because they are
      pinned and non-migratable (currently at least).  A sysctl is provided that
      allows huge pages to be allocated from that zone.  This means that the huge
      page pool can be resized to the size of ZONE_MOVABLE during the lifetime of
      the system assuming that pages are not mlocked.  Despite huge pages being
      non-movable, we do not introduce additional external fragmentation of note as
      huge pages are always the largest contiguous block we care about.
      
      Credit goes to Andy Whitcroft for catching a large variety of problems during
      review of the patches.
      
      This patch creates an additional zone, ZONE_MOVABLE.  This zone is only usable
      by allocations which specify both __GFP_HIGHMEM and __GFP_MOVABLE.  Hot-added
      memory continues to be placed in their existing destination as there is no
      mechanism to redirect them to a specific zone.
      
      [y-goto@jp.fujitsu.com: Fix section mismatch of memory hotplug related code]
      [akpm@linux-foundation.org: various fixes]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a1e274a
  5. 16 Jul, 2007 1 commit
    • KAMEZAWA Hiroyuki's avatar
      change zonelist order: zonelist order selection logic · f0c0b2b8
      KAMEZAWA Hiroyuki authored
      
      
      Make zonelist creation policy selectable from sysctl/boot option v6.
      
      This patch makes NUMA's zonelist (of pgdat) order selectable.
      Available order are Default(automatic)/ Node-based / Zone-based.
      
      [Default Order]
      The kernel selects Node-based or Zone-based order automatically.
      
      [Node-based Order]
      This policy treats the locality of memory as the most important parameter.
      Zonelist order is created by each zone's locality. This means lower zones
      (ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
      IOW. ZONE_DMA will be in the middle of zonelist.
      current 2.6.21 kernel uses this.
      
      Pros.
       * A user can expect local memory as much as possible.
      Cons.
       * lower zone will be exhansted before higher zone. This may cause OOM_KILL.
      
      Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
      because of ZONE_DMA exhaution and you need the best locality.
      
      (example)
      assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.
      
      *node(0)'s memory allocation order:
      
       node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.
      
      *node(1)'s memory allocation order:
      
       node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.
      
      [Zone-based order]
      This policy treats the zone type as the most important parameter.
      Zonelist order is created by zone-type order. This means lower zone
      never be used bofere higher zone exhaustion.
      IOW. ZONE_DMA will be always at the tail of zonelist.
      
      Pros.
       * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
      Cons.
       * memory locality may not be best.
      
      (example)
      assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.
      
      *node(0)'s memory allocation order:
      
       node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.
      
      *node(1)'s memory allocation order:
      
       node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.
      
      bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.
      
      command:
      %echo N > /proc/sys/vm/numa_zonelist_order
      
      Will rebuild zonelist in Node-based order.
      
      command:
      %echo Z > /proc/sys/vm/numa_zonelist_order
      
      Will rebuild zonelist in Zone-based order.
      
      Thanks to Lee Schermerhorn, he gives me much help and codes.
      
      [Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order]
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0c0b2b8
  6. 09 May, 2007 1 commit
    • Christoph Lameter's avatar
      Move remote node draining out of slab allocators · 4037d452
      Christoph Lameter authored
      
      
      Currently the slab allocators contain callbacks into the page allocator to
      perform the draining of pagesets on remote nodes.  This requires SLUB to have
      a whole subsystem in order to be compatible with SLAB.  Moving node draining
      out of the slab allocators avoids a section of code in SLUB.
      
      Move the node draining so that is is done when the vm statistics are updated.
      At that point we are already touching all the cachelines with the pagesets of
      a processor.
      
      Add a expire counter there.  If we have to update per zone or global vm
      statistics then assume that the pageset will require subsequent draining.
      
      The expire counter will be decremented on each vm stats update pass until it
      reaches zero.  Then we will drain one batch from the pageset.  The draining
      will cause vm counter updates which will then cause another expiration until
      the pcp is empty.  So we will drain a batch every 3 seconds.
      
      Note that remote node draining is a somewhat esoteric feature that is required
      on large NUMA systems because otherwise significant portions of system memory
      can become trapped in pcp queues.  The number of pcp is determined by the
      number of processors and nodes in a system.  A system with 4 processors and 2
      nodes has 8 pcps which is okay.  But a system with 1024 processors and 512
      nodes has 512k pcps with a high potential for large amount of memory being
      caught in them.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4037d452
  7. 07 May, 2007 1 commit
  8. 11 Feb, 2007 5 commits
  9. 11 Jan, 2007 1 commit
  10. 07 Dec, 2006 3 commits
    • Helge Deller's avatar
      [PATCH] struct seq_operations and struct file_operations constification · 15ad7cdc
      Helge Deller authored
      
      
       - move some file_operations structs into the .rodata section
      
       - move static strings from policy_types[] array into the .rodata section
      
       - fix generic seq_operations usages, so that those structs may be defined
         as "const" as well
      
      [akpm@osdl.org: couple of fixes]
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      15ad7cdc
    • Paul Jackson's avatar
      [PATCH] memory page_alloc zonelist caching reorder structure · 7253f4ef
      Paul Jackson authored
      
      
      Rearrange the struct members in the 'struct zonelist_cache' structure, so
      as to put the readonly (once initialized) z_to_n[] array first, where it
      will come right after the zones[] array in struct zonelist.
      
      This pretty much eliminates the chance that the two frequently written
      elements of 'struct zonelist_cache', the fullzones bitmap and last_full_zap
      times, will end up on the same cache line as the performance sensitive,
      frequently read, never (after init) written zones[] array.
      
      Keeping frequently written data off frequently read cache lines is good for
      performance.
      
      Thanks to Rohit Seth for the suggestion.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Cc: Rohit Seth <rohitseth@google.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7253f4ef
    • Paul Jackson's avatar
      [PATCH] memory page_alloc zonelist caching speedup · 9276b1bc
      Paul Jackson authored
      
      
      Optimize the critical zonelist scanning for free pages in the kernel memory
      allocator by caching the zones that were found to be full recently, and
      skipping them.
      
      Remembers the zones in a zonelist that were short of free memory in the
      last second.  And it stashes a zone-to-node table in the zonelist struct,
      to optimize that conversion (minimize its cache footprint.)
      
      Recent changes:
      
          This differs in a significant way from a similar patch that I
          posted a week ago.  Now, instead of having a nodemask_t of
          recently full nodes, I have a bitmask of recently full zones.
          This solves a problem that last weeks patch had, which on
          systems with multiple zones per node (such as DMA zone) would
          take seeing any of these zones full as meaning that all zones
          on that node were full.
      
          Also I changed names - from "zonelist faster" to "zonelist cache",
          as that seemed to better convey what we're doing here - caching
          some of the key zonelist state (for faster access.)
      
          See below for some performance benchmark results.  After all that
          discussion with David on why I didn't need them, I went and got
          some ;).  I wanted to verify that I had not hurt the normal case
          of memory allocation noticeably.  At least for my one little
          microbenchmark, I found (1) the normal case wasn't affected, and
          (2) workloads that forced scanning across multiple nodes for
          memory improved up to 10% fewer System CPU cycles and lower
          elapsed clock time ('sys' and 'real').  Good.  See details, below.
      
          I didn't have the logic in get_page_from_freelist() for various
          full nodes and zone reclaim failures correct.  That should be
          fixed up now - notice the new goto labels zonelist_scan,
          this_zone_full, and try_next_zone, in get_page_from_freelist().
      
      There are two reasons I persued this alternative, over some earlier
      proposals that would have focused on optimizing the fake numa
      emulation case by caching the last useful zone:
      
       1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
          have seen real customer loads where the cost to scan the zonelist
          was a problem, due to many nodes being full of memory before
          we got to a node we could use.  Or at least, I think we have.
          This was related to me by another engineer, based on experiences
          from some time past.  So this is not guaranteed.  Most likely, though.
      
          The following approach should help such real numa systems just as
          much as it helps fake numa systems, or any combination thereof.
      
       2) The effort to distinguish fake from real numa, using node_distance,
          so that we could cache a fake numa node and optimize choosing
          it over equivalent distance fake nodes, while continuing to
          properly scan all real nodes in distance order, was going to
          require a nasty blob of zonelist and node distance munging.
      
          The following approach has no new dependency on node distances or
          zone sorting.
      
      See comment in the patch below for a description of what it actually does.
      
      Technical details of note (or controversy):
      
       - See the use of "zlc_active" and "did_zlc_setup" below, to delay
         adding any work for this new mechanism until we've looked at the
         first zone in zonelist.  I figured the odds of the first zone
         having the memory we needed were high enough that we should just
         look there, first, then get fancy only if we need to keep looking.
      
       - Some odd hackery was needed to add items to struct zonelist, while
         not tripping up the custom zonelists built by the mm/mempolicy.c
         code for MPOL_BIND.  My usual wordy comments below explain this.
         Search for "MPOL_BIND".
      
       - Some per-node data in the struct zonelist is now modified frequently,
         with no locking.  Multiple CPU cores on a node could hit and mangle
         this data.  The theory is that this is just performance hint data,
         and the memory allocator will work just fine despite any such mangling.
         The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
         (a bitmask) and 'last_full_zap' (unsigned long jiffies).  It should
         all be self correcting after at most a one second delay.
      
       - This still does a linear scan of the same lengths as before.  All
         I've optimized is making the scan faster, not algorithmically
         shorter.  It is now able to scan a compact array of 'unsigned
         short' in the case of many full nodes, so one cache line should
         cover quite a few nodes, rather than each node hitting another
         one or two new and distinct cache lines.
      
       - If both Andi and Nick don't find this too complicated, I will be
         (pleasantly) flabbergasted.
      
       - I removed the comment claiming we only use one cachline's worth of
         zonelist.  We seem, at least in the fake numa case, to have put the
         lie to that claim.
      
       - I pay no attention to the various watermarks and such in this performance
         hint.  A node could be marked full for one watermark, and then skipped
         over when searching for a page using a different watermark.  I think
         that's actually quite ok, as it will tend to slightly increase the
         spreading of memory over other nodes, away from a memory stressed node.
      
      ===============
      
      Performance - some benchmark results and analysis:
      
      This benchmark runs a memory hog program that uses multiple
      threads to touch alot of memory as quickly as it can.
      
      Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
      the total 96 GBytes on the system, and using 1, 19, 37, or 55
      threads (on a 56 CPU system.)  System, user and real (elapsed)
      timings were recorded for each run, shown in units of seconds,
      in the table below.
      
      Two kernels were tested - 2.6.18-mm3 and the same kernel with
      this zonelist caching patch added.  The table also shows the
      percentage improvement the zonelist caching sys time is over
      (lower than) the stock *-mm kernel.
      
            number     2.6.18-mm3	   zonelist-cache    delta (< 0 good)	percent
       GBs    N  	------------	   --------------    ----------------	systime
       mem threads   sys user  real	  sys  user  real     sys  user  real	 better
        12	 1     153   24   177	  151	 24   176      -2     0    -1	   1%
        12	19	99   22     8	   99	 22	8	0     0     0	   0%
        12	37     111   25     6	  112	 25	6	1     0     0	  -0%
        12	55     115   25     5	  110	 23	5      -5    -2     0	   4%
        38	 1     502   74   576	  497	 73   570      -5    -1    -6	   0%
        38	19     426   78    48	  373	 76    39     -53    -2    -9	  12%
        38	37     544   83    36	  547	 82    36	3    -1     0	  -0%
        38	55     501   77    23	  511	 80    24      10     3     1	  -1%
        64	 1     917  125  1042	  890	124  1014     -27    -1   -28	   2%
        64	19    1118  138   119	  965	141   103    -153     3   -16	  13%
        64	37    1202  151    94	 1136	150    81     -66    -1   -13	   5%
        64	55    1118  141    61	 1072	140    58     -46    -1    -3	   4%
        90	 1    1342  177  1519	 1275	174  1450     -67    -3   -69	   4%
        90	19    2392  199   192	 2116	189   176    -276   -10   -16	  11%
        90	37    3313  238   175	 2972	225   145    -341   -13   -30	  10%
        90	55    1948  210   104	 1843	213   100    -105     3    -4	   5%
      
      Notes:
       1) This test ran a memory hog program that started a specified number N of
          threads, and had each thread allocate and touch 1/N'th of
          the total memory to be used in the test run in a single loop,
          writing a constant word to memory, one store every 4096 bytes.
          Watching this test during some earlier trial runs, I would see
          each of these threads sit down on one CPU and stay there, for
          the remainder of the pass, a different CPU for each thread.
      
       2) The 'real' column is not comparable to the 'sys' or 'user' columns.
          The 'real' column is seconds wall clock time elapsed, from beginning
          to end of that test pass.  The 'sys' and 'user' columns are total
          CPU seconds spent on that test pass.  For a 19 thread test run,
          for example, the sum of 'sys' and 'user' could be up to 19 times the
          number of 'real' elapsed wall clock seconds.
      
       3) Tests were run on a fresh, single-user boot, to minimize the amount
          of memory already in use at the start of the test, and to minimize
          the amount of background activity that might interfere.
      
       4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.
      
       5) Notice that the 'real' time gets large for the single thread runs, even
          though the measured 'sys' and 'user' times are modest.  I'm not sure what
          that means - probably something to do with it being slow for one thread to
          be accessing memory along ways away.  Perhaps the fake numa system, running
          ostensibly the same workload, would not show this substantial degradation
          of 'real' time for one thread on many nodes -- lets hope not.
      
       6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
          ran quite efficiently, as one might expect.  Each pair of threads needed
          to allocate and touch the memory on the node the two threads shared, a
          pleasantly parallizable workload.
      
       7) The intermediate thread count passes, when asking for alot of memory forcing
          them to go to a few neighboring nodes, improved the most with this zonelist
          caching patch.
      
      Conclusions:
       * This zonelist cache patch probably makes little difference one way or the
         other for most workloads on real numa hardware, if those workloads avoid
         heavy off node allocations.
       * For memory intensive workloads requiring substantial off-node allocations
         on real numa hardware, this patch improves both kernel and elapsed timings
         up to ten per-cent.
       * For fake numa systems, I'm optimistic, but will have to leave that up to
         Rohit Seth to actually test (once I get him a 2.6.18 backport.)
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Cc: Rohit Seth <rohitseth@google.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: David Rientjes <rientjes@cs.washington.edu>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9276b1bc
  11. 28 Oct, 2006 1 commit
    • Martin Bligh's avatar
      [PATCH] vmscan: Fix temp_priority race · 3bb1a852
      Martin Bligh authored
      
      
      The temp_priority field in zone is racy, as we can walk through a reclaim
      path, and just before we copy it into prev_priority, it can be overwritten
      (say with DEF_PRIORITY) by another reclaimer.
      
      The same bug is contained in both try_to_free_pages and balance_pgdat, but
      it is fixed slightly differently.  In balance_pgdat, we keep a separate
      priority record per zone in a local array.  In try_to_free_pages there is
      no need to do this, as the priority level is the same for all zones that we
      reclaim from.
      
      Impact of this bug is that temp_priority is copied into prev_priority, and
      setting this artificially high causes reclaimers to set distress
      artificially low.  They then fail to reclaim mapped pages, when they are,
      in fact, under severe memory pressure (their priority may be as low as 0).
      This causes the OOM killer to fire incorrectly.
      
      From: Andrew Morton <akpm@osdl.org>
      
      __zone_reclaim() isn't modifying zone->prev_priority.  But zone->prev_priority
      is used in the decision whether or not to bring mapped pages onto the inactive
      list.  Hence there's a risk here that __zone_reclaim() will fail because
      zone->prev_priority ir large (ie: low urgency) and lots of mapped pages end up
      stuck on the active list.
      
      Fix that up by decreasing (ie making more urgent) zone->prev_priority as
      __zone_reclaim() scans the zone's pages.
      
      This bug perhaps explains why ZONE_RECLAIM_PRIORITY was created.  It should be
      possible to remove that now, and to just start out at DEF_PRIORITY?
      
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      3bb1a852
  12. 21 Oct, 2006 1 commit
  13. 27 Sep, 2006 4 commits
    • Christoph Lameter's avatar
      [PATCH] Add node to zone for the NUMA case · d5f541ed
      Christoph Lameter authored
      
      
      Add the node in order to optimize zone_to_nid.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Acked-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d5f541ed
    • Heiko Carstens's avatar
      [PATCH] own header file for struct page · 5b99cd0e
      Heiko Carstens authored
      
      
      This moves the definition of struct page from mm.h to its own header file
      page-struct.h.  This is a prereq to fix SetPageUptodate which is broken on
      s390:
      
      #define SetPageUptodate(_page)
             do {
                     struct page *__page = (_page);
                     if (!test_and_set_bit(PG_uptodate, &__page->flags))
                             page_test_and_clear_dirty(_page);
             } while (0)
      
      _page gets used twice in this macro which can cause subtle bugs.  Using
      __page for the page_test_and_clear_dirty call doesn't work since it causes
      yet another problem with the page_test_and_clear_dirty macro as well.
      
      In order to avoid all these problems caused by macros it seems to be a good
      idea to get rid of them and convert them to static inline functions.
      Because of header file include order it's necessary to have a seperate
      header file for the struct page definition.
      
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      5b99cd0e
    • Andrew Morton's avatar
      [PATCH] vm: add per-zone writeout counter · e129b5c2
      Andrew Morton authored
      
      
      The VM is supposed to minimise the number of pages which get written off the
      LRU (for IO scheduling efficiency, and for high reclaim-success rates).  But
      we don't actually have a clear way of showing how true this is.
      
      So add `nr_vmscan_write' to /proc/vmstat and /proc/zoneinfo - the number of
      pages which have been written by the vm scanner in this zone and globally.
      
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      e129b5c2
    • Mel Gorman's avatar
      [PATCH] Introduce mechanism for registering active regions of memory · c713216d
      Mel Gorman authored
      
      
      At a basic level, architectures define structures to record where active
      ranges of page frames are located.  Once located, the code to calculate zone
      sizes and holes in each architecture is very similar.  Some of this zone and
      hole sizing code is difficult to read for no good reason.  This set of patches
      eliminates the similar-looking architecture-specific code.
      
      The patches introduce a mechanism where architectures register where the
      active ranges of page frames are with add_active_range().  When all areas have
      been discovered, free_area_init_nodes() is called to initialise the pgdat and
      zones.  The zone sizes and holes are then calculated in an architecture
      independent manner.
      
      Patch 1 introduces the mechanism for registering and initialising PFN ranges
      Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
      Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
      Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
      Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
      Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
      	It adjusts the watermarks slightly
      
      Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
      gensparse_defconfig and defconfig.  Bob Picco has also tested and debugged on
      IA64.  Jack Steiner successfully boot tested on a mammoth SGI IA64-based
      machine.  These were on patches against 2.6.17-rc1 and release 3 of these
      patches but there have been no ia64-changes since release 3.
      
      There are differences in the zone sizes for x86_64 as the arch-specific code
      for x86_64 accounts the kernel image and the starting mem_maps as memory holes
      but the architecture-independent code accounts the memory as present.
      
      The big benefit of this set of patches is a sizable reduction of
      architecture-specific code, some of which is very hairy.  There should be a
      greater reduction when other architectures use the same mechanisms for zone
      and hole sizing but I lack the hardware to test on.
      
      Additional credit;
      	Dave Hansen for the initial suggestion and comments on early patches
      	Andy Whitcroft for reviewing early versions and catching numerous
      		errors
      	Tony Luck for testing and debugging on IA64
      	Bob Picco for fixing bugs related to pfn registration, reviewing a
      		number of patch revisions, providing a number of suggestions
      		on future direction and testing heavily
      	Jack Steiner and Robin Holt for testing on IA64 and clarifying
      		issues related to memory holes
      	Yasunori for testing on IA64
      	Andi Kleen for reviewing and feeding back about x86_64
      	Christian Kujau for providing valuable information related to ACPI
      		problems on x86_64 and testing potential fixes
      
      This patch:
      
      Define the structure to represent an active range of page frames within a node
      in an architecture independent manner.  Architectures are expected to register
      active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
      free_area_init_nodes() passing the PFNs of the end of each zone.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarBob Picco <bob.picco@hp.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Keith Mannthey" <kmannth@gmail.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c713216d
  14. 26 Sep, 2006 7 commits
  15. 01 Sep, 2006 1 commit
    • Christoph Lameter's avatar
      [PATCH] ZVC: Scale thresholds depending on the size of the system · df9ecaba
      Christoph Lameter authored
      
      
      The ZVC counter update threshold is currently set to a fixed value of 32.
      This patch sets up the threshold depending on the number of processors and
      the sizes of the zones in the system.
      
      With the current threshold of 32, I was able to observe slight contention
      when more than 130-140 processors concurrently updated the counters.  The
      contention vanished when I either increased the threshold to 64 or used
      Andrew's idea of overstepping the interval (see ZVC overstep patch).
      
      However, we saw contention again at 220-230 processors.  So we need higher
      values for larger systems.
      
      But the current default is already a bit of an overkill for smaller
      systems.  Some systems have tiny zones where precision matters.  For
      example i386 and x86_64 have 16M DMA zones and either 900M ZONE_NORMAL or
      ZONE_DMA32.  These are even present on SMP and NUMA systems.
      
      The patch here sets up a threshold based on the number of processors in the
      system and the size of the zone that these counters are used for.  The
      threshold should grow logarithmically, so we use fls() as an easy
      approximation.
      
      Results of tests on a system with 1024 processors (4TB RAM)
      
      The following output is from a test allocating 1GB of memory concurrently
      on each processor (Forking the process.  So contention on mmap_sem and the
      pte locks is not a factor):
      
                             X                   MIN
      TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU
      fork                   1      0.552      0.552      0.540    0.012      0.552
      fork                   4      0.552      0.548      2.164    0.036      2.200
      fork                  16      0.564      0.548      8.812    0.164      8.976
      fork                 128      0.580      0.572     72.204    1.208     73.412
      fork                 256      1.300      0.660    310.400    2.160    312.560
      fork                 512      3.512      0.696   1526.836    4.816   1531.652
      fork                1020     20.024      0.700  17243.176    6.688  17249.863
      
      So a threshold of 32 is fine up to 128 processors. At 256 processors contention
      becomes a factor.
      
      Overstepping the counter (earlier patch) improves the numbers a bit:
      
      fork                   4      0.552      0.548      2.164    0.040      2.204
      fork                  16      0.552      0.548      8.640    0.148      8.788
      fork                 128      0.556      0.548     69.676    0.956     70.632
      fork                 256      0.876      0.636    212.468    2.108    214.576
      fork                 512      2.276      0.672    997.324    4.260   1001.584
      fork                1020     13.564      0.680  11586.436    6.088  11592.523
      
      Still contention at 512 and 1020. Contention at 1020 is down by a third.
      256 still has a slight bit of contention.
      
      After this patch the counter threshold will be set to 125 which reduces
      contention significantly:
      
      fork                 128      0.560      0.548     69.776    0.932     70.708
      fork                 256      0.636      0.556    143.460    2.036    145.496
      fork                 512      0.640      0.548    284.244    4.236    288.480
      fork                1020      1.500      0.588   1326.152    8.892   1335.044
      
      [akpm@osdl.org: !SMP build fix]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      df9ecaba
  16. 03 Jul, 2006 1 commit
    • Christoph Lameter's avatar
      [PATCH] ZVC/zone_reclaim: Leave 1% of unmapped pagecache pages for file I/O · 9614634f
      Christoph Lameter authored
      
      
      It turns out that it is advantageous to leave a small portion of unmapped file
      backed pages if all of a zone's pages (or almost all pages) are allocated and
      so the page allocator has to go off-node.
      
      This allows recently used file I/O buffers to stay on the node and
      reduces the times that zone reclaim is invoked if file I/O occurs
      when we run out of memory in a zone.
      
      The problem is that zone reclaim runs too frequently when the page cache is
      used for file I/O (read write and therefore unmapped pages!) alone and we have
      almost all pages of the zone allocated.  Zone reclaim may remove 32 unmapped
      pages.  File I/O will use these pages for the next read/write requests and the
      unmapped pages increase.  After the zone has filled up again zone reclaim will
      remove it again after only 32 pages.  This cycle is too inefficient and there
      are potentially too many zone reclaim cycles.
      
      With the 1% boundary we may still remove all unmapped pages for file I/O in
      zone reclaim pass.  However.  it will take a large number of read and writes
      to get back to 1% again where we trigger zone reclaim again.
      
      The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
      second timeout.
      
      [akpm@osdl.org: rename the /proc file and the variable]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9614634f
  17. 30 Jun, 2006 5 commits