1. 26 Sep, 2006 6 commits
    • Christoph Lameter's avatar
      [PATCH] Define easier to handle GFP_THISNODE · 980128f2
      Christoph Lameter authored
      
      
      In many places we will need to use the same combination of flags.  Specify
      a single GFP_THISNODE definition for ease of use in gfp.h.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      980128f2
    • Christoph Lameter's avatar
      [PATCH] Add __GFP_THISNODE to avoid fallback to other nodes and ignore... · 9b819d20
      Christoph Lameter authored
      
      [PATCH] Add __GFP_THISNODE to avoid fallback to other nodes and ignore cpuset/memory policy restrictions
      
      Add a new gfp flag __GFP_THISNODE to avoid fallback to other nodes.  This
      flag is essential if a kernel component requires memory to be located on a
      certain node.  It will be needed for alloc_pages_node() to force allocation
      on the indicated node and for alloc_pages() to force allocation on the
      current node.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9b819d20
    • Christoph Lameter's avatar
      [PATCH] linearly index zone->node_zonelists[] · 19655d34
      Christoph Lameter authored
      
      
      I wonder why we need this bitmask indexing into zone->node_zonelists[]?
      
      We always start with the highest zone and then include all lower zones
      if we build zonelists.
      
      Are there really cases where we need allocation from ZONE_DMA or
      ZONE_HIGHMEM but not ZONE_NORMAL? It seems that the current implementation
      of highest_zone() makes that already impossible.
      
      If we go linear on the index then gfp_zone() == highest_zone() and a lot
      of definitions fall by the wayside.
      
      We can now revert back to the use of gfp_zone() in mempolicy.c ;-)
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      19655d34
    • Christoph Lameter's avatar
      [PATCH] mempolicies: fix policy_zone check · 4e4785bc
      Christoph Lameter authored
      
      
      There is a check in zonelist_policy that compares pieces of the bitmap
      obtained from a gfp mask via GFP_ZONETYPES with a zone number in function
      zonelist_policy().
      
      The bitmap is an ORed mask of __GFP_DMA, __GFP_DMA32 and __GFP_HIGHMEM.
      The policy_zone is a zone number with the possible values of ZONE_DMA,
      ZONE_DMA32, ZONE_HIGHMEM and ZONE_NORMAL. These are two different domains
      of values.
      
      For some reason seemed to work before the zone reduction patchset (It
      definitely works on SGI boxes since we just have one zone and the check
      cannot fail).
      
      With the zone reduction patchset this check definitely fails on systems
      with two zones if the system actually has memory in both zones.
      
      This is because ZONE_NORMAL is selected using no __GFP flag at
      all and thus gfp_zone(gfpmask) == 0. ZONE_DMA is selected when __GFP_DMA
      is set. __GFP_DMA is 0x01.  So gfp_zone(gfpmask) == 1.
      
      policy_zone is set to ZONE_NORMAL (==1) if ZONE_NORMAL and ZONE_DMA are
      populated.
      
      For ZONE_NORMAL gfp_zone(<no _GFP_DMA>) yields 0 which is <
      policy_zone(ZONE_NORMAL) and so policy is not applied to regular memory
      allocations!
      
      Instead gfp_zone(__GFP_DMA) == 1 which results in policy being applied
      to DMA allocations!
      
      What we realy want in that place is to establish the highest allowable
      zone for a given gfp_mask. If the highest zone is higher or equal to the
      policy_zone then memory policies need to be applied. We have such
      a highest_zone() function in page_alloc.c.
      
      So move the highest_zone() function from mm/page_alloc.c into
      include/linux/gfp.h.  On the way we simplify the function and use the new
      zone_type that was also introduced with the zone reduction patchset plus we
      also specify the right type for the gfp flags parameter.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4e4785bc
    • Christoph Lameter's avatar
      [PATCH] reduce MAX_NR_ZONES: make ZONE_HIGHMEM optional · e53ef38d
      Christoph Lameter authored
      
      
      Make ZONE_HIGHMEM optional
      
      - ifdef out code and definitions related to CONFIG_HIGHMEM
      
      - __GFP_HIGHMEM falls back to normal allocations if there is no
        ZONE_HIGHMEM
      
      - GFP_ZONEMASK becomes 0x01 if there is no DMA32 and no HIGHMEM
        zone.
      
      [jdike@addtoit.com: build fix]
      Signed-off-by: default avatarJeff Dike <jdike@addtoit.com>
      Signed-off-by: default avatarChristoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      e53ef38d
    • Christoph Lameter's avatar
      [PATCH] reduce MAX_NR_ZONES: make ZONE_DMA32 optional · fb0e7942
      Christoph Lameter authored
      
      
      Make ZONE_DMA32 optional
      
      - Add #ifdefs around ZONE_DMA32 specific code and definitions.
      
      - Add CONFIG_ZONE_DMA32 config option and use that for x86_64
        that alone needs this zone.
      
      - Remove the use of CONFIG_DMA_IS_DMA32 and CONFIG_DMA_IS_NORMAL
        for ia64 and fix up the way per node ZVCs are calculated.
      
      - Fall back to prior GFP_ZONEMASK of 0x03 if there is no
        DMA32 zone.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      fb0e7942
  2. 26 Apr, 2006 1 commit
  3. 11 Apr, 2006 1 commit
  4. 09 Mar, 2006 1 commit
    • Christoph Lameter's avatar
      [PATCH] slab: Node rotor for freeing alien caches and remote per cpu pages. · 8fce4d8e
      Christoph Lameter authored
      
      
      The cache reaper currently tries to free all alien caches and all remote
      per cpu pages in each pass of cache_reap.  For a machines with large number
      of nodes (such as Altix) this may lead to sporadic delays of around ~10ms.
      Interrupts are disabled while reclaiming creating unacceptable delays.
      
      This patch changes that behavior by adding a per cpu reap_node variable.
      Instead of attempting to free all caches, we free only one alien cache and
      the per cpu pages from one remote node.  That reduces the time spend in
      cache_reap.  However, doing so will lengthen the time it takes to
      completely drain all remote per cpu pagesets and all alien caches.  The
      time needed will grow with the number of nodes in the system.  All caches
      are drained when they overflow their respective capacity.  So the drawback
      here is only that a bit of memory may be wasted for awhile longer.
      
      Details:
      
      1. Rename drain_remote_pages to drain_node_pages to allow the specification
         of the node to drain of pcp pages.
      
      2. Add additional functions init_reap_node, next_reap_node for NUMA
         that manage a per cpu reap_node counter.
      
      3. Add a reap_alien function that reaps only from the current reap_node.
      
      For us this seems to be a critical issue.  Holdoffs of an average of ~7ms
      cause some HPC benchmarks to slow down significantly.  F.e.  NAS parallel
      slows down dramatically.  NAS parallel has a 12-16 seconds runtime w/o rotor
      compared to 5.8 secs with the rotor patches.  It gets down to 5.05 secs with
      the additional interrupt holdoff reductions.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8fce4d8e
  5. 11 Jan, 2006 2 commits
  6. 22 Nov, 2005 1 commit
    • Linus Torvalds's avatar
      Fix up GFP_ZONEMASK for GFP_DMA32 usage · ac3461ad
      Linus Torvalds authored
      
      
      There was some confusion about the different zone usage, this should fix
      up the resulting mess in the GFP zonemask handling.
      
      The different zone usage is still confusing (it's very easy to mix up
      the individual zone numbers with the GFP zone _list_ numbers), so we
      might want to clean up some of this in the future, but in the meantime
      this should fix the actual problems.
      Acked-by: default avatarAndi Kleen <ak@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ac3461ad
  7. 14 Nov, 2005 1 commit
    • Andi Kleen's avatar
      [PATCH] x86_64: Add 4GB DMA32 zone · a2f1b424
      Andi Kleen authored
      
      
      Add a new 4GB GFP_DMA32 zone between the GFP_DMA and GFP_NORMAL zones.
      
      As a bit of historical background: when the x86-64 port
      was originally designed we had some discussion if we should
      use a 16MB DMA zone like i386 or a 4GB DMA zone like IA64 or
      both. Both was ruled out at this point because it was in early
      2.4 when VM is still quite shakey and had bad troubles even
      dealing with one DMA zone.  We settled on the 16MB DMA zone mainly
      because we worried about older soundcards and the floppy.
      
      But this has always caused problems since then because
      device drivers had trouble getting enough DMA able memory. These days
      the VM works much better and the wide use of NUMA has proven
      it can deal with many zones successfully.
      
      So this patch adds both zones.
      
      This helps drivers who need a lot of memory below 4GB because
      their hardware is not accessing more (graphic drivers - proprietary
      and free ones, video frame buffer drivers, sound drivers etc.).
      Previously they could only use IOMMU+16MB GFP_DMA, which
      was not enough memory.
      
      Another common problem is that hardware who has full memory
      addressing for >4GB misses it for some control structures in memory
      (like transmit rings or other metadata).  They tended to allocate memory
      in the 16MB GFP_DMA or the IOMMU/swiotlb then using pci_alloc_consistent,
      but that can tie up a lot of precious 16MB GFPDMA/IOMMU/swiotlb memory
      (even on AMD systems the IOMMU tends to be quite small) especially if you have
      many devices.  With the new zone pci_alloc_consistent can just put
      this stuff into memory below 4GB which works better.
      
      One argument was still if the zone should be 4GB or 2GB. The main
      motivation for 2GB would be an unnamed not so unpopular hardware
      raid controller (mostly found in older machines from a particular four letter
      company) who has a strange 2GB restriction in firmware. But
      that one works ok with swiotlb/IOMMU anyways, so it doesn't really
      need GFP_DMA32. I chose 4GB to be compatible with IA64 and because
      it seems to be the most common restriction.
      
      The new zone is so far added only for x86-64.
      
      For other architectures who don't set up this
      new zone nothing changes. Architectures can set a compatibility
      define in Kconfig CONFIG_DMA_IS_DMA32 that will define GFP_DMA32
      as GFP_DMA. Otherwise it's a nop because on 32bit architectures
      it's normally not needed because GFP_NORMAL (=0) is DMA able
      enough.
      
      One problem is still that GFP_DMA means different things on different
      architectures. e.g. some drivers used to have #ifdef ia64  use GFP_DMA
      (trusting it to be 4GB) #elif __x86_64__ (use other hacks like
      the swiotlb because 16MB is not enough) ... . This was quite
      ugly and is now obsolete.
      
      These should be now converted to use GFP_DMA32 unconditionally. I haven't done
      this yet. Or best only use pci_alloc_consistent/dma_alloc_coherent
      which will use GFP_DMA32 transparently.
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a2f1b424
  8. 13 Nov, 2005 1 commit
  9. 28 Oct, 2005 1 commit
    • Al Viro's avatar
      [PATCH] gfp_t: infrastructure · af4ca457
      Al Viro authored
      
      
      Beginning of gfp_t annotations:
      
       - -Wbitwise added to CHECKFLAGS
       - old __bitwise renamed to __bitwise__
       - __bitwise defined to either __bitwise__ or nothing, depending on
         __CHECK_ENDIAN__ being defined
       - gfp_t switched from __nocast to __bitwise__
       - force cast to gfp_t added to __GFP_... constants
       - new helper - gfp_zone(); extracts zone bits out of gfp_t value and casts
         the result to int
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      af4ca457
  10. 08 Oct, 2005 1 commit
  11. 07 Sep, 2005 1 commit
    • Paul Jackson's avatar
      [PATCH] cpusets: new __GFP_HARDWALL flag · f90b1d2f
      Paul Jackson authored
      
      
      Add another GFP flag: __GFP_HARDWALL.
      
      A subsequent "cpuset_zone_allowed" patch will use this flag to mark GFP_USER
      allocations, and distinguish them from GFP_KERNEL allocations.
      
      Allocations (such as GFP_USER) marked GFP_HARDWALL are constrainted to the
      current tasks cpuset.  Other allocations (such as GFP_KERNEL) can steal from
      the possibly larger nearest mem_exclusive cpuset ancestor, if memory is tight
      on every node in the current cpuset.
      
      This patch collides with Mel Gorman's patch to reduce fragmentation in the
      standard buddy allocator, which adds two GFP flags.  This was discussed on
      linux-mm in July.  Most likely, one of his flags for user reclaimable memory
      can be the same as my __GFP_HARDWALL flag, under some generic name meaning its
      user address space memory.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f90b1d2f
  12. 07 Jul, 2005 1 commit
  13. 21 Jun, 2005 2 commits
    • Christoph Lameter's avatar
      [PATCH] Periodically drain non local pagesets · 4ae7c039
      Christoph Lameter authored
      
      
      The pageset array can potentially acquire a huge amount of memory on large
      NUMA systems.  F.e.  on a system with 512 processors and 256 nodes there
      will be 256*512 pagesets.  If each pageset only holds 5 pages then we are
      talking about 655360 pages.With a 16K page size on IA64 this results in
      potentially 10 Gigabytes of memory being trapped in pagesets.  The typical
      cases are much less for smaller systems but there is still the potential of
      memory being trapped in off node pagesets.  Off node memory may be rarely
      used if local memory is available and so we may potentially have memory in
      seldom used pagesets without this patch.
      
      The slab allocator flushes its per cpu caches every 2 seconds.  The
      following patch flushes the off node pageset caches in the same way by
      tying into the slab flush.
      
      The patch also changes /proc/zoneinfo to include the number of pages
      currently in each pageset.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4ae7c039
    • Martin Hicks's avatar
      [PATCH] VM: add __GFP_NORECLAIM · 0c35bbad
      Martin Hicks authored
      
      
      When using the early zone reclaim, it was noticed that allocating new pages
      that should be spread across the whole system caused eviction of local pages.
      
      This adds a new GFP flag to prevent early reclaim from happening during
      certain allocation attempts.  The example that is implemented here is for page
      cache pages.  We want page cache pages to be spread across the whole system,
      and we don't want page cache pages to evict other pages to get local memory.
      Signed-off-by: default avatarMartin Hicks <mort@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0c35bbad
  14. 01 May, 2005 1 commit
    • Nick Piggin's avatar
      [PATCH] mempool: NOMEMALLOC and NORETRY · b84a35be
      Nick Piggin authored
      
      
      Mempools have 2 problems.
      
      The first is that mempool_alloc can possibly get stuck in __alloc_pages
      when they should opt to fail, and take an element from their reserved pool.
      
      The second is that it will happily eat emergency PF_MEMALLOC reserves
      instead of going to their reserved pools.
      
      Fix the first by passing __GFP_NORETRY in the allocation calls in
      mempool_alloc.  Fix the second by introducing a __GFP_MEMPOOL flag which
      directs the page allocator not to allocate from the reserve pool.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b84a35be
  15. 16 Apr, 2005 1 commit
    • Linus Torvalds's avatar
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds authored
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4