1. 27 Jul, 2010 1 commit
  2. 14 Aug, 2009 1 commit
    • Tejun Heo's avatar
      vmalloc: implement pcpu_get_vm_areas() · ca23e405
      Tejun Heo authored
      To directly use spread NUMA memories for percpu units, percpu
      allocator will be updated to allow sparsely mapping units in a chunk.
      As the distances between units can be very large, this makes
      allocating single vmap area for each chunk undesirable.  This patch
      implements pcpu_get_vm_areas() and pcpu_free_vm_areas() which
      allocates and frees sparse congruent vmap areas.
      pcpu_get_vm_areas() take @offsets and @sizes array which define
      distances and sizes of vmap areas.  It scans down from the top of
      vmalloc area looking for the top-most address which can accomodate all
      the areas.  The top-down scan is to avoid interacting with regular
      vmallocs which can push up these congruent areas up little by little
      ending up wasting address space and page table.
      To speed up top-down scan, the highest possible address hint is
      maintained.  Although the scan is linear from the hint, given the
      usual large holes between memory addresses between NUMA nodes, the
      scanning is highly likely to finish after finding the first hole for
      the last unit which is scanned first.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Nick Piggin <npiggin@suse.de>
  3. 23 Feb, 2009 1 commit
    • Tejun Heo's avatar
      vmalloc: add @align to vm_area_register_early() · c0c0a293
      Tejun Heo authored
      Impact: allow larger alignment for early vmalloc area allocation
      Some early vmalloc users might want larger alignment, for example, for
      custom large page mapping.  Add @align to vm_area_register_early().
      While at it, drop docbook comment on non-existent @size.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
  4. 20 Feb, 2009 2 commits
    • Tejun Heo's avatar
      vmalloc: add un/map_kernel_range_noflush() · 8fc48985
      Tejun Heo authored
      Impact: two more public map/unmap functions
      Implement map_kernel_range_noflush() and unmap_kernel_range_noflush().
      These functions respectively map and unmap address range in kernel VM
      area but doesn't do any vcache or tlb flushing.  These will be used by
      new percpu allocator.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
    • Tejun Heo's avatar
      vmalloc: implement vm_area_register_early() · f0aa6617
      Tejun Heo authored
      Impact: allow multiple early vm areas
      There are places where kernel VM area needs to be allocated before
      vmalloc is initialized.  This is done by allocating static vm_struct,
      initializing several fields and linking it to vmlist and later vmalloc
      initialization picking up these from vmlist.  This is currently done
      manually and if there's more than one such areas, there's no defined
      way to arbitrate who gets which address.
      This patch implements vm_area_register_early(), which takes vm_area
      struct with flags and size initialized, assigns address to it and puts
      it on the vmlist.  This way, multiple early vm areas can determine
      which addresses they should use.  The only current user - alpha mm
      init - is converted to use it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  5. 18 Feb, 2009 1 commit
  6. 06 Jan, 2009 1 commit
  7. 23 Oct, 2008 1 commit
  8. 20 Oct, 2008 1 commit
    • Nick Piggin's avatar
      mm: rewrite vmap layer · db64fe02
      Nick Piggin authored
      Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
      provide a fast, scalable percpu frontend for small vmaps (requires a
      slightly different API, though).
      The biggest problem with vmap is actually vunmap.  Presently this requires
      a global kernel TLB flush, which on most architectures is a broadcast IPI
      to all CPUs to flush the cache.  This is all done under a global lock.  As
      the number of CPUs increases, so will the number of vunmaps a scaled
      workload will want to perform, and so will the cost of a global TLB flush.
       This gives terrible quadratic scalability characteristics.
      Another problem is that the entire vmap subsystem works under a single
      lock.  It is a rwlock, but it is actually taken for write in all the fast
      paths, and the read locking would likely never be run concurrently anyway,
      so it's just pointless.
      This is a rewrite of vmap subsystem to solve those problems.  The existing
      vmalloc API is implemented on top of the rewritten subsystem.
      The TLB flushing problem is solved by using lazy TLB unmapping.  vmap
      addresses do not have to be flushed immediately when they are vunmapped,
      because the kernel will not reuse them again (would be a use-after-free)
      until they are reallocated.  So the addresses aren't allocated again until
      a subsequent TLB flush.  A single TLB flush then can flush multiple
      vunmaps from each CPU.
      XEN and PAT and such do not like deferred TLB flushing because they can't
      always handle multiple aliasing virtual addresses to a physical address.
      They now call vm_unmap_aliases() in order to flush any deferred mappings.
      That call is very expensive (well, actually not a lot more expensive than
      a single vunmap under the old scheme), however it should be OK if not
      called too often.
      The virtual memory extent information is stored in an rbtree rather than a
      linked list to improve the algorithmic scalability.
      There is a per-CPU allocator for small vmaps, which amortizes or avoids
      global locking.
      To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
      must be used in place of vmap and vunmap.  Vmalloc does not use these
      interfaces at the moment, so it will not be quite so scalable (although it
      will use lazy TLB flushing).
      As a quick test of performance, I ran a test that loops in the kernel,
      linearly mapping then touching then unmapping 4 pages.  Different numbers
      of tests were run in parallel on an 4 core, 2 socket opteron.  Results are
      in nanoseconds per map+touch+unmap.
      threads           vanilla         vmap rewrite
      1                 14700           2900
      2                 33600           3000
      4                 49500           2800
      8                 70631           2900
      So with a 8 cores, the rewritten version is already 25x faster.
      In a slightly more realistic test (although with an older and less
      scalable version of the patch), I ripped the not-very-good vunmap batching
      code out of XFS, and implemented the large buffer mapping with vm_map_ram
      and vm_unmap_ram...  along with a couple of other tricks, I was able to
      speed up a large directory workload by 20x on a 64 CPU system.  I believe
      vmap/vunmap is actually sped up a lot more than 20x on such a system, but
      I'm running into other locks now.  vmap is pretty well blown off the
      1352059 total                                      0.1401
      798784 _write_lock                              8320.6667 <- vmlist_lock
      529313 default_idle                             1181.5022
       15242 smp_call_function                         15.8771  <- vmap tlb flushing
        2472 __get_vm_area_node                         1.9312  <- vmap
        1762 remove_vm_area                             4.5885  <- vunmap
         316 map_vm_area                                0.2297  <- vmap
         312 kfree                                      0.1950
         300 _spin_lock                                 3.1250
         252 sn_send_IPI_phys                           0.4375  <- tlb flushing
         238 vmap                                       0.8264  <- vmap
         216 find_lock_page                             0.5192
         196 find_next_bit                              0.3603
         136 sn2_send_IPI                               0.2024
         130 pio_phys_write_mmr                         2.0312
         118 unmap_kernel_range                         0.1229
       78406 total                                      0.0081
       40053 default_idle                              89.4040
       33576 ia64_spinlock_contention                 349.7500
        1650 _spin_lock                                17.1875
         319 __reg_op                                   0.5538
         281 _atomic_dec_and_lock                       1.0977
         153 mutex_unlock                               1.5938
         123 iget_locked                                0.1671
         117 xfs_dir_lookup                             0.1662
         117 dput                                       0.1406
         114 xfs_iget_core                              0.0268
          92 xfs_da_hashname                            0.1917
          75 d_alloc                                    0.0670
          68 vmap_page_range                            0.0462 <- vmap
          58 kmem_cache_alloc                           0.0604
          57 memset                                     0.0540
          52 rb_next                                    0.1625
          50 __copy_user                                0.0208
          49 bitmap_find_free_region                    0.2188 <- vmap
          46 ia64_sn_udelay                             0.1106
          45 find_inode_fast                            0.1406
          42 memcmp                                     0.2188
          42 finish_task_switch                         0.1094
          42 __d_lookup                                 0.0410
          40 radix_tree_lookup_slot                     0.1250
          37 _spin_unlock_irqrestore                    0.3854
          36 xfs_bmapi                                  0.0050
          36 kmem_cache_free                            0.0256
          35 xfs_vn_getattr                             0.0322
          34 radix_tree_lookup                          0.1062
          33 __link_path_walk                           0.0035
          31 xfs_da_do_buf                              0.0091
          30 _xfs_buf_find                              0.0204
          28 find_get_page                              0.0875
          27 xfs_iread                                  0.0241
          27 __strncpy_from_user                        0.2812
          26 _xfs_buf_initialize                        0.0406
          24 _xfs_buf_lookup_pages                      0.0179
          24 vunmap_page_range                          0.0250 <- vunmap
          23 find_lock_page                             0.0799
          22 vm_map_ram                                 0.0087 <- vmap
          20 kfree                                      0.0125
          19 put_page                                   0.0330
          18 __kmalloc                                  0.0176
          17 xfs_da_node_lookup_int                     0.0086
          17 _read_lock                                 0.0885
          17 page_waitqueue                             0.0664
      vmap has gone from being the top 5 on the profiles and flushing the crap
      out of all TLBs, to using less than 1% of kernel time.
      [akpm@linux-foundation.org: cleanups, section fix]
      [akpm@linux-foundation.org: fix build on alpha]
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  9. 16 Aug, 2008 1 commit
  10. 28 Apr, 2008 2 commits
    • Christoph Lameter's avatar
      vmallocinfo: add caller information · 23016969
      Christoph Lameter authored
      Add caller information so that /proc/vmallocinfo shows where the allocation
      request for a slice of vmalloc memory originated.
      Results in output like this:
      0xffffc20000000000-0xffffc20000801000 8392704 alloc_large_system_hash+0x127/0x246 pages=2048 vmalloc vpages
      0xffffc20000801000-0xffffc20000806000   20480 alloc_large_system_hash+0x127/0x246 pages=4 vmalloc
      0xffffc20000806000-0xffffc20000c07000 4198400 alloc_large_system_hash+0x127/0x246 pages=1024 vmalloc vpages
      0xffffc20000c07000-0xffffc20000c0a000   12288 alloc_large_system_hash+0x127/0x246 pages=2 vmalloc
      0xffffc20000c0a000-0xffffc20000c0c000    8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
      0xffffc20000c0c000-0xffffc20000c0f000   12288 acpi_os_map_memory+0x13/0x1c phys=cff64000 ioremap
      0xffffc20000c10000-0xffffc20000c15000   20480 acpi_os_map_memory+0x13/0x1c phys=cff65000 ioremap
      0xffffc20000c16000-0xffffc20000c18000    8192 acpi_os_map_memory+0x13/0x1c phys=cff69000 ioremap
      0xffffc20000c18000-0xffffc20000c1a000    8192 acpi_os_map_memory+0x13/0x1c phys=fed1f000 ioremap
      0xffffc20000c1a000-0xffffc20000c1c000    8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
      0xffffc20000c1c000-0xffffc20000c1e000    8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
      0xffffc20000c1e000-0xffffc20000c20000    8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
      0xffffc20000c20000-0xffffc20000c22000    8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
      0xffffc20000c22000-0xffffc20000c24000    8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
      0xffffc20000c24000-0xffffc20000c26000    8192 acpi_os_map_memory+0x13/0x1c phys=e0081000 ioremap
      0xffffc20000c26000-0xffffc20000c28000    8192 acpi_os_map_memory+0x13/0x1c phys=e0080000 ioremap
      0xffffc20000c28000-0xffffc20000c2d000   20480 alloc_large_system_hash+0x127/0x246 pages=4 vmalloc
      0xffffc20000c2d000-0xffffc20000c31000   16384 tcp_init+0xd5/0x31c pages=3 vmalloc
      0xffffc20000c31000-0xffffc20000c34000   12288 alloc_large_system_hash+0x127/0x246 pages=2 vmalloc
      0xffffc20000c34000-0xffffc20000c36000    8192 init_vdso_vars+0xde/0x1f1
      0xffffc20000c36000-0xffffc20000c38000    8192 pci_iomap+0x8a/0xb4 phys=d8e00000 ioremap
      0xffffc20000c38000-0xffffc20000c3a000    8192 usb_hcd_pci_probe+0x139/0x295 [usbcore] phys=d8e00000 ioremap
      0xffffc20000c3a000-0xffffc20000c3e000   16384 sys_swapon+0x509/0xa15 pages=3 vmalloc
      0xffffc20000c40000-0xffffc20000c61000  135168 e1000_probe+0x1c4/0xa32 phys=d8a20000 ioremap
      0xffffc20000c61000-0xffffc20000c6a000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
      0xffffc20000c6a000-0xffffc20000c73000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
      0xffffc20000c73000-0xffffc20000c7c000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
      0xffffc20000c7c000-0xffffc20000c7f000   12288 e1000e_setup_tx_resources+0x29/0xbe pages=2 vmalloc
      0xffffc20000c80000-0xffffc20001481000 8392704 pci_mmcfg_arch_init+0x90/0x118 phys=e0000000 ioremap
      0xffffc20001481000-0xffffc20001682000 2101248 alloc_large_system_hash+0x127/0x246 pages=512 vmalloc
      0xffffc20001682000-0xffffc20001e83000 8392704 alloc_large_system_hash+0x127/0x246 pages=2048 vmalloc vpages
      0xffffc20001e83000-0xffffc20002204000 3674112 alloc_large_system_hash+0x127/0x246 pages=896 vmalloc vpages
      0xffffc20002204000-0xffffc2000220d000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
      0xffffc2000220d000-0xffffc20002216000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
      0xffffc20002216000-0xffffc2000221f000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
      0xffffc2000221f000-0xffffc20002228000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
      0xffffc20002228000-0xffffc20002231000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
      0xffffc20002231000-0xffffc20002234000   12288 e1000e_setup_rx_resources+0x35/0x122 pages=2 vmalloc
      0xffffc20002240000-0xffffc20002261000  135168 e1000_probe+0x1c4/0xa32 phys=d8a60000 ioremap
      0xffffc20002261000-0xffffc2000270c000 4894720 sys_swapon+0x509/0xa15 pages=1194 vmalloc vpages
      0xffffffffa0000000-0xffffffffa0022000  139264 module_alloc+0x4f/0x55 pages=33 vmalloc
      0xffffffffa0022000-0xffffffffa0029000   28672 module_alloc+0x4f/0x55 pages=6 vmalloc
      0xffffffffa002b000-0xffffffffa0034000   36864 module_alloc+0x4f/0x55 pages=8 vmalloc
      0xffffffffa0034000-0xffffffffa003d000   36864 module_alloc+0x4f/0x55 pages=8 vmalloc
      0xffffffffa003d000-0xffffffffa0049000   49152 module_alloc+0x4f/0x55 pages=11 vmalloc
      0xffffffffa0049000-0xffffffffa0050000   28672 module_alloc+0x4f/0x55 pages=6 vmalloc
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Christoph Lameter's avatar
      vmalloc: show vmalloced areas via /proc/vmallocinfo · a10aa579
      Christoph Lameter authored
      Implement a new proc file that allows the display of the currently allocated
      vmalloc memory.
      It allows to see the users of vmalloc.  That is important if vmalloc space is
      scarce (i386 for example).
      And it's going to be important for the compound page fallback to vmalloc.
      Many of the current users can be switched to use compound pages with fallback.
       This means that the number of users of vmalloc is reduced and page tables no
      longer necessary to access the memory.  /proc/vmallocinfo allows to review how
      that reduction occurs.
      If memory becomes fragmented and larger order allocations are no longer
      possible then /proc/vmallocinfo allows to see which compound page allocations
      fell back to virtual compound pages.  That is important for new users of
      virtual compound pages.  Such as order 1 stack allocation etc that may
      fallback to virtual compound pages in the future.
      /proc/vmallocinfo permissions are made readable-only-by-root to avoid possible
      information leakage.
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: CONFIG_MMU=n build fix]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  11. 05 Feb, 2008 1 commit
  12. 21 Jul, 2007 1 commit
    • Jeremy Fitzhardinge's avatar
      i386: fix iounmap's use of vm_struct's size field · 9585116b
      Jeremy Fitzhardinge authored
      get_vm_area always returns an area with an adjacent guard page.  That guard
      page is included in vm_struct.size.  iounmap uses vm_struct.size to
      determine how much address space needs to have change_page_attr applied to
      it, which will BUG if applied to the guard page.
      This patch adds a helper function - get_vm_area_size() in linux/vmalloc.h -
      to return the actual size of a vm area, and uses it to make iounmap do the
      right thing.  There are probably other places which should be using
      Thanks to Dave Young <hidave.darkstar@gmail.com> for debugging the
      [ Andi, it wasn't clear to me whether x86_64 needs the same fix. ]
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Dave Young <hidave.darkstar@gmail.com>
      Cc: Chuck Ebbert <cebbert@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  13. 18 Jul, 2007 1 commit
  14. 14 Jun, 2007 1 commit
  15. 08 May, 2007 1 commit
    • Christoph Hellwig's avatar
      move die notifier handling to common code · 1eeb66a1
      Christoph Hellwig authored
      This patch moves the die notifier handling to common code.  Previous
      various architectures had exactly the same code for it.  Note that the new
      code is compiled unconditionally, this should be understood as an appel to
      the other architecture maintainer to implement support for it aswell (aka
      sprinkling a notify_die or two in the proper place)
      arm had a notifiy_die that did something totally different, I renamed it to
      arm_notify_die as part of the patch and made it static to the file it's
      declared and used at.  avr32 used to pass slightly less information through
      this interface and I brought it into line with the other architectures.
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: fix vmalloc_sync_all bustage]
      [bryan.wu@analog.com: fix vmalloc_sync_all in nommu]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarBryan Wu <bryan.wu@analog.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  16. 13 Nov, 2006 1 commit
  17. 28 Oct, 2006 1 commit
  18. 27 Sep, 2006 1 commit
  19. 26 Sep, 2006 1 commit
  20. 14 Jul, 2006 1 commit
  21. 23 Jun, 2006 1 commit
  22. 29 Oct, 2005 1 commit
  23. 08 Oct, 2005 1 commit
  24. 05 Sep, 2005 1 commit
    • Deepak Saxena's avatar
      [PATCH] arm: allow for arch-specific IOREMAP_MAX_ORDER · fd195c49
      Deepak Saxena authored
      Version 6 of the ARM architecture introduces the concept of 16MB pages
      (supersections) and 36-bit (40-bit actually, but nobody uses this) physical
      addresses.  36-bit addressed memory and I/O and ARMv6 can only be mapped
      using supersections and the requirement on these is that both virtual and
      physical addresses be 16MB aligned.  In trying to add support for ioremap()
      of 36-bit I/O, we run into the issue that get_vm_area() allows for a
      maximum of 512K alignment via the IOREMAP_MAX_ORDER constant.  To work
      around this, we can:
      - Allocate a larger VM area than needed (size + (1ul << IOREMAP_MAX_ORDER))
        and then align the pointer ourselves, but this ends up with 512K of
        wasted VM per ioremap().
      - Provide a new __get_vm_area_aligned() API and make __get_vm_area() sit
        on top of this. I did this and it works but I don't like the idea
        adding another VM API just for this one case.
      - My preferred solution which is to allow the architecture to override
        the IOREMAP_MAX_ORDER constant with it's own version.
      Signed-off-by: default avatarDeepak Saxena <dsaxena@plexity.net>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  25. 20 May, 2005 1 commit
  26. 16 Apr, 2005 1 commit
    • Linus Torvalds's avatar
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds authored
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      Let it rip!