• Nick Piggin's avatar
    mm: rewrite vmap layer · db64fe02
    Nick Piggin authored
    Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
    provide a fast, scalable percpu frontend for small vmaps (requires a
    slightly different API, though).
    The biggest problem with vmap is actually vunmap.  Presently this requires
    a global kernel TLB flush, which on most architectures is a broadcast IPI
    to all CPUs to flush the cache.  This is all done under a global lock.  As
    the number of CPUs increases, so will the number of vunmaps a scaled
    workload will want to perform, and so will the cost of a global TLB flush.
     This gives terrible quadratic scalability characteristics.
    Another problem is that the entire vmap subsystem works under a single
    lock.  It is a rwlock, but it is actually taken for write in all the fast
    paths, and the read locking would likely never be run concurrently anyway,
    so it's just pointless.
    This is a rewrite of vmap subsystem to solve those problems.  The existing
    vmalloc API is implemented on top of the rewritten subsystem.
    The TLB flushing problem is solved by using lazy TLB unmapping.  vmap
    addresses do not have to be flushed immediately when they are vunmapped,
    because the kernel will not reuse them again (would be a use-after-free)
    until they are reallocated.  So the addresses aren't allocated again until
    a subsequent TLB flush.  A single TLB flush then can flush multiple
    vunmaps from each CPU.
    XEN and PAT and such do not like deferred TLB flushing because they can't
    always handle multiple aliasing virtual addresses to a physical address.
    They now call vm_unmap_aliases() in order to flush any deferred mappings.
    That call is very expensive (well, actually not a lot more expensive than
    a single vunmap under the old scheme), however it should be OK if not
    called too often.
    The virtual memory extent information is stored in an rbtree rather than a
    linked list to improve the algorithmic scalability.
    There is a per-CPU allocator for small vmaps, which amortizes or avoids
    global locking.
    To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
    must be used in place of vmap and vunmap.  Vmalloc does not use these
    interfaces at the moment, so it will not be quite so scalable (although it
    will use lazy TLB flushing).
    As a quick test of performance, I ran a test that loops in the kernel,
    linearly mapping then touching then unmapping 4 pages.  Different numbers
    of tests were run in parallel on an 4 core, 2 socket opteron.  Results are
    in nanoseconds per map+touch+unmap.
    threads           vanilla         vmap rewrite
    1                 14700           2900
    2                 33600           3000
    4                 49500           2800
    8                 70631           2900
    So with a 8 cores, the rewritten version is already 25x faster.
    In a slightly more realistic test (although with an older and less
    scalable version of the patch), I ripped the not-very-good vunmap batching
    code out of XFS, and implemented the large buffer mapping with vm_map_ram
    and vm_unmap_ram...  along with a couple of other tricks, I was able to
    speed up a large directory workload by 20x on a 64 CPU system.  I believe
    vmap/vunmap is actually sped up a lot more than 20x on such a system, but
    I'm running into other locks now.  vmap is pretty well blown off the
    1352059 total                                      0.1401
    798784 _write_lock                              8320.6667 <- vmlist_lock
    529313 default_idle                             1181.5022
     15242 smp_call_function                         15.8771  <- vmap tlb flushing
      2472 __get_vm_area_node                         1.9312  <- vmap
      1762 remove_vm_area                             4.5885  <- vunmap
       316 map_vm_area                                0.2297  <- vmap
       312 kfree                                      0.1950
       300 _spin_lock                                 3.1250
       252 sn_send_IPI_phys                           0.4375  <- tlb flushing
       238 vmap                                       0.8264  <- vmap
       216 find_lock_page                             0.5192
       196 find_next_bit                              0.3603
       136 sn2_send_IPI                               0.2024
       130 pio_phys_write_mmr                         2.0312
       118 unmap_kernel_range                         0.1229
     78406 total                                      0.0081
     40053 default_idle                              89.4040
     33576 ia64_spinlock_contention                 349.7500
      1650 _spin_lock                                17.1875
       319 __reg_op                                   0.5538
       281 _atomic_dec_and_lock                       1.0977
       153 mutex_unlock                               1.5938
       123 iget_locked                                0.1671
       117 xfs_dir_lookup                             0.1662
       117 dput                                       0.1406
       114 xfs_iget_core                              0.0268
        92 xfs_da_hashname                            0.1917
        75 d_alloc                                    0.0670
        68 vmap_page_range                            0.0462 <- vmap
        58 kmem_cache_alloc                           0.0604
        57 memset                                     0.0540
        52 rb_next                                    0.1625
        50 __copy_user                                0.0208
        49 bitmap_find_free_region                    0.2188 <- vmap
        46 ia64_sn_udelay                             0.1106
        45 find_inode_fast                            0.1406
        42 memcmp                                     0.2188
        42 finish_task_switch                         0.1094
        42 __d_lookup                                 0.0410
        40 radix_tree_lookup_slot                     0.1250
        37 _spin_unlock_irqrestore                    0.3854
        36 xfs_bmapi                                  0.0050
        36 kmem_cache_free                            0.0256
        35 xfs_vn_getattr                             0.0322
        34 radix_tree_lookup                          0.1062
        33 __link_path_walk                           0.0035
        31 xfs_da_do_buf                              0.0091
        30 _xfs_buf_find                              0.0204
        28 find_get_page                              0.0875
        27 xfs_iread                                  0.0241
        27 __strncpy_from_user                        0.2812
        26 _xfs_buf_initialize                        0.0406
        24 _xfs_buf_lookup_pages                      0.0179
        24 vunmap_page_range                          0.0250 <- vunmap
        23 find_lock_page                             0.0799
        22 vm_map_ram                                 0.0087 <- vmap
        20 kfree                                      0.0125
        19 put_page                                   0.0330
        18 __kmalloc                                  0.0176
        17 xfs_da_node_lookup_int                     0.0086
        17 _read_lock                                 0.0885
        17 page_waitqueue                             0.0664
    vmap has gone from being the top 5 on the profiles and flushing the crap
    out of all TLBs, to using less than 1% of kernel time.
    [akpm@linux-foundation.org: cleanups, section fix]
    [akpm@linux-foundation.org: fix build on alpha]
    Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
    Cc: Jeremy Fitzhardinge <jeremy@goop.org>
    Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
main.c 21 KB