Skip to content
  • Jeff King's avatar
    pack-revindex: radix-sort the revindex · 8b8dfd51
    Jeff King authored
    
    
    The pack revindex stores the offsets of the objects in the
    pack in sorted order, allowing us to easily find the on-disk
    size of each object. To compute it, we populate an array
    with the offsets from the sha1-sorted idx file, and then use
    qsort to order it by offsets.
    
    That does O(n log n) offset comparisons, and profiling shows
    that we spend most of our time in cmp_offset. However, since
    we are sorting on a simple off_t, we can use numeric sorts
    that perform better. A radix sort can run in O(k*n), where k
    is the number of "digits" in our number. For a 64-bit off_t,
    using 16-bit "digits" gives us k=4.
    
    On the linux.git repo, with about 3M objects to sort, this
    yields a 400% speedup. Here are the best-of-five numbers for
    running
    
      echo HEAD | git cat-file --batch-check="%(objectsize:disk)
    
    on a fully packed repository, which is dominated by time
    spent building the pack revindex:
    
              before     after
      real    0m0.834s   0m0.204s
      user    0m0.788s   0m0.164s
      sys     0m0.040s   0m0.036s
    
    This matches our algorithmic expectations. log(3M) is ~21.5,
    so a traditional sort is ~21.5n. Our radix sort runs in k*n,
    where k is the number of radix digits. In the worst case,
    this is k=4 for a 64-bit off_t, but we can quit early when
    the largest value to be sorted is smaller. For any
    repository under 4G, k=2. Our algorithm makes two passes
    over the list per radix digit, so we end up with 4n. That
    should yield ~5.3x speedup. We see 4x here; the difference
    is probably due to the extra bucket book-keeping the radix
    sort has to do.
    
    On a smaller repo, the difference is less impressive, as
    log(n) is smaller. For git.git, with 173K objects (but still
    k=2), we see a 2.7x improvement:
    
              before     after
      real    0m0.046s   0m0.017s
      user    0m0.036s   0m0.012s
      sys     0m0.008s   0m0.000s
    
    On even tinier repos (e.g., a few hundred objects), the
    speedup goes away entirely, as the small advantage of the
    radix sort gets erased by the book-keeping costs (and at
    those sizes, the cost to generate the the rev-index gets
    lost in the noise anyway).
    
    Signed-off-by: default avatarJeff King <peff@peff.net>
    Reviewed-by: default avatarBrandon Casey <drafnel@gmail.com>
    Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
    8b8dfd51