Skip to content
  • Charlie Jacobsen's avatar
    generalized-allocator: Sketch out data structures and interfaces. · 4d114c0e
    Charlie Jacobsen authored and Vikram Narayanan's avatar Vikram Narayanan committed
    I'm introducing two new data structures: a page allocator and
    a resource tree.
    
    page allocator motivation: We need some kind of data structure for
    tracking regions of the guest physical address space. For example,
    you may want to dedicate the portion of physical address space from
    (1 << 20) -- (16 << 20) (from the first megabyte to the sixteenth
    megabyte) for ioremap's. Someone will say: give me 16 pages of
    uncacheable addresses I can use to ioremap this device memory; the
    page allocator will find 16 free pages of guest physical address.
    
    resource tree motivation/what it is: This data structure is an
    interval tree used to resolve a guest physical address to the
    cptr/capability for the memory object that is mapped at that
    address. For example, you may need the cptr for the page that
    backs a certain guest physical address so that you can share or
    free the page (you need the cptr for the page because the microkernel
    interface only uses cptr's).
    
    page allocator planned implementation:
    
    I plan to adapt the buddy allocator algorithm from Linux. After
    reviewing the code, I found the algorithm to be simple enough that this
    is realistic. In addition, the allocator will provide a means for
    doing "microkernel page allocs" in a more coarse-grained fashion. For
    example, the page allocator will call out to the microkernel to get
    chunks of 1 MB machine pages, and then allocate from that at page
    (4 KB) granularity. This means fewer VM exits. (Right now, every page
    alloc/free results in a VM exit; the page allocator calls out the
    microkernel for every page alloc/free; it doesn't try to do
    coarse-grained alloc/frees and then track those bigger chunks.)
    
    I also plan to allow the page allocator to "embed" its metadata in the
    address space that it is managing, to cover some heap bootstrap issues.
    (This embedding won't work for some use cases, like tracking uncacheable
    memory - we wouldn't want to embed the RAM that contains the page
    allocator metadata inside the address space region and make it
    uncacheable.)
    
    Finally, I plan to allow the page allocator to use varying granularity
    for "microkernel allocs" (if applicable) and allocs for higher levels
    (e.g., page allocator allocates 4 MB chunks from the microkernel, but allows
    higher level code inside an LCD to alloc at 4 KB granularity).
    
    The page allocator data structure (there can be multiple instances)
    will be used exclusively inside an LCD for guest physical address
    space management.
    
    resource tree planned implementation:
    
    I plan to re-use the interval tree data structure in Linux. Google
    developed a nice API. (It replaced the priority tree that was once
    used in the vma code.)
    
    The resource tree will be used in isolated and non-isolated environments
    (physical address -> cptr translation is needed in both).
    
    Some alternatives/discussion:
    
    I could use an easier bitmap first-fit algorithm for page allocation,
    but this is slow (this is what we use now). I wondered if the majority
    of page allocs will be on the control path, and that we may be able to
    tolerate this inefficiency (and all data path operations will involve just
    ring buffer operations on shared memory that is set up beforehand). But
    I suspect this won't be the case. There could be some slab allocations that
    happen on the data path for internal data; if the slab needs to shrink
    or grow, this may trigger a call into the page allocator, which could
    be slow (if it triggered a VM exit, a call on the data path could be
    bloated to 2000 cycles). Maybe this is not true and my concerns are
    unfounded.
    
    It may also seem silly to have multiple page allocator instances inside
    an LCD; why not just one allocator that manages the entire address
    space? First, some of the dynamic regions are independent of each other:
    The heap region and the ioremap region are for different purposes; having
    a single allocator for both regions might be complex and error prone.
    Second, you wouldn't want one allocator to track the entire
    address space since its huge (the amount of allocator metadata could
    be enormous, depending on the design). My intent is to abstract over
    common needs from all regions (tracking free guest physical address space)
    and provide some hooks for specific cases.
    
    An alternative to the resource tree is to just use a giant array of
    cptr's, one slot for each page (this is what we do now for the heap
    inside an LCD). You would then translate a physical
    address to an offset into the array to get the cptr for the resource
    that contains that physical address (e.g. a page). There are a couple
    issues with this: First, the array could be huge (for a region of
    16 GBs at 4 KB granularity, the array would be 32 MBs). Second, even
    if this is tolerable inside an LCD, the non-isolated code needs the
    same functionality (address -> cptr translation), and setting up a
    giant array for the entire host physical address space is obviously
    dumb (and would need to be done *per thread* since each thread uses
    its own cspace). A tree handles the sparsity a lot better.
    
    Finally, it is worth considering how KVM handles backing guest physical
    memory with real host/machine pages. I believe they use some
    sophisticated demand paging triggered by EPT faults, and all of this
    is hooked into the host's page cache. This seems too scary and
    complex for our humble microkernel (that we want to keep simple).
    
    I hope you enjoyed this giant commit message.
    4d114c0e