lcd-domains/include/lcd-domains/liblcd/resource_tree.h · 4d114c0e5b9b675a903eaf4ba0d8b1141ff69404 · xcap / xcap-capability-linux

generalized-allocator: Sketch out data structures and interfaces. · 4d114c0e
Charlie Jacobsen authored Jan 25, 2016 and
Vikram Narayanan committed Oct 26, 2016
I'm introducing two new data structures: a page allocator and
a resource tree.

page allocator motivation: We need some kind of data structure for
tracking regions of the guest physical address space. For example,
you may want to dedicate the portion of physical address space from
(1 << 20) -- (16 << 20) (from the first megabyte to the sixteenth
megabyte) for ioremap's. Someone will say: give me 16 pages of
uncacheable addresses I can use to ioremap this device memory; the
page allocator will find 16 free pages of guest physical address.

resource tree motivation/what it is: This data structure is an
interval tree used to resolve a guest physical address to the
cptr/capability for the memory object that is mapped at that
address. For example, you may need the cptr for the page that
backs a certain guest physical address so that you can share or
free the page (you need the cptr for the page because the microkernel
interface only uses cptr's).

page allocator planned implementation:

I plan to adapt the buddy allocator algorithm from Linux. After
reviewing the code, I found the algorithm to be simple enough that this
is realistic. In addition, the allocator will provide a means for
doing "microkernel page allocs" in a more coarse-grained fashion. For
example, the page allocator will call out to the microkernel to get
chunks of 1 MB machine pages, and then allocate from that at page
(4 KB) granularity. This means fewer VM exits. (Right now, every page
alloc/free results in a VM exit; the page allocator calls out the
microkernel for every page alloc/free; it doesn't try to do
coarse-grained alloc/frees and then track those bigger chunks.)

I also plan to allow the page allocator to "embed" its metadata in the
address space that it is managing, to cover some heap bootstrap issues.
(This embedding won't work for some use cases, like tracking uncacheable
memory - we wouldn't want to embed the RAM that contains the page
allocator metadata inside the address space region and make it
uncacheable.)

Finally, I plan to allow the page allocator to use varying granularity
for "microkernel allocs" (if applicable) and allocs for higher levels
(e.g., page allocator allocates 4 MB chunks from the microkernel, but allows
higher level code inside an LCD to alloc at 4 KB granularity).

The page allocator data structure (there can be multiple instances)
will be used exclusively inside an LCD for guest physical address
space management.

resource tree planned implementation:

I plan to re-use the interval tree data structure in Linux. Google
developed a nice API. (It replaced the priority tree that was once
used in the vma code.)

The resource tree will be used in isolated and non-isolated environments
(physical address -> cptr translation is needed in both).

Some alternatives/discussion:

I could use an easier bitmap first-fit algorithm for page allocation,
but this is slow (this is what we use now). I wondered if the majority
of page allocs will be on the control path, and that we may be able to
tolerate this inefficiency (and all data path operations will involve just
ring buffer operations on shared memory that is set up beforehand). But
I suspect this won't be the case. There could be some slab allocations that
happen on the data path for internal data; if the slab needs to shrink
or grow, this may trigger a call into the page allocator, which could
be slow (if it triggered a VM exit, a call on the data path could be
bloated to 2000 cycles). Maybe this is not true and my concerns are
unfounded.

It may also seem silly to have multiple page allocator instances inside
an LCD; why not just one allocator that manages the entire address
space? First, some of the dynamic regions are independent of each other:
The heap region and the ioremap region are for different purposes; having
a single allocator for both regions might be complex and error prone.
Second, you wouldn't want one allocator to track the entire
address space since its huge (the amount of allocator metadata could
be enormous, depending on the design). My intent is to abstract over
common needs from all regions (tracking free guest physical address space)
and provide some hooks for specific cases.

An alternative to the resource tree is to just use a giant array of
cptr's, one slot for each page (this is what we do now for the heap
inside an LCD). You would then translate a physical
address to an offset into the array to get the cptr for the resource
that contains that physical address (e.g. a page). There are a couple
issues with this: First, the array could be huge (for a region of
16 GBs at 4 KB granularity, the array would be 32 MBs). Second, even
if this is tolerable inside an LCD, the non-isolated code needs the
same functionality (address -> cptr translation), and setting up a
giant array for the entire host physical address space is obviously
dumb (and would need to be done *per thread* since each thread uses
its own cspace). A tree handles the sparsity a lot better.

Finally, it is worth considering how KVM handles backing guest physical
memory with real host/machine pages. I believe they use some
sophisticated demand paging triggered by EPT faults, and all of this
is hooked into the host's page cache. This seems too scary and
complex for our humble microkernel (that we want to keep simple).

I hope you enjoyed this giant commit message.
4d114c0e