-
I'm introducing two new data structures: a page allocator and a resource tree. page allocator motivation: We need some kind of data structure for tracking regions of the guest physical address space. For example, you may want to dedicate the portion of physical address space from (1 << 20) -- (16 << 20) (from the first megabyte to the sixteenth megabyte) for ioremap's. Someone will say: give me 16 pages of uncacheable addresses I can use to ioremap this device memory; the page allocator will find 16 free pages of guest physical address. resource tree motivation/what it is: This data structure is an interval tree used to resolve a guest physical address to the cptr/capability for the memory object that is mapped at that address. For example, you may need the cptr for the page that backs a certain guest physical address so that you can share or free the page (you need the cptr for the page because the microkernel interface only uses cptr's). page allocator planned implementation: I plan to adapt the buddy allocator algorithm from Linux. After reviewing the code, I found the algorithm to be simple enough that this is realistic. In addition, the allocator will provide a means for doing "microkernel page allocs" in a more coarse-grained fashion. For example, the page allocator will call out to the microkernel to get chunks of 1 MB machine pages, and then allocate from that at page (4 KB) granularity. This means fewer VM exits. (Right now, every page alloc/free results in a VM exit; the page allocator calls out the microkernel for every page alloc/free; it doesn't try to do coarse-grained alloc/frees and then track those bigger chunks.) I also plan to allow the page allocator to "embed" its metadata in the address space that it is managing, to cover some heap bootstrap issues. (This embedding won't work for some use cases, like tracking uncacheable memory - we wouldn't want to embed the RAM that contains the page allocator metadata inside the address space region and make it uncacheable.) Finally, I plan to allow the page allocator to use varying granularity for "microkernel allocs" (if applicable) and allocs for higher levels (e.g., page allocator allocates 4 MB chunks from the microkernel, but allows higher level code inside an LCD to alloc at 4 KB granularity). The page allocator data structure (there can be multiple instances) will be used exclusively inside an LCD for guest physical address space management. resource tree planned implementation: I plan to re-use the interval tree data structure in Linux. Google developed a nice API. (It replaced the priority tree that was once used in the vma code.) The resource tree will be used in isolated and non-isolated environments (physical address -> cptr translation is needed in both). Some alternatives/discussion: I could use an easier bitmap first-fit algorithm for page allocation, but this is slow (this is what we use now). I wondered if the majority of page allocs will be on the control path, and that we may be able to tolerate this inefficiency (and all data path operations will involve just ring buffer operations on shared memory that is set up beforehand). But I suspect this won't be the case. There could be some slab allocations that happen on the data path for internal data; if the slab needs to shrink or grow, this may trigger a call into the page allocator, which could be slow (if it triggered a VM exit, a call on the data path could be bloated to 2000 cycles). Maybe this is not true and my concerns are unfounded. It may also seem silly to have multiple page allocator instances inside an LCD; why not just one allocator that manages the entire address space? First, some of the dynamic regions are independent of each other: The heap region and the ioremap region are for different purposes; having a single allocator for both regions might be complex and error prone. Second, you wouldn't want one allocator to track the entire address space since its huge (the amount of allocator metadata could be enormous, depending on the design). My intent is to abstract over common needs from all regions (tracking free guest physical address space) and provide some hooks for specific cases. An alternative to the resource tree is to just use a giant array of cptr's, one slot for each page (this is what we do now for the heap inside an LCD). You would then translate a physical address to an offset into the array to get the cptr for the resource that contains that physical address (e.g. a page). There are a couple issues with this: First, the array could be huge (for a region of 16 GBs at 4 KB granularity, the array would be 32 MBs). Second, even if this is tolerable inside an LCD, the non-isolated code needs the same functionality (address -> cptr translation), and setting up a giant array for the entire host physical address space is obviously dumb (and would need to be done *per thread* since each thread uses its own cspace). A tree handles the sparsity a lot better. Finally, it is worth considering how KVM handles backing guest physical memory with real host/machine pages. I believe they use some sophisticated demand paging triggered by EPT faults, and all of this is hooked into the host's page cache. This seems too scary and complex for our humble microkernel (that we want to keep simple). I hope you enjoyed this giant commit message.
4d114c0e