Commit 4d114c0e authored by Charlie Jacobsen's avatar Charlie Jacobsen Committed by Vikram Narayanan
Browse files

generalized-allocator: Sketch out data structures and interfaces.

I'm introducing two new data structures: a page allocator and
a resource tree.

page allocator motivation: We need some kind of data structure for
tracking regions of the guest physical address space. For example,
you may want to dedicate the portion of physical address space from
(1 << 20) -- (16 << 20) (from the first megabyte to the sixteenth
megabyte) for ioremap's. Someone will say: give me 16 pages of
uncacheable addresses I can use to ioremap this device memory; the
page allocator will find 16 free pages of guest physical address.

resource tree motivation/what it is: This data structure is an
interval tree used to resolve a guest physical address to the
cptr/capability for the memory object that is mapped at that
address. For example, you may need the cptr for the page that
backs a certain guest physical address so that you can share or
free the page (you need the cptr for the page because the microkernel
interface only uses cptr's).

page allocator planned implementation:

I plan to adapt the buddy allocator algorithm from Linux. After
reviewing the code, I found the algorithm to be simple enough that this
is realistic. In addition, the allocator will provide a means for
doing "microkernel page allocs" in a more coarse-grained fashion. For
example, the page allocator will call out to the microkernel to get
chunks of 1 MB machine pages, and then allocate from that at page
(4 KB) granularity. This means fewer VM exits. (Right now, every page
alloc/free results in a VM exit; the page allocator calls out the
microkernel for every page alloc/free; it doesn't try to do
coarse-grained alloc/frees and then track those bigger chunks.)

I also plan to allow the page allocator to "embed" its metadata in the
address space that it is managing, to cover some heap bootstrap issues.
(This embedding won't work for some use cases, like tracking uncacheable
memory - we wouldn't want to embed the RAM that contains the page
allocator metadata inside the address space region and make it
uncacheable.)

Finally, I plan to allow the page allocator to use varying granularity
for "microkernel allocs" (if applicable) and allocs for higher levels
(e.g., page allocator allocates 4 MB chunks from the microkernel, but allows
higher level code inside an LCD to alloc at 4 KB granularity).

The page allocator data structure (there can be multiple instances)
will be used exclusively inside an LCD for guest physical address
space management.

resource tree planned implementation:

I plan to re-use the interval tree data structure in Linux. Google
developed a nice API. (It replaced the priority tree that was once
used in the vma code.)

The resource tree will be used in isolated and non-isolated environments
(physical address -> cptr translation is needed in both).

Some alternatives/discussion:

I could use an easier bitmap first-fit algorithm for page allocation,
but this is slow (this is what we use now). I wondered if the majority
of page allocs will be on the control path, and that we may be able to
tolerate this inefficiency (and all data path operations will involve just
ring buffer operations on shared memory that is set up beforehand). But
I suspect this won't be the case. There could be some slab allocations that
happen on the data path for internal data; if the slab needs to shrink
or grow, this may trigger a call into the page allocator, which could
be slow (if it triggered a VM exit, a call on the data path could be
bloated to 2000 cycles). Maybe this is not true and my concerns are
unfounded.

It may also seem silly to have multiple page allocator instances inside
an LCD; why not just one allocator that manages the entire address
space? First, some of the dynamic regions are independent of each other:
The heap region and the ioremap region are for different purposes; having
a single allocator for both regions might be complex and error prone.
Second, you wouldn't want one allocator to track the entire
address space since its huge (the amount of allocator metadata could
be enormous, depending on the design). My intent is to abstract over
common needs from all regions (tracking free guest physical address space)
and provide some hooks for specific cases.

An alternative to the resource tree is to just use a giant array of
cptr's, one slot for each page (this is what we do now for the heap
inside an LCD). You would then translate a physical
address to an offset into the array to get the cptr for the resource
that contains that physical address (e.g. a page). There are a couple
issues with this: First, the array could be huge (for a region of
16 GBs at 4 KB granularity, the array would be 32 MBs). Second, even
if this is tolerable inside an LCD, the non-isolated code needs the
same functionality (address -> cptr translation), and setting up a
giant array for the entire host physical address space is obviously
dumb (and would need to be done *per thread* since each thread uses
its own cspace). A tree handles the sparsity a lot better.

Finally, it is worth considering how KVM handles backing guest physical
memory with real host/machine pages. I believe they use some
sophisticated demand paging triggered by EPT faults, and all of this
is hooked into the host's page cache. This seems too scary and
complex for our humble microkernel (that we want to keep simple).

I hope you enjoyed this giant commit message.
parent 0ca33457
/*
* allocator.h
*
* Code for creating instances of general buddy
* allocators. Allocator metadata embedded inside memory
* region (but not in the alloc'd chunks).
*
* Can't use page fault in, because this would be EPT violation.
* (Maybe can have LCD handle EPT fault as virtualization exception,
* but that's for another day ... and that may not be efficient for
* a general allocator? would also need background thingamajig for
* shrinking faulted in mem that's not being used)
*/
#include <linux/page.h>
#include <linux/list.h>
/* CAN'T ALLOC IN MEM REGION? - WHAT IF REGION IS UNCACHEABLE? */
/* Make it optional / possible to insert into mem region during init.
* -- initializer calls out to get memory for metadata
* -- callee allocs mem and maps it; maybe return value signals
* to insert into itself - takes mem, zeros it out, init resource
* tree, insert mem into tree; what about when allocator is freed?
* - during destroy,
* -- problem - alloc tree nodes during heap init . */
struct lcd_mem_chunk {
struct list_head buddy_list;
unsigned int order;
struct lcd_resource_node *n;
};
struct lcd_free_lists {
unsigned int min_order, notify_order, max_order;
struct lcd_mem_chunk *free_lists;
unsigned int nr_free;
/* Called when trying to alloc mc on 2^(backing order) boundary and
* n is
* null. (e.g. not backed by memory)). This function
* can be null.
*/
int (*alloc_notify)(struct lcd_free_lists *fl,
struct lcd_mem_chunk *mc);
/* Called when freeing mc, n is non-null, and mc's order is >=
* n's order. This
* gives user chance to e.g. free mem that was backing this chunk
* of memory */
void (*free_notify)(struct lcd_free_lists *fl,
struct lcd_mem_chunk *mc);
};
struct lcd_page_allocator {
struct lcd_free_lists fl;
};
struct allocator_callbacks {
/* alloc notify */
/* free notify */
/* init alloc */
/* exit dealloc */
};
int mk_page_allocator(
unsigned long nr_pages_order,
unsigned int min_order,
unsigned int backing_order,
unsigned int max_order,
const struct allocator_callbacks *cbs);
/* Alloc 2^order mem chunks */
struct lcd_mem_chunk* lcd_page_allocator_alloc(struct lcd_page_allocator *a,
unsigned int order);
void lcd_page_allocator_free(struct lcd_page_allocator *a,
struct lcd_mem_chunk *base,
unsigned int order);
void lcd_page_allocator_destroy(struct lcd_page_allocator *a);
unsigned long to_offset(struct lcd_mem_chunk *c);
struct lcd_mem_chunk* to_chunk(unsigned long offset);
/*
* chicken and egg - tree root and node(s) before allocator is set up; provide
* special slab user can alloc tree root and nodes from.
*
* what about tear down? user may malloc later nodes, so will use kfree; and
* as user is tearing down tree, what if they delete the cap to pages that
* contain nodes in the tree? ans: need to use flags on nodes to indicate
* how they were allocated; if node is flagged as e.g. "embedded", don't use
* kfree, only way to really "free" it is to destroy the page allocator
* and return pages to microkernel.
*/
/* Usage examples */
/* heap setup: nr pages order = 24 (16 mb's); min order = 1,
* backing order = 10, max order = 11
*
* init alloc: allocator says give me 2^10 pages of mem;
* 2^10 = 1 MB which is enough to hold meta data with
* some left over
*
* allocator then writes metadata to pages, then
* calls alloc on itself for required pages to hold
* metadata
*
* after allocator init'd, heap alloc's enough mem for struct page array
* for all 16 MBs, zero's it out.
*
* alloc notify: allocator calls when tries to alloc page on
* 2^(notify order) boundary; heap will alloc
* 2^(notify order) pages from mk, put in resource
* tree, and store in mem chunk
*
* free notify: free them
*
*
* lcd_page_alloc --> lcd_page_allocator_alloc, then heap does
* to_offset on chunk, then idx's into struct page array
* note that size of chunks must be 1
*
* lcd_page_free --> get page idx, then calc offset into heap, then
* use to_chunk to get lcd_mem_chunk; then lcd_page_allocator_free,
* passing along order
*
* virt_to_head_page --> virt = phys; phys - heap base = offset;
* offset >> PAGE_SHIFT; then idx into page array
*
* page_address --> (page - page base) * 4096 = offset;
* phys = heap base + offset; virt = phys
*/
/*
* ioremap: nr pages order = 34 (16 gb's); min order = 10 (1 MB),
* backing order = 12 (don't alloc notify), max order = 16 (64 MBs)
*
* (I don't know now that ioremap needs to be this big ...)
*
* alloc_notify = null
*
* ioremap_phys --> lcd_page_allocator_alloc( size to fit map )
* then insert resource into tree, attach node to lcd_mem_chunk
* that is returned
*
* free_notify = ioremapper removes node from tree
*
* iounmap_phys: phys addr - ioremap base = offset; to_chunk(offset) = c;
* lcd_page_allocator_free(c); fires free_notify, we get tree
* lookup fast and free (of course, we could have done the removal
* ourselves after looking up c)
*/
/*
* resource_tree.h
*
* Data structure for doing address -> cptr translation.
*
* We re-use interval trees from Linux.
*
* Copyright: University of Utah
*/
#include <libcap.h>
#include <linux/interval_tree.h>
struct lcd_resource_node {
struct interval_tree_node resource_tree_node;
unsigned long nr_pages_order;
cptr_t cptr;
};
struct lcd_resource_tree {
struct rb_root root;
};
/* what about when setting up heap? can't kmalloc data */
int lcd_resource_tree_alloc(struct lcd_resource_tree **t_out);
int lcd_resource_tree_init(struct lcd_resource_tree *t);
int lcd_resource_tree_destroy(struct lcd_resource_tree *t);
int lcd_resource_tree_free(struct lcd_resource_tree *t);
int lcd_resource_tree_insert(struct lcd_resource_tree *t,
struct lcd_resource_node *n);
int lcd_resource_tree_search(struct lcd_resource_tree *t,
unsigned long addr,
struct lcd_resource_node **n_out);
void lcd_resource_tree_delete(struct lcd_resource_tree *t,
struct lcd_resource_node *n);
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment