Commit 8198c2fb authored by Charlie Jacobsen's avatar Charlie Jacobsen Committed by Vikram Narayanan
Browse files

Major overhaul of build process.

Full kernel build no longer required. Yay! This should
cut down on dev time a lot.

I moved all of the LCD source into $(kernel-src)/lcd-domains/,
so it's all in one spot. There is now a top-level makefile in
there that triggers building liblcd, the microkernel, and the
examples. This is built as an *external* build now, even
though the directory is in the kernel source. The build now takes
under a minute to do everything LCD related.

This should also make verification easier in the future (e.g.
building with clang) if we aren't ensnared in the kernel

Of course, to use the microkernel and examples, you have to
build the patched kernel and install it. But now when you
make a few lines of changes in e.g. an example, you don't have
to trigger a top-level kernel build to rebuild it. Running
the full kernel build takes on average about 3 - 4 minutes
(some files are generated everytime, linking is done, and so
on), and can take upwards of 30 minutes for a full build if you

Which brings me to my other change: no more config for LCDs
in menuconfig. If we create menu entries for every example
and so on, we end up changing the config too often, and this
triggers full kernel rebuilds == waste of time. We can use
macros by setting them via compiler flags (e.g., -DSOME_FLAG).
Furthermore, it wasn't making sense to me to do conditional
compilation for LCD support (we always want to compile for that).
Yes, changes aren't clearly delineated with macros, but you can
see changes made by just doing 'git diff v3.10.14 some-file-or-dir'.

The wiki has been fully updated with instructions for building,
and other relevant parts (updated paths to files).

I also took the opportunity to clean up some old stuff lying around
that is dead (like lcdguest). I incorporated all of the documentation
in Documentation/lcd-domains into the wiki so it's all in one
spot now (including some helpful debug tips).
parent af74008b
An LCD refers to objects managed by the microkernel - like a page of memory -
using an integer identifier, similar to a file descriptor; these are called
*capability pointers*, or cptr_t's.
The microkernel resolves these integer identifiers to the objects and the
LCD's rights using a capability space, or *cspace*. Each LCD has a capability
A cspace is a radix-tree-like data structure (but the look up is a bit
different). Each node in the tree contains multiple slots that contain
pointers to further nodes, or object and rights data. The
slots are called *cnodes* and the nodes in the tree are called *cnode tables*.
So, each cnode is either empty, contains a pointer to another cnode table, or
contains object and rights data.
An LCD can grant a capability to another LCD. Suppose LCD A grants a capability
to LCD B. When this happens, the cnode in LCD A's cspace becomes the parent
of the cnode in LCD B's cspace, in a *capability derivation tree*, or cdt.
If LCD B grants the rights on the same object to LCD C, the cnode in LCD C's
cspace becomes a child of LCD B's cnode in the cdt, and so on. If LCD A
decides to revoke rights, the microkernel will recursively delete the
capabilities in LCD B's and LCD C's cspaces, using the cdt.
So, cspaces are contained in an LCD, but cdts have pointers that span across
include/lcd-domains/types.h - cptr definition, simple functions,
cspace configuration macros
virt/lcd-domains/{internal.h, cap.c} - cspace implemementation
virt/lcd-domains/kliblcd.c - cptr cache implementation for
arch/x86/lcd-domains/liblcd.c - cptr cache implementation for
regular lcd's
cspace init/destroy
Called by the microkernel when the LCD is created/destroyed.
When an object is first created / introduced into the microkernel, it is
inserted into a cnode in the calling LCD's cspace, using lcd_cap_insert. The
calling LCD should provide a cptr to indicate where the object is placed. The
microkernel will initialize a cdt with the cnode at the root.
Grant can occur at two times: when an lcd is being created, and during ipc.
If LCD A is creating LCD B, A can grant capabilities to B using lcd_cap_grant.
LCD A is responsible for notifying B where the capabilities are in B's
cspace, by some agreed upon protocol.
During ipc, LCD A can grant rights to LCD B. This is how it works:
-- Suppose LCD A has a capability to an object already. The capability
is stored in a cnode and referenced by cptr1.
-- Suppose LCD B has an empty cnode slot in its cspace referred to by
-- LCD B puts cptr2 in its ipc buffer and does an ipc receive, to indicate
it will accept a granted capability into the cnode at cptr2.
-- LCD A puts cptr1 in its ipc buffer and does an ipc send, to indicate it
will grant the capability that is in the cnode at cptr1.
-- The microkernel matches A and B up, and invokes lcd_cap_grant. This will
copy the object data from A's cspace to B's cspace, and make the cnode
at cptr2 a child of the cnode at cptr1 in the cdt for the object.
This convoluted technique is used to ensure that LCD A and LCD B are agreeing
to share rights - LCD A is willing to grant, and LCD B is willing to receive
the grant and make room in its cspace.
At some point later, LCD A may want to revoke the rights it granted to
LCD B. It does so using a call into the microkernel, which, in turn, will
invoke lcd_cap_revoke. This will delete LCD B's cnode from the cdt for the
object, clear the cnode for later use, and update the state of LCD B to
reflect the change in rights (e.g., if the capability was for a page, and LCD B
had mapped the page, the page will become unmapped).
** Note: a revoke does not delete the capability itself, only its children. This
is what seL4 does, for example. **
The microkernel may invoke lcd_cnode_get when LCDs invoke methods on certain
objects to confirm they have rights to do so. Invoking lcd_cnode_get will
lock the cnode, but not the underlying object. It should be matched with
lcd_cnode_put. Because we are using mutexes (see "Locking" below), interrupts
are not disabled while in a "cnode critical section" - so beware! We may
change the lock types in the future - but mutexes are easier and more forgiving
than hard spinlocks.
[ 1 ]
It is possible for an LCD to acquire mutliple capabilities to the same object.
For example, an LCD could get 3 capabilities to the same page. Each time a
capability to the page is deleted, the microkernel will unmap the page if
necessary. If the page is mapped 3 times to 3 different places or the same
place, the microkernel will be try to unmap each time, but won't crash.
The microkernel is carefully designed to handle this weird case for the
objects it currently manages, but if you add new kinds of objects, beware!
The cspace/cdt data structures can also accomodate this weird case too.
(Note: It's not possible for an object to be inserted mutliple times - and
lead to multiple cdts - because the microkernel does the insertion.)
[ 1 ]
Unlike seL4, cspaces cannot be shared, and the microkernel manages
all aspects of cspace creation. In seL4, threads can create cnodes using
untype/retype, and dynamically build a cspace, using the seL4 microkernel
To make things doable, we don't allow LCDs to manage microkernel objects at the
granularity that seL4 allows - seL4 allows threads to create page tables, cnodes
in cspaces, and many other kinds of microkernel objects. Why don't we allow
this? Because the threads do all of the set up, so the microkernel doesn't
know (or has to do a lot of work and maintain extra state to know) how to tear
it all down. For example, if an LCD sets up a big guest physical paging
hierarchy, and loses rights on one of the page directories, the microkernel
would have to know which directories to unmap it from, etc. seL4 handles all
of this using some tricks, but the developers list caveats and limitiations,
and in general it's hard.
Punchline: it's complicated. Maybe this will change in the future.
[ 2 ]
In seL4, threads populate cspaces using mint and copy. Because we aren't
having threads / LCDs create and set up cspaces to the extent that seL4
does, we do the copying on behalf of the LCDs when rights are granted. (As
mentioned in "Warnings", it is still possible for an LCD to have multiple
capabilities to the same object.)
[ 3 ]
In our microkernel, an LCD can grant capabilities to an LCD it is creating,
using lcd_cap_grant.
kliblcd (and soon, liblcd) contain cptr cache implementations, so that the
other code inside an LCD doesn't need to track which cnodes are used. Not every
integer is a valid cptr, so the allocation is a bit complicated. See the
"Capability Space Radix Tree" below for an overview of cptr indexing.
The cache contains a bitmap for each level of the cspace tree. This makes it
easy to generate valid cptr's and track allocation/free's.
The cspace is for resolving a cptr, or index, to a capability / cnode.
A single lock in the struct cspace protects "radix tree traversal", and
individual locks for each cnode (slots in the cnode table) protect each
Each node is a *cnode table* of slots with the following layout:
______ cap slots _______ ___ table id slots _____
/ \ / \
+---+---+-- ... --+---+---+---+---+-- ... --+---+---+
| | | | | | | | | | |
+---+---+-- ... --+---+---+---+---+-- ... --+---+---+
There are LCD_CNODE_TABLE_NUM_SLOTS slots - half are slots for capabilities,
the other half for further pointers to more tables (so the number of slots
should be a power of 2). The cspace contains a root cnode table to start it off.
The cspace is built dynamically by the microkernel as slots are referenced by
an LCD. (This is different from seL4 - threads in seL4 are responsible for
building the cspace using the interface provided by the microkernel.)
[See notes at the end if you're wondering why this is so complicated.]
The shape of the cspace is controlled by three parameters (in internal.h):
LCD_CPTR_DEPTH_BITS - controls number of levels in cspace
0 bits : only root table (2^0 levels)
1 bit : root and one more level (2^1 levels)
2 bits : root and three more levels (2^2 levels)
LCD_CPTR_FANOUT_BITS - controls how many table slots are in the
cnode tables - in other words, the fanout
0 bits : one table slot (2^0)
1 bit : two table slots (2^1)
2 bits : four table slots (2^2)
LCD_CPTR_SLOT_BITS - controls how many cap slots are in the
cnode tables; similar to table slots above
An index, or cptr, encodes the location of a cnode in the cspace. The encoding
includes the level of the table that contains the slot; the fanout "path" to
get to that table; and the slot index inside the table.
The lookup is just like a radix tree lookup (starting from LSB instead of MSB),
but the level bits tell us how far to go / how many fanout bits are meaningful.
If all of the parameters are 2 (2 bits each), the cptr bit layout is the
____ cap slot index bits
00 00 00 00 00
/ | | |
/ | | |
level -----' fanout path (like a radix tree path)
So, in LSB order, the slot index bits come first, then the fanout path, then
the level. The level and slot index bits are always in the same position. The
interpretation of the fanout path bits depends on the level.
For example, if the cptr is
11 11 00 10 01
the slot index is 01 = 1, and the level of the table is 11 = 3. All three
pairs of fanout bits are used to traverse from the root cnode table to the
final cnode table:
[1] From the root cnode table, we look at the first pair of fanout
bits (in LSB order) = 10 = 2; so we follow the 2nd table slot
(zero indexed) in the root cnode table to level 1
[2] From the level 1 cnode table, we look at the next pair of fanout
bits = 00 = 0; so we follow the 0th table slot in this cnode
table to level 2
[3] Finally, from level 2, we look at the next pair of fanout
bits = 11 = 3; so we follow the 3rd table slot to arrive
at the cnode table in level 3
We now use the cap slot index bits = 01 = 1 to look up the capability.
Another example: if the cptr is
01 00 00 01 11
the level is 1, and only the first pair (01) of fanout bits are meaningful.
Starting in the root cnode table,
[1] we see that the level = 01 = 1 > 0, so we look at the first pair of
fanout bits = 01 = 1; we follow the pointer in the 1st table slot
to level 1
[2] we now see that the level of the cptr = the level we are at, 1, so
we now use the slot bits to look up the slot in the table (11 = 3).
It looks more complicated than it is. It's just radix tree traversal with
a depth check.
We tried a similar technique, but without using the level bits. This lead to
problems: since a cptr may refer to different levels, it became hard to
know when to stop the radix tree-like traversal.
As an alternative, we tried taking triples of bits from LSB to MSB: If the
high bit was set in a triple, this indicated to keep traversing and look at the
next triple; otherwise, stop. But it's hard to convert a number like 15 to an
index of this type. And this makes alloc/free of cptrs hard (talk to Charlie
for more details).
When an object is created and inserted into the LCD A's cspace, a new cdt is
created, and the first cnode is made the root. If LCD A grants rights to
LCD B, the cnode in LCD B's cspace is made a child of the cnode in A's
cspace. And so on. The cdt can look like this:
- LCD A's cspace -
| | - LCD B's cspace -
| . | | |
| . | | |
| / | | |
| +-+ | | . |
| +-+ | | . |
`- \ ------------' | \ |
`------------------------------> +-+ |
| | +-+ |
| `-------|-------'
| - LCD C's cspace - |
| | | |
| | . | | - LCD D's cspace -
| | \ | | | |
`---------> +-+ | | | . |
| +-+ | | | . |
`----------------' | | / |
| | +-+ |
`-------> +-+ |
If LCD A does a revoke, it will delete the capabilities (clear the cnodes)
in B, C, and D's cspaces.
Each populated cnode that contains a capability also contains a pointer to
the cdt (so if the cnode is cleared, it can be removed from the cdt, etc.).
The cdt is protected by a lock.
We use mutexes with locking interruptible throughout - for now - so that we
don't get into nasty deadlocks that require a reboot.
There are three primary locks - one lock per cspace, one lock per cdt, and
one lock per cnode.
cspace locks
The lock on the cspace provides mutual exclusion on cspace radix tree traversal
and modifications. For example, an insert and grant cannot concurrently
traverse the cspace, because an insert may create cnode tables on the fly, and
the grant may incorrectly traverse those new tables. (It may not be possible
for a call to insert and grant that involve the same thread to happen
concurrently. The grant would require an lcd to be in an endpoint queue, so
unless we had asynchronous endpoints, the insert couldn't be called
concurrently. But the next example shows that the lock is necssary.)
As another example, an lcd may be in an endpoint queue while it is being
torn down. Suppose it gets dequeued, and capabilities are being transferred.
This requires lookups in the lcd's cspace. These lookups should fail.
Eventually, once the tear down process reaches the cnode that refers to the
endpoint, the code will ensure the lcd is not in the queue; but until then,
all of this can happen. This is why we mark the cspace as invalid and release
the lock, so that other threads like this can make progress but see that the
lcd is going away.
cdt locks
The lock on a cdt provides mutual exclusion on cdt traversal and updates. For
example, lcd 1 may be granting rights to lcd 2, but lcd 3 might be revoking
rights to lcd 1; these operations need to be serialized (we only want to see
two possible outcomes: lcd 1 grant succeeds, and then lcd 3 revokes all rights;
or lcd 3 revoke succeeds, and then lcd 1 grant fails).
During a revoke, when cnodes are removed from a cdt, the microkernel state is
updated to reflect that change in rights. ** It must be possible to update the
microkernel state while the cdt is locked. ** For the current objects, this is
ok, but beware.
cnode locks
The lock on a cnode provides mutual exclusion on it, and also prevents the
containing cdt and cspace from going away while it is in use (they can't go
away until the cnode is locked and removed from the cspace and/or cdt).
cptr 0 = null, always invalid
cptr 1 = capability to lcd's endpoint for receiving replies
cptr 2 = (dynamic) capability to caller's reply endpoint, during call/reply
-- If your code fails, but e.g. modprobe or insmod hangs, the cpu may
be stuck in VMX Non-root mode or something. You will need to reboot.
-- If you get a page fault inside the lcd, confirm you put the __init
compiler flag on your module_init routine. If you don't do that, the
init routine won't be linked with the module.
-- You can rate limit printk so you're e.g. not printing an error message
after every vm exit.
-- If you get vm exits from nmi's, you should be ok - you probably have
the nmi watchdog turned on. nmi's fire periodically, and the nmi watchdog
just does some routine checks. It will print out warnings if there's actually
a problem.
-- Beware of putting printk's inside nmi handlers. Doing a printk inside an
nmi is in general not safe, because printk uses locks - if code takes the
lock and gets interrupted by an nmi, the nmi will block trying to take the
lock. And nmi's won't fire again until that nmi handler does an EOI, so you
got a hard lock up. (More recent kernel versions use safter printk handling
inside nmi's, if I'm not mistaken - deferring the printk until it's safe to
do so.)
-- There may be a bug in the Broadcom bnx2 ethernet driver that was fixed
in the upstream kernel after we branched off version 3.10.14. You might
see - watchdog: timeout on eth0 (bnx2) etc. etc. And you may lose connectivity
and possibly a hang (if you're trying to access a file via nfs).
-- There may be bad interactions with KVM code if you load it. This might
be the source of the bad hang, but I'm not sure.
-- See also some of the tips in liblcd.txt: Notes & Suggestions when debugging
page faults, etc. inside an LCD.
-- If you get linking errors or redefined symbol errors, you might be using
a different configuration than what I used when I set up liblcd. You will
need to either change your configuration, or modify liblcd to resolve the
symbol errors. (This is one reason why we should build liblcd in a separate
tree, in the future.)
-- If you have lock dep turned on with `proving correctness', you will
get some warnings when you load the LCD module. This is because the code
in main.c and cap.c uses some wild locking that could possibly lead to
deadlocks (it hasn't yet). So lockdep dumps warnings. I haven't bothered
inserting the code to prevent lock dep from complaining.
Recall that LCDs refer to capabilities in their cspace using
integer identifiers (similar to a file descriptor); these are
capability pointers, or cptr_t's.
An LCD has 8 64-bit general registers and 8 capability pointer (cptr_t)
registers. General registers are for scalar arguments. Capability pointer
registers are for granting capabilities. An LCD accesses its registers via:
u64 lcd_r0(void)
... reading general registers
u64 lcd_r8(void)
void lcd_set_r0(u64 val)
... writing to general registers
void lcd_set_r8(u64 val)
ctpr_t lcd_cr0(void)
... reading capability registers
cptr_t lcd_cr8(void)
void lcd_set_cr0(cptr_t val)
... writing to capability registers
void lcd_set_cr8(cptr_t val)
I will explain by example.
Suppose LCD A has:
-- a send capability to a rendezvous point for communicating with LCD B,
referenced by cptr_t c1
-- a capability to a page referenced by cptr_t c2
and that LCD B has:
-- a receive capability on the same rendezvous point, referenced by cptr_t c3
Suppose LCD A wants to grant the page capability to LCD B, and LCD B is
expecting to be granted this capability, and wants to reference the granted
capability via cptr_t c4. A few things need to happen.
First, LCD B needs to allocate a cnode in its cspace:
c4 = lcd_cnode_alloc();
Second, LCD B needs to do a receive, and put c4 in its capability register:
Third, LCD A needs to invoke a send:
The microkernel will match up the send and receive. It will copy the page
capability referred to in *LCD A's cspace* by c2 to cnode in *LCD B's cspace*
referred to by c4.
LCD A could also pass along scalar arguments to LCD B during the same
send invocation.
Call/reply takes the place of two send/recv pairs. Instead of:
lcd_send( ... )
lcd_recv( ... )
lcd_recv( ... )
lcd_send( ... )
the two LCD's can do:
lcd_call( ... )
lcd_reply( ... )
The code is inside virt/lcd-domains/kliblcd.c. The header (for non-isolated
kernel code to use) is in include/lcd-domains/kliblcd.h.
A kernel thread can "enter/exit into lcd mode" (similar to cap_enter in
Capsicum) by invoking klcd_enter/klcd_exit. A kernel thread that has entered
lcd mode is called a *kernel lcd* or *klcd*. The functions you see with
klcd_ instead of lcd_ are only part of the kliblcd interface and only
available to non-isolated lcd's.
Upon entering lcd mode, a kernel thread can invoke the functions in the
kliblcd interface for creating lcd's, allocating pages, loading modules, etc.
A klcd has a cspace and utcb for message passing, but does not have an
underlying hardware vm (the thread runs unisolated).
See the kliblcd header for a detailed description of the interface. See the
test cases for examples.
An lcd can be in one of four states:
E = Embryo - just after it is created, not configured with a starting
stack pointer, etc.
C = Configured - stack pointer, starting program counter configured
R = Running - kthread is runnable or running, and may be running
inside vm
D = Dead - kthread has stopped or will soon stop; lcd may be in
the process of being torn down
/ lcd_destroy \
| |
lcd_run | lcd_run, lcd_config |
.__. ^ .__. |
| | .----------->| | | |
\ | / lcd_destroy | \ | |
\ | / ^ \ | |
\ V / | \ V V
create +---+ lcd_config +---+ lcd_run +---+ lcd_destroy +---+
-------->| E |------------->| C |------------->| R |--------------->| D |
+---+ .->+---+ +---+ +---+
/ / ^ \
/ / | \
/ / | \
'---' '---'
lcd_config lcd_run
The following transitions are an error (return non-zero), and have no effect:
E: lcd_run - you must configure the lcd first
C: lcd_config - lcd already configured
R: lcd_run, lcd_config - lcd already running and config'd
D: all - lcd is dead
Some of these may be too restrictive, and could change in the future (e.g.,
allow re-config).
This code is the most complicated part of kliblcd.c. We package up all of the
context and data for setting up a module LCD inside struct lcd_info. This
contains lists of pages we've mapped in the LCD, the temporary cptr
cache we're using to set up the LCD's cspace, and so on. This is done so
that we can properly boot the LCD and tear everything down later.
There are two main parts: loading the module and setting up the VM.
Loading the module happens in: