Commit 6ee9a51f authored by Charlie Jacobsen's avatar Charlie Jacobsen Committed by Vikram Narayanan

Muktesh's capabilities fully incorporated. Capsicum-style enter/exit.

Builds, but not fully tested. Good tests for capability subsystem, some tests
for kliblcd.

Non-isolated kernel threads can "enter" the lcd system by doing
klcd_enter / klcd_exit. They can create other lcd's, set them up, etc. They
use the same interface that regular lcd's will use, so such code could be
moved to an lcd, as we had planned. Will document this in Documentation folder
tomorrow ( == today ).

Capability system does checks now when a capability is deleted/revoked: for
example, if it's for a page, the microkernel checks if the page is mapped, and
unmaps it. If the last capability goes away, the page is freed. Documentation
is in Documentation/lcd-domains/cap.txt.

IPC code is in place, but not tested yet (pray for me).

Debug is taking some time. Sometimes requires a power cycle which adds an
extra 5 - 10 minutes. Build is slow the first time after reboot. Give me a user
level program and I'll debug it in 30 seconds! argc

Main arch-independent files:

    include/lcd-domains/kliblcd.h, types.h

       This is what non-isolated kernel code should include to use the
       kliblcd interface to the microkernel.

    virt/lcd-domains/main.c, kliblcd.c, cap.c, ipc.c, internal.h

       The microkernel, broken up into pieces.

    virt/lcd-domains/tests/

       The tests, in progress.

Some old files are still hanging around in virt/lcd-domains and will be
incorprated/cleaned up soon.

I couldn't squash over the merge from the decomposition branch, so there's a
bunch of junk commits coming over. (I should've just copied Muktesh's files.)

Conflicts:
	drivers/Kconfig
	drivers/lcd-cspace/test.h
	include/lcd-domains/cap.h
	include/lcd-prototype/lcd.h
	include/lcd/console.h
	include/lcd/elfnote.h
	include/linux/init_task.h
	include/linux/module.h
	include/linux/sched.h
	virt/lcd-domains/cap.c
	virt/lcd-domains/ipc.c
	virt/lcd-domains/lcd-cspace-tests2.c
Resolved-by: Vikram Narayanan's avatarVikram Narayanan <vikram186@gmail.com>
parent 3365b018
========================================
OVERVIEW
========================================
An LCD refers to objects managed by the microkernel - like a page of memory -
using an integer identifier, similar to a file descriptor; these are called
*capability pointers*, or cptr_t's.
The microkernel resolves these integer identifiers to the objects and the
LCD's rights using a capability space, or *cspace*. Each LCD has a capability
space.
A cspace is a radix-tree-like data structure (but the look up is a bit
different). Each node in the tree contains multiple slots that contain
pointers to further nodes, or object and rights data. The
slots are called *cnodes* and the nodes in the tree are called *cnode tables*.
So, each cnode is either empty, contains a pointer to another cnode table, or
contains object and rights data.
An LCD can grant a capability to another LCD. Suppose LCD A grants a capability
to LCD B. When this happens, the cnode in LCD A's cspace becomes the parent
of the cnode in LCD B's cspace, in a *capability derivation tree*, or cdt.
If LCD B grants the rights on the same object to LCD C, the cnode in LCD C's
cspace becomes a child of LCD B's cnode in the cdt, and so on. If LCD A
decides to revoke rights, the microkernel will recursively delete the
capabilities in LCD B's and LCD C's cspaces, using the cdt.
So, cspaces are contained in an LCD, but cdts have pointers that span across
cspaces.
========================================
OPERATIONS
========================================
cspace init/destroy
-------------------
Called by the microkernel when the LCD is created/destroyed.
insert
------
When an object is first created / introduced into the microkernel, it is
inserted into a cnode in the calling LCD's cspace, using lcd_cap_insert. The
calling LCD should provide a cptr to indicate where the object is placed. The
microkernel will initialize a cdt with the cnode at the root.
grant
-----
During ipc (and only during ipc), LCD A can grant rights to LCD B. This is how
it works:
-- Suppose LCD A has a capability to an object already. The capability
is stored in a cnode and referenced by cptr1.
-- Suppose LCD B has an empty cnode slot in its cspace referred to by
cptr2.
-- LCD B puts cptr2 in its ipc buffer and does an ipc receive, to indicate
it will accept a granted capability into the cnode at cptr2.
-- LCD A puts cptr1 in its ipc buffer and does an ipc send, to indicate it
will grant the capability that is in the cnode at cptr1.
-- The microkernel matches A and B up, and invokes lcd_cap_grant. This will
copy the object data from A's cspace to B's cspace, and make the cnode
at cptr2 a child of the cnode at cptr1 in the cdt for the object.
This convoluted technique is used to ensure that LCD A and LCD B are agreeing
to share rights - LCD A is willing to grant, and LCD B is willing to receive
the grant and make room in its cspace.
revoke
------
At some point later, LCD A may want to revoke the rights it granted to
LCD B. It does so using a call into the microkernel, which, in turn, will
invoke lcd_cap_revoke. This will delete LCD B's cnode from the cdt for the
object, clear the cnode for later use, and update the state of LCD B to
reflect the change in rights (e.g., if the capability was for a page, and LCD B
had mapped the page, the page will become unmapped).
** Note: a revoke does not delete the capability itself, only its children. This
is what seL4 does, for example. **
get/put
-------
The microkernel may invoke lcd_cnode_get when LCDs invoke methods on certain
objects to confirm they have rights to do so. Invoking lcd_cnode_get will
lock the cnode, but not the underlying object. It should be matched with
lcd_cnode_put. Because we are using mutexes (see "Locking" below), interrupts
are not disabled while in a "cnode critical section" - so beware! We may
change the lock types in the future - but mutexes are easier and more forgiving
than hard spinlocks.
========================================
WARNINGS
========================================
[ 1 ]
It is possible for an LCD to acquire mutliple capabilities to the same object.
For example, an LCD could get 3 capabilities to the same page. Each time a
capability to the page is deleted, the microkernel will unmap the page if
necessary. If the page is mapped 3 times to 3 different places or the same
place, the microkernel will be try to unmap each time, but won't crash.
The microkernel is carefully designed to handle this weird case for the
objects it currently manages, but if you add new kinds of objects, beware!
The cspace/cdt data structures can also accomodate this weird case too.
(Note: It's not possible for an object to be inserted mutliple times - and
lead to multiple cdts - because the microkernel does the insertion.)
[ 2 ]
Two different cptr's can refer to the same cap slot. For example, with
a cnode table size of 8, 0b000011 and 0b001011 refer to the same slot.
The cptr cache allocator will ensure it generates cptr's for unique slots. If
an LCD manipulates a cptr, it does so at its own peril - cptr's are meant to
be opaque.
Cptr `aliasing' still allows for plenty of cptr's - for a cnode table of
size 8, and 64 bit cptr's, you can get roughly:
4 + 4^2 + 4^3 + 4^4 + ...
cptr's (i.e., plenty).
Note that the cptr cache allocator needs to ensure it also doesn't hand out
a cptr like 0xFFFFFFFFFF .... since this cannot resolve to a cap slot (it will
always follow a table pointer).
See the section "Capability Space Radix Tree" for more details about cspace
traversal.
========================================
COMPARISON TO seL4
========================================
[ 1 ]
Unlike seL4, cspaces cannot be shared, and the microkernel manages
all aspects of cspace creation. In seL4, threads can create cnodes using
untype/retype, and dynamically build a cspace, using the seL4 microkernel
interfaces.
To make things doable, we don't allow LCDs to manage microkernel objects at the
granularity that seL4 allows - seL4 allows threads to create page tables, cnodes
in cspaces, and many other kinds of microkernel objects. Why don't we allow
this? Because the threads do all of the set up, so the microkernel doesn't
know (or has to do a lot of work and maintain extra state to know) how to tear
it all down. For example, if an LCD sets up a big guest physical paging
hierarchy, and loses rights on one of the page directories, the microkernel
would have to know which directories to unmap it from, etc. seL4 handles all
of this using some tricks, but the developers list caveats and limitiations,
and in general it's hard.
Punchline: it's complicated. Maybe this will change in the future.
[ 2 ]
In seL4, threads populate cspaces using mint and copy. Because we aren't
having threads / LCDs create and set up cspaces to the extent that seL4
does, we do the copying on behalf of the LCDs when rights are granted. (As
mentioned in "Warnings", it is still possible for an LCD to have multiple
capabilities to the same object.)
========================================
CPTR CACHE
========================================
For now, so that the LCD doesn't need to track which cnodes are used, we
give it a cptr cache for getting a fresh cptr to an unused cnode. This should
be moved at some point to liblcd.
========================================
CAPABILITY SPACE RADIX TREE
========================================
The cspace is for resolving a cptr, or index, to a capability / cnode.
A single lock in the struct cspace protects "radix tree traversal", and
individual locks for each cnode (slots in the cnode table) protect each
cnode.
Each node is a *cnode table* of slots with the following layout:
______ cap slots _______ ___ table id slots _____
/ \ / \
+---+---+-- ... --+---+---+---+---+-- ... --+---+---+
| | | | | | | | | | |
+---+---+-- ... --+---+---+---+---+-- ... --+---+---+
There are LCD_CNODE_TABLE_NUM_SLOTS slots - half are slots for capabilities,
the other half for further pointers to more tables (so the number of slots
should be a power of 2). The cspace contains a root cnode table to start it off.
The cspace is built dynamically by the microkernel as slots are referenced by
an LCD. (This is different from seL4 - threads in seL4 are responsible for
building the cspace using the interface provided by the microkernel.)
An index, or cptr, is resolved using a radix-tree-style look up, but from the
least significant bit rather than the most significant bit. If the bits
resolve to a cap slot, the search is done; otherwise, the table pointer is
followed to the next level, and the next set of bits are considered. Some
examples follow. 0b1101 means binary 1101.
Suppose the number of slots per table is 8 - so there are 4 cap slots and
4 table slots.
Index = 3 = 0b000011:
Since there are 8 entries in the root table, we look at the first three
bits - 011. This indexes into the 3rd cap slot (zero indexed) in the
root cnode table:
|
|
V |
+---+---+---+---+---+---+---+---+
| | | |011| | | | | level 0 (root table)
+---+---+---+---+---+---+---+---+
|
cap slots table slots
Index = 6 = 0b000110
First three bits are 110 - this is a table slot, so we follow the pointer
to the next level.
Next three bits are 000 - this is a cap slot - we're done.
|
+---+---+---+---+---+---+---+---+
| | | | | | |110| | level 0
+---+---+---+---+---+---+-|-+---+
| |
|
V
|
+---+---+---+---+---+---+---+---+
|000| | | | | | | | level 1
+---+---+---+---+---+---+---+---+
^ |
|
|
final slot
Index = 11 = 0b001011
First three bits are 011 - this is the 3rd cap slot again.
This demonstrates that two different indexes can refer to the same slot!
|
|
V |
+---+---+---+---+---+---+---+---+
| | | |011| | | | |
+---+---+---+---+---+---+---+---+
|
========================================
CAPABILITY DERIVATION TREE
========================================
When an object is created and inserted into the LCD A's cspace, a new cdt is
created, and the first cnode is made the root. If LCD A grants rights to
LCD B, the cnode in LCD B's cspace is made a child of the cnode in A's
cspace. And so on. The cdt can look like this:
- LCD A's cspace -
| | - LCD B's cspace -
| . | | |
| . | | |
| / | | |
| +-+ | | . |
| +-+ | | . |
`- \ ------------' | \ |
`------------------------------> +-+ |
| | +-+ |
| `-------|-------'
| - LCD C's cspace - |
| | | |
| | . | | - LCD D's cspace -
| | \ | | | |
`---------> +-+ | | | . |
| +-+ | | | . |
`----------------' | | / |
| | +-+ |
`-------> +-+ |
`----------------'
If LCD A does a revoke, it will delete the capabilities (clear the cnodes)
in B, C, and D's cspaces.
Each populated cnode that contains a capability also contains a pointer to
the cdt (so if the cnode is cleared, it can be removed from the cdt, etc.).
The cdt is protected by a lock.
========================================
LOCKING
========================================
We use mutexes with locking interruptible throughout - for now - so that we
don't get into nasty deadlocks that require a reboot.
There are three primary locks - one lock per cspace, one lock per cdt, and
one lock per cnode.
cspace locks
------------
The lock on the cspace provides mutual exclusion on cspace radix tree traversal
and modifications. For example, an insert and grant cannot concurrently
traverse the cspace, because an insert may create cnode tables on the fly, and
the grant may incorrectly traverse those new tables. (It may not be possible
for a call to insert and grant that involve the same thread to happen
concurrently. The grant would require an lcd to be in an endpoint queue, so
unless we had asynchronous endpoints, the insert couldn't be called
concurrently. But the next example shows that the lock is necssary.)
As another example, an lcd may be in an endpoint queue while it is being
torn down. Suppose it gets dequeued, and capabilities are being transferred.
This requires lookups in the lcd's cspace. These lookups should fail.
Eventually, once the tear down process reaches the cnode that refers to the
endpoint, the code will ensure the lcd is not in the queue; but until then,
all of this can happen. This is why we mark the cspace as invalid and release
the lock, so that other threads like this can make progress but see that the
lcd is going away.
cdt locks
---------
The lock on a cdt provides mutual exclusion on cdt traversal and updates. For
example, lcd 1 may be granting rights to lcd 2, but lcd 3 might be revoking
rights to lcd 1; these operations need to be serialized (we only want to see
two possible outcomes: lcd 1 grant succeeds, and then lcd 3 revokes all rights;
or lcd 3 revoke succeeds, and then lcd 1 grant fails).
During a revoke, when cnodes are removed from a cdt, the microkernel state is
updated to reflect that change in rights. ** It must be possible to update the
microkernel state while the cdt is locked. ** For the current objects, this is
ok, but beware.
cnode locks
-----------
The lock on a cnode provides mutual exclusion on it, and also prevents the
containing cdt and cspace from going away while it is in use (they can't go
away until the cnode is locked and removed from the cspace and/or cdt).
========================================
CALL/REPLY CAPABILITIES
========================================
slot 1 = capability to lcd's endpoint for receiving replies
slot 2 = (dynamic) capability to caller's reply endpoint, during call/reply
......@@ -97,7 +97,7 @@ and do the following:
[ 1 ] aclocal -I m4 && automake --add-missing --copy && autoconf
[ 2 ] ./configure --prefix=/ --program-prefx=lcd-
[ 2 ] ./configure --prefix=/ --program-prefix=lcd-
[ 3 ] make
......
......@@ -4,108 +4,7 @@
#include <asm/vmx.h>
#include <linux/spinlock.h>
#include <linux/bitmap.h>
/* ADDRESS SPACE TYPES ---------------------------------------- */
/* XXX: Assuming host and guest run in 64-bit mode */
typedef struct { unsigned long gva; } gva_t;
typedef struct { unsigned long hva; } hva_t;
typedef struct { unsigned long gpa; } gpa_t;
typedef struct { unsigned long hpa; } hpa_t;
static inline gva_t __gva(unsigned long gva)
{
return (gva_t){ gva };
}
static inline unsigned long gva_val(gva_t gva)
{
return gva.gva;
}
static inline unsigned long * gva_ptr(gva_t * gva)
{
return &(gva->gva);
}
static inline gva_t gva_add(gva_t gva, unsigned long off)
{
return __gva(gva_val(gva) + off);
}
static inline hva_t __hva(unsigned long hva)
{
return (hva_t){ hva };
}
static inline unsigned long hva_val(hva_t hva)
{
return hva.hva;
}
static inline unsigned long * hva_ptr(hva_t * hva)
{
return &(hva->hva);
}
static inline hva_t hva_add(hva_t hva, unsigned long off)
{
return __hva(hva_val(hva) + off);
}
static inline gpa_t __gpa(unsigned long gpa)
{
return (gpa_t){ gpa };
}
static inline unsigned long gpa_val(gpa_t gpa)
{
return gpa.gpa;
}
static inline unsigned long * gpa_ptr(gpa_t * gpa)
{
return &(gpa->gpa);
}
static inline gpa_t gpa_add(gpa_t gpa, unsigned long off)
{
return __gpa(gpa_val(gpa) + off);
}
static inline hpa_t __hpa(unsigned long hpa)
{
return (hpa_t){ hpa };
}
static inline unsigned long hpa_val(hpa_t hpa)
{
return hpa.hpa;
}
static inline unsigned long * hpa_ptr(hpa_t * hpa)
{
return &(hpa->hpa);
}
static inline hpa_t hpa_add(hpa_t hpa, unsigned long off)
{
return __hpa(hpa_val(hpa) + off);
}
static inline hpa_t pa2hpa(unsigned long pa)
{
return (hpa_t){ pa };
}
static inline hpa_t va2hpa(void *va)
{
return (hpa_t){ __pa(va) };
}
static inline void * hpa2va(hpa_t hpa)
{
return __va(hpa_val(hpa));
}
static inline hva_t hpa2hva(hpa_t hpa)
{
return (hva_t){ (unsigned long)__va(hpa.hpa) };
}
static inline void * hva2va(hva_t hva)
{
return (void *)hva_val(hva);
}
static inline hva_t va2hva(void *va)
{
return __hva((unsigned long)va);
}
static inline hpa_t hva2hpa(hva_t hva)
{
return (hpa_t){ (unsigned long)__pa(hva2va(hva)) };
}
#include <lcd-domains/types.h>
/* LCD ARCH DATA STRUCTURES ---------------------------------------- */
......@@ -163,16 +62,45 @@ struct lcd_arch_tss {
u8 io_bitmap[1];
} __attribute__((packed));
struct lcd_arch_thread {
/*
* Containing lcd_arch
*/
struct lcd_arch *lcd_arch;
/*
* List of lcd_arch_thread's inside the containing lcd_arch.
* Protected by lock inside lcd_arch.
*/
struct list_head lcd_arch_threads;
/*
* Guest Physical Memory Layout & Segment Registers
* ================================================
*
* No gdt/tss/idt for now (easier), but perhaps in the future (see
* Documentation/lcd-domains/vmx.txt). We can get away with this since
* we set all of the hidden fields in the segment registers - like %fs, %gs,
* etc.
*
* See Intel SDM V3 26.3.1.2, 26.3.1.3 for register requirements
* See Intel SDM V3 3.4.2, 3.4.3 for segment register layout
* See Intel SDM V3 2.4.1 - 2.4.4 for gdtr, ldtr, idtr, tr
*
* +---------------------------+ 0xFFFF FFFF FFFF FFFF
* | |
* : :
* : Free / Unmapped :
* : :
* | |
* LCD_ARCH_TOP,---------> +---------------------------+ 0x0000 0000 0000 1000
* LCD_ARCH_BOTTOM | Reserved |
* | (not mapped, catch NULLs) | (4 KBs)
* +---------------------------+ 0x0000 0000 0000 0000
*/
#define LCD_ARCH_BOTTOM (1 << 12)
#define LCD_ARCH_TOP LCD_ARCH_BOTTOM
#define LCD_ARCH_FS_BASE __gpa(0UL)
#define LCD_ARCH_FS_LIMIT 0xFFFFFFFF
#define LCD_ARCH_GS_BASE __gpa(0UL)
#define LCD_ARCH_GS_LIMIT 0xFFFFFFFF
#define LCD_ARCH_GDTR_BASE __gpa(0UL)
#define LCD_ARCH_GDTR_LIMIT 0x0 /* no gdt */
#define LCD_ARCH_TSS_BASE __gpa(0UL)
#define LCD_ARCH_TSS_LIMIT 0x0 /* no tss */
#define LCD_ARCH_IDTR_BASE __gpa(0UL)
#define LCD_ARCH_IDTR_LIMIT 0x0 /* no idt right now */
struct lcd_arch {
/*
* CPU we're running on / vmloaded on
*/
......@@ -190,6 +118,15 @@ struct lcd_arch_thread {
* vmcs data structure; *must* be accessed using vmread / vmwrite
*/
struct lcd_arch_vmcs *vmcs;
/*
* Guest physical address space
*/
struct {
struct mutex lock;
lcd_arch_epte_t *root;
u64 vmcs_ptr; /* to be loaded in vmcs EPT_POINTER field */
bool access_dirty_enabled;
} ept;
/*
* Exit info
......@@ -227,63 +164,6 @@ struct lcd_arch_thread {
} msr_autoload;
};
/*
* Guest Physical Memory Layout & Segment Registers
* ================================================
*
* No gdt/tss/idt for now (easier), but perhaps in the future (see
* Documentation/lcd-domains/vmx.txt). We can get away with this since
* we set all of the hidden fields in the segment registers - like %fs, %gs,
* etc.
*
* See Intel SDM V3 26.3.1.2, 26.3.1.3 for register requirements
* See Intel SDM V3 3.4.2, 3.4.3 for segment register layout
* See Intel SDM V3 2.4.1 - 2.4.4 for gdtr, ldtr, idtr, tr
*
* +---------------------------+ 0xFFFF FFFF FFFF FFFF
* | |
* : :
* : Free / Unmapped :
* : :
* | |
* LCD_ARCH_TOP,---------> +---------------------------+ 0x0000 0000 0000 1000
* LCD_ARCH_BOTTOM | Reserved |
* | (not mapped, catch NULLs) | (4 KBs)
* +---------------------------+ 0x0000 0000 0000 0000
*/
#define LCD_ARCH_BOTTOM (1 << 12)
#define LCD_ARCH_TOP LCD_ARCH_BOTTOM
#define LCD_ARCH_FS_BASE __gpa(0UL)
#define LCD_ARCH_FS_LIMIT 0xFFFFFFFF
#define LCD_ARCH_GS_BASE __gpa(0UL)
#define LCD_ARCH_GS_LIMIT 0xFFFFFFFF
#define LCD_ARCH_GDTR_BASE __gpa(0UL)
#define LCD_ARCH_GDTR_LIMIT 0x0 /* no gdt */
#define LCD_ARCH_TSS_BASE __gpa(0UL)
#define LCD_ARCH_TSS_LIMIT 0x0 /* no tss */