cap.txt 16.4 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51

========================================
OVERVIEW
========================================

An LCD refers to objects managed by the microkernel - like a page of memory -
using an integer identifier, similar to a file descriptor; these are called
*capability pointers*, or cptr_t's.

The microkernel resolves these integer identifiers to the objects and the
LCD's rights using a capability space, or *cspace*. Each LCD has a capability 
space.

A cspace is a radix-tree-like data structure (but the look up is a bit 
different). Each node in the tree contains multiple slots that contain 
pointers to further nodes, or object and rights data. The
slots are called *cnodes* and the nodes in the tree are called *cnode tables*.
So, each cnode is either empty, contains a pointer to another cnode table, or
contains object and rights data.

An LCD can grant a capability to another LCD. Suppose LCD A grants a capability
to LCD B. When this happens, the cnode in LCD A's cspace becomes the parent
of the cnode in LCD B's cspace, in a *capability derivation tree*, or cdt.
If LCD B grants the rights on the same object to LCD C, the cnode in LCD C's
cspace becomes a child of LCD B's cnode in the cdt, and so on. If LCD A
decides to revoke rights, the microkernel will recursively delete the
capabilities in LCD B's and LCD C's cspaces, using the cdt.

So, cspaces are contained in an LCD, but cdts have pointers that span across 
cspaces.

========================================
OPERATIONS
========================================

cspace init/destroy
-------------------

Called by the microkernel when the LCD is created/destroyed.

insert
------

When an object is first created / introduced into the microkernel, it is
inserted into a cnode in the calling LCD's cspace, using lcd_cap_insert. The 
calling LCD should provide a cptr to indicate where the object is placed. The
microkernel will initialize a cdt with the cnode at the root.

grant
-----

52 53 54 55 56 57 58
Grant can occur at two times: when an lcd is being created, and during ipc.

If LCD A is creating LCD B, A can grant capabilities to B using lcd_cap_grant.
LCD A is responsible for notifying B where the capabilities are in B's
cspace, by some agreed upon protocol.

During ipc, LCD A can grant rights to LCD B. This is how it works:
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155

   -- Suppose LCD A has a capability to an object already. The capability
      is stored in a cnode and referenced by cptr1.

   -- Suppose LCD B has an empty cnode slot in its cspace referred to by
      cptr2.

   -- LCD B puts cptr2 in its ipc buffer and does an ipc receive, to indicate
      it will accept a granted capability into the cnode at cptr2.

   -- LCD A puts cptr1 in its ipc buffer and does an ipc send, to indicate it
      will grant the capability that is in the cnode at cptr1.

   -- The microkernel matches A and B up, and invokes lcd_cap_grant. This will
      copy the object data from A's cspace to B's cspace, and make the cnode
      at cptr2 a child of the cnode at cptr1 in the cdt for the object.

This convoluted technique is used to ensure that LCD A and LCD B are agreeing
to share rights - LCD A is willing to grant, and LCD B is willing to receive
the grant and make room in its cspace.

revoke
------

At some point later, LCD A may want to revoke the rights it granted to
LCD B. It does so using a call into the microkernel, which, in turn, will
invoke lcd_cap_revoke. This will delete LCD B's cnode from the cdt for the
object, clear the cnode for later use, and update the state of LCD B to
reflect the change in rights (e.g., if the capability was for a page, and LCD B
had mapped the page, the page will become unmapped).

** Note: a revoke does not delete the capability itself, only its children. This
is what seL4 does, for example. **

get/put
-------

The microkernel may invoke lcd_cnode_get when LCDs invoke methods on certain
objects to confirm they have rights to do so. Invoking lcd_cnode_get will
lock the cnode, but not the underlying object. It should be matched with
lcd_cnode_put. Because we are using mutexes (see "Locking" below), interrupts
are not disabled while in a "cnode critical section" - so beware! We may
change the lock types in the future - but mutexes are easier and more forgiving
than hard spinlocks.

========================================
WARNINGS
========================================

[ 1 ]

It is possible for an LCD to acquire mutliple capabilities to the same object.
For example, an LCD could get 3 capabilities to the same page. Each time a
capability to the page is deleted, the microkernel will unmap the page if
necessary. If the page is mapped 3 times to 3 different places or the same
place, the microkernel will be try to unmap each time, but won't crash.

The microkernel is carefully designed to handle this weird case for the 
objects it currently manages, but if you add new kinds of objects, beware!

The cspace/cdt data structures can also accomodate this weird case too.

(Note: It's not possible for an object to be inserted mutliple times - and
lead to multiple cdts - because the microkernel does the insertion.)

========================================
COMPARISON TO seL4
========================================

[ 1 ]

Unlike seL4, cspaces cannot be shared, and the microkernel manages
all aspects of cspace creation. In seL4, threads can create cnodes using
untype/retype, and dynamically build a cspace, using the seL4 microkernel
interfaces.

To make things doable, we don't allow LCDs to manage microkernel objects at the
granularity that seL4 allows - seL4 allows threads to create page tables, cnodes
in cspaces, and many other kinds of microkernel objects. Why don't we allow
this? Because the threads do all of the set up, so the microkernel doesn't
know (or has to do a lot of work and maintain extra state to know) how to tear 
it all down. For example, if an LCD sets up a big guest physical paging 
hierarchy, and loses rights on one of the page directories, the microkernel 
would have to know which directories to unmap it from, etc. seL4 handles all 
of this using some tricks, but the developers list caveats and limitiations,
and in general it's hard.

Punchline: it's complicated. Maybe this will change in the future.

[ 2 ]

In seL4, threads populate cspaces using mint and copy. Because we aren't
having threads / LCDs create and set up cspaces to the extent that seL4
does, we do the copying on behalf of the LCDs when rights are granted. (As
mentioned in "Warnings", it is still possible for an LCD to have multiple
capabilities to the same object.)

156 157 158 159 160
[ 3 ]

In our microkernel, an LCD can grant capabilities to an LCD it is creating,
using lcd_cap_grant.

161 162 163 164
========================================
CPTR CACHE
========================================

165 166 167 168 169 170 171
kliblcd (and soon, liblcd) contain cptr cache implementations, so that the 
other code inside an LCD doesn't need to track which cnodes are used. Not every
integer is a valid cptr, so the allocation is a bit complicated. See the
"Capability Space Radix Tree" below for an overview of cptr indexing.

The cache contains a bitmap for each level of the cspace tree. This makes it
easy to generate valid cptr's and track allocation/free's.
172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198

========================================
CAPABILITY SPACE RADIX TREE
========================================

The cspace is for resolving a cptr, or index, to a capability / cnode.

A single lock in the struct cspace protects "radix tree traversal", and
individual locks for each cnode (slots in the cnode table) protect each
cnode.

Each node is a *cnode table* of slots with the following layout:

     ______ cap slots _______   ___ table id slots _____
    /                        \ /                        \
    +---+---+-- ... --+---+---+---+---+-- ... --+---+---+
    |   |   |         |   |   |   |   |         |   |   |
    +---+---+-- ... --+---+---+---+---+-- ... --+---+---+

There are LCD_CNODE_TABLE_NUM_SLOTS slots - half are slots for capabilities,
the other half for further pointers to more tables (so the number of slots
should be a power of 2). The cspace contains a root cnode table to start it off.

The cspace is built dynamically by the microkernel as slots are referenced by 
an LCD. (This is different from seL4 - threads in seL4 are responsible for
building the cspace using the interface provided by the microkernel.)

199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221
[See notes at the end if you're wondering why this is so complicated.]

The shape of the cspace is controlled by three parameters (in internal.h):

      LCD_CPTR_DEPTH_BITS   -   controls number of levels in cspace 

                             0 bits :   only root table (2^0 levels)
                             1 bit  :   root and one more level (2^1 levels)
                             2 bits :   root and three more levels (2^2 levels)

      LCD_CPTR_FANOUT_BITS  -   controls how many table slots are in the
                                cnode tables - in other words, the fanout

                             0 bits :   one table slot (2^0)
                             1 bit  :   two table slots (2^1)
                             2 bits :   four table slots (2^2)

      LCD_CPTR_SLOT_BITS    -   controls how many cap slots are in the
                                cnode tables; similar to table slots above

An index, or cptr, encodes the location of a cnode in the cspace. The encoding
includes the level of the table that contains the slot; the fanout "path" to
get to that table; and the slot index inside the table.
222

223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290
The lookup is just like a radix tree lookup (starting from LSB instead of MSB), 
but the level bits tell us how far to go / how many fanout bits are meaningful.

If all of the parameters are 2 (2 bits each), the cptr bit layout is the 
following:
                                        ____ cap slot index bits
                                       /
                                      /
                         00 00 00 00 00
                         /   |  |  |
                        /    |  |  |
            level -----'     fanout path (like a radix tree path)

So, in LSB order, the slot index bits come first, then the fanout path, then
the level. The level and slot index bits are always in the same position. The
interpretation of the fanout path bits depends on the level.

For example, if the cptr is

                         11 11 00 10 01

the slot index is 01 = 1, and the level of the table is 11 = 3. All three
pairs of fanout bits are used to traverse from the root cnode table to the
final cnode table:

         [1] From the root cnode table, we look at the first pair of fanout
             bits (in LSB order) = 10 = 2; so we follow the 2nd table slot
             (zero indexed) in the root cnode table to level 1

         [2] From the level 1 cnode table, we look at the next pair of fanout
             bits = 00 = 0; so we follow the 0th table slot in this cnode
             table to level 2

         [3] Finally, from level 2, we look at the next pair of fanout
             bits = 11 = 3; so we follow the 3rd table slot to arrive
             at the cnode table in level 3

We now use the cap lot index bits = 01 = 1 to look up the capability.

Another example: if the cptr is

                         01 00 00 01 11

the level is 1, and only the first pair (01) of fanout bits are meaningful.
Starting in the root cnode table,

    [1] we see that the level = 01 = 1 > 0, so we look at the first pair of
        fanout bits = 01 = 1; we follow the pointer in the 1st table slot
        to level

    [2] we now see that the level of the cptr = the level we are at, 1, so
        we now use the slot bits to look up the slot in the table (11 = 3).

Notes
-----

It looks more complicated than it is. It's just radix tree traversal with
a depth check.

We tried a similar technique, but without using the level bits. This lead to
problems: since a cptr may refer to different levels, it became hard to
know when to stop the radix tree-like traversal.

As an alternative, we tried taking triples of bits from LSB to MSB: If the
high bit was set in a triple, this indicated to keep traversing and look at the
next triple; otherwise, stop. But it's hard to convert a number like 15 to an 
index of this type. And this makes alloc/free of cptrs hard (talk to Charlie 
for more details).
291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383

========================================
CAPABILITY DERIVATION TREE
========================================

When an object is created and inserted into the LCD A's cspace, a new cdt is
created, and the first cnode is made the root. If LCD A grants rights to
LCD B, the cnode in LCD B's cspace is made a child of the cnode in A's
cspace. And so on. The cdt can look like this:


   - LCD A's cspace -         
   |                |           - LCD B's cspace -
   |     .          |           |                |
   |    .           |           |                |
   |   /            |           |                |
   | +-+            |           |     .          |
   | +-+            |           |      .         |
   `- \ ------------'           |       \        |
       `------------------------------> +-+      |
       |                        |       +-+      |
       |                         `-------|-------'
       |      - LCD C's cspace -         |
       |      |                |         |
       |      |   .            |         |    - LCD D's cspace -
       |      |    \           |         |    |                |
       `---------> +-+         |         |    |        .       |
              |    +-+         |         |    |       .        |
              `----------------'         |    |      /         |
                                         |    |     +-+        |
                                          `-------> +-+        |
                                              `----------------'

If LCD A does a revoke, it will delete the capabilities (clear the cnodes)
in B, C, and D's cspaces.

Each populated cnode that contains a capability also contains a pointer to
the cdt (so if the cnode is cleared, it can be removed from the cdt, etc.).
The cdt is protected by a lock.

========================================
LOCKING
========================================

We use mutexes with locking interruptible throughout - for now - so that we 
don't get into nasty deadlocks that require a reboot.

There are three primary locks - one lock per cspace, one lock per cdt, and
one lock per cnode.

cspace locks
------------

The lock on the cspace provides mutual exclusion on cspace radix tree traversal
and modifications. For example, an insert and grant cannot concurrently 
traverse the cspace, because an insert may create cnode tables on the fly, and 
the grant may incorrectly traverse those new tables. (It may not be possible
for a call to insert and grant that involve the same thread to happen
concurrently. The grant would require an lcd to be in an endpoint queue, so
unless we had asynchronous endpoints, the insert couldn't be called
concurrently. But the next example shows that the lock is necssary.)

As another example, an lcd may be in an endpoint queue while it is being
torn down. Suppose it gets dequeued, and capabilities are being transferred.
This requires lookups in the lcd's cspace. These lookups should fail.
Eventually, once the tear down process reaches the cnode that refers to the
endpoint, the code will ensure the lcd is not in the queue; but until then,
all of this can happen. This is why we mark the cspace as invalid and release
the lock, so that other threads like this can make progress but see that the
lcd is going away.

cdt locks
---------

The lock on a cdt provides mutual exclusion on cdt traversal and updates. For
example, lcd 1 may be granting rights to lcd 2, but lcd 3 might be revoking
rights to lcd 1; these operations need to be serialized (we only want to see
two possible outcomes: lcd 1 grant succeeds, and then lcd 3 revokes all rights;
or lcd 3 revoke succeeds, and then lcd 1 grant fails).

During a revoke, when cnodes are removed from a cdt, the microkernel state is
updated to reflect that change in rights. ** It must be possible to update the
microkernel state while the cdt is locked. ** For the current objects, this is
ok, but beware.

cnode locks
-----------

The lock on a cnode provides mutual exclusion on it, and also prevents the
containing cdt and cspace from going away while it is in use (they can't go
away until the cnode is locked and removed from the cspace and/or cdt).

========================================
384
SPECIAL CPTRS / CAPABILITIES
385 386
========================================

387 388 389 390 391
To be implemented soon:

cptr 0 = null, always invalid
cptr 1 = capability to lcd's endpoint for receiving replies
cptr 2 = (dynamic) capability to caller's reply endpoint, during call/reply