TODO 11.5 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
NEW FEATURES:

1. Accommodate new usage.

   There are a couple of other places where we could make good use of
   frisbee technology.

   One is in the Emulab infrastructure for using frisbee!
   Our current technique for running frisbee involves running a memory
   filesystem based FreeBSD system.  So each node downloads not only a
   FreeBSD kernel but a multi-megabyte MFS image as well.  We could
   modify the FreeBSD bootloader (from which our initial pxeboot program
   is derived) to support a frisbee client.  In this way, multiple nodes
   could download their kernels and MFSs more efficiently and we could
   scale better.  The big problem here is that the bootload uses the PXE
   API for its network support and it isn't clear that the UDP-level
   services provided support multicast.  Lower-level (UNDI) services
   do, so that might be a possibility.  Another possibility is to use
   stub "real" drivers say from NetBSD.

   Another use is for distributing arbitrary files.  In the Emulab
   context this would be useful for implementing the features allowing
   installation of tar and rpm files.  We could multicast distribute
   the files to all nodes in an experiment at once.  A couple of things
   are needed for this.  One is a discovery protocol in frisbee itself.
   A frisbee client should be able to contact a known address and ask
   for an addr/port to use to download a specific file.  While we can
   get this info "out of band" (like we do now), it would be generally
   useful to include this in frisbee itself.  The other needed feature
   is the ability for the frisbee server to generate images on the fly.
   I would propose that it not bother with compression when creating the
   image, it would only do the chunking.  Three reasons for this: 1) it is
   simpler and more efficient for the server, 2) the types of files I
   envision us really wanting to distribute this way are already
   compressed (tgz and rpm files), and 3) it makes out-of-order delivery
   of blocks much easier.  To elaborate on the final point, one of the
   nice things about frisbee is that clients can request blocks in any
   order at any time.  For static images, the frisbee server can
   handle this easily, every chunk is at a fixed offset in the image
   file.  If we were to dynamically generate compressed images in the
   server, and the first request it got was for the final chunk of the
   image, the server would have to compress/chunk the entire source
   file before being able to present that chunk.  By not compressing,
   we know in advance, for any chunk, how much and which data will go
   into that chunk.  Note that we could just say that exactly 1MB of
   data will go into every chunk and still compress that, but there
   is no point since we pad out every chunk to 1MB.

49 50 51 52
   [ The discovery protocol is now done.  On-the-fly images has not
     though we can distribute arbitrary files, we do so without creating
     imagezip-format images. ]

53 54 55 56 57 58 59 60
   Now one could imagine a super, caching frisbee server that creates
   compressed images on the fly and caches them for later use.  Perhaps
   it would become more of a chunk server where it caches every chunk
   it has ever fed up along with a hash of the chunk.  Here when a client
   requests a specific file, we send back not only the addr/port to use
   for the request, we also send back a list of chunk hashes for every
   chunk in the file.  This is very like the HCP work at Stanford.

61 62 63 64
   [ 10/22/09: a quick investigation of 171 unique images from our standard
     collection (/usr/testbed/images) shows very little commonality (3068 of
     94379 chunks, 3%, were repeated).  This is a bit surprising for images
     that are derived from each other. ]
65 66 67

ENHANCEMENTS:

68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102
1. Better throttling of block re-requests at the client.

   Currently we have a variable, redodelay, in the client which is a
   constant value which is supposed to estimate how long a block request
   might legitimately be outstanding.  Ideally, this value depends on
   the currently length of the server's request queue; i.e., we should
   wait long enough for every current request at the server to be
   processed before it does ours.

   Since we are tracking requests and replies (blocks), estimate the length
   of the server request queue.  First approximation: chunk granularity.
   When we see any (full or partial) request for a chunk, mark the chunk
   outstanding.  When we see any block for a chunk, mark it no longer
   outstanding.  The number of outstanding chunks at any time is an
   estimate of the queue size.  Good news: return blocks lost due to
   congestion will keep the queue size high, reducing the request rate.
   Bad news: we mark a chunk no longer outstanding after the first block
   we receive for that chunk, decrementing the queue size even though much
   of the chunk has not yet been sent.  In fact, if we loose all but one
   returned block from a chunk, the chunk is still marked as no longer
   outstanding.  Better approximation: block granularity.  Keep an accurate
   map of outstanding blocks.  When we see a request, record the blocks.
   When we see a block come in, remove it.  This function could be combined
   with the current request aging mechanism.  Good news: offers a
   conservative estimate of server queue size, lost reply blocks keep our
   queue size estimate high, reducing congestion.  Bad news: requires space
   proportional to the compressed image size to track, for each 1k block of
   data we might have as much as 8 bytes of tracking info, only 2 orders of
   magnitude difference, e.g. a 1GB image requiring 10MB of tracking info.

   Alternative: measure the rate of partial chunk requests based on observed
   requests, similar to what the server does.  We back off (increase redodelay)
   when the rate is too high.  Good news: it is symmetric with what the server
   currently does.  Bad news: harder to map this rate to an adjustment than
   it is with the queue-size-estimate method.
Mike Hibler's avatar
Mike Hibler committed
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122

2. Auto-adjust readahead on the client.

   Similar to #1 the client should track the level of activity on the
   server and increase its readahead accordingly.  For example, if we are
   the only client, we could increase our readahead.

3. Eliminate client-side copy of compressed data.

   Right now we read packets into a local packet buffer and then, for
   BLOCK messages, copy the data out to the chunk buffers.  This results
   in a complete copy of the compressed data.  If we make a chunk buffer
   into an array of pointers to data buffers, we can read packets into
   these data buffers and link them straight into the chunk buffers.
   The downside is that we must modify the already gruesome decompression
   loop to deal with input buffer boundaries in addition to region
   and writer buffer boundaries.

4. Multi-thread the frisbee server.

123
   We can make our network output intervals more consistent if we
Mike Hibler's avatar
Mike Hibler committed
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138
   separate the disk reader from the network writer.  This would have a
   performance benefit for the imageunzip program which currently
   combines the reader and decompresser having only a separate writer
   thread.

5. Investigate large block/chunk sizes.

   Most importantly would be to increase block size from 1024 to something
   approaching the 1448 max (given current packet format).  Constraint:
   number of blocks in a chunk should be a multiple of 8 since we use a
   bitmap to track blocks.  This is not strictly necessary, it would just
   be nice and the BlockMap routines might require a little tweaking ow.
   Maybe should be a multiple of 32 to ensure bitmap is a multiple of 4
   in size.  Large chunk issues: 1) more potential wasted space per chunk,
   though mostly only in the last chunk, 2) It takes longer to accumulate
139
   chunks at the client, potentially idling the decompresser and writer,
Mike Hibler's avatar
Mike Hibler committed
140 141 142 143
   3) takes more space to accumulate chunks, allowing for fewer in progress
   chunks.  So maybe 1448B/blk * 768 blks/chunk == 1.06MB/chunk.  PREQUEST
   BlockMaps come down from 128 bytes to 96.

144 145 146 147 148 149 150
   [ Support for jumbo packets has been done.  This increases the block
     size to 8192 and reduces the blocks/chunk to 128 to keep the chunk
     size constant (so that we can continue to distribute our existing
     images).  Currently this requires static re-compilation of both the
     client and server, though some support has been put in for negotiating
     the blocksize (in join v2 messages). ]

Mike Hibler's avatar
Mike Hibler committed
151 152 153 154 155 156
6. Dynamic rate pacing in the server.

   Our attempts to date have been pretty feeble.  I think we have a
   reasonable loss metric now, just need a smooth weighted decay formula
   we can use.  Look at the proposed standard TCP-friendly rate equation.

157 158 159 160 161 162 163 164
7. Partial last chunks.

   This is way-nit-like comparted to the others.  The "must be a multiple
   of 1MB (chunksize)" requirement for images can be onerous for the
   so-called delta images which can be only a couple hundred KB.  By allowing
   the last chunk in the image file to be partial, and having imageunzip
   and frisbeed pad it out, we can save a lot of disk space for these
   small images.
Mike Hibler's avatar
Mike Hibler committed
165

166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
   [ DONE ]

8. Allow a server to serve multiple unicast clients.

   Right now an instance of the server not only serves just a single
   image, but only to a single destination address.  This is reasonable
   for broadcast/multicast but is overly restrictive for unicast.  Changing
   this should be minor, we just need to keep track of destinations
   (addr/port) in the queue along with block ranges to send out.  We would
   need to back off the queue optimizations where we combine incoming
   requests with those already in the queue (i.e., now we would also have
   to make sure that they are for the same destination before we combine).
   Minor changes would be needed to PacketSend/Recv to track the client
   IP/port rather than just assuming/using the global mcastaddr/portnum.

181 182 183 184
   One nice side-effect of this is that the client would no longer have
   to bind to a specific port in the unicast case, since each reply from
   the server could be tagged with the proper destination port.

185 186 187 188 189 190 191 192 193 194 195 196 197 198 199
9. Allow the frisbee client to be used in a pipe.

   If we could pipe the output of frisbee into another utility, this would
   make it more useful for arbitrary file distribution.  For example:
	frisbee -m <addr> -p <port> - | tar xzf
   to download and unpack a tarfile.  The problem is out-of-order processing
   of chunks and there are a couple of ways around it.  Frisbee can already
   request chunks in-order, but it is also opportunistic and saves other
   chunk data it needs that other clients have requested.  We could just
   ignore that data and keep re-requesting blocks in order as we need them,
   or we could do some limited memory caching of incoming data; i.e., save
   but don't decompress chunks until we need them.  We could cache to disk
   as well, but then we don't really save anything over just frisbeeing into
   a tmp file and giving that to the next util in the pipeline.

Mike Hibler's avatar
Mike Hibler committed
200 201 202 203 204 205 206 207 208 209 210 211 212 213
PROBLEMS:

1. Have seen the clients run out of socket buffer space causing them
   to lose packets when still well short of the network bandwidth (at
   ~70Mb/sec).  Not sure why.  One thing we know is that the decompress
   thread will almost certainly run for a full scheduling interval (1ms)
   everytime.  Thus we have to have enough buffering in the card and socket
   buffers to handle 1ms of data.  With the default params, we are only
   putting out 8 packets every 1ms, so that shouldn't be an issue.
   Assuming that we are getting it off the card in time, that means the
   network thread is either not running frequently enough, or it is spending
   too much time doing other things (like copying packet data, see #3 above).