NEW FEATURES: 1. Accommodate new usage. There are a couple of other places where we could make good use of frisbee technology. One is in the Emulab infrastructure for using frisbee! Our current technique for running frisbee involves running a memory filesystem based FreeBSD system. So each node downloads not only a FreeBSD kernel but a multi-megabyte MFS image as well. We could modify the FreeBSD bootloader (from which our initial pxeboot program is derived) to support a frisbee client. In this way, multiple nodes could download their kernels and MFSs more efficiently and we could scale better. The big problem here is that the bootload uses the PXE API for its network support and it isn't clear that the UDP-level services provided support multicast. Lower-level (UNDI) services do, so that might be a possibility. Another possibility is to use stub "real" drivers say from NetBSD. Another use is for distributing arbitrary files. In the Emulab context this would be useful for implementing the features allowing installation of tar and rpm files. We could multicast distribute the files to all nodes in an experiment at once. A couple of things are needed for this. One is a discovery protocol in frisbee itself. A frisbee client should be able to contact a known address and ask for an addr/port to use to download a specific file. While we can get this info "out of band" (like we do now), it would be generally useful to include this in frisbee itself. The other needed feature is the ability for the frisbee server to generate images on the fly. I would propose that it not bother with compression when creating the image, it would only do the chunking. Three reasons for this: 1) it is simpler and more efficient for the server, 2) the types of files I envision us really wanting to distribute this way are already compressed (tgz and rpm files), and 3) it makes out-of-order delivery of blocks much easier. To elaborate on the final point, one of the nice things about frisbee is that clients can request blocks in any order at any time. For static images, the frisbee server can handle this easily, every chunk is at a fixed offset in the image file. If we were to dynamically generate compressed images in the server, and the first request it got was for the final chunk of the image, the server would have to compress/chunk the entire source file before being able to present that chunk. By not compressing, we know in advance, for any chunk, how much and which data will go into that chunk. Note that we could just say that exactly 1MB of data will go into every chunk and still compress that, but there is no point since we pad out every chunk to 1MB. [ The discovery protocol is now done. On-the-fly images has not though we can distribute arbitrary files, we do so without creating imagezip-format images. ] Now one could imagine a super, caching frisbee server that creates compressed images on the fly and caches them for later use. Perhaps it would become more of a chunk server where it caches every chunk it has ever fed up along with a hash of the chunk. Here when a client requests a specific file, we send back not only the addr/port to use for the request, we also send back a list of chunk hashes for every chunk in the file. This is very like the HCP work at Stanford. [ 10/22/09: a quick investigation of 171 unique images from our standard collection (/usr/testbed/images) shows very little commonality (3068 of 94379 chunks, 3%, were repeated). This is a bit surprising for images that are derived from each other. ] ENHANCEMENTS: 1. Better throttling of block re-requests at the client. Currently we have a variable, redodelay, in the client which is a constant value which is supposed to estimate how long a block request might legitimately be outstanding. Ideally, this value depends on the currently length of the server's request queue; i.e., we should wait long enough for every current request at the server to be processed before it does ours. Since we are tracking requests and replies (blocks), estimate the length of the server request queue. First approximation: chunk granularity. When we see any (full or partial) request for a chunk, mark the chunk outstanding. When we see any block for a chunk, mark it no longer outstanding. The number of outstanding chunks at any time is an estimate of the queue size. Good news: return blocks lost due to congestion will keep the queue size high, reducing the request rate. Bad news: we mark a chunk no longer outstanding after the first block we receive for that chunk, decrementing the queue size even though much of the chunk has not yet been sent. In fact, if we loose all but one returned block from a chunk, the chunk is still marked as no longer outstanding. Better approximation: block granularity. Keep an accurate map of outstanding blocks. When we see a request, record the blocks. When we see a block come in, remove it. This function could be combined with the current request aging mechanism. Good news: offers a conservative estimate of server queue size, lost reply blocks keep our queue size estimate high, reducing congestion. Bad news: requires space proportional to the compressed image size to track, for each 1k block of data we might have as much as 8 bytes of tracking info, only 2 orders of magnitude difference, e.g. a 1GB image requiring 10MB of tracking info. Alternative: measure the rate of partial chunk requests based on observed requests, similar to what the server does. We back off (increase redodelay) when the rate is too high. Good news: it is symmetric with what the server currently does. Bad news: harder to map this rate to an adjustment than it is with the queue-size-estimate method. 2. Auto-adjust readahead on the client. Similar to #1 the client should track the level of activity on the server and increase its readahead accordingly. For example, if we are the only client, we could increase our readahead. 3. Eliminate client-side copy of compressed data. Right now we read packets into a local packet buffer and then, for BLOCK messages, copy the data out to the chunk buffers. This results in a complete copy of the compressed data. If we make a chunk buffer into an array of pointers to data buffers, we can read packets into these data buffers and link them straight into the chunk buffers. The downside is that we must modify the already gruesome decompression loop to deal with input buffer boundaries in addition to region and writer buffer boundaries. 4. Multi-thread the frisbee server. We can make our network output intervals more consistent if we separate the disk reader from the network writer. This would have a performance benefit for the imageunzip program which currently combines the reader and decompresser having only a separate writer thread. 5. Investigate large block/chunk sizes. Most importantly would be to increase block size from 1024 to something approaching the 1448 max (given current packet format). Constraint: number of blocks in a chunk should be a multiple of 8 since we use a bitmap to track blocks. This is not strictly necessary, it would just be nice and the BlockMap routines might require a little tweaking ow. Maybe should be a multiple of 32 to ensure bitmap is a multiple of 4 in size. Large chunk issues: 1) more potential wasted space per chunk, though mostly only in the last chunk, 2) It takes longer to accumulate chunks at the client, potentially idling the decompresser and writer, 3) takes more space to accumulate chunks, allowing for fewer in progress chunks. So maybe 1448B/blk * 768 blks/chunk == 1.06MB/chunk. PREQUEST BlockMaps come down from 128 bytes to 96. [ Support for jumbo packets has been done. This increases the block size to 8192 and reduces the blocks/chunk to 128 to keep the chunk size constant (so that we can continue to distribute our existing images). Currently this requires static re-compilation of both the client and server, though some support has been put in for negotiating the blocksize (in join v2 messages). ] 6. Dynamic rate pacing in the server. Our attempts to date have been pretty feeble. I think we have a reasonable loss metric now, just need a smooth weighted decay formula we can use. Look at the proposed standard TCP-friendly rate equation. 7. Partial last chunks. This is way-nit-like comparted to the others. The "must be a multiple of 1MB (chunksize)" requirement for images can be onerous for the so-called delta images which can be only a couple hundred KB. By allowing the last chunk in the image file to be partial, and having imageunzip and frisbeed pad it out, we can save a lot of disk space for these small images. [ DONE ] 8. Allow a server to serve multiple unicast clients. Right now an instance of the server not only serves just a single image, but only to a single destination address. This is reasonable for broadcast/multicast but is overly restrictive for unicast. Changing this should be minor, we just need to keep track of destinations (addr/port) in the queue along with block ranges to send out. We would need to back off the queue optimizations where we combine incoming requests with those already in the queue (i.e., now we would also have to make sure that they are for the same destination before we combine). Minor changes would be needed to PacketSend/Recv to track the client IP/port rather than just assuming/using the global mcastaddr/portnum. One nice side-effect of this is that the client would no longer have to bind to a specific port in the unicast case, since each reply from the server could be tagged with the proper destination port. 9. Allow the frisbee client to be used in a pipe. If we could pipe the output of frisbee into another utility, this would make it more useful for arbitrary file distribution. For example: frisbee -m -p - | tar xzf to download and unpack a tarfile. The problem is out-of-order processing of chunks and there are a couple of ways around it. Frisbee can already request chunks in-order, but it is also opportunistic and saves other chunk data it needs that other clients have requested. We could just ignore that data and keep re-requesting blocks in order as we need them, or we could do some limited memory caching of incoming data; i.e., save but don't decompress chunks until we need them. We could cache to disk as well, but then we don't really save anything over just frisbeeing into a tmp file and giving that to the next util in the pipeline. PROBLEMS: 1. Have seen the clients run out of socket buffer space causing them to lose packets when still well short of the network bandwidth (at ~70Mb/sec). Not sure why. One thing we know is that the decompress thread will almost certainly run for a full scheduling interval (1ms) everytime. Thus we have to have enough buffering in the card and socket buffers to handle 1ms of data. With the default params, we are only putting out 8 packets every 1ms, so that shouldn't be an issue. Assuming that we are getting it off the card in time, that means the network thread is either not running frequently enough, or it is spending too much time doing other things (like copying packet data, see #3 above).