Commit 86efdd9e authored by Leigh B. Stoller's avatar Leigh B. Stoller

Checkpoint first working version of Frisbee Redux. This version

requires the linux threads package to give us kernel level pthreads.

From: Leigh Stoller <stoller@fast.cs.utah.edu>
To: Testbed Operations <testbed-ops@fast.cs.utah.edu>
Cc: Jay Lepreau <lepreau@cs.utah.edu>
Subject: Frisbee Redux
Date: Mon, 7 Jan 2002 12:03:56 -0800

Server:
The server is multithreaded. One thread takes in requests from the
clients, and adds the request to a work queue. The other thread processes
the work queue in fifo order, spitting out the desrired block ranges. A
request is a chunk/block/blockcount tuple, and most of the time the clients
are requesting complete 1MB chunks. The exception of course is when
individual blocks are lost, in which case the clients request just those
subranges.  The server it totally asynchronous; It maintains a list of who
is "connected", but thats just to make sure we can time the server out
after a suitable inactive time. The server really only cares about the work
queue; As long as the queue si non empty, it spits out data.

Client:
The client is also multithreaded. One thread receives data packets and
stuffs them in a chunkbuffer data structure. This thread also request more
data, either to complete chunks with missing blocks, or to request new
chunks. Each client can read ahead up 2 chunks, although with multiple
clients it might actually be much further ahead as it also receives chunks
that other clients requested. I set the number of chunk buffers to 16,
although this is probably unnecessary as I will explain below. The other
thread waits for chunkbuffers to be marked complete, and then invokes the
imagunzip code on that chunk. Meanwhile, the other thread is busily getting
more data and requesting/reading ahread, so that by the time the unzip is
done, there is another chunk to unzip. In practice, the main thread never
goes idle after the first chunk is received; there is always a ready chunk
for it. Perfect overlap of I/O! In order to prevent the clients from
getting overly synchronized (and causing all the clients to wait until the
last client is done!), each client randomizes it block request order. This
why we can retain the original frisbee name; clients end up catching random
blocks flung out from the server until it has all the blocks.

Performance:
The single node speed is about 180 seconds for our current full image.
Frisbee V1 compares at about 210 seconds. The two node speed was 181 and
174 seconds. The amount of CPU used for the two node run ranged from 1% to
4%, typically averaging about 2% while I watched it with "top."

The main problem on the server side is how to keep boss (1GHZ with a Gbit
ethernet) from spitting out packets so fast that 1/2 of them get dropped. I
eventually settled on a static 1ms delay every 64K of packets sent. Nothing
to be proud of, but it works.

As mentioned above, the number of chunk buffers is 16, although only a few
of them are used in practice. The reason is that the network transfer speed
is perhaps 10 times faster than the decompression and raw device write
speed. To know for sure, I would have to figure out the per byte transfer
rate for 350 MBs via network, via the time to decompress and write the
1.2GB of data to the raw disk. With such a big difference, its only
necessary to ensure that you stay 1 or 2 chunks ahead, since you can
request 10 chunks in the time it takes to write one of them.
parent d0b9f55f
......@@ -1049,6 +1049,7 @@ outfiles="$outfiles Makeconf GNUmakefile \
ipod/GNUmakefile \
lib/GNUmakefile \
os/GNUmakefile os/split-image.sh os/imagezip/GNUmakefile \
os/frisbee.redux/GNUmakefile \
pxe/GNUmakefile pxe/proxydhcp.restart pxe/bootinfo.restart \
security/GNUmakefile security/paperbag security/lastlog_daemon \
tbsetup/GNUmakefile tbsetup/console_setup \
......
......@@ -164,6 +164,7 @@ outfiles="$outfiles Makeconf GNUmakefile \
ipod/GNUmakefile \
lib/GNUmakefile \
os/GNUmakefile os/split-image.sh os/imagezip/GNUmakefile \
os/frisbee.redux/GNUmakefile \
pxe/GNUmakefile pxe/proxydhcp.restart pxe/bootinfo.restart \
security/GNUmakefile security/paperbag security/lastlog_daemon \
tbsetup/GNUmakefile tbsetup/console_setup \
......
SRCDIR = @srcdir@
TESTBED_SRCDIR = @top_srcdir@
OBJDIR = ../..
SUBDIR = os/frisbee.new
include $(OBJDIR)/Makeconf
all: frisbee frisbeed
include $(TESTBED_SRCDIR)/GNUmakerules
SHAREDOBJS = log.o network.o utils.o
PTHREADCFLAGS = -D_THREAD_SAFE \
-I/usr/local/include/pthread/linuxthreads
PTHREADLIBS = -L/usr/local/lib -llthread -llgcc_r
CLIENTFLAGS = $(CFLAGS)
CLIENTLIBS = ../imagezip/frisbee.o -lz $(PTHREADLIBS)
CLIENTOBJS = client.o $(SHAREDOBJS)
SERVERFLAGS = $(CFLAGS)
SERVERLIBS = $(PTHREADLIBS)
SERVEROBJS = server.o $(SHAREDOBJS)
CFLAGS = -O2 -g -Wall -static $(PTHREADCFLAGS)
LDFLAGS = -static
frisbee: $(CLIENTOBJS) ../imagezip/frisbee.o
$(CC) $(LDFLAGS) $(CLIENTFLAGS) $(CLIENTOBJS) $(CLIENTLIBS) -o frisbee
cp frisbee frisbee.debug
strip frisbee
frisbeed: $(SERVEROBJS)
$(CC) $(LDFLAGS) $(SERVERFLAGS) $(SERVEROBJS) $(SERVERLIBS) -o frisbeed
cp frisbeed frisbeed.debug
strip frisbeed
client.o: decls.h log.h
server.o: decls.h log.h
log.o: decls.h log.h
clean:
/bin/rm -f *.o *.a frisbee frisbeed frisbee.debug frisbeed.debug
This diff is collapsed.
/*
* Shared for defintions for frisbee client/server code.
*/
#include "log.h"
/*
* Max number of clients we can support at once. Not likely to be an issue
* since the amount of per client state is very little.
*/
#define MAXCLIENTS 1000
/*
* We operate in terms of this blocksize (in bytes).
*/
#define BLOCKSIZE 1024
/*
* Each chunk is this many blocks.
*/
#define CHUNKSIZE 1024
/*
* The number of chunk buffers in the client.
*/
#define MAXCHUNKBUFS 16
/*
* The number of read-ahead chunks that the client will request
* at a time. No point in requesting to far ahead either, since they
* are uncompressed/written at a fraction of the network transfer speed.
* Also, with multiple clients at different stages, each requesting blocks
* it is likely that there will be plenty more chunks ready or in progress.
*/
#define MAXREADAHEAD 2
#define MAXINPROGRESS 4
/*
* Timeout (in usecs) for packet receive. The idletimer number is how
* many PKT timeouts we allow before requesting more data from the server.
* That is, if we go TIMEOUT usecs without getting a packet, then ask for
* more (or on the server side, poll the clients). On the server, side
* use a timeout to check for dead clients. We want that to be longish.
*/
#define PKTRCV_TIMEOUT 30000
#define CLIENT_IDLETIMER_COUNT 1
#define SERVER_IDLETIMER_COUNT ((300 * 1000000) / PKTRCV_TIMEOUT)
/*
* Timeout (in seconds!) server will hang around with no active clients.
*/
#define SERVER_INACTIVE_SECONDS 30
/*
* The number of disk read blocks in a single read on the server.
* Must be an even divisor of CHUNKSIZE.
*/
#define MAXREADBLOCKS 32
/*
* Packet defs.
*/
typedef struct {
struct {
int type;
int subtype;
int datalen; /* Useful amount of data in packet */
unsigned int srcip; /* Filled in by network level. */
} hdr;
union {
/*
* Join/leave the Team. Send a randomized ID, and receive
* the number of blocks in the file.
*/
union {
unsigned int clientid;
int blockcount;
} join;
/*
* A data block, indexed by chunk,block.
*/
struct {
int chunk;
int block;
char buf[BLOCKSIZE];
} block;
/*
* A request for a data block, indexed by chunk,block.
*/
struct {
int chunk;
int block;
int count; /* Number of blocks */
} request;
} msg;
} Packet_t;
#define PKTTYPE_REQUEST 1
#define PKTTYPE_REPLY 2
#define PKTSUBTYPE_JOIN 1
#define PKTSUBTYPE_LEAVE 2
#define PKTSUBTYPE_BLOCK 3
#define PKTSUBTYPE_REQUEST 4
/*
* Protos.
*/
int ClientNetInit(void);
int ServerNetInit(void);
int PacketReceive(Packet_t *p);
int PacketSend(Packet_t *p);
int PacketReply(Packet_t *p);
char *CurrentTimeString(void);
int fsleep(unsigned int usecs);
/*
* Globals
*/
extern int debug;
extern int portnum;
extern struct in_addr mcastaddr;
extern struct in_addr mcastif;
extern char *filename;
/*
* Logging and debug routines.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <syslog.h>
#include <assert.h>
#include <errno.h>
#include "decls.h"
static int usesyslog = 1;
/*
* There is really no point in the client using syslog, but its nice
* to use the same log functions either way.
*/
int
ClientLogInit(void)
{
usesyslog = 0;
return 0;
}
int
ServerLogInit(void)
{
if (debug) {
usesyslog = 0;
return 1;
}
openlog("frisbee", LOG_PID, LOG_USER);
return 0;
}
void
log(const char *fmt, ...)
{
va_list args;
va_start(args, fmt);
if (!usesyslog) {
vfprintf(stderr, fmt, args);
fputc('\n', stderr);
fflush(stderr);
}
else
vsyslog(LOG_INFO, fmt, args);
va_end(args);
}
void
warning(const char *fmt, ...)
{
va_list args;
va_start(args, fmt);
if (!usesyslog) {
vfprintf(stderr, fmt, args);
fputc('\n', stderr);
fflush(stderr);
}
else
vsyslog(LOG_WARNING, fmt, args);
va_end(args);
}
void
fatal(const char *fmt, ...)
{
va_list args;
va_start(args, fmt);
if (!usesyslog) {
vfprintf(stderr, fmt, args);
fputc('\n', stderr);
fflush(stderr);
}
else
vsyslog(LOG_ERR, fmt, args);
va_end(args);
exit(-1);
}
void
pwarning(const char *fmt, ...)
{
va_list args;
char buf[BUFSIZ];
va_start(args, fmt);
vsnprintf(buf, sizeof(buf), fmt, args);
va_end(args);
warning("%s : %s", buf, strerror(errno));
}
void
pfatal(const char *fmt, ...)
{
va_list args;
char buf[BUFSIZ];
va_start(args, fmt);
vsnprintf(buf, sizeof(buf), fmt, args);
va_end(args);
fatal("%s : %s", buf, strerror(errno));
}
/*
* Log defs.
*/
#include <stdarg.h>
int ClientLogInit(void);
int ServerLogInit(void);
void log(const char *fmt, ...);
void warning(const char *fmt, ...);
void fatal(const char *fmt, ...);
void pwarning(const char *fmt, ...);
void pfatal(const char *fmt, ...);
/*
* Network routines.
*/
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/time.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <netdb.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <signal.h>
#include <errno.h>
#include "decls.h"
/* Max number of times to attempt bind to port before failing. */
#define MAXBINDATTEMPTS 10
/* Max number of hops multicast hops. */
#define MCAST_TTL 5
static int sock;
static struct in_addr ipaddr;
static void
CommonInit(void)
{
struct sockaddr_in name;
struct timeval timeout;
int i;
char buf[BUFSIZ];
struct hostent *he;
if ((sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP)) < 0)
pfatal("Could not allocate a socket");
i = (128 * 1024);
setsockopt(sock, SOL_SOCKET, SO_SNDBUF, &i, sizeof(i));
i = (128 * 1024);
setsockopt(sock, SOL_SOCKET, SO_RCVBUF, &i, sizeof(i));
name.sin_family = AF_INET;
name.sin_port = htons(portnum);
name.sin_addr.s_addr = htonl(INADDR_ANY);
i = MAXBINDATTEMPTS;
while (i) {
if (bind(sock, (struct sockaddr *)&name, sizeof(name)) == 0)
break;
pwarning("Bind to port %d failed. Will try %d more times!",
portnum, i);
i--;
sleep(5);
}
log("Bound to port %d", portnum);
/*
* At present, we use a multicast address in both directions.
*/
if ((ntohl(mcastaddr.s_addr) >> 28) == 14) {
unsigned int loop = 0, ttl = MCAST_TTL;
struct ip_mreq mreq;
log("Using Multicast");
mreq.imr_multiaddr.s_addr = mcastaddr.s_addr;
if (mcastif.s_addr)
mreq.imr_interface.s_addr = mcastif.s_addr;
else
mreq.imr_interface.s_addr = htonl(INADDR_ANY);
if (setsockopt(sock, IPPROTO_IP, IP_ADD_MEMBERSHIP,
&mreq, sizeof(mreq)) < 0)
pfatal("setsockopt(IPPROTO_IP, IP_ADD_MEMBERSHIP)");
if (setsockopt(sock, IPPROTO_IP, IP_MULTICAST_TTL,
&ttl, sizeof(ttl)) < 0)
pfatal("setsockopt(IPPROTO_IP, IP_MULTICAST_TTL)");
/* Disable local echo */
if (setsockopt(sock, IPPROTO_IP, IP_MULTICAST_LOOP,
&loop, sizeof(loop)) < 0)
pfatal("setsockopt(IPPROTO_IP, IP_MULTICAST_LOOP)");
if (mcastif.s_addr &&
setsockopt(sock, IPPROTO_IP, IP_MULTICAST_IF,
&mcastif, sizeof(mcastif)) < 0) {
pfatal("setsockopt(IPPROTO_IP, IP_MULTICAST_IF)");
}
}
else {
/*
* Otherwise, we use a broadcast addr.
*/
i = 1;
if (setsockopt(sock, SOL_SOCKET, SO_BROADCAST,
&i, sizeof(i)) < 0)
pfatal("setsockopt(SOL_SOCKET, SO_BROADCAST)");
}
/*
* We use a socket level timeout instead of polling for data.
*/
timeout.tv_sec = 0;
timeout.tv_usec = PKTRCV_TIMEOUT;
if (setsockopt(sock, SOL_SOCKET, SO_RCVTIMEO,
&timeout, sizeof(timeout)) < 0)
pfatal("setsockopt(SOL_SOCKET, SO_RCVTIMEO)");
/*
* We add our (unicast) IP addr to every outgoing message.
* This is going to be used to return replies to the sender,
* where appropriate.
*/
if (gethostname(buf, sizeof(buf)) < 0)
pfatal("gethostname failed");
if ((he = gethostbyname(buf)) == 0)
fatal("gethostbyname: %s", hstrerror(h_errno));
memcpy((char *)&ipaddr, he->h_addr, sizeof(ipaddr));
}
int
ClientNetInit(void)
{
CommonInit();
return 1;
}
int
ServerNetInit(void)
{
CommonInit();
return 1;
}
/*
* Look for a packet on the socket. Propogate the errors back to the caller
* exactly as the system call does. Remember that we set up a socket timeout
* above, so we will get EWOULDBLOCK errors when no data is available.
*
* The amount of data received is determined from the datalen of the hdr.
* All packets are actually the same size/structure.
*/
int
PacketReceive(Packet_t *p)
{
struct sockaddr_in from;
int mlen, alen = sizeof(from);
bzero(&from, sizeof(from));
if ((mlen = recvfrom(sock, p, sizeof(*p), 0,
(struct sockaddr *)&from, &alen)) < 0) {
if (errno == EWOULDBLOCK)
return -1;
pfatal("PacketReceive(recvfrom)");
}
if (mlen != sizeof(p->hdr) + p->hdr.datalen)
fatal("PacketReceive: Bad message length %d!=%d",
mlen, p->hdr.datalen);
return p->hdr.datalen;
}
/*
* We use blocking sends since there is no point in giving up. All packets
* go to the same place, whether client or server.
*
* The amount of data sent is determined from the datalen of the packet hdr.
* All packets are actually the same size/structure.
*/
int
PacketSend(Packet_t *p)
{
struct sockaddr_in to;
int len;
len = sizeof(p->hdr) + p->hdr.datalen;
p->hdr.srcip = ipaddr.s_addr;
to.sin_family = AF_INET;
to.sin_port = htons(portnum);
to.sin_addr.s_addr = mcastaddr.s_addr;
while (sendto(sock, (void *)p, len, 0,
(struct sockaddr *)&to, sizeof(to)) < 0) {
if (errno != ENOBUFS)
pfatal("PacketSend(sendto)");
/*
* ENOBUFS means we ran out of mbufs. Okay to sleep a bit
* to let things drain.
*/
fsleep(10000);
}
return p->hdr.datalen;
}
/*
* Basically the same as above, but instead of sending to the multicast
* group, send to the (unicast) IP in the packet header. This simplifies
* the logic in a number of places, by avoiding having to deal with
* multicast packets that are not destined for us, but for someone else.
*/
int
PacketReply(Packet_t *p)
{
struct sockaddr_in to;
int len;
len = sizeof(p->hdr) + p->hdr.datalen;
to.sin_family = AF_INET;
to.sin_port = htons(portnum);
to.sin_addr.s_addr = p->hdr.srcip;
p->hdr.srcip = ipaddr.s_addr;
while (sendto(sock, (void *)p, len, 0,
(struct sockaddr *)&to, sizeof(to)) < 0) {
if (errno != ENOBUFS)
pfatal("PacketSend(sendto)");
/*
* ENOBUFS means we ran out of mbufs. Okay to sleep a bit
* to let things drain.
*/
fsleep(10000);
}
return p->hdr.datalen;
}
/*
* Mach Operating System
* Copyright (c) 1991,1990,1989,1988,1987 Carnegie Mellon University
* All Rights Reserved.
*
* Permission to use, copy, modify and distribute this software and its
* documentation is hereby granted, provided that both the copyright
* notice and this permission notice appear in all copies of the
* software, derivative works or modified versions, and any portions
* thereof, and that both notices appear in supporting documentation.
*
* CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS"
* CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND FOR
* ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
*
* Carnegie Mellon requests users of this software to return to
*
* Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU
* School of Computer Science
* Carnegie Mellon University
* Pittsburgh PA 15213-3890
*
* any improvements or extensions that they make and grant Carnegie Mellon rights
* to redistribute these changes.
*/
/*
* File: queue.h
* Author: Avadis Tevanian, Jr.
* Date: 1985
*
* Type definitions for generic queues.
*
*/
#ifndef _KERN_QUEUE_H_
#define _KERN_QUEUE_H_
/*
* Queue of abstract objects. Queue is maintained
* within that object.
*
* Supports fast removal from within the queue.
*
* How to declare a queue of elements of type "foo_t":
* In the "*foo_t" type, you must have a field of
* type "queue_chain_t" to hold together this queue.
* There may be more than one chain through a
* "foo_t", for use by different queues.
*
* Declare the queue as a "queue_t" type.
*
* Elements of the queue (of type "foo_t", that is)
* are referred to by reference, and cast to type
* "queue_entry_t" within this module.
*/
/*
* A generic doubly-linked list (queue).
*/
struct queue_entry {
struct queue_entry *next; /* next element */
struct queue_entry *prev; /* previous element */
};
typedef struct queue_entry *queue_t;
typedef struct queue_entry queue_head_t;
typedef struct queue_entry queue_chain_t;
typedef struct queue_entry *queue_entry_t;
/*
* Macro: queue_init
* Function:
* Initialize the given queue.
* Header:
* void queue_init(q)
* queue_t q; *MODIFIED*
*/
#define queue_init(q) ((q)->next = (q)->prev = q)
/*
* Macro: queue_first
* Function:
* Returns the first entry in the queue,
* Header:
* queue_entry_t queue_first(q)
* queue_t q; *IN*
*/
#define queue_first(q) ((q)->next)
/*
* Macro: queue_next
* Function:
* Returns the entry after an item in the queue.
* Header:
* queue_entry_t queue_next(qc)
* queue_t qc;
*/ </