Commit 723b6ad5 authored by Jay Lepreau's avatar Jay Lepreau
Browse files

Move from utah/doc to doc so gets distributed.

It's full of good detailed info.  Here's the old log:
-------------------
revision 1.3
date: 2003/01/30 05:11:37;  author: mike;  state: Exp;  lines: +31 -1
Document NFS problems
----------------------------
revision 1.2
date: 2003/01/16 20:33:34;  author: mike;  state: Exp;  lines: +140 -3
start laying out some potential solutions to our problems
----------------------------
revision 1.1
date: 2003/01/15 17:09:40;  author: mike;  state: Exp;
New directory utah/doc for internal documentation.
Add start of "boot scaling issues" doc.
parent a78899d3
=======================
Boot sequence overview:
=======================
A. PXE BIOS interacts with DHCP to get initial boot program name:
1. DHCPDISCOVER to server, server replies with DHCP info
Up to four retries with timeouts of 4, 8, 16, 32 seconds
(total of 60 seconds to get an initial reply)
2. DHCPREQUEST to server, server replies
This is apparently necessary in the DHCP protocol even though the
client got its info from the DHCPDISCOVER reply. Its not clear what
the timeout used here is. If it doesn't get a reply, it restarts
with the discover.
3. DHCPREQUEST to the boot server (proxydhcp), server replies
This gets the boot file name. Up to four retries with 1, 2, 3, 4
second timeouts.
Notes:
- In step 2, if the client gets a BOOTP reply rather than a DHCP
reply to the first query, this step isn't needed.
- In step 3, the extremely short timeouts are why elvind/stated
bogging down at all gets us in trouble, they don't cut us a whole
lot of slack.
B. PXE BIOS requests/loads bootfile via TFTP:
1. An initial request is made for block 1 with the TSIZE=0 option,
presumably to learn the size of the file. Retries?
2. A second request is made for block 1 with the BLKSIZE=1456 option,
to set the transfer block size and begin transferring the file.
3. The remaining blocks are requested and transfered.
Notes:
- Our tftpd forks a new copy for each request on a new port.
Both steps 1 and 2 cause such a fork.
- Our tftpd doesn't recognize any options. Could be that step
2 wouldn't happen if we responded correctly to step 1.
C1. Normal pxeboot (emuboot) executes:
1. The FreeBSD libstand code does the DHCP DISCOVER/REQUEST dance with
the server, two messages are exchanged. Retries?
2. Emuboot sends a bootinfo request (retries?) and boots as indicated
(usually from disk).
D1. OS boots:
1. OS boots from disk, once again doing the DHCP dance (from dhclient
or pump).
2. Testbed specific startup issues a series of TMCD commands.
From power on of an already allocated node, I counted 26 TCP
TMCD requests in a seven second period:
+0 reboot
+0 status
+1 ntpinfo
+3 state
+3 reboot
+3 status
+3 mounts
+4 accounts
+4 ifconfig
+5 tunnels
+5 hostnames
+5 routing
+5 status
+5 ifconfig
+5 routelist
+5 trafgens
+5 nseconfigs
+5 rpms
+6 tarballs
+6 startupcmd
+6 delay
+6 ipodinfo
+7 vnodelist
+7 isalive
+7 creator
+7 state
Notes:
- The DHCP transaction done here, at least under FreeBSD will take
a long time (20-30 seconds, instead of 1-3) if the cisco2 control
net port is not configured properly ("set port host <mod/port>").
E1. First time boot of an experiment
1. Optional download and installation of tarballs and RPM files
across NFS.
Notes:
- The loading of tarballs/RPMs across NFS has been shown to put a
hurtin' on our server with as few as 30 nodes and 5MB of RPMs.
This is also a problem when people log to files in /proj.
C2. Frisbee pxeboot (pxeboot.frisbee) executes:
1. The FreeBSD libstand code does the DHCP DISCOVER/REQUEST dance with
the server, two messages are exchanged. Retries?
2. The FreeBSD loader issues a series of requests for files via TFTP:
/tftpboot/frisbee/boot/boot.4th.gz
/tftpboot/frisbee/boot/loader.rc.gz
/tftpboot/frisbee/boot/loader.4th.gz
/tftpboot/frisbee/boot/support.4th.gz
/tftpboot/frisbee/boot/defaults/loader.conf
/tftpboot/frisbee/boot/loader.conf
/tftpboot/frisbee/boot/loader.conf.local
/tftpboot/frisbee/boot/kernel.ko.gz # check to see if it exists
/tftpboot/frisbee/boot/kernel.ko.gz # read it
/tftpboot/frisbee/boot/mfsroot.gz # check to see if it exists
/tftpboot/frisbee/boot/mfsroot.gz # read it
Notes:
- Each of the 11 file requests uses a different instance of tftpd.
- Use of .gz and .ko files ensures a minimum number of requests;
i.e., if the .gz or .ko file didn't exist it would try again
without the suffix, doubling (or more) the number of requests
(and servers).
===============
Scaling Issues:
===============
PXE BIOS interaction (Step A):
The big concern here is losing the initial (larger timeout) or later
(smaller timeout) DHCP requests. This seems to happen at about 40
machines. Not sure about all the drop scenarios, but we know that
proxydhcp will overload. Presumably dhcpd does as well.
We have a hack right now that if PXE fails, it boots from the hard disk
where we have a special MBR which just reboots the machine, thus trying
again. This is ok as a last ditch effort, but we need to scale a lot
higher than 40 machines before we hit this.
Possible fixes/optimizations:
1. Have dhcpd send bootp replies in step 1, eliminating the second step.
As Leigh points out in the message below, this won't work if we
continue to have proxydhcp provide the boot file.
2. Let regular dhcpd provide the boot file (pxeboot).
I found a way to do this even with dhcpd version 2. This eliminates
proxydhcp. The downside is that the boot file must be specified in
the dhcpd.conf file. While we could dynamically generate the dhcpd.conf
file, we're opening ourselves up to the same headaches we have with
mountd, named and anything else that we have to kill/restart. Or we
can hack dhcpd to directly access the DB for its info. I looked at
both V2 and V3 to see if there were any existing hook mechanisms or if
there were any obvious places at which to add hooks, but it looks like
it will be a PITA.
3. We can get rid of regular dhcpd and just use proxydhcp.
This should work though I haven't tried it. This would reduce it to a
single transaction. Downside is that we lose the ability to handle
regular DHCP traffic (since proxydhcp would have to run on the standard
port). Not an issue for us, at least right now.
4. Get the PXE BIOS source or a custom version from Intel.
In this scenario we customize PXE will larger timeouts or different
failure behavior. I don't think this is practical personally. It
applies only to Intel NICs (granted, that is what we mostly have)
and could make us incompatible with standard PXE, a bad move if we
want other people to use our stuff.
A thread discussing PXE/TFTP problems and potential solutions in the
cluster space starts here:
http://www.beowulf.org/pipermail/beowulf/2002-November/005063.html
See the later messages in the thread.
TFTP interaction (Steps B, C2):
The biggest single problem right now is that inetd forks a new tftpd
for every new request (on a new port). In the PXE case (Step B) this
is not so bad, but for the BSD frisbee MFS we make a new request
everytime we even want to check for the existence of a file ("stat").
The BSD boot loader is oriented toward having a disk filesystem, so all
this activity isn't an issue.
One optimization already enacted is to try to minimize the number of
files it attempts to access. At one time, when attempting to locate
a file "foo", it would try: foo.split.gz, foo.split, foo.gz, foo,
often looking in both /boot and /boot/defaults for the file. I tweaked
the loader configuration to minimize the number of configuration files
it tries to load, eliminated the splitfs support to get rid of attempts
to access .split files, and then made sure there was a .gz (or .ko.gz
for kernel modules) version of every file that did exist. This
eliminated most of the "stat" type calls.
Another thing to look into is a better tftpd, specifically we want to
minimize the number of spawns yet maintain reasonable parallelism.
Prioritizing incoming requests as discussed in the beowulf thread
cited above are probably less important right now. I looked at tftp-hpa
(http://freshmeat.net/projects/tftp-hpa/) some. While it doesn't have
a FreeBSD port, it does use configure and is supposed to work on BSD
(it is derived from the BSD tftp code). The server supports some of
the newer TFTP options like increased transfer blocksize (> 512 bytes)
and the ability to test for the existence of a file. It can operate
standalone (rather than under inetd) where it will just fork to handle
new requests (rather than fork/exec) and/or it can have copies of the
daemon persist rather than immediately exit. We can probably tweak-out
the BSD boot code to take advantage of the options, and hopefully
persistent and faster spawning daemons would take care of some of the
load issues.
TMCD interaction (Step D1):
The sheer volume of requests is the big concern here. At 26 requests
per-node, per-boot, it would be easy to overwhelm the server and cause
requests to timeout on the clients.
An obvious boot-time-only solution is to have a meta-request which
will get all the info for a host in one request at boot time. This
could take the form of a script which makes the call and records the
data into a file for all other scripts to use, or it could be a
proxy tmcd which starts up first and services all the usual tmcc calls.
A more general solution is a caching proxy which runs all the time.
This would presumably be an extension of the current watchdog using
the keep-alive packets to find out if cached data needs updating.
For jails, we would presumably run only a single proxy outside the
jails since I doubt we will ever support enough jails per node to
require a proxy in each.
NFS traffic (Step E1):
When lots of nodes attempt to use NFS, we get in trouble in a hurry.
The NFS traffic, being UDP, interferes will most other boot-time or
run-time traffic causing timeouts and lost packets.
One thing we could do is switch to using TCP-based NFS where we would
get some congestion control. FreeBSD TCP NFS works fine, we would need
to be sure of Linux NFS. We should also make sure we are using NFS v3
which reduces the amount of control traffic some. We could also get
a serious file server to handle the load.
Alternatives to NFS? We can use ssh to download tarballs and RPMs.
We could provide a logging facility that uses something other than
NFS. But in general, if we want to continue to present a filesystem
interface to shared space, there are not many good options.
======================
Related Mail messages:
======================
Date: Wed, 18 Dec 2002 10:50:52 -0700 (MST)
From: Mike Hibler <mike@flux.utah.edu>
Message-Id: <200212181750.gBIHoqFC008329@bas.flux.utah.edu>
To: testbed-ops@flux.utah.edu
Subject: PXE and DHCP and TFTP
Was looking into this a bit yesterday.
Summary: we can probably eliminate 1 or 2 of the 3 PXE/DHCP transactions
at boot and there is a potentially better TFTP daemon out there.
PXE/DHCP: the normal procedure is three transactions:
1. DHCPDISCOVER to server, server replies with DHCP info
Up to four retries with timeouts of 4, 8, 16, 32 seconds
(total of 60 seconds to get an initial reply)
2. DHCPREQUEST to server, server replies
This is apparently necessary in the DHCP protocol even though the
client got its info from the DHCPDISCOVER reply. Its not clear what
the timeout used here is. If it doesn't get a reply, it restarts
with the discover. One note: if the client gets a BOOTP reply rather
than a DHCP reply to the first query, this step isn't needed.
3. DHCPREQUEST to the boot server (proxydhcp), server replies
This gets the boot file name. Up to four retries with 1, 2, 3, 4
second timeouts. This is why elvind/stated bogging down at all
gets us in trouble, they don't cut us a whole lot of slack.
Possible optimizations:
1. Have dhcpd send bootp replies in step 1, eliminating the second
step. Not sure this is possible.
2. Let regular dhcpd provide the boot file (pxeboot). I found a way to do
this even with dhcpd version 2. This eliminates proxydhcp. The downside
is that the boot file must be specified in the dhcpd.conf file. While we
could dynamically generate the dhcpd.conf file, we're opening ourselves
up to the same headaches we have with mountd, named and anything else
that we have to kill/restart. Or we can hack dhcpd to directly access
the DB for its info. I looked at both V2 and V3 to see if there were
any existing hook mechanisms or if there were any obvious places at
which to add hooks, but it looks like it will be a PITA.
3. We can get rid of regular dhcpd and just use proxydhcp. This should
work though I haven't tried it. This would reduce it to a single
transaction. Downside is that we lose the ability to handle regular
DHCP traffic (since proxydhcp would have to run on the standard port).
Not an issue for us, at least right now.
TFTP: I looked at tftp-hpa (http://freshmeat.net/projects/tftp-hpa/) some.
While it doesn't have a FreeBSD port, it does use configure and is supposed
to work on BSD (it is derived from the BSD tftp code). The server supports
some of the newer TFTP options like increased transfer blocksize (> 512
bytes) and the ability to test for the existence of a file. It can operate
standalone (rather than under inetd) where it will just fork to handle new
requests (rather than fork/exec) and/or it can have copies of the daemon
persist rather than immediately exit. We can probably tweak-out the BSD
boot code to take advantage of the options, and hopefully persistent and
faster spawning daemons would take care of some of the load issues.
Date: Wed, 18 Dec 2002 09:58:41 -0800
From: Leigh Stoller <stoller@flux.utah.edu>
To: mike@cs.utah.edu
Cc: testbed-ops@flux.utah.edu
Subject: Re: PXE and DHCP and TFTP
In-Reply-To: <200212181750.gBIHoqFC008329@bas.flux.utah.edu>
References: <200212181750.gBIHoqFC008329@bas.flux.utah.edu>
> From: Mike Hibler <mike@flux.utah.edu>
> Subject: PXE and DHCP and TFTP
> Date: Wed, 18 Dec 2002 10:50:52 -0700 (MST)
>
> 1. Have dhcpd send bootp replies in step 1, eliminating the second
> step. Not sure this is possible.
No, its not. The client won't go to the proxydhcp.
> 2. Let regular dhcpd provide the boot file (pxeboot). I found a way to do
> this even with dhcpd version 2. This eliminates proxydhcp. The downside
> is that the boot file must be specified in the dhcpd.conf file. While we
> could dynamically generate the dhcpd.conf file, we're opening ourselves
> up to the same headaches we have with mountd, named and anything else
> that we have to kill/restart. Or we can hack dhcpd to directly access
> the DB for its info. I looked at both V2 and V3 to see if there were
> any existing hook mechanisms or if there were any obvious places at
> which to add hooks, but it looks like it will be a PITA.
We could hardwire the path if the bsd based pxeboot you put together could
handle everything after that. That is, we run the same pxeboot all the
time, and have it contact bootinfo to see if it should run the frisbee MFS,
or the freebsd MFS, or whatever, and then hand off to that. That gives us
lots of control over timeouts and retries.
Don't know if thats possible though. It would require that pxeboot be able
to load and run another pxeboot (say, the one in the frisbee directory).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment