Commit c59b861f authored by Mike Hibler's avatar Mike Hibler
Browse files

Some notes on an obscure feature of imagezip, the -F option.

I got side-tracked on this while looking for reasons why a simple delta image
is so large.  So I thought I would reevaluate our choice of 64 as the default
value.  Bottom line is that it is a pretty good choice.
parent 189e10ca
One of the undocumented behaviors of imagezip is that it will ignore free
ranges it discovered that are less than a certain size. That size can be
set with the 'F' option and defaults to 64 (sectors, aka 32KB). By ignoring
free ranges we effectively make the amount of allocated data larger.
Why do we do this? There are two reasons, best illustrated by an example.
Assume we have a 64 sector disk where every other sector is allocated.
In the case where we do not ignore any free ranges (-F 0), the resulting
image would have 32 allocated ranges, each one sector long. With the
default frange setting, we would instead have 1 allocated range of 64 sectors
because we would ignore (i.e., treat as allocated) all of the 1 sector
free ranges. This means that the resulting restore of the image would
require only a single large write instead of 32 small writes and the amount
of metadata (the "allocated range" descriptors) is decreased from 32 to 1.
So we do this both to reduce the amount of metadata that is passed in the
image and also to increase the size of write operations done on the disk,
since large writes are more efficient than small ones (primarily due to
fewer seeks). We do this at the expense of some image size, since more
data is listed as allocated.
Here are some numbers that bear this out. This is for our standard disk
image which includes a 3GB FreeBSD 4.10 partition and a 3GB Redhat 9 Linux
partition. Combined there is about 2GB of allocated data. In the following,
"F value" is the setting of the imagezip 'F' option (in sectors), "# allocated
ranges" is the number of metadata entries, "allocated size" is the amount of
data declared allocated that goes into the image, "image size" is the size
of the resulting image, "image save" the time required to run imagezip and
produce the image file (on a pc3000), "local load" is the time to load to
a local disk using imageunzip (on a pc3000), and "pc850" and "pc3000" are
the time to run frisbee to reload the image on a pc850 and pc3000 respectively.
The last two are broken into "r" for randomized order of blocks requested
and "not" for sequential block requests.
A pc850 is an old 850Mhz PIII (BX chipset) with 512MB of memory and a 7200 rpm
(ATA33) IDE disk. A pc3000 is a 3GHz P4 (E7520 chipset) with 2GB of memory
and a 10K rpm SCSI disk (write-caching enabled).
image local frisbee load
'F' #alloc alloc image save load pc850 pc3000
value ranges size(MB) size(MB) (sec) (sec) (r/not) (r/not)
0 14175 1926.6 595.6 189 55 109/109 56.8/55.7
8 10840 1933.4 596.6 177 51 109/108 54.4/53.9
16 8047 1947.9 600.8 171 46 109/108 52.7/52.7
32 5980 1968.4 607.1 167 39 109/110 53.4/53.3
64 4482 2001.8 620.8 167 37 111/110 54.5/54.4
128 3224 2059.3 634.4 168 36
The effect on sizes is pretty obvious, the smaller the free ranges you
allow, the more metadata but less actual data in the image.
The fact that it takes more time to create images with more metadata is
partially a side-effect of the larger metadata list (which is implemented
as a simple singly-linked list) and partially due to making fewer, smaller
disk reads.
For loading a local disk via imageunzip, we see the effect which was
postulated, that fewer and larger writes will be more efficient.
Interestingly we do not see this when using frisbee. One could speculate
that frisbee's default behavior of randomizing the order of the blocks it
requests, is causing lots of disk seeks anyway and that that dwarfs any
time we save by doing large writes. However, even when frisbee is told not
to randomize the order (the "not" in "r/not"), the time is largely uneffected.
For the pc3000, it is the case that we simply cannot feed it image data
fast enough with a 100Mb link. By switching to a Gb link and cranking up
the bandwidth on the server, we are able to obtain times, for non-randomized
ordering, matching those for local load. For random order, times were 4-5%
worse. By observing with lstop, it looks like we require about 120-130Mb/sec,
so we don't miss by much.
Quick experiment: create a "-z 9" compressed image to see what that does:
'F' #alloc alloc image save load pc850 pc3000
value ranges size(MB) size(MB) (sec) (sec) (r/not) (r/not)
64 4482 2001.8 620.8 166 37 111/110 54.5/54.4
64 (z9) 597.7 808 37 /52.4
Thus we can improve run time by ~4% (and save ~3% in image size) at the expense
of taking 5x time to compress. So if we want to get faster on pc3000s using
100Mb, we will have to use other tricks.
For the pc850, even pushing the bandwidth to the full 100Mb (default is ~70Mb),
we see 107-109 seconds. Here we are very much disk bound with the decompresser
thread often blocked waiting for a disk buffer to fill. Note that a straight
dd to disk shows the expected behavior due to large writes: 8KB writes achieve
about 17MB/sec while 64KB writes get around 21MB/sec, but we do not see that
in frisbee.
So what about Windows you say? I'm glad you asked. Using our up-to-date
XP SP2+ image, which is a 4GB partition with just over 2GB allocated.
image local frisbee load
'F' #alloc alloc image save load pc850 pc3000
value ranges size(MB) size(MB) (sec) (sec) (r/not) (r/not)
0 16279 1956.6 1051.7 242 64
16 13546 1967.9 1057.0 236 57
32 9359 2006.4 1073.7 230 43
64 6359 2072.0 1103.1 231 40
128 4244 2165.3 1112.5 228 38
64 (z9) 6338 2072.0 1086.3 566 39
So, is there a bottom line in all this? No, not really. If we are concerned
with space (~2-3%) we could probably go to F=32 without affecting client
performance too much (~5%). Or, if we want to pick up the client speed a
bit (~3-5%), we could go to F=128 with a slight loss in space (~1-2%).
The case for switching to "completely accurate" (F=0) images would be harder
to make: save ~4-5% in space at the expense of 50-60% in best-case load time.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment