Commit 5db53f3e authored by Joern Engel's avatar Joern Engel

[LogFS] add new flash file system

This is a new flash file system. See
Documentation/filesystems/logfs.txt
Signed-off-by: default avatarJoern Engel <joern@logfs.org>
parent 66b00a7c
......@@ -62,6 +62,8 @@ jfs.txt
- info and mount options for the JFS filesystem.
locks.txt
- info on file locking implementations, flock() vs. fcntl(), etc.
logfs.txt
- info on the LogFS flash filesystem.
mandatory-locking.txt
- info on the Linux implementation of Sys V mandatory file locking.
ncpfs.txt
......
The LogFS Flash Filesystem
==========================
Specification
=============
Superblocks
-----------
Two superblocks exist at the beginning and end of the filesystem.
Each superblock is 256 Bytes large, with another 3840 Bytes reserved
for future purposes, making a total of 4096 Bytes.
Superblock locations may differ for MTD and block devices. On MTD the
first non-bad block contains a superblock in the first 4096 Bytes and
the last non-bad block contains a superblock in the last 4096 Bytes.
On block devices, the first 4096 Bytes of the device contain the first
superblock and the last aligned 4096 Byte-block contains the second
superblock.
For the most part, the superblocks can be considered read-only. They
are written only to correct errors detected within the superblocks,
move the journal and change the filesystem parameters through tunefs.
As a result, the superblock does not contain any fields that require
constant updates, like the amount of free space, etc.
Segments
--------
The space in the device is split up into equal-sized segments.
Segments are the primary write unit of LogFS. Within each segments,
writes happen from front (low addresses) to back (high addresses. If
only a partial segment has been written, the segment number, the
current position within and optionally a write buffer are stored in
the journal.
Segments are erased as a whole. Therefore Garbage Collection may be
required to completely free a segment before doing so.
Journal
--------
The journal contains all global information about the filesystem that
is subject to frequent change. At mount time, it has to be scanned
for the most recent commit entry, which contains a list of pointers to
all currently valid entries.
Object Store
------------
All space except for the superblocks and journal is part of the object
store. Each segment contains a segment header and a number of
objects, each consisting of the object header and the payload.
Objects are either inodes, directory entries (dentries), file data
blocks or indirect blocks.
Levels
------
Garbage collection (GC) may fail if all data is written
indiscriminately. One requirement of GC is that data is seperated
roughly according to the distance between the tree root and the data.
Effectively that means all file data is on level 0, indirect blocks
are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks,
respectively. Inode file data is on level 6 for the inodes and 7-11
for indirect blocks.
Each segment contains objects of a single level only. As a result,
each level requires its own seperate segment to be open for writing.
Inode File
----------
All inodes are stored in a special file, the inode file. Single
exception is the inode file's inode (master inode) which for obvious
reasons is stored in the journal instead. Instead of data blocks, the
leaf nodes of the inode files are inodes.
Aliases
-------
Writes in LogFS are done by means of a wandering tree. A naïve
implementation would require that for each write or a block, all
parent blocks are written as well, since the block pointers have
changed. Such an implementation would not be very efficient.
In LogFS, the block pointer changes are cached in the journal by means
of alias entries. Each alias consists of its logical address - inode
number, block index, level and child number (index into block) - and
the changed data. Any 8-byte word can be changes in this manner.
Currently aliases are used for block pointers, file size, file used
bytes and the height of an inodes indirect tree.
Segment Aliases
---------------
Related to regular aliases, these are used to handle bad blocks.
Initially, bad blocks are handled by moving the affected segment
content to a spare segment and noting this move in the journal with a
segment alias, a simple (to, from) tupel. GC will later empty this
segment and the alias can be removed again. This is used on MTD only.
Vim
---
By cleverly predicting the life time of data, it is possible to
seperate long-living data from short-living data and thereby reduce
the GC overhead later. Each type of distinc life expectency (vim) can
have a seperate segment open for writing. Each (level, vim) tupel can
be open just once. If an open segment with unknown vim is encountered
at mount time, it is closed and ignored henceforth.
Indirect Tree
-------------
Inodes in LogFS are similar to FFS-style filesystems with direct and
indirect block pointers. One difference is that LogFS uses a single
indirect pointer that can be either a 1x, 2x, etc. indirect pointer.
A height field in the inode defines the height of the indirect tree
and thereby the indirection of the pointer.
Another difference is the addressing of indirect blocks. In LogFS,
the first 16 pointers in the first indirect block are left empty,
corresponding to the 16 direct pointers in the inode. In ext2 (maybe
others as well) the first pointer in the first indirect block
corresponds to logical block 12, skipping the 12 direct pointers.
So where ext2 is using arithmetic to better utilize space, LogFS keeps
arithmetic simple and uses compression to save space.
Compression
-----------
Both file data and metadata can be compressed. Compression for file
data can be enabled with chattr +c and disabled with chattr -c. Doing
so has no effect on existing data, but new data will be stored
accordingly. New inodes will inherit the compression flag of the
parent directory.
Metadata is always compressed. However, the space accounting ignores
this and charges for the uncompressed size. Failing to do so could
result in GC failures when, after moving some data, indirect blocks
compress worse than previously. Even on a 100% full medium, GC may
not consume any extra space, so the compression gains are lost space
to the user.
However, they are not lost space to the filesystem internals. By
cheating the user for those bytes, the filesystem gained some slack
space and GC will run less often and faster.
Garbage Collection and Wear Leveling
------------------------------------
Garbage collection is invoked whenever the number of free segments
falls below a threshold. The best (known) candidate is picked based
on the least amount of valid data contained in the segment. All
remaining valid data is copied elsewhere, thereby invalidating it.
The GC code also checks for aliases and writes then back if their
number gets too large.
Wear leveling is done by occasionally picking a suboptimal segment for
garbage collection. If a stale segments erase count is significantly
lower than the active segments' erase counts, it will be picked. Wear
leveling is rate limited, so it will never monopolize the device for
more than one segment worth at a time.
Values for "occasionally", "significantly lower" are compile time
constants.
Hashed directories
------------------
To satisfy efficient lookup(), directory entries are hashed and
located based on the hash. In order to both support large directories
and not be overly inefficient for small directories, several hash
tables of increasing size are used. For each table, the hash value
modulo the table size gives the table index.
Tables sizes are chosen to limit the number of indirect blocks with a
fully populated table to 0, 1, 2 or 3 respectively. So the first
table contains 16 entries, the second 512-16, etc.
The last table is special in several ways. First its size depends on
the effective 32bit limit on telldir/seekdir cookies. Since logfs
uses the upper half of the address space for indirect blocks, the size
is limited to 2^31. Secondly the table contains hash buckets with 16
entries each.
Using single-entry buckets would result in birthday "attacks". At
just 2^16 used entries, hash collisions would be likely (P >= 0.5).
My math skills are insufficient to do the combinatorics for the 17x
collisions necessary to overflow a bucket, but testing showed that in
10,000 runs the lowest directory fill before a bucket overflow was
188,057,130 entries with an average of 315,149,915 entries. So for
directory sizes of up to a million, bucket overflows should be
virtually impossible under normal circumstances.
With carefully chosen filenames, it is obviously possible to cause an
overflow with just 21 entries (4 higher tables + 16 entries + 1). So
there may be a security concern if a malicious user has write access
to a directory.
Open For Discussion
===================
Device Address Space
--------------------
A device address space is used for caching. Both block devices and
MTD provide functions to either read a single page or write a segment.
Partial segments may be written for data integrity, but where possible
complete segments are written for performance on simple block device
flash media.
Meta Inodes
-----------
Inodes are stored in the inode file, which is just a regular file for
most purposes. At umount time, however, the inode file needs to
remain open until all dirty inodes are written. So
generic_shutdown_super() may not close this inode, but shouldn't
complain about remaining inodes due to the inode file either. Same
goes for mapping inode of the device address space.
Currently logfs uses a hack that essentially copies part of fs/inode.c
code over. A general solution would be preferred.
Indirect block mapping
----------------------
With compression, the block device (or mapping inode) cannot be used
to cache indirect blocks. Some other place is required. Currently
logfs uses the top half of each inode's address space. The low 8TB
(on 32bit) are filled with file data, the high 8TB are used for
indirect blocks.
One problem is that 16TB files created on 64bit systems actually have
data in the top 8TB. But files >16TB would cause problems anyway, so
only the limit has changed.
......@@ -177,6 +177,7 @@ source "fs/efs/Kconfig"
source "fs/jffs2/Kconfig"
# UBIFS File system configuration
source "fs/ubifs/Kconfig"
source "fs/logfs/Kconfig"
source "fs/cramfs/Kconfig"
source "fs/squashfs/Kconfig"
source "fs/freevxfs/Kconfig"
......
......@@ -99,6 +99,7 @@ obj-$(CONFIG_NTFS_FS) += ntfs/
obj-$(CONFIG_UFS_FS) += ufs/
obj-$(CONFIG_EFS_FS) += efs/
obj-$(CONFIG_JFFS2_FS) += jffs2/
obj-$(CONFIG_LOGFS) += logfs/
obj-$(CONFIG_UBIFS_FS) += ubifs/
obj-$(CONFIG_AFFS_FS) += affs/
obj-$(CONFIG_ROMFS_FS) += romfs/
......
config LOGFS
tristate "LogFS file system (EXPERIMENTAL)"
depends on (MTD || BLOCK) && EXPERIMENTAL
select ZLIB_INFLATE
select ZLIB_DEFLATE
select CRC32
select BTREE
help
Flash filesystem aimed to scale efficiently to large devices.
In comparison to JFFS2 it offers significantly faster mount
times and potentially less RAM usage, although the latter has
not been measured yet.
In its current state it is still very experimental and should
not be used for other than testing purposes.
If unsure, say N.
obj-$(CONFIG_LOGFS) += logfs.o
logfs-y += compr.o
logfs-y += dir.o
logfs-y += file.o
logfs-y += gc.o
logfs-y += inode.o
logfs-y += journal.o
logfs-y += readwrite.o
logfs-y += segment.o
logfs-y += super.o
logfs-$(CONFIG_BLOCK) += dev_bdev.o
logfs-$(CONFIG_MTD) += dev_mtd.o
/*
* fs/logfs/compr.c - compression routines
*
* As should be obvious for Linux kernel code, license is GPLv2
*
* Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>
*/
#include "logfs.h"
#include <linux/vmalloc.h>
#include <linux/zlib.h>
#define COMPR_LEVEL 3
static DEFINE_MUTEX(compr_mutex);
static struct z_stream_s stream;
int logfs_compress(void *in, void *out, size_t inlen, size_t outlen)
{
int err, ret;
ret = -EIO;
mutex_lock(&compr_mutex);
err = zlib_deflateInit(&stream, COMPR_LEVEL);
if (err != Z_OK)
goto error;
stream.next_in = in;
stream.avail_in = inlen;
stream.total_in = 0;
stream.next_out = out;
stream.avail_out = outlen;
stream.total_out = 0;
err = zlib_deflate(&stream, Z_FINISH);
if (err != Z_STREAM_END)
goto error;
err = zlib_deflateEnd(&stream);
if (err != Z_OK)
goto error;
if (stream.total_out >= stream.total_in)
goto error;
ret = stream.total_out;
error:
mutex_unlock(&compr_mutex);
return ret;
}
int logfs_uncompress(void *in, void *out, size_t inlen, size_t outlen)
{
int err, ret;
ret = -EIO;
mutex_lock(&compr_mutex);
err = zlib_inflateInit(&stream);
if (err != Z_OK)
goto error;
stream.next_in = in;
stream.avail_in = inlen;
stream.total_in = 0;
stream.next_out = out;
stream.avail_out = outlen;
stream.total_out = 0;
err = zlib_inflate(&stream, Z_FINISH);
if (err != Z_STREAM_END)
goto error;
err = zlib_inflateEnd(&stream);
if (err != Z_OK)
goto error;
ret = 0;
error:
mutex_unlock(&compr_mutex);
return ret;
}
int __init logfs_compr_init(void)
{
size_t size = max(zlib_deflate_workspacesize(),
zlib_inflate_workspacesize());
stream.workspace = vmalloc(size);
if (!stream.workspace)
return -ENOMEM;
return 0;
}
void logfs_compr_exit(void)
{
vfree(stream.workspace);
}
/*
* fs/logfs/dev_bdev.c - Device access methods for block devices
*
* As should be obvious for Linux kernel code, license is GPLv2
*
* Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>
*/
#include "logfs.h"
#include <linux/bio.h>
#include <linux/blkdev.h>
#include <linux/buffer_head.h>
#define PAGE_OFS(ofs) ((ofs) & (PAGE_SIZE-1))
static void request_complete(struct bio *bio, int err)
{
complete((struct completion *)bio->bi_private);
}
static int sync_request(struct page *page, struct block_device *bdev, int rw)
{
struct bio bio;
struct bio_vec bio_vec;
struct completion complete;
bio_init(&bio);
bio.bi_io_vec = &bio_vec;
bio_vec.bv_page = page;
bio_vec.bv_len = PAGE_SIZE;
bio_vec.bv_offset = 0;
bio.bi_vcnt = 1;
bio.bi_idx = 0;
bio.bi_size = PAGE_SIZE;
bio.bi_bdev = bdev;
bio.bi_sector = page->index * (PAGE_SIZE >> 9);
init_completion(&complete);
bio.bi_private = &complete;
bio.bi_end_io = request_complete;
submit_bio(rw, &bio);
generic_unplug_device(bdev_get_queue(bdev));
wait_for_completion(&complete);
return test_bit(BIO_UPTODATE, &bio.bi_flags) ? 0 : -EIO;
}
static int bdev_readpage(void *_sb, struct page *page)
{
struct super_block *sb = _sb;
struct block_device *bdev = logfs_super(sb)->s_bdev;
int err;
err = sync_request(page, bdev, READ);
if (err) {
ClearPageUptodate(page);
SetPageError(page);
} else {
SetPageUptodate(page);
ClearPageError(page);
}
unlock_page(page);
return err;
}
static DECLARE_WAIT_QUEUE_HEAD(wq);
static void writeseg_end_io(struct bio *bio, int err)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
struct super_block *sb = bio->bi_private;
struct logfs_super *super = logfs_super(sb);
struct page *page;
BUG_ON(!uptodate); /* FIXME: Retry io or write elsewhere */
BUG_ON(err);
BUG_ON(bio->bi_vcnt == 0);
do {
page = bvec->bv_page;
if (--bvec >= bio->bi_io_vec)
prefetchw(&bvec->bv_page->flags);
end_page_writeback(page);
} while (bvec >= bio->bi_io_vec);
bio_put(bio);
if (atomic_dec_and_test(&super->s_pending_writes))
wake_up(&wq);
}
static int __bdev_writeseg(struct super_block *sb, u64 ofs, pgoff_t index,
size_t nr_pages)
{
struct logfs_super *super = logfs_super(sb);
struct address_space *mapping = super->s_mapping_inode->i_mapping;
struct bio *bio;
struct page *page;
struct request_queue *q = bdev_get_queue(sb->s_bdev);
unsigned int max_pages = queue_max_hw_sectors(q) >> (PAGE_SHIFT - 9);
int i;
bio = bio_alloc(GFP_NOFS, max_pages);
BUG_ON(!bio); /* FIXME: handle this */
for (i = 0; i < nr_pages; i++) {
if (i >= max_pages) {
/* Block layer cannot split bios :( */
bio->bi_vcnt = i;
bio->bi_idx = 0;
bio->bi_size = i * PAGE_SIZE;
bio->bi_bdev = super->s_bdev;
bio->bi_sector = ofs >> 9;
bio->bi_private = sb;
bio->bi_end_io = writeseg_end_io;
atomic_inc(&super->s_pending_writes);
submit_bio(WRITE, bio);
ofs += i * PAGE_SIZE;
index += i;
nr_pages -= i;
i = 0;
bio = bio_alloc(GFP_NOFS, max_pages);
BUG_ON(!bio);
}
page = find_lock_page(mapping, index + i);
BUG_ON(!page);
bio->bi_io_vec[i].bv_page = page;
bio->bi_io_vec[i].bv_len = PAGE_SIZE;
bio->bi_io_vec[i].bv_offset = 0;
BUG_ON(PageWriteback(page));
set_page_writeback(page);
unlock_page(page);
}
bio->bi_vcnt = nr_pages;
bio->bi_idx = 0;
bio->bi_size = nr_pages * PAGE_SIZE;
bio->bi_bdev = super->s_bdev;
bio->bi_sector = ofs >> 9;
bio->bi_private = sb;
bio->bi_end_io = writeseg_end_io;
atomic_inc(&super->s_pending_writes);
submit_bio(WRITE, bio);
return 0;
}
static void bdev_writeseg(struct super_block *sb, u64 ofs, size_t len)
{
struct logfs_super *super = logfs_super(sb);
int head;
BUG_ON(super->s_flags & LOGFS_SB_FLAG_RO);
if (len == 0) {
/* This can happen when the object fit perfectly into a
* segment, the segment gets written per sync and subsequently
* closed.
*/
return;
}
head = ofs & (PAGE_SIZE - 1);
if (head) {
ofs -= head;
len += head;
}
len = PAGE_ALIGN(len);
__bdev_writeseg(sb, ofs, ofs >> PAGE_SHIFT, len >> PAGE_SHIFT);
generic_unplug_device(bdev_get_queue(logfs_super(sb)->s_bdev));
}
static int bdev_erase(struct super_block *sb, loff_t to, size_t len)
{
struct logfs_super *super = logfs_super(sb);
struct address_space *mapping = super->s_mapping_inode->i_mapping;
struct page *page;
pgoff_t index = to >> PAGE_SHIFT;
int i, nr_pages = len >> PAGE_SHIFT;
BUG_ON(to & (PAGE_SIZE - 1));
BUG_ON(len & (PAGE_SIZE - 1));
if (logfs_super(sb)->s_flags & LOGFS_SB_FLAG_RO)
return -EROFS;
for (i = 0; i < nr_pages; i++) {
page = find_get_page(mapping, index + i);
if (page) {
memset(page_address(page), 0xFF, PAGE_SIZE);
page_cache_release(page);
}
}
return 0;
}
static void bdev_sync(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
wait_event(wq, atomic_read(&super->s_pending_writes) == 0);
}
static struct page *bdev_find_first_sb(struct super_block *sb, u64 *ofs)
{
struct logfs_super *super = logfs_super(sb);
struct address_space *mapping = super->s_mapping_inode->i_mapping;
filler_t *filler = bdev_readpage;
*ofs = 0;
return read_cache_page(mapping, 0, filler, sb);
}
static struct page *bdev_find_last_sb(struct super_block *sb, u64 *ofs)
{
struct logfs_super *super = logfs_super(sb);
struct address_space *mapping = super->s_mapping_inode->i_mapping;
filler_t *filler = bdev_readpage;
u64 pos = (super->s_bdev->bd_inode->i_size & ~0xfffULL) - 0x1000;
pgoff_t index = pos >> PAGE_SHIFT;
*ofs = pos;
return read_cache_page(mapping, index, filler, sb);
}
static int bdev_write_sb(struct super_block *sb, struct page *page)
{
struct block_device *bdev = logfs_super(sb)->s_bdev;
/* Nothing special to do for block devices. */
return sync_request(page, bdev, WRITE);
}
static void bdev_put_device(struct super_block *sb)
{
close_bdev_exclusive(logfs_super(sb)->s_bdev, FMODE_READ|FMODE_WRITE);
}
static const struct logfs_device_ops bd_devops = {
.find_first_sb = bdev_find_first_sb,
.find_last_sb = bdev_find_last_sb,
.write_sb = bdev_write_sb,
.readpage = bdev_readpage,
.writeseg = bdev_writeseg,
.erase = bdev_erase,
.sync = bdev_sync,
.put_device = bdev_put_device,
};
int logfs_get_sb_bdev(struct file_system_type *type, int flags,
const char *devname, struct vfsmount *mnt)
{
struct block_device *bdev;
bdev = open_bdev_exclusive(devname, FMODE_READ|FMODE_WRITE, type);
if (IS_ERR(bdev))
return PTR_ERR(bdev);
if (MAJOR(bdev->bd_dev) == MTD_BLOCK_MAJOR) {
int mtdnr = MINOR(bdev->bd_dev);
close_bdev_exclusive(bdev, FMODE_READ|FMODE_WRITE);
return logfs_get_sb_mtd(type, flags, mtdnr, mnt);
}
return logfs_get_sb_device(type, flags, NULL, bdev, &bd_devops, mnt);
}
/*
* fs/logfs/dev_mtd.c - Device access methods for MTD
*
* As should be obvious for Linux kernel code, license is GPLv2
*
* Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>
*/
#include "logfs.h"
#include <linux/completion.h>
#include <linux/mount.h>
#include <linux/sched.h>
#define PAGE_OFS(ofs) ((ofs) & (PAGE_SIZE-1))
static int mtd_read(struct super_block *sb, loff_t ofs, size_t len, void *buf)
{
struct mtd_info *mtd = logfs_super(sb)->s_mtd;
size_t retlen;
int ret;
ret = mtd->read(mtd, ofs, len, &retlen, buf);
BUG_ON(ret == -EINVAL);
if (ret)
return ret;
/* Not sure if we should loop instead. */
if (retlen != len)
return -EIO;
return 0;
}
static int mtd_write(struct super_block *sb, loff_t ofs, size_t len, void *buf)
{
struct logfs_super *super = logfs_super(sb);
struct mtd_info *mtd = super->s_mtd;
size_t retlen;
loff_t page_start, page_end;
int ret;
if (super->s_flags & LOGFS_SB_FLAG_RO)
return -EROFS;
BUG_ON((ofs >= mtd->size) || (len > mtd->size - ofs));
BUG_ON(ofs != (ofs >> super->s_writeshift) << super->s_writeshift);
BUG_ON(len > PAGE_CACHE_SIZE);
page_start = ofs & PAGE_CACHE_MASK;
page_end = PAGE_CACHE_ALIGN(ofs + len) - 1;
ret = mtd->write(mtd, ofs, len, &retlen, buf);
if (ret || (retlen != len))
return -EIO;
return 0;
}
/*
* For as long as I can remember (since about 2001) mtd->erase has been an
* asynchronous interface lacking the first driver to actually use the
* asynchronous properties. So just to prevent the first implementor of such
* a thing from breaking logfs in 2350, we do the usual pointless dance to
* declare a completion variable and wait for completion before returning
* from mtd_erase(). What an excercise in futility!
*/
static void logfs_erase_callback(struct erase_info *ei)
{
complete((struct completion *)ei->priv);
}
static int mtd_erase_mapping(struct super_block *sb, loff_t ofs, size_t len)
{
struct logfs_super *super = logfs_super(sb);
struct address_space *mapping = super->s_mapping_inode->i_mapping;
struct page *page;
pgoff_t index = ofs >> PAGE_SHIFT;
for (index = ofs >> PAGE_SHIFT; index < (ofs + len) >> PAGE_SHIFT; index++) {
page = find_get_page(mapping, index);
if (!page)
continue;
memset(page_address(page), 0xFF, PAGE_SIZE);
page_cache_release(page);