Skip to content
  • Christoph Hellwig's avatar
    fsync: wait for data writeout completion before calling ->fsync · 2daea67e
    Christoph Hellwig authored
    
    
    Currenly vfs_fsync(_range) first calls filemap_fdatawrite to write out
    the data, the calls into ->fsync to write out the metadata and then finally
    calls filemap_fdatawait to wait for the data I/O to complete.  What sounds
    like a clever micro-optimization actually is nast trap for many filesystems.
    
    For many modern filesystems i_size or other inode information is only
    updated on I/O completion and we need to wait for I/O to finish before
    we can write out the metadata.  For old fashionen filesystems that
    instanciate blocks during the actual write and also update the metadata
    at that point it opens up a large window were we could expose uninitialized
    blocks after a crash.  While a few filesystems that need it already wait
    for the I/O to finish inside their ->fsync methods it is rather suboptimal
    as it is done under the i_mutex and also always for the whole file instead
    of just a part as we could do for O_SYNC handling.
    
    Here is a small audit of all fsync instances in the tree:
    
     - spufs_mfc_fsync:
     - ps3flash_fsync:
     - vol_cdev_fsync:
     - printer_fsync:
     - fb_deferred_io_fsync:
     - bad_file_fsync:
     - simple_sync_file:
    
    	don't care - filesystems/drivers do't use the page cache or are
    	purely in-memory.
    
     - simple_fsync:
     - file_fsync:
     - affs_file_fsync:
     - fat_file_fsync:
     - jfs_fsync:
     - ubifs_fsync:
     - reiserfs_dir_fsync:
     - reiserfs_sync_file:
    
    	never touch pagecache themselves.  We need to wait before if we do
    	not want to expose stale data after an allocation.
    
     - afs_fsync:
     - fuse_fsync_common:
    
    	do the waiting writeback itself in awkward ways, would benefit from
    	proper semantics
    
     - block_fsync:
    
    	Does a filemap_write_and_wait on the block device inode.  Because we
    	now have f_mapping that is the same inode we call it on in vfs_fsync.
    	So just removing it and letting the VFS do the work in one go would
    	be an improvement.
    
     - btrfs_sync_file:
     - cifs_fsync:
     - xfs_file_fsync:
    
    	need the wait first and currently do it themselves. would benefit from
    	doing it outside i_mutex.
    
     - coda_fsync:
     - ecryptfs_fsync:
     - exofs_file_fsync:
     - shm_fsync:
    
    	only passes the fsync through to the lower layer
    
     - ext3_sync_file:
    
    	doesn't seem to care, comments are confusing.
    
     - ext4_sync_file:
    
    	would need the wait to work correctly for delalloc mode with late
    	i_size updates.  Otherwise the ext3 comment applies.
    
    	currently implemens it's own writeback and wait in an odd way,
    	could benefit from doing it properly.
    
     - gfs2_fsync:
    
    	not needed for journaled data mode, but probably harmless there.
    	Currently writes back data asynchronously itself.  Needs some
    	major audit.
    
     - hostfs_fsync:
    
    	just calls fsync/datasync on the host FD.  Without the wait before
    	data might not even be inflight yet if we're unlucky.
    
     - hpfs_file_fsync:
     - ncp_fsync:
    
    	no-ops.  Dangerous before and after.
    
     - jffs2_fsync:
    
    	just calls jffs2_flush_wbuf_gc, not sure how this relates to data.
    
     - nfs_fsync_dir:
    
    	just increments stats, claims all directory operations are synchronous
    
     - nfs_file_fsync:
    
    	only writes out data???  Looks very odd.
    
     - nilfs_sync_file:
    
    	looks like it expects all data done, but not sure from the code
    
     - ntfs_dir_fsync:
     - ntfs_file_fsync:
    
    	appear to do their own data writeback.  Very convoluted code.
    
     - ocfs2_sync_file:
    
    	does it's own data writeback, but no wait.  probably needs the wait.
    
     - smb_fsync:
    
    	according to a comment expects all pages written already, probably needs
    	the wait before.
    
    This patch only changes vfs_fsync_range, removal of the wait in the methods
    that have it is left to the filesystem maintainers.  Note that most
    filesystems really do need an audit for their fsync methods given the
    gems found in this very brief audit.
    
    Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
    Signed-off-by: default avatarJan Kara <jack@suse.cz>
    2daea67e