Skip to content
  • Brian Foster's avatar
    xfs: rework zero range to prevent invalid i_size updates · 5d11fb4b
    Brian Foster authored
    
    
    The zero range operation is analogous to fallocate with the exception of
    converting the range to zeroes. E.g., it attempts to allocate zeroed
    blocks over the range specified by the caller. The XFS implementation
    kills all delalloc blocks currently over the aligned range, converts the
    range to allocated zero blocks (unwritten extents) and handles the
    partial pages at the ends of the range by sending writes through the
    pagecache.
    
    The current implementation suffers from several problems associated with
    inode size. If the aligned range covers an extending I/O, said I/O is
    discarded and an inode size update from a previous write never makes it
    to disk. Further, if an unaligned zero range extends beyond eof, the
    page write induced for the partial end page can itself increase the
    inode size, even if the zero range request is not supposed to update
    i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).
    
    The latter behavior not only incorrectly increases the inode size, but
    can lead to stray delalloc blocks on the inode. Typically, post-eof
    preallocation blocks are either truncated on release or inode eviction
    or explicitly written to by xfs_zero_eof() on natural file size
    extension. If the inode size increases due to zero range, however,
    associated blocks leak into the address space having never been
    converted or mapped to pagecache pages. A direct I/O to such an
    uncovered range cannot convert the extent via writeback and will BUG().
    For example:
    
    $ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file>
    ...
    $ xfs_io -d -c "pread 128k 128k" <file>
    <BUG>
    
    If the entire delalloc extent happens to not have page coverage
    whatsoever (e.g., delalloc conversion couldn't find a large enough free
    space extent), even a full file writeback won't convert what's left of
    the extent and we'll assert on inode eviction.
    
    Rework xfs_zero_file_space() to avoid buffered I/O for partial pages.
    Use the existing hole punch and prealloc mechanisms as primitives for
    zero range. This implementation is not efficient nor ideal as we
    writeback dirty data over the range and remove existing extents rather
    than convert to unwrittern. The former writeback, however, is currently
    the only mechanism available to ensure consistency between pagecache and
    extent state. Even a pagecache truncate/delalloc punch prior to hole
    punch has lead to inconsistencies due to racing with writeback.
    
    This provides a consistent, correct implementation of zero range that
    survives fsstress/fsx testing without assert failures. The
    implementation can be optimized from this point forward once the
    fundamental issue of pagecache and delalloc extent state consistency is
    addressed.
    
    Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
    Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
    Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
    
    5d11fb4b