1. 16 Sep, 2009 5 commits
  2. 14 Sep, 2009 1 commit
  3. 11 Sep, 2009 5 commits
    • Jens Axboe's avatar
      writeback: check for registered bdi in flusher add and inode dirty · 500b067c
      Jens Axboe authored
      
      
      Also a debugging aid. We want to catch dirty inodes being added to
      backing devices that don't do writeback.
      
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      500b067c
    • Jens Axboe's avatar
      writeback: get rid of pdflush completely · d0bceac7
      Jens Axboe authored
      
      
      It is now unused, so kill it off.
      
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      d0bceac7
    • Jens Axboe's avatar
      writeback: switch to per-bdi threads for flushing data · 03ba3782
      Jens Axboe authored
      
      
      This gets rid of pdflush for bdi writeout and kupdated style cleaning.
      pdflush writeout suffers from lack of locality and also requires more
      threads to handle the same workload, since it has to work in a
      non-blocking fashion against each queue. This also introduces lumpy
      behaviour and potential request starvation, since pdflush can be starved
      for queue access if others are accessing it. A sample ffsb workload that
      does random writes to files is about 8% faster here on a simple SATA drive
      during the benchmark phase. File layout also seems a LOT more smooth in
      vmstat:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
       0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
       1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
       0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
       0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
       0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
       0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
       0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
       0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
       0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45
      
      where vanilla tends to fluctuate a lot in the creation phase:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
       1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
       0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
       0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
       1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
       0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
       0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
       1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
       0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
       1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
       1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
       0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54
      
      A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
      SSD based writeback test on XFS performs over 20% better as well, with
      the throughput being very stable around 1GB/sec, where pdflush only
      manages 750MB/sec and fluctuates wildly while doing so. Random buffered
      writes to many files behave a lot better as well, as does random mmap'ed
      writes.
      
      A separate thread is added to sync the super blocks. In the long term,
      adding sync_supers_bdi() functionality could get rid of this thread again.
      
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      03ba3782
    • Jens Axboe's avatar
      writeback: move dirty inodes from super_block to backing_dev_info · 66f3b8e2
      Jens Axboe authored
      
      
      This is a first step at introducing per-bdi flusher threads. We should
      have no change in behaviour, although sb_has_dirty_inodes() is now
      ridiculously expensive, as there's no easy way to answer that question.
      Not a huge problem, since it'll be deleted in subsequent patches.
      
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      66f3b8e2
    • Jens Axboe's avatar
      writeback: get rid of generic_sync_sb_inodes() export · d8a8559c
      Jens Axboe authored
      
      
      This adds two new exported functions:
      
      - writeback_inodes_sb(), which only attempts to writeback dirty inodes on
        this super_block, for WB_SYNC_NONE writeout.
      - sync_inodes_sb(), which writes out all dirty inodes on this super_block
        and also waits for the IO to complete.
      
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      d8a8559c
  4. 24 Jun, 2009 1 commit
    • Christoph Hellwig's avatar
      cleanup __writeback_single_inode · 01c03194
      Christoph Hellwig authored
      
      
      There is no reason to for the split between __writeback_single_inode and
      __sync_single_inode, the former just does a couple of checks before
      tail-calling the latter.  So merge the two, and while we're at it split
      out the I_SYNC waiting case for data integrity writers, as it's
      logically separate function.  Finally rename __writeback_single_inode to
      writeback_single_inode.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      01c03194
  5. 16 Jun, 2009 1 commit
    • Wu Fengguang's avatar
      writeback: skip new or to-be-freed inodes · 84a89245
      Wu Fengguang authored
      
      
      1) I_FREEING tests should be coupled with I_CLEAR
      
      The two I_FREEING tests are racy because clear_inode() can set i_state to
      I_CLEAR between the clear of I_SYNC and the test of I_FREEING.
      
      2) skip I_WILL_FREE inodes in generic_sync_sb_inodes() to avoid possible
         races with generic_forget_inode()
      
      generic_forget_inode() sets I_WILL_FREE call writeback on its own, so
      generic_sync_sb_inodes() shall not try to step in and create possible races:
      
        generic_forget_inode
          inode->i_state |= I_WILL_FREE;
          spin_unlock(&inode_lock);
                                             generic_sync_sb_inodes()
                                               spin_lock(&inode_lock);
                                               __iget(inode);
                                               __writeback_single_inode
                                                 // see non zero i_count
       may WARN here ==>                         WARN_ON(inode->i_state & I_WILL_FREE);
                                               spin_unlock(&inode_lock);
       may call generic_forget_inode again ==> iput(inode);
      
      The above race and warning didn't turn up because writeback_inodes() holds
      the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
      early.  But we are not sure the UBIFS calls and future callers will
      guarantee that.  So skip I_WILL_FREE inodes for the sake of safety.
      
      Cc: Eric Sandeen <sandeen@sandeen.net>
      Acked-by: default avatarJeff Layton <jlayton@redhat.com>
      Cc: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84a89245
  6. 11 Jun, 2009 3 commits
    • Nick Piggin's avatar
      fs: block_dump missing dentry locking · 4195f73d
      Nick Piggin authored
      
      
      I think the block_dump output in __mark_inode_dirty is missing dentry locking.
      Surely the i_dentry list can change any time, so we may not even *get* a
      dentry there. If we do get one by chance, then it would appear to be able to
      go away or get renamed at any time...
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      4195f73d
    • Nick Piggin's avatar
      fs: remove incorrect I_NEW warnings · 545b9fd3
      Nick Piggin authored
      Some filesystems can call in to sync an inode that is still in the
      I_NEW state (eg. ext family, when mounted with -osync). This is OK
      because the filesystem has sole access to the new inode, so it can
      modify i_state without races (because no other thread should be
      modifying it, by definition of I_NEW). Ie. a false positive, so
      remove the warnings.
      
      The races are described here 7ef0d737
      
      ,
      which is also where the warnings were introduced.
      
      Reported-by: default avatarStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      545b9fd3
    • Jan Kara's avatar
      vfs: Make sys_sync() use fsync_super() (version 4) · 5cee5815
      Jan Kara authored
      
      
      It is unnecessarily fragile to have two places (fsync_super() and do_sync())
      doing data integrity sync of the filesystem. Alter __fsync_super() to
      accommodate needs of both callers and use it. So after this patch
      __fsync_super() is the only place where we gather all the calls needed to
      properly send all data on a filesystem to disk.
      
      Nice bonus is that we get a complete livelock avoidance and write_supers()
      is now only used for periodic writeback of superblocks.
      
      sync_blockdevs() introduced a couple of patches ago is gone now.
      
      [build fixes folded]
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      5cee5815
  7. 02 Apr, 2009 2 commits
    • Jeff Layton's avatar
      writeback: guard against jiffies wraparound on inode->dirtied_when checks (try #3) · d2caa3c5
      Jeff Layton authored
      
      
      The dirtied_when value on an inode is supposed to represent the first time
      that an inode has one of its pages dirtied.  This value is in units of
      jiffies.  It's used in several places in the writeback code to determine
      when to write out an inode.
      
      The problem is that these checks assume that dirtied_when is updated
      periodically.  If an inode is continuously being used for I/O it can be
      persistently marked as dirty and will continue to age.  Once the time
      compared to is greater than or equal to half the maximum of the jiffies
      type, the logic of the time_*() macros inverts and the opposite of what is
      needed is returned.  On 32-bit architectures that's just under 25 days
      (assuming HZ == 1000).
      
      As the least-recently dirtied inode, it'll end up being the first one that
      pdflush will try to write out.  sync_sb_inodes does this check:
      
      	/* Was this inode dirtied after sync_sb_inodes was called? */
       	if (time_after(inode->dirtied_when, start))
       		break;
      
      ...but now dirtied_when appears to be in the future.  sync_sb_inodes bails
      out without attempting to write any dirty inodes.  When this occurs,
      pdflush will stop writing out inodes for this superblock.  Nothing can
      unwedge it until jiffies moves out of the problematic window.
      
      This patch fixes this problem by changing the checks against dirtied_when
      to also check whether it appears to be in the future.  If it does, then we
      consider the value to be far in the past.
      
      This should shrink the problematic window of time to such a small period
      (30s) as not to matter.
      
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Acked-by: default avatarIan Kent <raven@themaw.net>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2caa3c5
    • Wu Fengguang's avatar
      vfs: skip I_CLEAR state inodes · b6fac63c
      Wu Fengguang authored
      
      
      clear_inode() will switch inode state from I_FREEING to I_CLEAR, and do so
      _outside_ of inode_lock.  So any I_FREEING testing is incomplete without a
      coupled testing of I_CLEAR.
      
      So add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
      add_dquot_ref().
      
      Masayoshi MIZUMA discovered the bug in drop_pagecache_sb() and Jan Kara
      reminds fixing the other two cases.
      
      Masayoshi MIZUMA has a nice panic flow:
      
      =====================================================================
                  [process A]               |        [process B]
       |                                    |
       |    prune_icache()                  | drop_pagecache()
       |      spin_lock(&inode_lock)        |   drop_pagecache_sb()
       |      inode->i_state |= I_FREEING;  |       |
       |      spin_unlock(&inode_lock)      |       V
       |          |                         |     spin_lock(&inode_lock)
       |          V                         |         |
       |      dispose_list()                |         |
       |        list_del()                  |         |
       |        clear_inode()               |         |
       |          inode->i_state = I_CLEAR  |         |
       |            |                       |         V
       |            |                       |      if (inode->i_state & (I_FREEING|I_WILL_FREE))
       |            |                       |              continue;           <==== NOT MATCH
       |            |                       |
       |            |                       | (DANGER from here on! Accessing disposing inode!)
       |            |                       |
       |            |                       |      __iget()
       |            |                       |        list_move() <===== PANIC on poisoned list !!
       V            V                       |
      (time)
      =====================================================================
      
      Reported-by: default avatarMasayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6fac63c
  8. 30 Mar, 2009 1 commit
  9. 12 Mar, 2009 1 commit
    • Nick Piggin's avatar
      fs: new inode i_state corruption fix · 7ef0d737
      Nick Piggin authored
      There was a report of a data corruption
      http://lkml.org/lkml/2008/11/14/121
      
      .  There is a script included to
      reproduce the problem.
      
      During testing, I encountered a number of strange things with ext3, so I
      tried ext2 to attempt to reduce complexity of the problem.  I found that
      fsstress would quickly hang in wait_on_inode, waiting for I_LOCK to be
      cleared, even though instrumentation showed that unlock_new_inode had
      already been called for that inode.  This points to memory scribble, or
      synchronisation problme.
      
      i_state of I_NEW inodes is not protected by inode_lock because other
      processes are not supposed to touch them until I_LOCK (and I_NEW) is
      cleared.  Adding WARN_ON(inode->i_state & I_NEW) to sites where we modify
      i_state revealed that generic_sync_sb_inodes is picking up new inodes from
      the inode lists and passing them to __writeback_single_inode without
      waiting for I_NEW.  Subsequently modifying i_state causes corruption.  In
      my case it would look like this:
      
      CPU0                            CPU1
      unlock_new_inode()              __sync_single_inode()
       reg <- inode->i_state
       reg -> reg & ~(I_LOCK|I_NEW)   reg <- inode->i_state
       reg -> inode->i_state          reg -> reg | I_SYNC
                                      reg -> inode->i_state
      
      Non-atomic RMW on CPU1 overwrites CPU0 store and sets I_LOCK|I_NEW again.
      
      Fix for this is rather than wait for I_NEW inodes, just skip over them:
      inodes concurrently being created are not subject to data integrity
      operations, and should not significantly contribute to dirty memory
      either.
      
      After this change, I'm unable to reproduce any of the added warnings or
      hangs after ~1hour of running.  Previously, the new warnings would start
      immediately and hang would happen in under 5 minutes.
      
      I'm also testing on ext3 now, and so far no problems there either.  I
      don't know whether this fixes the problem reported above, but it fixes a
      real problem for me.
      
      Cc: "Jorge Boncompte [DTI2]" <jorge@dti2.net>
      Reported-by: default avatarAdrian Hunter <ext-adrian.hunter@nokia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ef0d737
  10. 06 Jan, 2009 3 commits
    • Nick Piggin's avatar
      fs: sys_sync fix · 856bf4d7
      Nick Piggin authored
      
      
      s_syncing livelock avoidance was breaking data integrity guarantee of
      sys_sync, by allowing sys_sync to skip writing or waiting for superblocks
      if there is a concurrent sys_sync happening.
      
      This livelock avoidance is much less important now that we don't have the
      get_super_to_sync() call after every sb that we sync.  This was replaced
      by __put_super_and_need_restart.
      
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      856bf4d7
    • Nick Piggin's avatar
      fs: sync_sb_inodes fix · 38f21977
      Nick Piggin authored
      
      
      Fix data integrity semantics required by sys_sync, by iterating over all
      inodes and waiting for any writeback pages after the initial writeout.
      Comments explain the exact problem.
      
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38f21977
    • Nick Piggin's avatar
      fs: remove WB_SYNC_HOLD · 4f5a99d6
      Nick Piggin authored
      
      
      Remove WB_SYNC_HOLD.  The primary motiviation is the design of my
      anti-starvation code for fsync.  It requires taking an inode lock over the
      sync operation, so we could run into lock ordering problems with multiple
      inodes.  It is possible to take a single global lock to solve the ordering
      problem, but then that would prevent a future nice implementation of "sync
      multiple inodes" based on lock order via inode address.
      
      Seems like a backward step to remove this, but actually it is busted
      anyway: we can't use the inode lists for data integrity wait: an inode can
      be taken off the dirty lists but still be under writeback.  In order to
      satisfy data integrity semantics, we should wait for it to finish
      writeback, but if we only search the dirty lists, we'll miss it.
      
      It would be possible to have a "writeback" list, for sys_sync, I suppose.
      But why complicate things by prematurely optimise?  For unmounting, we
      could avoid the "livelock avoidance" code, which would be easier, but
      again premature IMO.
      
      Fixing the existing data integrity problem will come next.
      
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f5a99d6
  11. 16 Oct, 2008 1 commit
  12. 14 Jul, 2008 2 commits
  13. 29 Apr, 2008 1 commit
  14. 19 Mar, 2008 1 commit
    • Randy Dunlap's avatar
      fs: fix kernel-doc notation warnings · a6b91919
      Randy Dunlap authored
      
      
      Fix kernel-doc notation warnings in fs/.
      
      Warning(mmotm-2008-0314-1449//fs/super.c:560): missing initial short description on line:
       *	mark_files_ro
      Warning(mmotm-2008-0314-1449//fs/locks.c:1277): missing initial short description on line:
       *	lease_get_mtime
      Warning(mmotm-2008-0314-1449//fs/locks.c:1277): missing initial short description on line:
       *	lease_get_mtime
      Warning(mmotm-2008-0314-1449//fs/namei.c:1368): missing initial short description on line:
       * lookup_one_len:  filesystem helper to lookup single pathname component
      Warning(mmotm-2008-0314-1449//fs/buffer.c:3221): missing initial short description on line:
       * bh_uptodate_or_lock: Test whether the buffer is uptodate
      Warning(mmotm-2008-0314-1449//fs/buffer.c:3240): missing initial short description on line:
       * bh_submit_read: Submit a locked buffer for reading
      Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:30): missing initial short description on line:
       * writeback_acquire: attempt to get exclusive writeback access to a device
      Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:47): missing initial short description on line:
       * writeback_in_progress: determine whether there is writeback in progress
      Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:58): missing initial short description on line:
       * writeback_release: relinquish exclusive writeback access against a device.
      Warning(mmotm-2008-0314-1449//include/linux/jbd.h:351): contents before sections
      Warning(mmotm-2008-0314-1449//include/linux/jbd.h:561): contents before sections
      Warning(mmotm-2008-0314-1449//fs/jbd/transaction.c:1935): missing initial short description on line:
       * void journal_invalidatepage()
      
      Signed-off-by: default avatarRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6b91919
  15. 08 Feb, 2008 1 commit
  16. 06 Feb, 2008 1 commit
  17. 05 Feb, 2008 2 commits
    • Fengguang Wu's avatar
      writeback: speed up writeback of big dirty files · 8bc3be27
      Fengguang Wu authored
      
      
      After making dirty a 100M file, the normal behavior is to start the
      writeback for all data after 30s delays.  But sometimes the following
      happens instead:
      
      	- after 30s:    ~4M
      	- after 5s:     ~4M
      	- after 5s:     all remaining 92M
      
      Some analyze shows that the internal io dispatch queues goes like this:
      
      		s_io            s_more_io
      		-------------------------
      	1)	100M,1K         0
      	2)	1K              96M
      	3)	0               96M
      1) initial state with a 100M file and a 1K file
      
      2) 4M written, nr_to_write <= 0, so write more
      
      3) 1K written, nr_to_write > 0, no more writes(BUG)
      
      nr_to_write > 0 in (3) fools the upper layer to think that data have all
      been written out.  The big dirty file is actually still sitting in
      s_more_io.  We cannot simply splice s_more_io back to s_io as soon as s_io
      becomes empty, and let the loop in generic_sync_sb_inodes() continue: this
      may starve newly expired inodes in s_dirty.  It is also not an option to
      draw inodes from both s_more_io and s_dirty, an let the loop go on: this
      might lead to live locks, and might also starve other superblocks in sync
      time(well kupdate may still starve some superblocks, that's another bug).
      
      We have to return when a full scan of s_io completes.  So nr_to_write > 0
      does not necessarily mean that "all data are written".  This patch
      introduces a flag writeback_control.more_io to indicate that more io should
      be done.  With it the big dirty file no longer has to wait for the next
      kupdate invokation 5s later.
      
      In sync_sb_inodes() we only set more_io on super_blocks we actually
      visited.  This avoids the interaction between two pdflush deamons.
      
      Also in __sync_single_inode() we don't blindly keep requeuing the io if the
      filesystem cannot progress.  Failing to do so may lead to 100% iowait.
      
      Tested-by: default avatarMike Snitzer <snitzer@gmail.com>
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8bc3be27
    • Qi Yong's avatar
      skip writing data pages when inode is under I_SYNC · 2d544564
      Qi Yong authored
      Since I_SYNC was split out from I_LOCK, the concern in commit
      4b89eed9
      
       ("Write back inode data pages
      even when the inode itself is locked") is not longer valid.
      
      We should revert to the original behavior: in __writeback_single_inode(),
      when we find an I_SYNC-ed inode and we're not doing a data-integrity sync,
      skip writing entirely.  Otherwise, we are double calling do_writepages()
      
      Signed-off-by: default avatarQi Yong <qiyong@fc-cn.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Joern Engel <joern@wohnheim.fh-wedel.de>
      Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d544564
  18. 14 Jan, 2008 1 commit
  19. 19 Oct, 2007 1 commit
  20. 17 Oct, 2007 6 commits
    • Joern Engel's avatar
      introduce I_SYNC · 1c0eeaf5
      Joern Engel authored
      
      
      I_LOCK was used for several unrelated purposes, which caused deadlock
      situations in certain filesystems as a side effect.  One of the purposes
      now uses the new I_SYNC bit.
      
      Also document the various bits and change their order from historical to
      logical.
      
      [bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
      Signed-off-by: default avatarJoern Engel <joern@wohnheim.fh-wedel.de>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: David Chinner <dgc@sgi.com>
      Cc: Anton Altaparmakov <aia21@cam.ac.uk>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c0eeaf5
    • Fengguang Wu's avatar
      writeback: introduce writeback_control.more_io to indicate more io · 2e6883bd
      Fengguang Wu authored
      
      
      After making dirty a 100M file, the normal behavior is to start the writeback
      for all data after 30s delays.  But sometimes the following happens instead:
      
      	- after 30s:    ~4M
      	- after 5s:     ~4M
      	- after 5s:     all remaining 92M
      
      Some analyze shows that the internal io dispatch queues goes like this:
      
      		s_io            s_more_io
      		-------------------------
      	1)	100M,1K         0
      	2)	1K              96M
      	3)	0               96M
      
      1) initial state with a 100M file and a 1K file
      2) 4M written, nr_to_write <= 0, so write more
      3) 1K written, nr_to_write > 0, no more writes(BUG)
      
      nr_to_write > 0 in (3) fools the upper layer to think that data have all been
      written out.  The big dirty file is actually still sitting in s_more_io.  We
      cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
      let the loop in generic_sync_sb_inodes() continue: this may starve newly
      expired inodes in s_dirty.  It is also not an option to draw inodes from both
      s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
      and might also starve other superblocks in sync time(well kupdate may still
      starve some superblocks, that's another bug).
      
      We have to return when a full scan of s_io completes.  So nr_to_write > 0 does
      not necessarily mean that "all data are written".  This patch introduces a
      flag writeback_control.more_io to indicate this situation.  With it the big
      dirty file no longer has to wait for the next kupdate invocation 5s later.
      
      Cc: David Chinner <dgc@sgi.com>
      Cc: Ken Chen <kenchen@google.com>
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e6883bd
    • Fengguang Wu's avatar
      writeback: fix ntfs with sb_has_dirty_inodes() · 08d8e974
      Fengguang Wu authored
      
      
      NTFS's if-condition on dirty inodes is not complete.  Fix it with
      sb_has_dirty_inodes().
      
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Cc: Ken Chen <kenchen@google.com>
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08d8e974
    • Fengguang Wu's avatar
      writeback: fix time ordering of the per superblock inode lists 8 · 2c136579
      Fengguang Wu authored
      
      
      Streamline the management of dirty inode lists and fix time ordering bugs.
      
      The writeback logic used to move not-yet-expired dirty inodes from s_dirty to
      s_io, *only to* move them back.  The move-inodes-back-and-forth thing is a
      mess, which is eliminated by this patch.
      
      The new scheme is:
      - s_dirty acts as a time ordered io delaying queue;
      - s_io/s_more_io together acts as an io dispatching queue.
      
      On kupdate writeback, we pull some inodes from s_dirty to s_io at the start of
      every full scan of s_io.  Otherwise  (i.e. for sync/throttle/background
      writeback), we always pull from s_dirty on each run (a partial scan).
      
      Note that the line
      	list_splice_init(&sb->s_more_io, &sb->s_io);
      is moved to queue_io() to leave s_io empty. Otherwise a big dirtied file will
      sit in s_io for a long time, preventing new expired inodes to get in.
      
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c136579
    • Ken Chen's avatar
      writeback: fix periodic superblock dirty inode flushing · 0e0f4fc2
      Ken Chen authored
      
      
      Current -mm tree has bucketful of bug fixes in periodic writeback path.
      However, we still hit a glitch where dirty pages on a given inode aren't
      completely flushed to the disk, and system will accumulate large amount of
      dirty pages beyond what dirty_expire_interval is designed for.
      
      The problem is __sync_single_inode() will move an inode to sb->s_dirty list
      even when there are more pending dirty pages on that inode.  If there is
      another inode with a small number of dirty pages, we hit a case where the loop
      iteration in wb_kupdate() terminates prematurely because wbc.nr_to_write > 0.
      Thus leaving the inode that has large amount of dirty pages behind and it has
      to wait for another dirty_writeback_interval before we flush it again.  We
      effectively only write out MAX_WRITEBACK_PAGES every dirty_writeback_interval.
      If the rate of dirtying is sufficiently high, the system will start
      accumulate a large number of dirty pages.
      
      So fix it by having another sb->s_more_io list on which to park the inode
      while we iterate through sb->s_io and to allow each dirty inode which resides
      on that sb to have an equal chance of flushing some amount of dirty pages.
      
      Signed-off-by: default avatarKen Chen <kenchen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e0f4fc2
    • Andrew Morton's avatar
      writeback: fix time ordering of the per superblock dirty inode lists 7 · 670e4def
      Andrew Morton authored
      
      
      This one fixes four bugs.
      
      There are a few situation in there where writeback decides it is going to skip
      over a blockdev inode on the kernel-internal blockdev superblock.  It
      presently does this by moving the blockdev inode onto the tail of the blockdev
      superblock's s_dirty.  But
      
      a) this screws up s_dirty's reverse-time-orderedness and
      
      b) refiling the blockdev for writeback in another 30 second is rude.  We
         should try again sooner than that.
      
      Fix all this up by using redirty_head(): move the blockdev inode onto the head
      of the blockdev superblock's s_dirty list for prompt writeback.
      
      Cc: Mike Waychison <mikew@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      670e4def