1. 12 Oct, 2015 4 commits
    • Tejun Heo's avatar
      writeback: fix incorrect calculation of available memory for memcg domains · c5edf9cd
      Tejun Heo authored
      For memcg domains, the amount of available memory was calculated as
      
       min(the amount currently in use + headroom according to memcg,
           total clean memory)
      
      This isn't quite correct as what should be capped by the amount of
      clean memory is the headroom, not the sum of memory in use and
      headroom.  For example, if a memcg domain has a significant amount of
      dirty memory, the above can lead to a value which is lower than the
      current amount in use which doesn't make much sense.  In most
      circumstances, the above leads to a number which is somewhat but not
      drastically lower.
      
      As the amount of memory which can be readily allocated to the memcg
      domain is capped by the amount of system-wide clean memory which is
      not already assigned to the memcg itself, the number we want is
      
       the amount currently in use +
       min(headroom according to memcg, clean memory elsewhere in the system)
      
      This patch updates mem_cgroup_wb_stats() to return the number of
      filepages and headroom instead of the calculated available pages.
      mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the
      above calculation from file, headroom, dirty and globally clean pages.
      
      v2: Dummy mem_cgroup_wb_stats() implementation wasn't updated leading
          to build failure when !CGROUP_WRITEBACK.  Fixed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: c2aa723a ("writeback: implement memcg writeback domain based throttling")
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c5edf9cd
    • Tejun Heo's avatar
      writeback: memcg dirty_throttle_control should be initialized with wb->memcg_completions · d60d1bdd
      Tejun Heo authored
      MDTC_INIT() is used to initialize dirty_throttle_control for memcg
      domains.  It used DTC_INIT_COMMON() to initialized mdtc->wb and
      ->wb_completions which is incorrect as DTC_INIT_COMMON() sets the
      latter to wb->completions instead of wb->memcg_completions.  This can
      lead to wildly incorrect results when calculating the proportion of
      dirty memory the memcg domain should get.
      
      Remove DTC_INIT_COMMON() and update MDTC_INIT() to initialize
      mdtc->wb_completions to wb->memcg_completions.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: c2aa723a ("writeback: implement memcg writeback domain based throttling")
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      d60d1bdd
    • Tejun Heo's avatar
      writeback: bdi_writeback iteration must not skip dying ones · b817525a
      Tejun Heo authored
      bdi_for_each_wb() is used in several places to wake up or issue
      writeback work items to all wb's (bdi_writeback's) on a given bdi.
      The iteration is performed by walking bdi->cgwb_tree; however, the
      tree only indexes wb's which are currently active.
      
      For example, when a memcg gets associated with a different blkcg, the
      old wb is removed from the tree so that the new one can be indexed.
      The old wb starts dying from then on but will linger till all its
      inodes are drained.  As these dying wb's may still host dirty inodes,
      writeback operations which affect all wb's must include them.
      bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
      failing to sync the inodes belonging to those wb's.
      
      This patch adds a RCU protected @bdi->wb_list which lists all wb's
      beloinging to that bdi.  wb's are added on creation and removed on
      release rather than on the start of destruction.  bdi_for_each_wb()
      usages are replaced with list_for_each[_continue]_rcu() iterations
      over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.
      
      v2: Updated as per Jan.  last_wb ref leak in bdi_split_work_to_wbs()
          fixed and unnecessary list head severing in cgwb_bdi_destroy()
          removed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarArtem Bityutskiy <dedekind1@gmail.com>
      Fixes: ebe41ab0 ("writeback: implement bdi_for_each_wb()")
      Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b817525a
    • Tejun Heo's avatar
      writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration · 9ad18ab9
      Tejun Heo authored
      laptop_mode_timer_fn() was using bdi_for_each_wb() without the
      required RCU locking leading to the following warning.
      
       WARNING: CPU: 0 PID: 0 at include/linux/backing-dev.h:415 laptop_mode_timer_fn+0x106/0x170()
       ...
       Call Trace:
        <IRQ>  [<ffffffff81480cdc>] dump_stack+0x4e/0x82
        [<ffffffff81051912>] warn_slowpath_common+0x82/0xc0
        [<ffffffff81051a0a>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8115f0e6>] laptop_mode_timer_fn+0x106/0x170
        [<ffffffff810ca8e3>] call_timer_fn+0xb3/0x2f0
        [<ffffffff810cad25>] run_timer_softirq+0x205/0x370
        [<ffffffff81056854>] __do_softirq+0xd4/0x460
        [<ffffffff81056d69>] irq_exit+0x89/0xa0
        [<ffffffff8185a892>] smp_apic_timer_interrupt+0x42/0x50
        [<ffffffff81858a44>] apic_timer_interrupt+0x84/0x90
       ...
      
      Fix it by adding rcu_read_lock() around the iteration.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: a06fd6b1 ("writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's")
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      9ad18ab9
  2. 18 Aug, 2015 1 commit
    • Tejun Heo's avatar
      writeback: update writeback tracepoints to report cgroup · 5634cc2a
      Tejun Heo authored
      The following tracepoints are updated to report the cgroup used during
      cgroup writeback.
      
      * writeback_write_inode[_start]
      * writeback_queue
      * writeback_exec
      * writeback_start
      * writeback_written
      * writeback_wait
      * writeback_nowork
      * writeback_wake_background
      * wbc_writepage
      * writeback_queue_io
      * bdi_dirty_ratelimit
      * balance_dirty_pages
      * writeback_sb_inodes_requeue
      * writeback_single_inode[_start]
      
      Note that writeback_bdi_register is separated out from writeback_class
      as reporting cgroup doesn't make sense to it.  Tracepoints which take
      bdi are updated to take bdi_writeback instead.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      5634cc2a
  3. 06 Aug, 2015 1 commit
  4. 02 Jun, 2015 32 commits
    • Tejun Heo's avatar
      writeback: implement unlocked_inode_to_wb transaction and use it for stat updates · 682aa8e1
      Tejun Heo authored
      The mechanism for detecting whether an inode should switch its wb
      (bdi_writeback) association is now in place.  This patch build the
      framework for the actual switching.
      
      This patch adds a new inode flag I_WB_SWITCHING, which has two
      functions.  First, the easy one, it ensures that there's only one
      switching in progress for a give inode.  Second, it's used as a
      mechanism to synchronize wb stat updates.
      
      The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
      but track the current number of dirty pages and pages under writeback
      respectively.  As such, when an inode is moved from one wb to another,
      the inode's portion of those stats have to be transferred together;
      unfortunately, this is a bit tricky as those stat updates are percpu
      operations which are performed without holding any lock in some
      places.
      
      This patch solves the problem in a similar way as memcg.  Each such
      lockless stat updates are wrapped in transaction surrounded by
      unlocked_inode_to_wb_begin/end().  During normal operation, they map
      to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
      mapping->tree_lock is grabbed across the transaction.
      
      In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
      grace period to pass before actually starting to switch, which
      guarantees that all stat update paths are synchronizing against
      mapping->tree_lock.
      
      This patch still doesn't implement the actual switching.
      
      v3: Updated on top of the recent cancel_dirty_page() updates.
          unlocked_inode_to_wb_begin() now nests inside
          mem_cgroup_begin_page_stat() to match the locking order.
      
      v2: The i_wb access transaction will be used for !stat accesses too.
          Function names and comments updated accordingly.
      
          s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
          s/switch_wb/switch_wbs/
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      682aa8e1
    • Tejun Heo's avatar
      writeback: implement memcg writeback domain based throttling · c2aa723a
      Tejun Heo authored
      While cgroup writeback support now connects memcg and blkcg so that
      writeback IOs are properly attributed and controlled, the IO back
      pressure propagation mechanism implemented in balance_dirty_pages()
      and its subroutines wasn't aware of cgroup writeback.
      
      Processes belonging to a memcg may have access to only subset of total
      memory available in the system and not factoring this into dirty
      throttling rendered it completely ineffective for processes under
      memcg limits and memcg ended up building a separate ad-hoc degenerate
      mechanism directly into vmscan code to limit page dirtying.
      
      The previous patches updated balance_dirty_pages() and its subroutines
      so that they can deal with multiple wb_domain's (writeback domains)
      and defined per-memcg wb_domain.  Processes belonging to a non-root
      memcg are bound to two wb_domains, global wb_domain and memcg
      wb_domain, and should be throttled according to IO pressures from both
      domains.  This patch updates dirty throttling code so that it repeats
      similar calculations for the two domains - the differences between the
      two are few and minor - and applies the lower of the two sets of
      resulting constraints.
      
      wb_over_bg_thresh(), which controls when background writeback
      terminates, is also updated to consider both global and memcg
      wb_domains.  It returns true if dirty is over bg_thresh for either
      domain.
      
      This makes the dirty throttling mechanism operational for memcg
      domains including writeback-bandwidth-proportional dirty page
      distribution inside them but the ad-hoc memcg throttling mechanism in
      vmscan is still in place.  The next patch will rip it out.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c2aa723a
    • Tejun Heo's avatar
      writeback: implement memcg wb_domain · 841710aa
      Tejun Heo authored
      Dirtyable memory is distributed to a wb (bdi_writeback) according to
      the relative bandwidth the wb is writing out in the whole system.
      This distribution is global - each wb is measured against all other
      wb's and gets the proportinately sized portion of the memory in the
      whole system.
      
      For cgroup writeback, the amount of dirtyable memory is scoped by
      memcg and thus each wb would need to be measured and controlled in its
      memcg.  IOW, a wb will belong to two writeback domains - the global
      and memcg domains.
      
      The previous patches laid the groundwork to support the two wb_domains
      and this patch implements memcg wb_domain.  memcg->cgwb_domain is
      initialized on css online and destroyed on css release,
      wb->memcg_completions is added, and __wb_writeout_inc() is updated to
      increment completions against both global and memcg wb_domains.
      
      The following patches will update balance_dirty_pages() and its
      subroutines to actually consider memcg wb_domain for throttling.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      841710aa
    • Tejun Heo's avatar
      writeback: update wb_over_bg_thresh() to use wb_domain aware operations · 947e9762
      Tejun Heo authored
      wb_over_bg_thresh() currently uses global_dirty_limits() and
      wb_dirty_limit() both of which are wrappers around operations which
      take dirty_throttle_control.  For cgroup writeback support, the
      function will be updated to also consider memcg wb_domains which
      requires the context information carried in dirty_throttle_control.
      
      This patch updates wb_over_bg_thresh() so that it uses the underlying
      wb_domain aware operations directly and builds the global
      dirty_throttle_control in the process.
      
      This patch doesn't introduce any behavioral changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      947e9762
    • Tejun Heo's avatar
      writeback: move over_bground_thresh() to mm/page-writeback.c · aa661bbe
      Tejun Heo authored
      and rename it to wb_over_bg_thresh().  The function is closely tied to
      the dirty throttling mechanism implemented in page-writeback.c.  This
      relocation will allow future updates necessary for cgroup writeback
      support.
      
      While at it, add function comment.
      
      This is pure reorganization and doesn't introduce any behavioral
      changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      aa661bbe
    • Tejun Heo's avatar
      writeback: separate out domain_dirty_limits() · 9fc3a43e
      Tejun Heo authored
      global_dirty_limits() calculates thresh and bg_thresh (confusingly
      called *pdirty and *pbackground in the function) assuming
      global_wb_domain; however, cgroup writeback support requires
      considering per-memcg wb_domain too.
      
      This patch separates out domain_dirty_limits() which takes
      dirty_throttle_control out of global_dirty_limits().  As thresh and
      bg_thresh calculation needs the amount of dirtyable memory in the
      domain, dirty_throttle_control->avail is added.  The new function
      calculates the two thresholds and store them directly in the
      dirty_throttle_control.
      
      Also, as memcg domains can't follow vm_dirty_bytes and
      dirty_background_bytes settings directly.  If those are set and
      domain_dirty_limits() is invoked for a !global domain, the settings
      are translated to ratios by scaling them against globally available
      memory.  dirty_throttle_control->gdtc is added to enable this when
      CONFIG_CGROUP_WRITEBACK.
      
      global_dirty_limits() is now a thin wrapper around
      domain_dirty_limits() and balance_dirty_pages() is updated to use the
      new function too.
      
      This patch doesn't introduce any behavioral changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      9fc3a43e
    • Tejun Heo's avatar
      writeback: make __wb_writeout_inc() and hard_dirty_limit() take wb_domaas a parameter · c7981433
      Tejun Heo authored
      Currently __wb_writeout_inc() and hard_dirty_limit() assume
      global_wb_domain; however, cgroup writeback support requires
      considering per-memcg wb_domain too.
      
      This patch separates out domain-specific part of __wb_writeout_inc()
      into wb_domain_writeout_inc() which takes wb_domain as a parameter and
      adds the parameter to hard_dirty_limit().  This will allow these two
      functions to handle per-memcg wb_domains.
      
      This patch doesn't introduce any behavioral changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c7981433
    • Tejun Heo's avatar
      writeback: add dirty_throttle_control->dom · e9f07dfd
      Tejun Heo authored
      Currently all dirty throttle operations use global_wb_domain; however,
      cgroup writeback support requires considering per-memcg wb_domain too.
      This patch adds dirty_throttle_control->dom and updates functions
      which are directly using globabl_wb_domain to use it instead.
      
      As this makes global_update_bandwidth() a misnomer, the function is
      renamed to domain_update_bandwidth().
      
      This patch doesn't introduce any behavioral changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e9f07dfd
    • Tejun Heo's avatar
      writeback: add dirty_throttle_control->wb_completions · e9770b34
      Tejun Heo authored
      wb->completions measures the wb's proportional write bandwidth in
      global_wb_domain and thus naturally tied to the wb_domain.  This patch
      adds dirty_throttle_control->wb_completions which is initialized to
      wb->completions by GDTC_INIT() and updates __wb_dirty_limits() to use
      it instead of dereferencing wb->completions directly.
      
      This will allow dirty_throttle_control to represent different
      wb_domains and the matching wb completions.
      
      This patch doesn't introduce any behavioral changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e9770b34
    • Tejun Heo's avatar
      writeback: add dirty_throttle_control->pos_ratio · daddfa3c
      Tejun Heo authored
      wb_position_ratio() is used to calculate pos_ratio, which is used for
      two purposes.  wb_update_dirty_ratelimit() uses it to adjust
      wb->[balanced_]dirty_ratelimit gradually and balance_dirty_pages() to
      immediately adjust dirty_ratelimit right before applying it to
      determine pause duration.
      
      While wb_update_dirty_ratelimit() is separately rate limited from
      balance_dirty_pages(), on the run where the ratelimit is updated, we
      end up calculating pos_ratio twice with the same parameters.
      
      This patch adds dirty_throttle_control->pos_ratio.
      balance_dirty_pages() calculates it once per run and
      wb_update_dirty_ratelimit() uses the value stored in
      dirty_throttle_control.
      
      This removes the duplicate calculation and also will help implementing
      memcg wb_domain.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      daddfa3c
    • Tejun Heo's avatar
      writeback: make __wb_calc_thresh() take dirty_throttle_control · b1cbc6d4
      Tejun Heo authored
      wb_calc_thresh() calculates wb_thresh by scaling thresh according to
      the wb's portion in the system-wide write bandwidth.  cgroup writeback
      support would need to calculate wb_thresh against memcg domain too.
      This patch renames wb_calc_thresh() to __wb_calc_thresh() and makes it
      take dirty_throttle_control so that the function can later be updated
      to calculate against different domains according to
      dirty_throttle_control.
      
      wb_calc_thresh() is now a thin wrapper around __wb_calc_thresh().
      
      v2: The original version was incorrectly scaling dtc->dirty instead of
          dtc->thresh.  This was due to the extremely confusing function and
          variable names.  Added a rename patch and fixed this one.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b1cbc6d4
    • Tejun Heo's avatar
      writeback: add dirty_throttle_control->wb_bg_thresh · 970fb01a
      Tejun Heo authored
      wb_bg_thresh is currently treated as a second-class citizen.  It's
      only used when BDI_CAP_STRICTLIMIT is set and balance_dirty_pages()
      doesn't calculate it unless the cap is set.  When the cap is set, the
      calculated value is not passed around but instead recalculated
      whenever it's used.
      
      wb_position_ratio() calculates it by scaling wb_thresh proportional to
      bg_thresh / thresh.  wb_update_dirty_ratelimit() uses wb_dirty_limit()
      on bg_thresh, which should generally lead to a similar result as the
      proportional scaling but can also be way off in the presence of
      max/min_ratio settings.
      
      Avoiding wb_bg_thresh calculation saves us one u64 multiplication and
      divsion when BDI_CAP_STRICTLIMIT is not set.  Given that
      balance_dirty_pages() is already ratelimited, this doesn't justify the
      incurred extra complexity.
      
      This patch adds wb_bg_thresh to dirty_throttle_control and makes
      wb_dirty_limits() always calculate it and updates the users to use the
      pre-calculated value.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      970fb01a
    • Tejun Heo's avatar
      writeback: consolidate dirty throttle parameters into dirty_throttle_control · 2bc00aef
      Tejun Heo authored
      Dirty throttling implemented in balance_dirty_pages() and its
      subroutines makes use of a number of parameters which are passed
      around individually.  This renders these functions somewhat unwieldy
      and makes it difficult to add or change the involved parameters.  Also
      some functions use different or conflicting naming schemes for the
      same parameters making the code confusing to follow.
      
      This patch consolidates the main parameters into struct
      dirty_throttle_control so that they can be passed around easily and
      adding new paramters isn't painful.  This also unifies how a given
      parameter is named and accessed.  The drawback of using this type of
      control structure rather than explicit paramters is that it isn't
      immediately obvious which function accesses and modifies what;
      however, it's fairly clear that the benefits outweigh in this case.
      
      GDTC_INIT() macro is provided to ease initializing
      dirty_throttle_control for the global_wb_domain and
      balance_dirty_pages() uses a separate pointer to point to its global
      dirty_throttle_control.  This is to make it uniform with memcg domain
      handling which will be added later.
      
      This patch doesn't introduce any behavioral changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      2bc00aef
    • Tejun Heo's avatar
      writeback: move global_dirty_limit into wb_domain · dcc25ae7
      Tejun Heo authored
      This patch is a part of the series to define wb_domain which
      represents a domain that wb's (bdi_writeback's) belong to and are
      measured against each other in.  This will enable IO backpressure
      propagation for cgroup writeback.
      
      global_dirty_limit exists to regulate the global dirty threshold which
      is a property of the wb_domain.  This patch moves hard_dirty_limit,
      dirty_lock, and update_time into wb_domain.
      
      This is pure reorganization and doesn't introduce any behavioral
      changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      dcc25ae7
    • Tejun Heo's avatar
      writeback: implement wb_domain · 380c27ca
      Tejun Heo authored
      Dirtyable memory is distributed to a wb (bdi_writeback) according to
      the relative bandwidth the wb is writing out in the whole system.
      This distribution is global - each wb is measured against all other
      wb's and gets the proportinately sized portion of the memory in the
      whole system.
      
      For cgroup writeback, the amount of dirtyable memory is scoped by
      memcg and thus each wb would need to be measured and controlled in its
      memcg.  IOW, a wb will belong to two writeback domains - the global
      and memcg domains.
      
      Currently, what constitutes the global writeback domain are scattered
      across a number of global states.  This patch starts collecting them
      into struct wb_domain.
      
      * fprop_global which serves as the basis for proportional bandwidth
        measurement and its period timer are moved into struct wb_domain.
      
      * global_wb_domain hosts the states for the global domain.
      
      * While at it, flatten wb_writeout_fraction() into its callers.  This
        thin wrapper doesn't provide any actual benefits while getting in
        the way.
      
      This is pure reorganization and doesn't introduce any behavioral
      changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      380c27ca
    • Tejun Heo's avatar
      writeback: reorganize [__]wb_update_bandwidth() · 8a731799
      Tejun Heo authored
      __wb_update_bandwidth() is called from two places -
      fs/fs-writeback.c::balance_dirty_pages() and
      mm/page-writeback.c::wb_writeback().  The latter updates only the
      write bandwidth while the former also deals with the dirty ratelimit.
      The two callsites are distinguished by whether @thresh parameter is
      zero or not, which is cryptic.  In addition, the two files define
      their own different versions of wb_update_bandwidth() on top of
      __wb_update_bandwidth(), which is confusing to say the least.  This
      patch cleans up [__]wb_update_bandwidth() in the following ways.
      
      * __wb_update_bandwidth() now takes explicit @update_ratelimit
        parameter to gate dirty ratelimit handling.
      
      * mm/page-writeback.c::wb_update_bandwidth() is flattened into its
        caller - balance_dirty_pages().
      
      * fs/fs-writeback.c::wb_update_bandwidth() is moved to
        mm/page-writeback.c and __wb_update_bandwidth() is made static.
      
      * While at it, add a lockdep assertion to __wb_update_bandwidth().
      
      Except for the lockdep addition, this is pure reorganization and
      doesn't introduce any behavioral changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      8a731799
    • Tejun Heo's avatar
      writeback: clean up wb_dirty_limit() · 0d960a38
      Tejun Heo authored
      The function name wb_dirty_limit(), its argument @dirty and the local
      variable @wb_dirty are mortally confusing given that the function
      calculates per-wb threshold value not dirty pages, especially given
      that @dirty and @wb_dirty are used elsewhere for dirty pages.
      
      Let's rename the function to wb_calc_thresh() and wb_dirty to
      wb_thresh.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      0d960a38
    • Tejun Heo's avatar
      writeback: make bdi_start_background_writeback() take bdi_writeback instead of backing_dev_info · 9ecf4866
      Tejun Heo authored
      bdi_start_background_writeback() currently takes @bdi and kicks the
      root wb (bdi_writeback).  In preparation for cgroup writeback support,
      make it take wb instead.
      
      This patch doesn't make any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      9ecf4866
    • Tejun Heo's avatar
      writeback: make writeback_in_progress() take bdi_writeback instead of backing_dev_info · bc05873d
      Tejun Heo authored
      writeback_in_progress() currently takes @bdi and returns whether
      writeback is in progress on its root wb (bdi_writeback).  In
      preparation for cgroup writeback support, make it take wb instead.
      While at it, make it an inline function.
      
      This patch doesn't make any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      bc05873d
    • Tejun Heo's avatar
      writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's · a06fd6b1
      Tejun Heo authored
      For cgroup writeback support, all bdi-wide operations should be
      distributed to all its wb's (bdi_writeback's).
      
      This patch updates laptop_mode_timer_fn() so that it invokes
      wb_start_writeback() on all wb's rather than just the root one.  As
      the intent is writing out all dirty data, there's no reason to split
      the number of pages to write.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a06fd6b1
    • Tejun Heo's avatar
      writeback: remove bdi_start_writeback() · c00ddad3
      Tejun Heo authored
      bdi_start_writeback() is a thin wrapper on top of
      __wb_start_writeback() which is used only by laptop_mode_timer_fn().
      This patches removes bdi_start_writeback(), renames
      __wb_start_writeback() to wb_start_writeback() and makes
      laptop_mode_timer_fn() use it instead.
      
      This doesn't cause any functional difference and will ease making
      laptop_mode_timer_fn() cgroup writeback aware.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c00ddad3
    • Tejun Heo's avatar
      writeback: make bdi->min/max_ratio handling cgroup writeback aware · 693108a8
      Tejun Heo authored
      bdi->min/max_ratio are user-configurable per-bdi knobs which regulate
      dirty limit of each bdi.  For cgroup writeback, they need to be
      further distributed across wb's (bdi_writeback's) belonging to the
      configured bdi.
      
      This patch introduces wb_min_max_ratio() which distributes
      bdi->min/max_ratio according to a wb's proportion in the total active
      bandwidth of its bdi.
      
      v2: Update wb_min_max_ratio() to fix a bug where both min and max were
          assigned the min value and avoid calculations when possible.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      693108a8
    • Tejun Heo's avatar
      writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account · 95a46c65
      Tejun Heo authored
      bdi_has_dirty_io() used to only reflect whether the root wb
      (bdi_writeback) has dirty inodes.  For cgroup writeback support, it
      needs to take all active wb's into account.  If any wb on the bdi has
      dirty inodes, bdi_has_dirty_io() should return true.
      
      To achieve that, as inode_wb_list_{move|del}_locked() now keep track
      of the dirty state transition of each wb, the number of dirty wbs can
      be counted in the bdi; however, bdi is already aggregating
      wb->avg_write_bandwidth which can easily be guaranteed to be > 0 when
      there are any dirty inodes by ensuring wb->avg_write_bandwidth can't
      dip below 1.  bdi_has_dirty_io() can simply test whether
      bdi->tot_write_bandwidth is zero or not.
      
      While this bumps the value of wb->avg_write_bandwidth to one when it
      used to be zero, this shouldn't cause any meaningful behavior
      difference.
      
      bdi_has_dirty_io() is made an inline function which tests whether
      ->tot_write_bandwidth is non-zero.  Also, WARN_ON_ONCE()'s on its
      value are added to inode_wb_list_{move|del}_locked().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      95a46c65
    • Tejun Heo's avatar
      writeback: implement backing_dev_info->tot_write_bandwidth · 766a9d6e
      Tejun Heo authored
      cgroup writeback support needs to keep track of the sum of
      avg_write_bandwidth of all wb's (bdi_writeback's) with dirty inodes to
      distribute write workload.  This patch adds bdi->tot_write_bandwidth
      and updates inode_wb_list_move_locked(), inode_wb_list_del_locked()
      and wb_update_write_bandwidth() to adjust it as wb's gain and lose
      dirty inodes and its avg_write_bandwidth gets updated.
      
      As the update events are not synchronized with each other,
      bdi->tot_write_bandwidth is an atomic_long_t.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      766a9d6e
    • Tejun Heo's avatar
      writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback · dfb8ae56
      Tejun Heo authored
      Currently, balance_dirty_pages() always work on bdi->wb.  This patch
      updates it to work on the wb (bdi_writeback) matching memcg and blkcg
      of the current task as that's what the inode is being dirtied against.
      
      balance_dirty_pages_ratelimited() now pins the current wb and passes
      it to balance_dirty_pages().
      
      As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to
      visible behavior differences.
      
      v2: Updated for per-inode wb association.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      dfb8ae56
    • Tejun Heo's avatar
      writeback: attribute stats to the matching per-cgroup bdi_writeback · 91018134
      Tejun Heo authored
      Until now, all WB_* stats were accounted against the root wb
      (bdi_writeback), now that multiple wb (bdi_writeback) support is in
      place, let's attributes the stats to the respective per-cgroup wb's.
      
      As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to
      visible behavior differences.
      
      v2: Updated for per-inode wb association.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      91018134
    • Tejun Heo's avatar
      writeback: make backing_dev_info host cgroup-specific bdi_writebacks · 52ebea74
      Tejun Heo authored
      For the planned cgroup writeback support, on each bdi
      (backing_dev_info), each memcg will be served by a separate wb
      (bdi_writeback).  This patch updates bdi so that a bdi can host
      multiple wbs (bdi_writebacks).
      
      On the default hierarchy, blkcg implicitly enables memcg.  This allows
      using memcg's page ownership for attributing writeback IOs, and every
      memcg - blkcg combination can be served by its own wb by assigning a
      dedicated wb to each memcg.  This means that there may be multiple
      wb's of a bdi mapped to the same blkcg.  As congested state is per
      blkcg - bdi combination, those wb's should share the same congested
      state.  This is achieved by tracking congested state via
      bdi_writeback_congested structs which are keyed by blkcg.
      
      bdi->wb remains unchanged and will keep serving the root cgroup.
      cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
      looked up while dirtying an inode according to the memcg of the page
      being dirtied or current task.  Each cgwb is indexed on bdi->cgwb_tree
      by its memcg id.  Once an inode is associated with its wb, it can be
      retrieved using inode_to_wb().
      
      Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
      pages will keep being associated with bdi->wb.
      
      v3: inode_attach_wb() in account_page_dirtied() moved inside
          mapping_cap_account_dirty() block where it's known to be !NULL.
          Also, an unnecessary NULL check before kfree() removed.  Both
          detected by the kbuild bot.
      
      v2: Updated so that wb association is per inode and wb is per memcg
          rather than blkcg.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      52ebea74
    • Tejun Heo's avatar
      writeback: s/bdi/wb/ in mm/page-writeback.c · de1fff37
      Tejun Heo authored
      Writeback operations will now be per wb (bdi_writeback) instead of
      bdi.  Replace the relevant bdi references in symbol names and comments
      with wb.  This patch is purely cosmetic and doesn't make any
      functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      de1fff37
    • Tejun Heo's avatar
      writeback: move bandwidth related fields from backing_dev_info into bdi_writeback · a88a341a
      Tejun Heo authored
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bandwidth related fields from backing_dev_info into
      bdi_writeback.
      
      * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
        write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
        balanced_dirty_ratelimit, completions and dirty_exceeded.
      
      * writeback_chunk_size() and over_bground_thresh() now take @wb
        instead of @bdi.
      
      * bdi_writeout_fraction(bdi, ...)	-> wb_writeout_fraction(wb, ...)
        bdi_dirty_limit(bdi, ...)		-> wb_dirty_limit(wb, ...)
        bdi_position_ration(bdi, ...)		-> wb_position_ratio(wb, ...)
        bdi_update_writebandwidth(bdi, ...)	-> wb_update_write_bandwidth(wb, ...)
        [__]bdi_update_bandwidth(bdi, ...)	-> [__]wb_update_bandwidth(wb, ...)
        bdi_{max|min}_pause(bdi, ...)		-> wb_{max|min}_pause(wb, ...)
        bdi_dirty_limits(bdi, ...)		-> wb_dirty_limits(wb, ...)
      
      * Init/exits of the relocated fields are moved to bdi_wb_init/exit()
        respectively.  Note that explicit zeroing is dropped in the process
        as wb's are cleared in entirety anyway.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
        introducing no behavior changes.
      
      v2: Typo in description fixed as suggested by Jan.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a88a341a
    • Tejun Heo's avatar
      writeback: move backing_dev_info->bdi_stat[] into bdi_writeback · 93f78d88
      Tejun Heo authored
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bdi->bdi_stat[] into wb.
      
      * enum bdi_stat_item is renamed to wb_stat_item and the prefix of all
        enums is changed from BDI_ to WB_.
      
      * BDI_STAT_BATCH() -> WB_STAT_BATCH()
      
      * [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...)
      
      * bdi_stat[_error]() -> wb_stat[_error]()
      
      * bdi_writeout_inc() -> wb_writeout_inc()
      
      * stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and
        frees stat.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
        introducing no behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      93f78d88
    • Greg Thelen's avatar
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen authored
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      messages).
      
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      		[...]
      		mem_cgroup_update_page_stat(memcg)
      	}
      	mem_cgroup_end_page_stat(memcg)
      
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
        rcu_read_lock()
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
      
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      
      Performance tests run on v4.0-rc1-36-g4f671fe2.  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      
      As expected anon page faults are not affected by this patch.
      
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: default avatarSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c4843a75
    • Tejun Heo's avatar
      page_writeback: revive cancel_dirty_page() in a restricted form · 11f81bec
      Tejun Heo authored
      cancel_dirty_page() had some issues and b9ea2515 ("page_writeback:
      clean up mess around cancel_dirty_page()") replaced it with
      account_page_cleaned() which makes the caller responsible for clearing
      the dirty bit; unfortunately, the planned changes for cgroup writeback
      support requires synchronization between dirty bit manipulation and
      stat updates.  While we can open-code such synchronization in each
      account_page_cleaned() callsite, that's gonna be unnecessarily awkward
      and verbose.
      
      This patch revives cancel_dirty_page() but in a more restricted form.
      All it does is TestClearPageDirty() followed by account_page_cleaned()
      invocation if the page was dirty.  This helper covers all
      account_page_cleaned() usages except for __delete_from_page_cache()
      which is a special case anyway and left alone.  As this leaves no
      module user for account_page_cleaned(), EXPORT_SYMBOL() is dropped
      from it.
      
      This patch just revives cancel_dirty_page() as a trivial wrapper to
      replace equivalent usages and doesn't introduce any functional
      changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      11f81bec
  5. 23 Apr, 2015 1 commit
    • Tejun Heo's avatar
      writeback: use |1 instead of +1 to protect against div by zero · 464d1387
      Tejun Heo authored
      mm/page-writeback.c has several places where 1 is added to the divisor
      to prevent division by zero exceptions; however, if the original
      divisor is equivalent to -1, adding 1 leads to division by zero.
      
      There are three places where +1 is used for this purpose - one in
      pos_ratio_polynom() and two in bdi_position_ratio().  The second one
      in bdi_position_ratio() actually triggered div-by-zero oops on a
      machine running a 3.10 kernel.  The divisor is
      
        x_intercept - bdi_setpoint + 1 == span + 1
      
      span is confirmed to be (u32)-1.  It isn't clear how it ended up that
      but it could be from write bandwidth calculation underflow fixed by
      c72efb65 ("writeback: fix possible underflow in write bandwidth
      calculation").
      
      At any rate, +1 isn't a proper protection against div-by-zero.  This
      patch converts all +1 protections to |1.  Note that
      bdi_update_dirty_ratelimit() was already using |1 before this patch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      464d1387
  6. 15 Apr, 2015 1 commit