1. 15 Mar, 2016 1 commit
    • Peter Zijlstra's avatar
      tags: Fix DEFINE_PER_CPU expansions · 25528213
      Peter Zijlstra authored
      $ make tags
        GEN     tags
      ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
      ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
      ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
      ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
      ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
      ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
      ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"
      Which are all the result of the DEFINE_PER_CPU pattern:
        scripts/tags.sh:200:	'/\<DEFINE_PER_CPU([^,]*, *\([[:alnum:]_]*\)/\1/v/'
        scripts/tags.sh:201:	'/\<DEFINE_PER_CPU_SHARED_ALIGNED([^,]*, *\([[:alnum:]_]*\)/\1/v/'
      The below cures them. All except the workqueue one are within reasonable
      distance of the 80 char limit. TJ do you have any preference on how to
      fix the wq one, or shall we just not care its too long?
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  2. 11 Mar, 2016 1 commit
  3. 02 Mar, 2016 1 commit
  4. 17 Feb, 2016 1 commit
  5. 10 Feb, 2016 1 commit
    • Tejun Heo's avatar
      workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup · d6e022f1
      Tejun Heo authored
      When looking up the pool_workqueue to use for an unbound workqueue,
      workqueue assumes that the target CPU is always bound to a valid NUMA
      node.  However, currently, when a CPU goes offline, the mapping is
      destroyed and cpu_to_node() returns NUMA_NO_NODE.
      This has always been broken but hasn't triggered often enough before
      874bbfe6 ("workqueue: make sure delayed work run in local cpu").
      After the commit, workqueue forcifully assigns the local CPU for
      delayed work items without explicit target CPU to fix a different
      issue.  This widens the window where CPU can go offline while a
      delayed work item is pending causing delayed work items dispatched
      with target CPU set to an already offlined CPU.  The resulting
      NUMA_NO_NODE mapping makes workqueue try to queue the work item on a
      NULL pool_workqueue and thus crash.
      While 874bbfe6
       has been reverted for a different reason making the
      bug less visible again, it can still happen.  Fix it by mapping
      NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node().
      This is a temporary workaround.  The long term solution is keeping CPU
      -> NODE mapping stable across CPU off/online cycles which is being
      worked on.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarMike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Len Brown <len.brown@intel.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com
      Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com
  6. 09 Feb, 2016 3 commits
    • Tejun Heo's avatar
      workqueue: implement "workqueue.debug_force_rr_cpu" debug feature · f303fccb
      Tejun Heo authored
      Workqueue used to guarantee local execution for work items queued
      without explicit target CPU.  The guarantee is gone now which can
      break some usages in subtle ways.  To flush out those cases, this
      patch implements a debug feature which forces round-robin CPU
      selection for all such work items.
      The debug feature defaults to off and can be enabled with a kernel
      parameter.  The default can be flipped with a debug config option.
      If you hit this commit during bisection, please refer to 041bd12e
      ("Revert "workqueue: make sure delayed work run in local cpu"") for
      more information and ping me.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Mike Galbraith's avatar
      workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs · ef557180
      Mike Galbraith authored
      WORK_CPU_UNBOUND work items queued to a bound workqueue always run
      locally.  This is a good thing normally, but not when the user has
      asked us to keep unbound work away from certain CPUs.  Round robin
      these to wq_unbound_cpumask CPUs instead, as perturbation avoidance
      trumps performance.
      tj: Cosmetic and comment changes.  WARN_ON_ONCE() dropped from empty
          (wq_unbound_cpumask AND cpu_online_mask).  If we want that, it
          should be done when config changes.
      Signed-off-by: default avatarMike Galbraith <umgwanakikbuti@gmail.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Tejun Heo's avatar
      Revert "workqueue: make sure delayed work run in local cpu" · 041bd12e
      Tejun Heo authored
      This reverts commit 874bbfe6.
      Workqueue used to implicity guarantee that work items queued without
      explicit CPU specified are put on the local CPU.  Recent changes in
      timer broke the guarantee and led to vmstat breakage which was fixed
      by 176bed1d ("vmstat: explicitly schedule per-cpu work on the CPU
      we need it to run on").
      vmstat is the most likely to expose the issue and it's quite possible
      that there are other similar problems which are a lot more difficult
      to trigger.  As a preventive measure, 874bbfe6 ("workqueue: make
      sure delayed work run in local cpu") was applied to restore the local
      CPU guarnatee.  Unfortunately, the change exposed a bug in timer code
      which got fixed by 22b886dd ("timers: Use proper base migration in
      add_timer_on()").  Due to code restructuring, the commit couldn't be
      backported beyond certain point and stable kernels which only had
      874bbfe6 started crashing.
      The local CPU guarantee was accidental more than anything else and we
      want to get rid of it anyway.  As, with the vmstat case fixed,
       is causing more problems than it's fixing, it has been
      decided to take the chance and officially break the guarantee by
      reverting the commit.  A debug feature will be added to force foreign
      CPU assignment to expose cases relying on the guarantee and fixes for
      the individual cases will be backported to stable as necessary.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: 874bbfe6 ("workqueue: make sure delayed work run in local cpu")
      Link: http://lkml.kernel.org/g/20160120211926.GJ10810@quack.suse.cz
      Cc: stable@vger.kernel.org
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
      Cc: Daniel Bilik <daniel.bilik@neosystem.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Daniel Bilik <daniel.bilik@neosystem.cz>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
  7. 29 Jan, 2016 1 commit
    • Tejun Heo's avatar
      workqueue: skip flush dependency checks for legacy workqueues · 23d11a58
      Tejun Heo authored
       ("workqueue: warn if memory reclaim tries to flush
      !WQ_MEM_RECLAIM workqueue") implemented flush dependency warning which
      triggers if a PF_MEMALLOC task or WQ_MEM_RECLAIM workqueue tries to
      flush a !WQ_MEM_RECLAIM workquee.
      This assumes that workqueues marked with WQ_MEM_RECLAIM sit in memory
      reclaim path and making it depend on something which may need more
      memory to make forward progress can lead to deadlocks.  Unfortunately,
      workqueues created with the legacy create*_workqueue() interface
      always have WQ_MEM_RECLAIM regardless of whether they are depended
      upon memory reclaim or not.  These spurious WQ_MEM_RECLAIM markings
      cause spurious triggering of the flush dependency checks.
        WARNING: CPU: 0 PID: 6 at kernel/workqueue.c:2361 check_flush_dependency+0x138/0x144()
        workqueue: WQ_MEM_RECLAIM deferwq:deferred_probe_work_func is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
        Workqueue: deferwq deferred_probe_work_func
        [<c0017acc>] (unwind_backtrace) from [<c0013134>] (show_stack+0x10/0x14)
        [<c0013134>] (show_stack) from [<c0245f18>] (dump_stack+0x94/0xd4)
        [<c0245f18>] (dump_stack) from [<c0026f9c>] (warn_slowpath_common+0x80/0xb0)
        [<c0026f9c>] (warn_slowpath_common) from [<c0026ffc>] (warn_slowpath_fmt+0x30/0x40)
        [<c0026ffc>] (warn_slowpath_fmt) from [<c00390b8>] (check_flush_dependency+0x138/0x144)
        [<c00390b8>] (check_flush_dependency) from [<c0039ca0>] (flush_work+0x50/0x15c)
        [<c0039ca0>] (flush_work) from [<c00c51b0>] (lru_add_drain_all+0x130/0x180)
        [<c00c51b0>] (lru_add_drain_all) from [<c00f728c>] (migrate_prep+0x8/0x10)
        [<c00f728c>] (migrate_prep) from [<c00bfbc4>] (alloc_contig_range+0xd8/0x338)
        [<c00bfbc4>] (alloc_contig_range) from [<c00f8f18>] (cma_alloc+0xe0/0x1ac)
        [<c00f8f18>] (cma_alloc) from [<c001cac4>] (__alloc_from_contiguous+0x38/0xd8)
        [<c001cac4>] (__alloc_from_contiguous) from [<c001ceb4>] (__dma_alloc+0x240/0x278)
        [<c001ceb4>] (__dma_alloc) from [<c001cf78>] (arm_dma_alloc+0x54/0x5c)
        [<c001cf78>] (arm_dma_alloc) from [<c0355ea4>] (dmam_alloc_coherent+0xc0/0xec)
        [<c0355ea4>] (dmam_alloc_coherent) from [<c039cc4c>] (ahci_port_start+0x150/0x1dc)
        [<c039cc4c>] (ahci_port_start) from [<c0384734>] (ata_host_start.part.3+0xc8/0x1c8)
        [<c0384734>] (ata_host_start.part.3) from [<c03898dc>] (ata_host_activate+0x50/0x148)
        [<c03898dc>] (ata_host_activate) from [<c039d558>] (ahci_host_activate+0x44/0x114)
        [<c039d558>] (ahci_host_activate) from [<c039f05c>] (ahci_platform_init_host+0x1d8/0x3c8)
        [<c039f05c>] (ahci_platform_init_host) from [<c039e6bc>] (tegra_ahci_probe+0x448/0x4e8)
        [<c039e6bc>] (tegra_ahci_probe) from [<c0347058>] (platform_drv_probe+0x50/0xac)
        [<c0347058>] (platform_drv_probe) from [<c03458cc>] (driver_probe_device+0x214/0x2c0)
        [<c03458cc>] (driver_probe_device) from [<c0343cc0>] (bus_for_each_drv+0x60/0x94)
        [<c0343cc0>] (bus_for_each_drv) from [<c03455d8>] (__device_attach+0xb0/0x114)
        [<c03455d8>] (__device_attach) from [<c0344ab8>] (bus_probe_device+0x84/0x8c)
        [<c0344ab8>] (bus_probe_device) from [<c0344f48>] (deferred_probe_work_func+0x68/0x98)
        [<c0344f48>] (deferred_probe_work_func) from [<c003b738>] (process_one_work+0x120/0x3f8)
        [<c003b738>] (process_one_work) from [<c003ba48>] (worker_thread+0x38/0x55c)
        [<c003ba48>] (worker_thread) from [<c0040f14>] (kthread+0xdc/0xf4)
        [<c0040f14>] (kthread) from [<c000f778>] (ret_from_fork+0x14/0x3c)
      Fix it by marking workqueues created via create*_workqueue() with
      __WQ_LEGACY and disabling flush dependency checks on them.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarThierry Reding <thierry.reding@gmail.com>
      Link: http://lkml.kernel.org/g/20160126173843.GA11115@ulmo.nvidia.com
      Fixes: fca839c0 ("workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue")
  8. 07 Jan, 2016 1 commit
  9. 08 Dec, 2015 2 commits
    • Tejun Heo's avatar
      workqueue: implement lockup detector · 82607adc
      Tejun Heo authored
      Workqueue stalls can happen from a variety of usage bugs such as
      missing WQ_MEM_RECLAIM flag or concurrency managed work item
      indefinitely staying RUNNING.  These stalls can be extremely difficult
      to hunt down because the usual warning mechanisms can't detect
      workqueue stalls and the internal state is pretty opaque.
      To alleviate the situation, this patch implements workqueue lockup
      detector.  It periodically monitors all worker_pools periodically and,
      if any pool failed to make forward progress longer than the threshold
      duration, triggers warning and dumps workqueue state as follows.
       BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
       Showing busy workqueues and worker pools:
       workqueue events: flags=0x0
         pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
           pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
       workqueue events_power_efficient: flags=0x80
         pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
           pending: check_lifetime, neigh_periodic_work
       workqueue cgroup_pidlist_destroy: flags=0x0
         pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
           pending: cgroup_pidlist_destroy_work_fn
      The detection mechanism is controller through kernel parameter
      workqueue.watchdog_thresh and can be updated at runtime through the
      sysfs module parameter file.
      v2: Decoupled from softlockup control knobs.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDon Zickus <dzickus@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
    • Tejun Heo's avatar
      workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue · fca839c0
      Tejun Heo authored
      Task or work item involved in memory reclaim trying to flush a
      non-WQ_MEM_RECLAIM workqueue or one of its work items can lead to
      deadlock.  Trigger WARN_ONCE() if such conditions are detected.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
  10. 12 Oct, 2015 1 commit
  11. 30 Sep, 2015 1 commit
    • Shaohua Li's avatar
      workqueue: make sure delayed work run in local cpu · 874bbfe6
      Shaohua Li authored
      My system keeps crashing with below message. vmstat_update() schedules a delayed
      work in current cpu and expects the work runs in the cpu.
      schedule_delayed_work() is expected to make delayed work run in local cpu. The
      problem is timer can be migrated with NO_HZ. __queue_work() queues work in
      timer handler, which could run in a different cpu other than where the delayed
      work is scheduled. The end result is the delayed work runs in different cpu.
      The patch makes __queue_delayed_work records local cpu earlier. Where the timer
      runs doesn't change where the work runs with the change.
      [   28.010131] ------------[ cut here ]------------
      [   28.010609] kernel BUG at ../mm/vmstat.c:1392!
      [   28.011099] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
      [   28.011860] Modules linked in:
      [   28.012245] CPU: 0 PID: 289 Comm: kworker/0:3 Tainted: G        W4.3.0-rc3+ #634
      [   28.013065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
      [   28.014160] Workqueue: events vmstat_update
      [   28.014571] task: ffff880117682580 ti: ffff8800ba428000 task.ti: ffff8800ba428000
      [   28.015445] RIP: 0010:[<ffffffff8115f921>]  [<ffffffff8115f921>]vmstat_update+0x31/0x80
      [   28.016282] RSP: 0018:ffff8800ba42fd80  EFLAGS: 00010297
      [   28.016812] RAX: 0000000000000000 RBX: ffff88011a858dc0 RCX:0000000000000000
      [   28.017585] RDX: ffff880117682580 RSI: ffffffff81f14d8c RDI:ffffffff81f4df8d
      [   28.018366] RBP: ffff8800ba42fd90 R08: 0000000000000001 R09:0000000000000000
      [   28.019169] R10: 0000000000000000 R11: 0000000000000121 R12:ffff8800baa9f640
      [   28.019947] R13: ffff88011a81e340 R14: ffff88011a823700 R15:0000000000000000
      [   28.020071] FS:  0000000000000000(0000) GS:ffff88011a800000(0000)knlGS:0000000000000000
      [   28.020071] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [   28.020071] CR2: 00007ff6144b01d0 CR3: 00000000b8e93000 CR4:00000000000006f0
      [   28.020071] Stack:
      [   28.020071]  ffff88011a858dc0 ffff8800baa9f640 ffff8800ba42fe00ffffffff8106bd88
      [   28.020071]  ffffffff8106bd0b 0000000000000096 0000000000000000ffffffff82f9b1e8
      [   28.020071]  ffffffff829f0b10 0000000000000000 ffffffff81f18460ffff88011a81e340
      [   28.020071] Call Trace:
      [   28.020071]  [<ffffffff8106bd88>] process_one_work+0x1c8/0x540
      [   28.020071]  [<ffffffff8106bd0b>] ? process_one_work+0x14b/0x540
      [   28.020071]  [<ffffffff8106c214>] worker_thread+0x114/0x460
      [   28.020071]  [<ffffffff8106c100>] ? process_one_work+0x540/0x540
      [   28.020071]  [<ffffffff81071bf8>] kthread+0xf8/0x110
      [   28.020071]  [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200
      [   28.020071]  [<ffffffff81a6522f>] ret_from_fork+0x3f/0x70
      [   28.020071]  [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v2.6.31+
  12. 12 Aug, 2015 1 commit
    • Peter Zijlstra's avatar
      sched: Fix a race between __kthread_bind() and sched_setaffinity() · 25834c73
      Peter Zijlstra authored
      Because sched_setscheduler() checks p->flags & PF_NO_SETAFFINITY
      without locks, a caller might observe an old value and race with the
      set_cpus_allowed_ptr() call from __kthread_bind() and effectively undo
      						    if (p->flags & PF_NO_SETAFFINITIY)
      	  p->flags |= PF_NO_SETAFFINITY
      Fix the bug by putting everything under the regular scheduler locks.
      This also closes a hole in the serialization of task_struct::{nr_,}cpus_allowed.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dedekind1@gmail.com
      Cc: juri.lelli@arm.com
      Cc: mgorman@suse.de
      Cc: riel@redhat.com
      Cc: rostedt@goodmis.org
      Link: http://lkml.kernel.org/r/20150515154833.545640346@infradead.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
  13. 04 Aug, 2015 1 commit
  14. 22 Jul, 2015 1 commit
  15. 29 May, 2015 1 commit
  16. 27 May, 2015 1 commit
  17. 21 May, 2015 3 commits
  18. 19 May, 2015 2 commits
    • Lai Jiangshan's avatar
      workqueue: ensure attrs changes are properly synchronized · d4d3e257
      Lai Jiangshan authored
      Current modification to attrs via sysfs is not fully synchronized.
      Process A (change cpumask)      | Process B (change numa affinity)
      wq_cpumask_store()              |
        wq_sysfs_prep_attrs()         |
                                      | apply_workqueue_attrs()
        apply_workqueue_attrs()       |
      It results that the Process B's operation is totally reverted
      without any notification, it is a buggy behavior.  So this patch
      moves wq_sysfs_prep_attrs() into the protection under wq_pool_mutex
      to ensure attrs changes are properly synchronized.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Lai Jiangshan's avatar
      workqueue: separate out and refactor the locking of applying attrs · a0111cf6
      Lai Jiangshan authored
      Applying attrs requires two locks: get_online_cpus() and wq_pool_mutex,
      and this code is duplicated at two places (apply_workqueue_attrs() and
      workqueue_set_unbound_cpumask()).  So we separate out this locking
      code into apply_wqattrs_[un]lock() and do a minor refactor on
      The apply_wqattrs_[un]lock() will be also used on later patch for
      ensuring attrs changes are properly synchronized.
      tj: minor updates to comments
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  19. 18 May, 2015 2 commits
    • Lai Jiangshan's avatar
      workqueue: simplify wq_update_unbound_numa() · f7142ed4
      Lai Jiangshan authored
      wq_update_unbound_numa() is known be called with wq_pool_mutex held.
      But wq_update_unbound_numa() requests wq->mutex before reading
      wq->unbound_attrs, wq->numa_pwq_tbl[] and wq->dfl_pwq.  But these fields
      were changed to be allowed being read with wq_pool_mutex held.  So we
      simply remove the mutex_lock(&wq->mutex).
      Without the dependence on the the mutex_lock(&wq->mutex), the test
      of wq->unbound_attrs->no_numa can also be moved upward.
      The old code need a long comment to describe the stableness of
      @wq->unbound_attrs which is also guaranteed by wq_pool_mutex now,
      so we don't need this such comment.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Lai Jiangshan's avatar
      workqueue: wq_pool_mutex protects the attrs-installation · 5b95e1af
      Lai Jiangshan authored
      Current wq_pool_mutex doesn't proctect the attrs-installation, it results
      that ->unbound_attrs, ->numa_pwq_tbl[] and ->dfl_pwq can only be accessed
      under wq->mutex and causes some inconveniences. Example, wq_update_unbound_numa()
      has to acquire wq->mutex before fetching the wq->unbound_attrs->no_numa
      and the old_pwq.
      attrs-installation is a short operation, so this change will no cause any
      latency for other operations which also acquire the wq_pool_mutex.
      The only unprotected attrs-installation code is in apply_workqueue_attrs(),
      so this patch touches code less than comments.
      It is also a preparation patch for next several patches which read
      wq->unbound_attrs, wq->numa_pwq_tbl[] and wq->dfl_pwq with
      only wq_pool_mutex held.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  20. 13 May, 2015 1 commit
  21. 11 May, 2015 1 commit
  22. 30 Apr, 2015 1 commit
    • Lai Jiangshan's avatar
      workqueue: Allow modifying low level unbound workqueue cpumask · 042f7df1
      Lai Jiangshan authored
      Allow to modify the low-level unbound workqueues cpumask through
      sysfs. This is performed by traversing the entire workqueue list
      and calling apply_wqattrs_prepare() on the unbound workqueues
      with the new low level mask. Only after all the preparation are done,
      we commit them all together.
      Ordered workqueues are ignored from the low level unbound workqueue
      cpumask, it will be handled in near future.
      All the (default & per-node) pwqs are mandatorily controlled by
      the low level cpumask. If the user configured cpumask doesn't overlap
      with the low level cpumask, the low level cpumask will be used for the
      wq instead.
      The comment of wq_calc_node_cpumask() is updated and explicitly
      requires that its first argument should be the attrs of the default
      The default wq_unbound_cpumask is cpu_possible_mask.  The workqueue
      subsystem doesn't know its best default value, let the system manager
      or the other subsystem set it when needed.
      Changed from V8:
        merge the calculating code for the attrs of the default pwq together.
        minor change the code&comments for saving the user configured attrs.
        remove unnecessary list_del().
        minor update the comment of wq_calc_node_cpumask().
        update the comment of workqueue_set_unbound_cpumask();
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Mike Galbraith <bitbucket@online.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Original-patch-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  23. 27 Apr, 2015 2 commits
    • Frederic Weisbecker's avatar
      workqueue: Create low-level unbound workqueues cpumask · b05a7928
      Frederic Weisbecker authored
      Create a cpumask that limits the affinity of all unbound workqueues.
      This cpumask is controlled through a file at the root of the workqueue
      sysfs directory.
      It works on a lower-level than the per WQ_SYSFS workqueues cpumask files
      such that the effective cpumask applied for a given unbound workqueue is
      the intersection of /sys/devices/virtual/workqueue/$WORKQUEUE/cpumask and
      the new /sys/devices/virtual/workqueue/cpumask file.
      This patch implements the basic infrastructure and the read interface.
      wq_unbound_cpumask is initially set to cpu_possible_mask.
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Mike Galbraith <bitbucket@online.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Lai Jiangshan's avatar
      workqueue: split apply_workqueue_attrs() into 3 stages · 2d5f0764
      Lai Jiangshan authored
      Current apply_workqueue_attrs() includes pwqs-allocation and pwqs-installation,
      so when we batch multiple apply_workqueue_attrs()s as a transaction, we can't
      ensure the transaction must succeed or fail as a complete unit.
      To solve this, we split apply_workqueue_attrs() into three stages.
      The first stage does the preparation: allocation memory, pwqs.
      The second stage does the attrs-installaion and pwqs-installation.
      The third stage frees the allocated memory and (old or unused) pwqs.
      As the result, batching multiple apply_workqueue_attrs()s can
      succeed or fail as a complete unit:
      	1) batch do all the first stage for all the workqueues
      	2) only commit all when all the above succeed.
      This patch is a preparation for the next patch ("Allow modifying low level
      unbound workqueue cpumask") which will do a multiple apply_workqueue_attrs().
      The patch doesn't have functionality changed except two minor adjustment:
      	1) free_unbound_pwq() for the error path is removed, we use the
      	   heavier version put_pwq_unlocked() instead since the error path
      	   is rare. this adjustment simplifies the code.
      	2) the memory-allocation is also moved into wq_pool_mutex.
      	   this is needed to avoid to do the further splitting.
      tj: minor updates to comments.
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Mike Galbraith <bitbucket@online.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  24. 06 Apr, 2015 1 commit
    • Frederic Weisbecker's avatar
      workqueue: Reorder sysfs code · 6ba94429
      Frederic Weisbecker authored
      The sysfs code usually belongs to the botom of the file since it deals
      with high level objects. In the workqueue code it's misplaced and such
      that we'll need to work around functions references to allow the sysfs
      code to call APIs like apply_workqueue_attrs().
      Lets move that block further in the file, almost the botom.
      And declare workqueue_sysfs_unregister() just before destroy_workqueue()
      which reference it.
      tj: Moved workqueue_sysfs_unregister() forward declaration where other
          forward declarations are.
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Mike Galbraith <bitbucket@online.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  25. 09 Mar, 2015 3 commits
    • Tejun Heo's avatar
      workqueue: dump workqueues on sysrq-t · 3494fc30
      Tejun Heo authored
      Workqueues are used extensively throughout the kernel but sometimes
      it's difficult to debug stalls involving work items because visibility
      into its inner workings is fairly limited.  Although sysrq-t task dump
      annotates each active worker task with the information on the work
      item being executed, it is challenging to find out which work items
      are pending or delayed on which queues and how pools are being
      This patch implements show_workqueue_state() which dumps all busy
      workqueues and pools and is called from the sysrq-t handler.  At the
      end of sysrq-t dump, something like the following is printed.
       Showing busy workqueues and worker pools:
       workqueue filler_wq: flags=0x0
         pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
           in-flight: 491:filler_workfn, 507:filler_workfn
         pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
           in-flight: 501:filler_workfn
           pending: filler_workfn
       workqueue test_wq: flags=0x8
         pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
           in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
           delayed: test_workfn1 BAR(492), test_workfn2
       pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
       pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
       pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
       pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
      The above shows that test_wq is executing test_workfn() on pid 510
      which is the rescuer and also that there are two tasks 69 and 500
      waiting for the work item to finish in flush_work().  As test_wq has
      max_active of 1, there are two work items for test_workfn1() and
      test_workfn2() which are delayed till the current work item is
      finished.  In addition, pid 492 is flushing test_workfn1().
      The work item for test_workfn() is being executed on pwq of pool 2
      which is the normal priority per-cpu pool for CPU 1.  The pool has
      three workers, two of which are executing filler_workfn() for
      filler_wq and the last one is assuming the manager role trying to
      create more workers.
      This extra workqueue state dump will hopefully help chasing down hangs
      involving workqueues.
      v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
      v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
          printk()'s replaced with pr_info()'s, and cpumask printing now
          uses cpulist_pr_cont().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      CC: Ingo Molnar <mingo@redhat.com>
    • Tejun Heo's avatar
      workqueue: keep track of the flushing task and pool manager · 2607d7a6
      Tejun Heo authored
      Add wq_barrier->task and worker_pool->manager to keep track of the
      flushing task and pool manager respectively.  These are purely
      informational and will be used to implement sysrq dump of workqueues.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Tejun Heo's avatar
      workqueue: make the workqueues list RCU walkable · e2dca7ad
      Tejun Heo authored
      The workqueues list is protected by wq_pool_mutex and a workqueue and
      its subordinate data structures are freed directly on destruction.  We
      want to add the ability dump workqueues from a sysrq callback which
      requires walking all workqueues without grabbing wq_pool_mutex.  This
      patch makes freeing of workqueues RCU protected and makes the
      workqueues list walkable while holding RCU read lock.
      Note that pool_workqueues and pools are already sched-RCU protected.
      For consistency, workqueues are also protected with sched-RCU.
      While at it, reverse the workqueues list so that a workqueue which is
      created earlier comes before.  The order of the list isn't significant
      functionally but this makes the planned sysrq dump list system
      workqueues first.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  26. 05 Mar, 2015 1 commit
    • Tejun Heo's avatar
      workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE · 8603e1b3
      Tejun Heo authored
      cancel[_delayed]_work_sync() are implemented using
      __cancel_work_timer() which grabs the PENDING bit using
      try_to_grab_pending() and then flushes the work item with PENDING set
      to prevent the on-going execution of the work item from requeueing
      try_to_grab_pending() can always grab PENDING bit without blocking
      except when someone else is doing the above flushing during
      cancelation.  In that case, try_to_grab_pending() returns -ENOENT.  In
      this case, __cancel_work_timer() currently invokes flush_work().  The
      assumption is that the completion of the work item is what the other
      canceling task would be waiting for too and thus waiting for the same
      condition and retrying should allow forward progress without excessive
      busy looping
      Unfortunately, this doesn't work if preemption is disabled or the
      latter task has real time priority.  Let's say task A just got woken
      up from flush_work() by the completion of the target work item.  If,
      before task A starts executing, task B gets scheduled and invokes
      __cancel_work_timer() on the same work item, its try_to_grab_pending()
      will return -ENOENT as the work item is still being canceled by task A
      and flush_work() will also immediately return false as the work item
      is no longer executing.  This puts task B in a busy loop possibly
      preventing task A from executing and clearing the canceling state on
      the work item leading to a hang.
      task A			task B			worker
      						executing work
        set work CANCELING
          block for work completion
      						completion, wakes up A
      			while (forever) {
      			    -ENOENT as work is being canceled
      			    false as work is no longer executing
      This patch removes the possible hang by updating __cancel_work_timer()
      to explicitly wait for clearing of CANCELING rather than invoking
      flush_work() after try_to_grab_pending() fails with -ENOENT.
      Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com
      v3: bit_waitqueue() can't be used for work items defined in vmalloc
          area.  Switched to custom wake function which matches the target
          work item and exclusive wait and wakeup.
      v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if
          the target bit waitqueue has wait_bit_queue's on it.  Use
          DEFINE_WAIT_BIT() and __wake_up_bit() instead.  Reported by Tomeu
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarRabin Vincent <rabin.vincent@axis.com>
      Cc: Tomeu Vizoso <tomeu.vizoso@gmail.com>
      Cc: stable@vger.kernel.org
      Tested-by: default avatarJesper Nilsson <jesper.nilsson@axis.com>
      Tested-by: default avatarRabin Vincent <rabin.vincent@axis.com>
  27. 13 Feb, 2015 1 commit
  28. 16 Jan, 2015 1 commit
    • Tejun Heo's avatar
      workqueue: fix subtle pool management issue which can stall whole worker_pool · 29187a9e
      Tejun Heo authored
      A worker_pool's forward progress is guaranteed by the fact that the
      last idle worker assumes the manager role to create more workers and
      summon the rescuers if creating workers doesn't succeed in timely
      manner before proceeding to execute work items.
      This manager role is implemented in manage_workers(), which indicates
      whether the worker may proceed to work item execution with its return
      value.  This is necessary because multiple workers may contend for the
      manager role, and, if there already is a manager, others should
      proceed to work item execution.
      Unfortunately, the function also indicates that the worker may proceed
      to work item execution if need_to_create_worker() is false at the head
      of the function.  need_to_create_worker() tests the following
      	pending work items && !nr_running && !nr_idle
      The first and third conditions are protected by pool->lock and thus
      won't change while holding pool->lock; however, nr_running can change
      asynchronously as other workers block and resume and while it's likely
      to be zero, as someone woke this worker up in the first place, some
      other workers could have become runnable inbetween making it non-zero.
      If this happens, manage_worker() could return false even with zero
      nr_idle making the worker, the last idle one, proceed to execute work
      items.  If then all workers of the pool end up blocking on a resource
      which can only be released by a work item which is pending on that
      pool, the whole pool can deadlock as there's no one to create more
      workers or summon the rescuers.
      This patch fixes the problem by removing the early exit condition from
      maybe_create_worker() and making manage_workers() return false iff
      there's already another manager, which ensures that the last worker
      doesn't start executing work items.
      We can leave the early exit condition alone and just ignore the return
      value but the only reason it was put there is because the
      manage_workers() used to perform both creations and destructions of
      workers and thus the function may be invoked while the pool is trying
      to reduce the number of workers.  Now that manage_workers() is called
      only when more workers are needed, the only case this early exit
      condition is triggered is rare race conditions rendering it pointless.
      Tested with simulated workload and modified workqueue code which
      trigger the pool deadlock reliably without this patch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarEric Sandeen <sandeen@sandeen.net>
      Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: stable@vger.kernel.org
  29. 08 Dec, 2014 2 commits
    • NeilBrown's avatar
      workqueue: allow rescuer thread to do more work. · 008847f6
      NeilBrown authored
      When there is serious memory pressure, all workers in a pool could be
      blocked, and a new thread cannot be created because it requires memory
      In this situation a WQ_MEM_RECLAIM workqueue will wake up the
      rescuer thread to do some work.
      The rescuer will only handle requests that are already on ->worklist.
      If max_requests is 1, that means it will handle a single request.
      The rescuer will be woken again in 100ms to handle another max_requests
      I've seen a machine (running a 3.0 based "enterprise" kernel) with
      thousands of requests queued for xfslogd, which has a max_requests of
      1, and is needed for retiring all 'xfs' write requests.  When one of
      the worker pools gets into this state, it progresses extremely slowly
      and possibly never recovers (only waited an hour or two).
      With this patch we leave a pool_workqueue on mayday list
      until it is clearly no longer in need of assistance.  This allows
      all requests to be handled in a timely fashion.
      We keep each pool_workqueue on the mayday list until
      need_to_create_worker() is false, and no work for this workqueue is
      found in the pool.
      I have tested this in combination with a (hackish) patch which forces
      all work items to be handled by the rescuer thread.  In that context
      it significantly improves performance.  A similar patch for a 3.0
      kernel significantly improved performance on a heavy work load.
      Thanks to Jan Kara for some design ideas, and to Dongsu Park for
      some comments and testing.
      tj: Inverted the lock order between wq_mayday_lock and pool->lock with
          a preceding patch and simplified this patch.  Added comment and
          updated changelog accordingly.  Dongsu spotted missing get_pwq()
          in the simplified code.
      Cc: Dongsu Park <dongsu.park@profitbricks.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Tejun Heo's avatar
      workqueue: invert the order between pool->lock and wq_mayday_lock · b2d82909
      Tejun Heo authored
      Currently, pool->lock nests inside pool->lock.  There's no inherent
      reason for this order.  The only place where the two locks are held
      together is pool_mayday_timeout() and it just got decided that way.
      This nesting order turns out to complicate things with the planned
      rescuer_thread() update.  Let's invert them.  This doesn't cause any
      behavior differences.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Dongsu Park <dongsu.park@profitbricks.com>