1. 25 Oct, 2016 5 commits
    • Charlie Jacobsen's avatar
      Basic lcd module create, run, and destroy. · e0193fa4
      Charlie Jacobsen authored
      This code is ugly, but it's working.
      
      Tested with basic module, and appears to be working
      properly. I will soon incorporate the patched
      modprobe into the kernel tree, and then this code
      will be usable by everyone.
      
      The ipc code is still unimplemented. The only
      hypercall handled is yield. Also note that other
      exit conditions (e.g. external interrupt) have not
      been fully tested.
      
      Overview:
      -- kernel code calls lcd_create_as_module with
         the module's name
      -- lcd_create_as_module loads the module using
         request_lcd_module (request_lcd_module calls
         the patched modprobe to load the module, and
         the patched modprobe calls back into the lcd
         driver via the ioctrl interface to load the
         module)
      -- lcd_create_as_module then finds the loaded
         module, spawns a kernel thread and passes off
         the module to it
      -- the kernel thread initializes the lcd and
         maps the module inside it, then suspends itself
      -- lcd_run_as_module wakes up the kernel thread
         and tells it to run
      -- lcd_delete_as_module stops the kernel thread
         and deletes the module from the host kernel
      
      File-by-file details:
      
      arch/x86/include/asm/lcd-domains-arch.h
      arch/x86/lcd-domains/lcd-domains-arch-tests.c
      arch/x86/lcd-domains/lcd-domains-arch.c
      -- lcd was not running in 64-bit mode, and my
         checks had one subtle bug
      -- fixed %cr3 load to properly load vmcs first
      -- fixed set program counter to use guest virtual
         rather than guest physical address
      
      include/linux/sched.h
      -- added struct lcd to task_struct
      
      include/linux/init_task.h
      -- lcd pointer set to null when task_struct is
         initialized
      
      include/linux/module.h
      kernel/module.c
      -- made init_module and delete_module system calls
         callable from kernel code
      -- available in module.h via do_sys_init_module and
         do_sys_delete_module
      -- simply moved the majority of the guts of the
         system calls into a non-system call, exported
         routine
      -- take an extra flag, for_lcd; when set, the init
         code skips over running (and deallocating) the
         module's init code, and the delete code skips
         over running the module exit
      -- system calls from user code set for_lcd = 0; this
         ensures existing code still works
      
      include/linux/kmod.h
      kernel/kmod.c
      kernel/sysctl.c
      -- changed __request_module to __do_request_module; takes
         one extra argument, for_lcd
      -- __request_module   ==>  __do_request_module with for_lcd = 0
      -- request_lcd_module ==>  __do_request_module with for_lcd = 1
      -- call_modprobe conditionally uses lcd_modprobe_path, the path
         to a patched modprobe accessible via sysfs
      
      include/lcd-domains/lcd-domains.h
      -- added lcd status enum; see source code doc
      -- three routines for creating/running/destroying
         lcd's that use modules; see source code doc
      
      include/uapi/linux/lcd-domains.h
      -- added interface defns for patched modprobe to call into
         lcd driver for module init; lcd driver loads
         module (via slightly refactored module.c code) on behalf
         of modprobe
      
      virt/lcd-domains/lcd-domains.c
      -- implementation of routines for modules inside lcd's
      -- implementation of module init / delete for lcd's
         (uses patched module.c code)
      
      virt/lcd-domains/Kconfig
      virt/lcd-domains/Makefile
      virt/lcd-domains/lcd-module-load-test.c
      virt/lcd-domains/lcd-tests.c
      -- added test module for lcd module code
      -- test runs automatically when lcd module is inserted
      e0193fa4
    • Charles Jacobsen's avatar
      Fixed build system for lcd, and most compiler errors. · 7c05c7a0
      Charles Jacobsen authored
      Two components to the lcd build now:
      -- arch/x86/lcd/Makefile: for building x86 lcd code
      -- virt/lcd/Makefile: for building arch-indep lcd code
      
      Modified the build system just slightly for a cleaner
      build:
      -- virt/ directory treated like ipc/, usr/, etc. directories
      -- added Kconfig and Makefile to virt/, mirroring drivers/
      -- updated top-level Makefile to include virt/ as vmlinux
         directory / dependency, so build system will recur into
         virt/
      -- updated arch/x86/Kconfig to include virt/Kconfig, so it
         will be included as a menu item
      -- updated arch/x86/Kbuild to include arch/x86/lcd/
      
      Removed old capabilities code in cap/.
      
      Removed lcd syscall.
      
      Temporarily turned off build for drivers/lcd.
      
      Fixed most bugs in lcd-vmx (still need to do lcd_vmx_exit).
      -- minor naming issues in lcd-vmx.h
      -- straight copy over of vmx_disable_intercepts_for_msr,
         but with more doc
      -- removed VMX_EPT_INDIVIDUAL_ADDR macro from vmx.h (where
         did this come from? it's not documented in the intel manual,
         nor is it used in kvm)
      7c05c7a0
    • Anton Burtsev's avatar
      ioctl interface to LCD · 552bd976
      Anton Burtsev authored
      552bd976
    • Anton Burtsev's avatar
      Move things around · 0ec257ef
      Anton Burtsev authored
      Not ideal. Some problems:
      
        - The arch dependent and independent parts are not completely
          separated.
      
        - The module builds in arch/x86/lcd and this is a bit strange
      
        - I can't build with LCD buildin, but build works if I build it
          as a module
      
        - There is still a bunch of old warrnings from Weibin's code
      0ec257ef
    • Sarah Spall's avatar
  2. 16 Oct, 2016 1 commit
    • John Stultz's avatar
      timekeeping: Fix __ktime_get_fast_ns() regression · 4e584cfe
      John Stultz authored
      commit 58bfea9532552d422bde7afa207e1a0f08dffa7d upstream.
      
      In commit 27727df2 ("Avoid taking lock in NMI path with
      CONFIG_DEBUG_TIMEKEEPING"), I changed the logic to open-code
      the timekeeping_get_ns() function, but I forgot to include
      the unit conversion from cycles to nanoseconds, breaking the
      function's output, which impacts users like perf.
      
      This results in bogus perf timestamps like:
       swapper     0 [000]   253.427536:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.426573:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.426687:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.426800:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.426905:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427022:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427127:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427239:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427346:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427463:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   255.426572:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
      
      Instead of more reasonable expected timestamps like:
       swapper     0 [000]    39.953768:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.064839:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.175956:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.287103:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.398217:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.509324:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.620437:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.731546:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.842654:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.953772:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    41.064881:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
      
      Add the proper use of timekeeping_delta_to_ns() to convert
      the cycle delta to nanoseconds as needed.
      
      Thanks to Brendan and Alexei for finding this quickly after
      the v4.8 release. Unfortunately the problematic commit has
      landed in some -stable trees so they'll need this fix as
      well.
      
      Many apologies for this mistake. I'll be looking to add a
      perf-clock sanity test to the kselftest timers tests soon.
      
      Fixes: 27727df2 "timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING"
      Reported-by: default avatarBrendan Gregg <bgregg@netflix.com>
      Reported-by: default avatarAlexei Starovoitov <alexei.starovoitov@gmail.com>
      Tested-and-reviewed-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/1475636148-26539-1-git-send-email-john.stultz@linaro.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4e584cfe
  3. 25 Sep, 2016 2 commits
  4. 23 Sep, 2016 1 commit
    • Tejun Heo's avatar
      cgroup: fix invalid controller enable rejections with cgroup namespace · 9157056d
      Tejun Heo authored
      On the v2 hierarchy, "cgroup.subtree_control" rejects controller
      enables if the cgroup has processes in it.  The enforcement of this
      logic assumes that the cgroup wouldn't have any css_sets associated
      with it if there are no tasks in the cgroup, which is no longer true
      since a79a908f ("cgroup: introduce cgroup namespaces").
      
      When a cgroup namespace is created, it pins the css_set of the
      creating task to use it as the root css_set of the namespace.  This
      extra reference stays as long as the namespace is around and makes
      "cgroup.subtree_control" think that the namespace root cgroup is not
      empty even when it is and thus reject controller enables.
      
      Fix it by making cgroup_subtree_control() walk and test emptiness of
      each css_set instead of testing whether the list_head is empty.
      
      While at it, update the comment of cgroup_task_count() to indicate
      that the returned value may be higher than the number of tasks, which
      has always been true due to temporary references and doesn't break
      anything.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarEvgeny Vereshchagin <evvers@ya.ru>
      Cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
      Cc: Aditya Kali <adityakali@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: stable@vger.kernel.org # v4.6+
      Fixes: a79a908f ("cgroup: introduce cgroup namespaces")
      Link: https://github.com/systemd/systemd/pull/3589#issuecomment-249089541
      9157056d
  5. 22 Sep, 2016 1 commit
  6. 19 Sep, 2016 2 commits
  7. 16 Sep, 2016 1 commit
  8. 13 Sep, 2016 1 commit
    • Joonwoo Park's avatar
      cpuset: handle race between CPU hotplug and cpuset_hotplug_work · 28b89b9e
      Joonwoo Park authored
      A discrepancy between cpu_online_mask and cpuset's effective_cpus
      mask is inevitable during hotplug since cpuset defers updating of
      effective_cpus mask using a workqueue, during which time nothing
      prevents the system from more hotplug operations.  For that reason
      guarantee_online_cpus() walks up the cpuset hierarchy until it finds
      an intersection under the assumption that top cpuset's effective_cpus
      mask intersects with cpu_online_mask even with such a race occurring.
      
      However a sequence of CPU hotplugs can open a time window, during which
      none of the effective CPUs in the top cpuset intersect with
      cpu_online_mask.
      
      For example when there are 4 possible CPUs 0-3 and only CPU0 is online:
      
        ========================  ===========================
         cpu_online_mask           top_cpuset.effective_cpus
        ========================  ===========================
         echo 1 > cpu2/online.
         CPU hotplug notifier woke up hotplug work but not yet scheduled.
            [0,2]                     [0]
      
         echo 0 > cpu0/online.
         The workqueue is still runnable.
            [2]                       [0]
        ========================  ===========================
      
        Now there is no intersection between cpu_online_mask and
        top_cpuset.effective_cpus.  Thus invoking sys_sched_setaffinity() at
        this moment can cause following:
      
         Unable to handle kernel NULL pointer dereference at virtual address 000000d0
         ------------[ cut here ]------------
         Kernel BUG at ffffffc0001389b0 [verbose debug info unavailable]
         Internal error: Oops - BUG: 96000005 [#1] PREEMPT SMP
         Modules linked in:
         CPU: 2 PID: 1420 Comm: taskset Tainted: G        W       4.4.8+ #98
         task: ffffffc06a5c4880 ti: ffffffc06e124000 task.ti: ffffffc06e124000
         PC is at guarantee_online_cpus+0x2c/0x58
         LR is at cpuset_cpus_allowed+0x4c/0x6c
         <snip>
         Process taskset (pid: 1420, stack limit = 0xffffffc06e124020)
         Call trace:
         [<ffffffc0001389b0>] guarantee_online_cpus+0x2c/0x58
         [<ffffffc00013b208>] cpuset_cpus_allowed+0x4c/0x6c
         [<ffffffc0000d61f0>] sched_setaffinity+0xc0/0x1ac
         [<ffffffc0000d6374>] SyS_sched_setaffinity+0x98/0xac
         [<ffffffc000085cb0>] el0_svc_naked+0x24/0x28
      
      The top cpuset's effective_cpus are guaranteed to be identical to
      cpu_online_mask eventually.  Hence fall back to cpu_online_mask when
      there is no intersection between top cpuset's effective_cpus and
      cpu_online_mask.
      Signed-off-by: default avatarJoonwoo Park <joonwoop@codeaurora.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: cgroups@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 3.17+
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      28b89b9e
  9. 10 Sep, 2016 2 commits
    • Alexander Shishkin's avatar
      perf/core: Fix aux_mmap_count vs aux_refcount order · b79ccadd
      Alexander Shishkin authored
      The order of accesses to ring buffer's aux_mmap_count and aux_refcount
      has to be preserved across the users, namely perf_mmap_close() and
      perf_aux_output_begin(), otherwise the inversion can result in the latter
      holding the last reference to the aux buffer and subsequently free'ing
      it in atomic context, triggering a warning.
      
      > ------------[ cut here ]------------
      > WARNING: CPU: 0 PID: 257 at kernel/events/ring_buffer.c:541 __rb_free_aux+0x11a/0x130
      > CPU: 0 PID: 257 Comm: stopbug Not tainted 4.8.0-rc1+ #2596
      > Call Trace:
      >  [<ffffffff810f3e0b>] __warn+0xcb/0xf0
      >  [<ffffffff810f3f3d>] warn_slowpath_null+0x1d/0x20
      >  [<ffffffff8121182a>] __rb_free_aux+0x11a/0x130
      >  [<ffffffff812127a8>] rb_free_aux+0x18/0x20
      >  [<ffffffff81212913>] perf_aux_output_begin+0x163/0x1e0
      >  [<ffffffff8100c33a>] bts_event_start+0x3a/0xd0
      >  [<ffffffff8100c42d>] bts_event_add+0x5d/0x80
      >  [<ffffffff81203646>] event_sched_in.isra.104+0xf6/0x2f0
      >  [<ffffffff8120652e>] group_sched_in+0x6e/0x190
      >  [<ffffffff8120694e>] ctx_sched_in+0x2fe/0x5f0
      >  [<ffffffff81206ca0>] perf_event_sched_in+0x60/0x80
      >  [<ffffffff81206d1b>] ctx_resched+0x5b/0x90
      >  [<ffffffff81207281>] __perf_event_enable+0x1e1/0x240
      >  [<ffffffff81200639>] event_function+0xa9/0x180
      >  [<ffffffff81202000>] ? perf_cgroup_attach+0x70/0x70
      >  [<ffffffff8120203f>] remote_function+0x3f/0x50
      >  [<ffffffff811971f3>] flush_smp_call_function_queue+0x83/0x150
      >  [<ffffffff81197bd3>] generic_smp_call_function_single_interrupt+0x13/0x60
      >  [<ffffffff810a6477>] smp_call_function_single_interrupt+0x27/0x40
      >  [<ffffffff81a26ea9>] call_function_single_interrupt+0x89/0x90
      >  [<ffffffff81120056>] finish_task_switch+0xa6/0x210
      >  [<ffffffff81120017>] ? finish_task_switch+0x67/0x210
      >  [<ffffffff81a1e83d>] __schedule+0x3dd/0xb50
      >  [<ffffffff81a1efe5>] schedule+0x35/0x80
      >  [<ffffffff81128031>] sys_sched_yield+0x61/0x70
      >  [<ffffffff81a25be5>] entry_SYSCALL_64_fastpath+0x18/0xa8
      > ---[ end trace 6235f556f5ea83a9 ]---
      
      This patch puts the checks in perf_aux_output_begin() in the same order
      as that of perf_mmap_close().
      Reported-by: default avatarVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160906132353.19887-3-alexander.shishkin@linux.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b79ccadd
    • Alexander Shishkin's avatar
      perf/core: Fix a race between mmap_close() and set_output() of AUX events · 767ae086
      Alexander Shishkin authored
      In the mmap_close() path we need to stop all the AUX events that are
      writing data to the AUX area that we are unmapping, before we can
      safely free the pages. To determine if an event needs to be stopped,
      we're comparing its ->rb against the one that's getting unmapped.
      However, a SET_OUTPUT ioctl may turn up inside an AUX transaction
      and swizzle event::rb to some other ring buffer, but the transaction
      will keep writing data to the old ring buffer until the event gets
      scheduled out. At this point, mmap_close() will skip over such an
      event and will proceed to free the AUX area, while it's still being
      used by this event, which will set off a warning in the mmap_close()
      path and cause a memory corruption.
      
      To avoid this, always stop an AUX event before its ->rb is updated;
      this will release the (potentially) last reference on the AUX area
      of the buffer. If the event gets restarted, its new ring buffer will
      be used. If another SET_OUTPUT comes and switches it back to the
      old ring buffer that's getting unmapped, it's also fine: this
      ring buffer's aux_mmap_count will be zero and AUX transactions won't
      start any more.
      Reported-by: default avatarVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160906132353.19887-2-alexander.shishkin@linux.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      767ae086
  10. 09 Sep, 2016 1 commit
    • Dan Williams's avatar
      mm: fix cache mode of dax pmd mappings · 9049771f
      Dan Williams authored
      track_pfn_insert() in vmf_insert_pfn_pmd() is marking dax mappings as
      uncacheable rendering them impractical for application usage.  DAX-pte
      mappings are cached and the goal of establishing DAX-pmd mappings is to
      attain more performance, not dramatically less (3 orders of magnitude).
      
      track_pfn_insert() relies on a previous call to reserve_memtype() to
      establish the expected page_cache_mode for the range.  While memremap()
      arranges for reserve_memtype() to be called, devm_memremap_pages() does
      not.  So, teach track_pfn_insert() and untrack_pfn() how to handle
      tracking without a vma, and arrange for devm_memremap_pages() to
      establish the write-back-cache reservation in the memtype tree.
      
      Cc: <stable@vger.kernel.org>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Reported-by: default avatarKai Zhang <kai.ka.zhang@oracle.com>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      9049771f
  11. 05 Sep, 2016 3 commits
    • Tejun Heo's avatar
      PM / QoS: avoid calling cancel_delayed_work_sync() during early boot · c86d06ba
      Tejun Heo authored
      of_clk_init() ends up calling into pm_qos_update_request() very early
      during boot where irq is expected to stay disabled.
      pm_qos_update_request() uses cancel_delayed_work_sync() which
      correctly assumes that irq is enabled on invocation and
      unconditionally disables and re-enables it.
      
      Gate cancel_delayed_work_sync() invocation with kevented_up() to avoid
      enabling irq unexpectedly during early boot.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarQiao Zhou <qiaozhou@asrmicro.com>
      Link: http://lkml.kernel.org/r/d2501c4c-8e7b-bea3-1b01-000b36b5dfe9@asrmicro.comSigned-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c86d06ba
    • Balbir Singh's avatar
      sched/core: Fix a race between try_to_wake_up() and a woken up task · 135e8c92
      Balbir Singh authored
      The origin of the issue I've seen is related to
      a missing memory barrier between check for task->state and
      the check for task->on_rq.
      
      The task being woken up is already awake from a schedule()
      and is doing the following:
      
      	do {
      		schedule()
      		set_current_state(TASK_(UN)INTERRUPTIBLE);
      	} while (!cond);
      
      The waker, actually gets stuck doing the following in
      try_to_wake_up():
      
      	while (p->on_cpu)
      		cpu_relax();
      
      Analysis:
      
      The instance I've seen involves the following race:
      
       CPU1					CPU2
      
       while () {
         if (cond)
           break;
         do {
           schedule();
           set_current_state(TASK_UN..)
         } while (!cond);
      					wakeup_routine()
      					  spin_lock_irqsave(wait_lock)
         raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
       }					  try_to_wake_up()
       set_current_state(TASK_RUNNING);	  ..
       list_del(&waiter.list);
      
      CPU2 wakes up CPU1, but before it can get the wait_lock and set
      current state to TASK_RUNNING the following occurs:
      
       CPU3
       wakeup_routine()
       raw_spin_lock_irqsave(wait_lock)
       if (!list_empty)
         wake_up_process()
         try_to_wake_up()
         raw_spin_lock_irqsave(p->pi_lock)
         ..
         if (p->on_rq && ttwu_wakeup())
         ..
         while (p->on_cpu)
           cpu_relax()
         ..
      
      CPU3 tries to wake up the task on CPU1 again since it finds
      it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
      after CPU2, CPU3 got it.
      
      CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
      the task is spinning on the wait_lock. Interestingly since p->on_rq
      is checked under pi_lock, I've noticed that try_to_wake_up() finds
      p->on_rq to be 0. This was the most confusing bit of the analysis,
      but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
      check is not reliable without this fix IMHO. The race is visible
      (based on the analysis) only when ttwu_queue() does a remote wakeup
      via ttwu_queue_remote. In which case the p->on_rq change is not
      done uder the pi_lock.
      
      The result is that after a while the entire system locks up on
      the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
      
      Reproduction of the issue:
      
      The issue can be reproduced after a long run on my system with 80
      threads and having to tweak available memory to very low and running
      memory stress-ng mmapfork test. It usually takes a long time to
      reproduce. I am trying to work on a test case that can reproduce
      the issue faster, but thats work in progress. I am still testing the
      changes on my still in a loop and the tests seem OK thus far.
      
      Big thanks to Benjamin and Nick for helping debug this as well.
      Ben helped catch the missing barrier, Nick caught every missing
      bit in my theory.
      Signed-off-by: default avatarBalbir Singh <bsingharora@gmail.com>
      [ Updated comment to clarify matching barriers. Many
        architectures do not have a full barrier in switch_to()
        so that cannot be relied upon. ]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nicholas Piggin <nicholas.piggin@gmail.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      135e8c92
    • Peter Zijlstra's avatar
      perf/core: Remove WARN from perf_event_read() · 58763148
      Peter Zijlstra authored
      This effectively reverts commit:
      
        71e7bc2b ("perf/core: Check return value of the perf_event_read() IPI")
      
      ... and puts in a comment explaining why we ignore the return value.
      Reported-by: default avatarVegard Nossum <vegard.nossum@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: David Carrillo-Cisneros <davidcc@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 71e7bc2b ("perf/core: Check return value of the perf_event_read() IPI")
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      58763148
  12. 02 Sep, 2016 1 commit
    • Wanpeng Li's avatar
      tick/nohz: Fix softlockup on scheduler stalls in kvm guest · 08d07259
      Wanpeng Li authored
      tick_nohz_start_idle() is prevented to be called if the idle tick can't 
      be stopped since commit 1f3b0f82 ("tick/nohz: Optimize nohz idle 
      enter"). As a result, after suspend/resume the host machine, full dynticks 
      kvm guest will softlockup:
      
       NMI watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:0]
       Call Trace:
        default_idle+0x31/0x1a0
        arch_cpu_idle+0xf/0x20
        default_idle_call+0x2a/0x50
        cpu_startup_entry+0x39b/0x4d0
        rest_init+0x138/0x140
        ? rest_init+0x5/0x140
        start_kernel+0x4c1/0x4ce
        ? set_init_arg+0x55/0x55
        ? early_idt_handler_array+0x120/0x120
        x86_64_start_reservations+0x24/0x26
        x86_64_start_kernel+0x142/0x14f
      
      In addition, cat /proc/stat | grep cpu in guest or host:
      
      cpu  398 16 5049 15754 5490 0 1 46 0 0
      cpu0 206 5 450 0 0 0 1 14 0 0
      cpu1 81 0 3937 3149 1514 0 0 9 0 0
      cpu2 45 6 332 6052 2243 0 0 11 0 0
      cpu3 65 2 328 6552 1732 0 0 11 0 0
      
      The idle and iowait states are weird 0 for cpu0(housekeeping). 
      
      The bug is present in both guest and host kernels, and they both have 
      cpu0's idle and iowait states issue, however, host kernel's suspend/resume 
      path etc will touch watchdog to avoid the softlockup.
      
      - The watchdog will not be touched in tick_nohz_stop_idle path (need be 
        touched since the scheduler stall is expected) if idle_active flags are 
        not detected.
      - The idle and iowait states will not be accounted when exit idle loop 
        (resched or interrupt) if idle start time and idle_active flags are 
        not set. 
      
      This patch fixes it by reverting commit 1f3b0f82 since can't stop 
      idle tick doesn't mean can't be idle.
      
      Fixes: 1f3b0f82 ("tick/nohz: Optimize nohz idle enter")
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Cc: Sanjeev Yadav<sanjeev.yadav@spreadtrum.com>
      Cc: Gaurav Jindal<gaurav.jindal@spreadtrum.com>
      Cc: stable@vger.kernel.org
      Cc: kvm@vger.kernel.org
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Link: http://lkml.kernel.org/r/1472798303-4154-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      08d07259
  13. 01 Sep, 2016 5 commits
  14. 31 Aug, 2016 2 commits
  15. 30 Aug, 2016 1 commit
  16. 26 Aug, 2016 2 commits
  17. 24 Aug, 2016 3 commits
    • Will Deacon's avatar
      perf/core: Use this_cpu_ptr() when stopping AUX events · 8b6a3fe8
      Will Deacon authored
      When tearing down an AUX buf for an event via perf_mmap_close(),
      __perf_event_output_stop() is called on the event's CPU to ensure that
      trace generation is halted before the process of unmapping and
      freeing the buffer pages begins.
      
      The callback is performed via cpu_function_call(), which ensures that it
      runs with interrupts disabled and is therefore not preemptible.
      Unfortunately, the current code grabs the per-cpu context pointer using
      get_cpu_ptr(), which unnecessarily disables preemption and doesn't pair
      the call with put_cpu_ptr(), leading to a preempt_count() imbalance and
      a BUG when freeing the AUX buffer later on:
      
        WARNING: CPU: 1 PID: 2249 at kernel/events/ring_buffer.c:539 __rb_free_aux+0x10c/0x120
        Modules linked in:
        [...]
        Call Trace:
         [<ffffffff813379dd>] dump_stack+0x4f/0x72
         [<ffffffff81059ff6>] __warn+0xc6/0xe0
         [<ffffffff8105a0c8>] warn_slowpath_null+0x18/0x20
         [<ffffffff8112761c>] __rb_free_aux+0x10c/0x120
         [<ffffffff81128163>] rb_free_aux+0x13/0x20
         [<ffffffff8112515e>] perf_mmap_close+0x29e/0x2f0
         [<ffffffff8111da30>] ? perf_iterate_ctx+0xe0/0xe0
         [<ffffffff8115f685>] remove_vma+0x25/0x60
         [<ffffffff81161796>] exit_mmap+0x106/0x140
         [<ffffffff8105725c>] mmput+0x1c/0xd0
         [<ffffffff8105cac3>] do_exit+0x253/0xbf0
         [<ffffffff8105e32e>] do_group_exit+0x3e/0xb0
         [<ffffffff81068d49>] get_signal+0x249/0x640
         [<ffffffff8101c273>] do_signal+0x23/0x640
         [<ffffffff81905f42>] ? _raw_write_unlock_irq+0x12/0x30
         [<ffffffff81905f69>] ? _raw_spin_unlock_irq+0x9/0x10
         [<ffffffff81901896>] ? __schedule+0x2c6/0x710
         [<ffffffff810022a4>] exit_to_usermode_loop+0x74/0x90
         [<ffffffff81002a56>] prepare_exit_to_usermode+0x26/0x30
         [<ffffffff81906d1b>] retint_user+0x8/0x10
      
      This patch uses this_cpu_ptr() instead of get_cpu_ptr(), since preemption is
      already disabled by the caller.
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Reviewed-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 95ff4ca2 ("perf/core: Free AUX pages in unmap path")
      Link: http://lkml.kernel.org/r/20160824091905.GA16944@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8b6a3fe8
    • John Stultz's avatar
      timekeeping: Cap array access in timekeeping_debug · a4f8f666
      John Stultz authored
      It was reported that hibernation could fail on the 2nd attempt, where the
      system hangs at hibernate() -> syscore_resume() -> i8237A_resume() ->
      claim_dma_lock(), because the lock has already been taken.
      
      However there is actually no other process would like to grab this lock on
      that problematic platform.
      
      Further investigation showed that the problem is triggered by setting
      /sys/power/pm_trace to 1 before the 1st hibernation.
      
      Since once pm_trace is enabled, the rtc becomes unmeaningful after suspend,
      and meanwhile some BIOSes would like to adjust the 'invalid' RTC (e.g, smaller
      than 1970) to the release date of that motherboard during POST stage, thus
      after resumed, it may seem that the system had a significant long sleep time
      which is a completely meaningless value.
      
      Then in timekeeping_resume -> tk_debug_account_sleep_time, if the bit31 of the
      sleep time happened to be set to 1, fls() returns 32 and we add 1 to
      sleep_time_bin[32], which causes an out of bounds array access and therefor
      memory being overwritten.
      
      As depicted by System.map:
      0xffffffff81c9d080 b sleep_time_bin
      0xffffffff81c9d100 B dma_spin_lock
      the dma_spin_lock.val is set to 1, which caused this problem.
      
      This patch adds a sanity check in tk_debug_account_sleep_time()
      to ensure we don't index past the sleep_time_bin array.
      
      [jstultz: Problem diagnosed and original patch by Chen Yu, I've solved the
       issue slightly differently, but borrowed his excelent explanation of the
       issue here.]
      
      Fixes: 5c83545f "power: Add option to log time spent in suspend"
      Reported-by: default avatarJanek Kozicki <cosurgi@gmail.com>
      Reported-by: default avatarChen Yu <yu.c.chen@intel.com>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Cc: linux-pm@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Xunlei Pang <xpang@redhat.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: stable <stable@vger.kernel.org>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Link: http://lkml.kernel.org/r/1471993702-29148-3-git-send-email-john.stultz@linaro.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      a4f8f666
    • John Stultz's avatar
      timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING · 27727df2
      John Stultz authored
      When I added some extra sanity checking in timekeeping_get_ns() under
      CONFIG_DEBUG_TIMEKEEPING, I missed that the NMI safe __ktime_get_fast_ns()
      method was using timekeeping_get_ns().
      
      Thus the locking added to the debug checks broke the NMI-safety of
      __ktime_get_fast_ns().
      
      This patch open-codes the timekeeping_get_ns() logic for
      __ktime_get_fast_ns(), so can avoid any deadlocks in NMI.
      
      Fixes: 4ca22c26 "timekeeping: Add warnings when overflows or underflows are observed"
      Reported-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Reported-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Cc: stable <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1471993702-29148-2-git-send-email-john.stultz@linaro.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      27727df2
  18. 22 Aug, 2016 2 commits
  19. 18 Aug, 2016 4 commits