1. 29 Dec, 2018 1 commit
  2. 27 Nov, 2016 1 commit
  3. 26 Oct, 2016 6 commits
    • Charles Jacobsen's avatar
      libasync-integration: async rpc working! · 71f69881
      Charles Jacobsen authored
      Fixed a nasty bug in module loader (my bad). I wasn't allocating
      sufficient memory to account for init symbols.
      Also, we don't need to do the temporary return address NULL out
      for isolated code, and it doesn't work so well anyway; so, we only
      do it for non-isolated code.
    • Charles Jacobsen's avatar
      stack-trace: All works now! · bda8dba8
      Charles Jacobsen authored
      Now seeing a full stack - including __init symbols and the top
      return address (I do a check in dump stack to see if the top of
      the stack has a return address - happens when we fault just before
      setting up the stack frame).
      Introduces another LCD-specific hack to the module loader. If we're
      loading inside an LCD, we add the __init symbols to the symbol
      and string tables that the loader builds (otherwise, the loader
      just builds a permanent home for core symbols). It's all going
      into core symtab and strtab, which is misleading since for LCDs
      those tables will have both; but as the code states, those are just
      temporary vars for the tables; eventually, symtab and strtab are
      pointed to them (after the module's init has ran).
    • Charles Jacobsen's avatar
      stack-trace: Fix module loader to not 'erase' init symbols. · eac08288
      Charles Jacobsen authored
      For normal module loads, __init code is freed, so it makes sense
      we update the symbol table to remove those symbols. But for
      LCDs, we want them to hang around.
    • Charles Jacobsen's avatar
      stack-trace: Stack trace 90 percent there. Wow, that was ugly. · d8f3540a
      Charles Jacobsen authored
      I fought this thing until late in the morning. Since we're
      working with a copy of the module image (.ko), some of the fields
      in the struct module have to be updated (symbol and string table
      fields). Then, everything works.
      Still getting weird trace symbol for return address at the
      bottom of the stack.
    • Charles Jacobsen's avatar
    • Charlie Jacobsen's avatar
      Major overhaul of build process. · 8198c2fb
      Charlie Jacobsen authored
      Full kernel build no longer required. Yay! This should
      cut down on dev time a lot.
      I moved all of the LCD source into $(kernel-src)/lcd-domains/,
      so it's all in one spot. There is now a top-level makefile in
      there that triggers building liblcd, the microkernel, and the
      examples. This is built as an *external* build now, even
      though the directory is in the kernel source. The build now takes
      under a minute to do everything LCD related.
      This should also make verification easier in the future (e.g.
      building with clang) if we aren't ensnared in the kernel
      Of course, to use the microkernel and examples, you have to
      build the patched kernel and install it. But now when you
      make a few lines of changes in e.g. an example, you don't have
      to trigger a top-level kernel build to rebuild it. Running
      the full kernel build takes on average about 3 - 4 minutes
      (some files are generated everytime, linking is done, and so
      on), and can take upwards of 30 minutes for a full build if you
      Which brings me to my other change: no more config for LCDs
      in menuconfig. If we create menu entries for every example
      and so on, we end up changing the config too often, and this
      triggers full kernel rebuilds == waste of time. We can use
      macros by setting them via compiler flags (e.g., -DSOME_FLAG).
      Furthermore, it wasn't making sense to me to do conditional
      compilation for LCD support (we always want to compile for that).
      Yes, changes aren't clearly delineated with macros, but you can
      see changes made by just doing 'git diff v3.10.14 some-file-or-dir'.
      The wiki has been fully updated with instructions for building,
      and other relevant parts (updated paths to files).
      I also took the opportunity to clean up some old stuff lying around
      that is dead (like lcdguest). I incorporated all of the documentation
      in Documentation/lcd-domains into the wiki so it's all in one
      spot now (including some helpful debug tips).
  4. 25 Oct, 2016 8 commits
    • Charlie Jacobsen's avatar
      Except IPC, kliblcd fully tested. Everything is working. · 664628a4
      Charlie Jacobsen authored
      Documentation in Documentation/lcd-domains/...
      Loading, mapping, and running a module is working correctly, using
      all of the capability code that interposes on each operation (mapping,
      freeing pages, etc.).
      cptr allocation and indexing into cspaces is working correctly.
      IPC testing and debugging is coming next.
    • Charlie Jacobsen's avatar
      Muktesh's capabilities fully incorporated. Capsicum-style enter/exit. · 6ee9a51f
      Charlie Jacobsen authored
      Builds, but not fully tested. Good tests for capability subsystem, some tests
      for kliblcd.
      Non-isolated kernel threads can "enter" the lcd system by doing
      klcd_enter / klcd_exit. They can create other lcd's, set them up, etc. They
      use the same interface that regular lcd's will use, so such code could be
      moved to an lcd, as we had planned. Will document this in Documentation folder
      tomorrow ( == today ).
      Capability system does checks now when a capability is deleted/revoked: for
      example, if it's for a page, the microkernel checks if the page is mapped, and
      unmaps it. If the last capability goes away, the page is freed. Documentation
      is in Documentation/lcd-domains/cap.txt.
      IPC code is in place, but not tested yet (pray for me).
      Debug is taking some time. Sometimes requires a power cycle which adds an
      extra 5 - 10 minutes. Build is slow the first time after reboot. Give me a user
      level program and I'll debug it in 30 seconds! argc
      Main arch-independent files:
          include/lcd-domains/kliblcd.h, types.h
             This is what non-isolated kernel code should include to use the
             kliblcd interface to the microkernel.
          virt/lcd-domains/main.c, kliblcd.c, cap.c, ipc.c, internal.h
             The microkernel, broken up into pieces.
             The tests, in progress.
      Some old files are still hanging around in virt/lcd-domains and will be
      incorprated/cleaned up soon.
      I couldn't squash over the merge from the decomposition branch, so there's a
      bunch of junk commits coming over. (I should've just copied Muktesh's files.)
      Resolved-by: Vikram Narayanan's avatarVikram Narayanan <vikram186@gmail.com>
    • Charlie Jacobsen's avatar
      Partially through starting ipc. · 9edaa6a6
      Charlie Jacobsen authored
      Will probably switch back to single lcd, multiple addr space, etc.
      Merging muktesh's caps.
    • Charlie Jacobsen's avatar
      Basic lcd module create, run, and destroy. · e0193fa4
      Charlie Jacobsen authored
      This code is ugly, but it's working.
      Tested with basic module, and appears to be working
      properly. I will soon incorporate the patched
      modprobe into the kernel tree, and then this code
      will be usable by everyone.
      The ipc code is still unimplemented. The only
      hypercall handled is yield. Also note that other
      exit conditions (e.g. external interrupt) have not
      been fully tested.
      -- kernel code calls lcd_create_as_module with
         the module's name
      -- lcd_create_as_module loads the module using
         request_lcd_module (request_lcd_module calls
         the patched modprobe to load the module, and
         the patched modprobe calls back into the lcd
         driver via the ioctrl interface to load the
      -- lcd_create_as_module then finds the loaded
         module, spawns a kernel thread and passes off
         the module to it
      -- the kernel thread initializes the lcd and
         maps the module inside it, then suspends itself
      -- lcd_run_as_module wakes up the kernel thread
         and tells it to run
      -- lcd_delete_as_module stops the kernel thread
         and deletes the module from the host kernel
      File-by-file details:
      -- lcd was not running in 64-bit mode, and my
         checks had one subtle bug
      -- fixed %cr3 load to properly load vmcs first
      -- fixed set program counter to use guest virtual
         rather than guest physical address
      -- added struct lcd to task_struct
      -- lcd pointer set to null when task_struct is
      -- made init_module and delete_module system calls
         callable from kernel code
      -- available in module.h via do_sys_init_module and
      -- simply moved the majority of the guts of the
         system calls into a non-system call, exported
      -- take an extra flag, for_lcd; when set, the init
         code skips over running (and deallocating) the
         module's init code, and the delete code skips
         over running the module exit
      -- system calls from user code set for_lcd = 0; this
         ensures existing code still works
      -- changed __request_module to __do_request_module; takes
         one extra argument, for_lcd
      -- __request_module   ==>  __do_request_module with for_lcd = 0
      -- request_lcd_module ==>  __do_request_module with for_lcd = 1
      -- call_modprobe conditionally uses lcd_modprobe_path, the path
         to a patched modprobe accessible via sysfs
      -- added lcd status enum; see source code doc
      -- three routines for creating/running/destroying
         lcd's that use modules; see source code doc
      -- added interface defns for patched modprobe to call into
         lcd driver for module init; lcd driver loads
         module (via slightly refactored module.c code) on behalf
         of modprobe
      -- implementation of routines for modules inside lcd's
      -- implementation of module init / delete for lcd's
         (uses patched module.c code)
      -- added test module for lcd module code
      -- test runs automatically when lcd module is inserted
    • Charles Jacobsen's avatar
      Fixed build system for lcd, and most compiler errors. · 7c05c7a0
      Charles Jacobsen authored
      Two components to the lcd build now:
      -- arch/x86/lcd/Makefile: for building x86 lcd code
      -- virt/lcd/Makefile: for building arch-indep lcd code
      Modified the build system just slightly for a cleaner
      -- virt/ directory treated like ipc/, usr/, etc. directories
      -- added Kconfig and Makefile to virt/, mirroring drivers/
      -- updated top-level Makefile to include virt/ as vmlinux
         directory / dependency, so build system will recur into
      -- updated arch/x86/Kconfig to include virt/Kconfig, so it
         will be included as a menu item
      -- updated arch/x86/Kbuild to include arch/x86/lcd/
      Removed old capabilities code in cap/.
      Removed lcd syscall.
      Temporarily turned off build for drivers/lcd.
      Fixed most bugs in lcd-vmx (still need to do lcd_vmx_exit).
      -- minor naming issues in lcd-vmx.h
      -- straight copy over of vmx_disable_intercepts_for_msr,
         but with more doc
      -- removed VMX_EPT_INDIVIDUAL_ADDR macro from vmx.h (where
         did this come from? it's not documented in the intel manual,
         nor is it used in kvm)
    • Anton Burtsev's avatar
      ioctl interface to LCD · 552bd976
      Anton Burtsev authored
    • Anton Burtsev's avatar
      Move things around · 0ec257ef
      Anton Burtsev authored
      Not ideal. Some problems:
        - The arch dependent and independent parts are not completely
        - The module builds in arch/x86/lcd and this is a bit strange
        - I can't build with LCD buildin, but build works if I build it
          as a module
        - There is still a bunch of old warrnings from Weibin's code
    • Sarah Spall's avatar
  5. 16 Oct, 2016 1 commit
    • John Stultz's avatar
      timekeeping: Fix __ktime_get_fast_ns() regression · 4e584cfe
      John Stultz authored
      commit 58bfea9532552d422bde7afa207e1a0f08dffa7d upstream.
      In commit 27727df2 ("Avoid taking lock in NMI path with
      CONFIG_DEBUG_TIMEKEEPING"), I changed the logic to open-code
      the timekeeping_get_ns() function, but I forgot to include
      the unit conversion from cycles to nanoseconds, breaking the
      function's output, which impacts users like perf.
      This results in bogus perf timestamps like:
       swapper     0 [000]   253.427536:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.426573:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.426687:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.426800:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.426905:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427022:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427127:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427239:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427346:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427463:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   255.426572:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
      Instead of more reasonable expected timestamps like:
       swapper     0 [000]    39.953768:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.064839:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.175956:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.287103:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.398217:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.509324:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.620437:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.731546:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.842654:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.953772:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    41.064881:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
      Add the proper use of timekeeping_delta_to_ns() to convert
      the cycle delta to nanoseconds as needed.
      Thanks to Brendan and Alexei for finding this quickly after
      the v4.8 release. Unfortunately the problematic commit has
      landed in some -stable trees so they'll need this fix as
      Many apologies for this mistake. I'll be looking to add a
      perf-clock sanity test to the kselftest timers tests soon.
      Fixes: 27727df2 "timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING"
      Reported-by: default avatarBrendan Gregg <bgregg@netflix.com>
      Reported-by: default avatarAlexei Starovoitov <alexei.starovoitov@gmail.com>
      Tested-and-reviewed-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/1475636148-26539-1-git-send-email-john.stultz@linaro.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  6. 25 Sep, 2016 2 commits
  7. 23 Sep, 2016 1 commit
    • Tejun Heo's avatar
      cgroup: fix invalid controller enable rejections with cgroup namespace · 9157056d
      Tejun Heo authored
      On the v2 hierarchy, "cgroup.subtree_control" rejects controller
      enables if the cgroup has processes in it.  The enforcement of this
      logic assumes that the cgroup wouldn't have any css_sets associated
      with it if there are no tasks in the cgroup, which is no longer true
      since a79a908f ("cgroup: introduce cgroup namespaces").
      When a cgroup namespace is created, it pins the css_set of the
      creating task to use it as the root css_set of the namespace.  This
      extra reference stays as long as the namespace is around and makes
      "cgroup.subtree_control" think that the namespace root cgroup is not
      empty even when it is and thus reject controller enables.
      Fix it by making cgroup_subtree_control() walk and test emptiness of
      each css_set instead of testing whether the list_head is empty.
      While at it, update the comment of cgroup_task_count() to indicate
      that the returned value may be higher than the number of tasks, which
      has always been true due to temporary references and doesn't break
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarEvgeny Vereshchagin <evvers@ya.ru>
      Cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
      Cc: Aditya Kali <adityakali@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: stable@vger.kernel.org # v4.6+
      Fixes: a79a908f ("cgroup: introduce cgroup namespaces")
      Link: https://github.com/systemd/systemd/pull/3589#issuecomment-249089541
  8. 22 Sep, 2016 1 commit
  9. 19 Sep, 2016 2 commits
  10. 16 Sep, 2016 1 commit
  11. 13 Sep, 2016 1 commit
    • Joonwoo Park's avatar
      cpuset: handle race between CPU hotplug and cpuset_hotplug_work · 28b89b9e
      Joonwoo Park authored
      A discrepancy between cpu_online_mask and cpuset's effective_cpus
      mask is inevitable during hotplug since cpuset defers updating of
      effective_cpus mask using a workqueue, during which time nothing
      prevents the system from more hotplug operations.  For that reason
      guarantee_online_cpus() walks up the cpuset hierarchy until it finds
      an intersection under the assumption that top cpuset's effective_cpus
      mask intersects with cpu_online_mask even with such a race occurring.
      However a sequence of CPU hotplugs can open a time window, during which
      none of the effective CPUs in the top cpuset intersect with
      For example when there are 4 possible CPUs 0-3 and only CPU0 is online:
        ========================  ===========================
         cpu_online_mask           top_cpuset.effective_cpus
        ========================  ===========================
         echo 1 > cpu2/online.
         CPU hotplug notifier woke up hotplug work but not yet scheduled.
            [0,2]                     [0]
         echo 0 > cpu0/online.
         The workqueue is still runnable.
            [2]                       [0]
        ========================  ===========================
        Now there is no intersection between cpu_online_mask and
        top_cpuset.effective_cpus.  Thus invoking sys_sched_setaffinity() at
        this moment can cause following:
         Unable to handle kernel NULL pointer dereference at virtual address 000000d0
         ------------[ cut here ]------------
         Kernel BUG at ffffffc0001389b0 [verbose debug info unavailable]
         Internal error: Oops - BUG: 96000005 [#1] PREEMPT SMP
         Modules linked in:
         CPU: 2 PID: 1420 Comm: taskset Tainted: G        W       4.4.8+ #98
         task: ffffffc06a5c4880 ti: ffffffc06e124000 task.ti: ffffffc06e124000
         PC is at guarantee_online_cpus+0x2c/0x58
         LR is at cpuset_cpus_allowed+0x4c/0x6c
         Process taskset (pid: 1420, stack limit = 0xffffffc06e124020)
         Call trace:
         [<ffffffc0001389b0>] guarantee_online_cpus+0x2c/0x58
         [<ffffffc00013b208>] cpuset_cpus_allowed+0x4c/0x6c
         [<ffffffc0000d61f0>] sched_setaffinity+0xc0/0x1ac
         [<ffffffc0000d6374>] SyS_sched_setaffinity+0x98/0xac
         [<ffffffc000085cb0>] el0_svc_naked+0x24/0x28
      The top cpuset's effective_cpus are guaranteed to be identical to
      cpu_online_mask eventually.  Hence fall back to cpu_online_mask when
      there is no intersection between top cpuset's effective_cpus and
      Signed-off-by: default avatarJoonwoo Park <joonwoop@codeaurora.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: cgroups@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 3.17+
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  12. 10 Sep, 2016 3 commits
    • Alexander Shishkin's avatar
      perf/core: Fix aux_mmap_count vs aux_refcount order · b79ccadd
      Alexander Shishkin authored
      The order of accesses to ring buffer's aux_mmap_count and aux_refcount
      has to be preserved across the users, namely perf_mmap_close() and
      perf_aux_output_begin(), otherwise the inversion can result in the latter
      holding the last reference to the aux buffer and subsequently free'ing
      it in atomic context, triggering a warning.
      > ------------[ cut here ]------------
      > WARNING: CPU: 0 PID: 257 at kernel/events/ring_buffer.c:541 __rb_free_aux+0x11a/0x130
      > CPU: 0 PID: 257 Comm: stopbug Not tainted 4.8.0-rc1+ #2596
      > Call Trace:
      >  [<ffffffff810f3e0b>] __warn+0xcb/0xf0
      >  [<ffffffff810f3f3d>] warn_slowpath_null+0x1d/0x20
      >  [<ffffffff8121182a>] __rb_free_aux+0x11a/0x130
      >  [<ffffffff812127a8>] rb_free_aux+0x18/0x20
      >  [<ffffffff81212913>] perf_aux_output_begin+0x163/0x1e0
      >  [<ffffffff8100c33a>] bts_event_start+0x3a/0xd0
      >  [<ffffffff8100c42d>] bts_event_add+0x5d/0x80
      >  [<ffffffff81203646>] event_sched_in.isra.104+0xf6/0x2f0
      >  [<ffffffff8120652e>] group_sched_in+0x6e/0x190
      >  [<ffffffff8120694e>] ctx_sched_in+0x2fe/0x5f0
      >  [<ffffffff81206ca0>] perf_event_sched_in+0x60/0x80
      >  [<ffffffff81206d1b>] ctx_resched+0x5b/0x90
      >  [<ffffffff81207281>] __perf_event_enable+0x1e1/0x240
      >  [<ffffffff81200639>] event_function+0xa9/0x180
      >  [<ffffffff81202000>] ? perf_cgroup_attach+0x70/0x70
      >  [<ffffffff8120203f>] remote_function+0x3f/0x50
      >  [<ffffffff811971f3>] flush_smp_call_function_queue+0x83/0x150
      >  [<ffffffff81197bd3>] generic_smp_call_function_single_interrupt+0x13/0x60
      >  [<ffffffff810a6477>] smp_call_function_single_interrupt+0x27/0x40
      >  [<ffffffff81a26ea9>] call_function_single_interrupt+0x89/0x90
      >  [<ffffffff81120056>] finish_task_switch+0xa6/0x210
      >  [<ffffffff81120017>] ? finish_task_switch+0x67/0x210
      >  [<ffffffff81a1e83d>] __schedule+0x3dd/0xb50
      >  [<ffffffff81a1efe5>] schedule+0x35/0x80
      >  [<ffffffff81128031>] sys_sched_yield+0x61/0x70
      >  [<ffffffff81a25be5>] entry_SYSCALL_64_fastpath+0x18/0xa8
      > ---[ end trace 6235f556f5ea83a9 ]---
      This patch puts the checks in perf_aux_output_begin() in the same order
      as that of perf_mmap_close().
      Reported-by: default avatarVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160906132353.19887-3-alexander.shishkin@linux.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Alexander Shishkin's avatar
      perf/core: Fix a race between mmap_close() and set_output() of AUX events · 767ae086
      Alexander Shishkin authored
      In the mmap_close() path we need to stop all the AUX events that are
      writing data to the AUX area that we are unmapping, before we can
      safely free the pages. To determine if an event needs to be stopped,
      we're comparing its ->rb against the one that's getting unmapped.
      However, a SET_OUTPUT ioctl may turn up inside an AUX transaction
      and swizzle event::rb to some other ring buffer, but the transaction
      will keep writing data to the old ring buffer until the event gets
      scheduled out. At this point, mmap_close() will skip over such an
      event and will proceed to free the AUX area, while it's still being
      used by this event, which will set off a warning in the mmap_close()
      path and cause a memory corruption.
      To avoid this, always stop an AUX event before its ->rb is updated;
      this will release the (potentially) last reference on the AUX area
      of the buffer. If the event gets restarted, its new ring buffer will
      be used. If another SET_OUTPUT comes and switches it back to the
      old ring buffer that's getting unmapped, it's also fine: this
      ring buffer's aux_mmap_count will be zero and AUX transactions won't
      start any more.
      Reported-by: default avatarVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160906132353.19887-2-alexander.shishkin@linux.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Dan Williams's avatar
      mm: fix cache mode of dax pmd mappings · 9049771f
      Dan Williams authored
      track_pfn_insert() in vmf_insert_pfn_pmd() is marking dax mappings as
      uncacheable rendering them impractical for application usage.  DAX-pte
      mappings are cached and the goal of establishing DAX-pmd mappings is to
      attain more performance, not dramatically less (3 orders of magnitude).
      track_pfn_insert() relies on a previous call to reserve_memtype() to
      establish the expected page_cache_mode for the range.  While memremap()
      arranges for reserve_memtype() to be called, devm_memremap_pages() does
      not.  So, teach track_pfn_insert() and untrack_pfn() how to handle
      tracking without a vma, and arrange for devm_memremap_pages() to
      establish the write-back-cache reservation in the memtype tree.
      Cc: <stable@vger.kernel.org>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Reported-by: default avatarKai Zhang <kai.ka.zhang@oracle.com>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  13. 05 Sep, 2016 3 commits
    • Tejun Heo's avatar
      PM / QoS: avoid calling cancel_delayed_work_sync() during early boot · c86d06ba
      Tejun Heo authored
      of_clk_init() ends up calling into pm_qos_update_request() very early
      during boot where irq is expected to stay disabled.
      pm_qos_update_request() uses cancel_delayed_work_sync() which
      correctly assumes that irq is enabled on invocation and
      unconditionally disables and re-enables it.
      Gate cancel_delayed_work_sync() invocation with kevented_up() to avoid
      enabling irq unexpectedly during early boot.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarQiao Zhou <qiaozhou@asrmicro.com>
      Link: http://lkml.kernel.org/r/d2501c4c-8e7b-bea3-1b01-000b36b5dfe9@asrmicro.comSigned-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
    • Balbir Singh's avatar
      sched/core: Fix a race between try_to_wake_up() and a woken up task · 135e8c92
      Balbir Singh authored
      The origin of the issue I've seen is related to
      a missing memory barrier between check for task->state and
      the check for task->on_rq.
      The task being woken up is already awake from a schedule()
      and is doing the following:
      	do {
      	} while (!cond);
      The waker, actually gets stuck doing the following in
      	while (p->on_cpu)
      The instance I've seen involves the following race:
       CPU1					CPU2
       while () {
         if (cond)
         do {
         } while (!cond);
         raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
       }					  try_to_wake_up()
       set_current_state(TASK_RUNNING);	  ..
      CPU2 wakes up CPU1, but before it can get the wait_lock and set
      current state to TASK_RUNNING the following occurs:
       if (!list_empty)
         if (p->on_rq && ttwu_wakeup())
         while (p->on_cpu)
      CPU3 tries to wake up the task on CPU1 again since it finds
      it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
      after CPU2, CPU3 got it.
      CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
      the task is spinning on the wait_lock. Interestingly since p->on_rq
      is checked under pi_lock, I've noticed that try_to_wake_up() finds
      p->on_rq to be 0. This was the most confusing bit of the analysis,
      but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
      check is not reliable without this fix IMHO. The race is visible
      (based on the analysis) only when ttwu_queue() does a remote wakeup
      via ttwu_queue_remote. In which case the p->on_rq change is not
      done uder the pi_lock.
      The result is that after a while the entire system locks up on
      the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
      Reproduction of the issue:
      The issue can be reproduced after a long run on my system with 80
      threads and having to tweak available memory to very low and running
      memory stress-ng mmapfork test. It usually takes a long time to
      reproduce. I am trying to work on a test case that can reproduce
      the issue faster, but thats work in progress. I am still testing the
      changes on my still in a loop and the tests seem OK thus far.
      Big thanks to Benjamin and Nick for helping debug this as well.
      Ben helped catch the missing barrier, Nick caught every missing
      bit in my theory.
      Signed-off-by: default avatarBalbir Singh <bsingharora@gmail.com>
      [ Updated comment to clarify matching barriers. Many
        architectures do not have a full barrier in switch_to()
        so that cannot be relied upon. ]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nicholas Piggin <nicholas.piggin@gmail.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Peter Zijlstra's avatar
      perf/core: Remove WARN from perf_event_read() · 58763148
      Peter Zijlstra authored
      This effectively reverts commit:
        71e7bc2b ("perf/core: Check return value of the perf_event_read() IPI")
      ... and puts in a comment explaining why we ignore the return value.
      Reported-by: default avatarVegard Nossum <vegard.nossum@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: David Carrillo-Cisneros <davidcc@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 71e7bc2b ("perf/core: Check return value of the perf_event_read() IPI")
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
  14. 02 Sep, 2016 6 commits
    • Wanpeng Li's avatar
      tick/nohz: Fix softlockup on scheduler stalls in kvm guest · 08d07259
      Wanpeng Li authored
      tick_nohz_start_idle() is prevented to be called if the idle tick can't 
      be stopped since commit 1f3b0f82 ("tick/nohz: Optimize nohz idle 
      enter"). As a result, after suspend/resume the host machine, full dynticks 
      kvm guest will softlockup:
       NMI watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:0]
       Call Trace:
        ? rest_init+0x5/0x140
        ? set_init_arg+0x55/0x55
        ? early_idt_handler_array+0x120/0x120
      In addition, cat /proc/stat | grep cpu in guest or host:
      cpu  398 16 5049 15754 5490 0 1 46 0 0
      cpu0 206 5 450 0 0 0 1 14 0 0
      cpu1 81 0 3937 3149 1514 0 0 9 0 0
      cpu2 45 6 332 6052 2243 0 0 11 0 0
      cpu3 65 2 328 6552 1732 0 0 11 0 0
      The idle and iowait states are weird 0 for cpu0(housekeeping). 
      The bug is present in both guest and host kernels, and they both have 
      cpu0's idle and iowait states issue, however, host kernel's suspend/resume 
      path etc will touch watchdog to avoid the softlockup.
      - The watchdog will not be touched in tick_nohz_stop_idle path (need be 
        touched since the scheduler stall is expected) if idle_active flags are 
        not detected.
      - The idle and iowait states will not be accounted when exit idle loop 
        (resched or interrupt) if idle start time and idle_active flags are 
        not set. 
      This patch fixes it by reverting commit 1f3b0f82 since can't stop 
      idle tick doesn't mean can't be idle.
      Fixes: 1f3b0f82 ("tick/nohz: Optimize nohz idle enter")
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Cc: Sanjeev Yadav<sanjeev.yadav@spreadtrum.com>
      Cc: Gaurav Jindal<gaurav.jindal@spreadtrum.com>
      Cc: stable@vger.kernel.org
      Cc: kvm@vger.kernel.org
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Link: http://lkml.kernel.org/r/1472798303-4154-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
    • Michal Hocko's avatar
      kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd · 735f2770
      Michal Hocko authored
      Commit fec1d011 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal
      exit") has caused a subtle regression in nscd which uses
      CLONE_CHILD_CLEARTID to clear the nscd_certainly_running flag in the
      shared databases, so that the clients are notified when nscd is
      restarted.  Now, when nscd uses a non-persistent database, clients that
      have it mapped keep thinking the database is being updated by nscd, when
      in fact nscd has created a new (anonymous) one (for non-persistent
      databases it uses an unlinked file as backend).
      The original proposal for the CLONE_CHILD_CLEARTID change claimed
      : The NPTL library uses the CLONE_CHILD_CLEARTID flag on clone() syscalls
      : on behalf of pthread_create() library calls.  This feature is used to
      : request that the kernel clear the thread-id in user space (at an address
      : provided in the syscall) when the thread disassociates itself from the
      : address space, which is done in mm_release().
      : Unfortunately, when a multi-threaded process incurs a core dump (such as
      : from a SIGSEGV), the core-dumping thread sends SIGKILL signals to all of
      : the other threads, which then proceed to clear their user-space tids
      : before synchronizing in exit_mm() with the start of core dumping.  This
      : misrepresents the state of process's address space at the time of the
      : SIGSEGV and makes it more difficult for someone to debug NPTL and glibc
      : problems (misleading him/her to conclude that the threads had gone away
      : before the fault).
      : The fix below is to simply avoid the CLONE_CHILD_CLEARTID action if a
      : core dump has been initiated.
      The resulting patch from Roland (https://lkml.org/lkml/2006/10/26/269)
      seems to have a larger scope than the original patch asked for.  It
      seems that limitting the scope of the check to core dumping should work
      for SIGSEGV issue describe above.
      [Changelog partly based on Andreas' description]
      Fixes: fec1d011 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal exit")
      Link: http://lkml.kernel.org/r/1471968749-26173-1-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Tested-by: default avatarWilliam Preston <wpreston@suse.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Andreas Schwab <schwab@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • David Rientjes's avatar
      mm, mempolicy: task->mempolicy must be NULL before dropping final reference · c11600e4
      David Rientjes authored
      KASAN allocates memory from the page allocator as part of
      kmem_cache_free(), and that can reference current->mempolicy through any
      number of allocation functions.  It needs to be NULL'd out before the
      final reference is dropped to prevent a use-after-free bug:
      	BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr ffff88010b48102c
      	CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140
      	Call Trace:
      		alloc_pages_current	<-- use after free
      		__mpol_put		<-- free
      This patch sets current->mempolicy to NULL before dropping the final
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1608301442180.63329@chino.kir.corp.google.com
      Fixes: cd11016e ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: <stable@vger.kernel.org>	[4.6+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Sergey Senozhatsky's avatar
      printk/nmi: avoid direct printk()-s from __printk_nmi_flush() · 19feeff1
      Sergey Senozhatsky authored
      __printk_nmi_flush() can be called from nmi_panic(), therefore it has to
      test whether it's executed in NMI context and thus must route the
      messages through deferred printk() or via direct printk().
      This is to avoid potential deadlocks, as described in commit
      cf9b1106 ("printk/nmi: flush NMI messages on the system panic").
      However there remain two places where __printk_nmi_flush() does
      unconditional direct printk() calls:
       - pr_err("printk_nmi_flush: internal error ...")
       - pr_cont("\n")
      Factor out print_nmi_seq_line() parts into a new printk_nmi_flush_line()
      function, which takes care of in_nmi(), and use it in
      __printk_nmi_flush() for printing and error-reporting.
      Link: http://lkml.kernel.org/r/20160830161354.581-1-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Arnd Bergmann's avatar
      kconfig: tinyconfig: provide whole choice blocks to avoid warnings · 236dec05
      Arnd Bergmann authored
      Using "make tinyconfig" produces a couple of annoying warnings that show
      up for build test machines all the time:
          .config:966:warning: override: NOHIGHMEM changes choice state
          .config:965:warning: override: SLOB changes choice state
          .config:963:warning: override: KERNEL_XZ changes choice state
          .config:962:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state
          .config:933:warning: override: SLOB changes choice state
          .config:930:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state
          .config:870:warning: override: SLOB changes choice state
          .config:868:warning: override: KERNEL_XZ changes choice state
          .config:867:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state
      I've made a previous attempt at fixing them and we discussed a number of
      I tried changing the Makefile to use "merge_config.sh -n
      $(fragment-list)" but couldn't get that to work properly.
      This is yet another approach, based on the observation that we do want
      to see a warning for conflicting 'choice' options, and that we can
      simply make them non-conflicting by listing all other options as
      disabled.  This is a trivial patch that we can apply independent of
      plans for other changes.
      Link: http://lkml.kernel.org/r/20160829214952.1334674-2-arnd@arndb.de
      Link: https://storage.kernelci.org/mainline/v4.7-rc6/x86-tinyconfig/build.log
      https://patchwork.kernel.org/patch/9212749/Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Reviewed-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Thiago Jung Bauermann's avatar
      kexec: fix double-free when failing to relocate the purgatory · 070c43ee
      Thiago Jung Bauermann authored
      If kexec_apply_relocations fails, kexec_load_purgatory frees pi->sechdrs
      and pi->purgatory_buf.  This is redundant, because in case of error
      kimage_file_prepare_segments calls kimage_file_post_load_cleanup, which
      will also free those buffers.
      This causes two warnings like the following, one for pi->sechdrs and the
      other for pi->purgatory_buf:
        kexec-bzImage64: Loading purgatory failed
        ------------[ cut here ]------------
        WARNING: CPU: 1 PID: 2119 at mm/vmalloc.c:1490 __vunmap+0xc1/0xd0
        Trying to vfree() nonexistent vm area (ffffc90000e91000)
        Modules linked in:
        CPU: 1 PID: 2119 Comm: kexec Not tainted 4.8.0-rc3+ #5
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
          ? find_vmap_area+0x19/0x70
          ? kimage_file_post_load_cleanup+0x47/0xb0
          ? putname+0x54/0x60
          ? do_sys_open+0x190/0x1f0
        ---[ end trace 158bb74f5950ca2b ]---
      Fix by setting pi->sechdrs an pi->purgatory_buf to NULL, since vfree
      won't try to free a NULL pointer.
      Link: http://lkml.kernel.org/r/1472083546-23683-1-git-send-email-bauerman@linux.vnet.ibm.comSigned-off-by: default avatarThiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  15. 31 Aug, 2016 2 commits
  16. 30 Aug, 2016 1 commit