1. 28 Dec, 2018 1 commit
  2. 27 Nov, 2016 1 commit
  3. 26 Oct, 2016 6 commits
    • Charles Jacobsen's avatar
      libasync-integration: async rpc working! · 71f69881
      Charles Jacobsen authored
      Fixed a nasty bug in module loader (my bad). I wasn't allocating
      sufficient memory to account for init symbols.
      
      Also, we don't need to do the temporary return address NULL out
      for isolated code, and it doesn't work so well anyway; so, we only
      do it for non-isolated code.
      71f69881
    • Charles Jacobsen's avatar
      stack-trace: All works now! · bda8dba8
      Charles Jacobsen authored
      Now seeing a full stack - including __init symbols and the top
      return address (I do a check in dump stack to see if the top of
      the stack has a return address - happens when we fault just before
      setting up the stack frame).
      
      Introduces another LCD-specific hack to the module loader. If we're
      loading inside an LCD, we add the __init symbols to the symbol
      and string tables that the loader builds (otherwise, the loader
      just builds a permanent home for core symbols). It's all going
      into core symtab and strtab, which is misleading since for LCDs
      those tables will have both; but as the code states, those are just
      temporary vars for the tables; eventually, symtab and strtab are
      pointed to them (after the module's init has ran).
      bda8dba8
    • Charles Jacobsen's avatar
      stack-trace: Fix module loader to not 'erase' init symbols. · eac08288
      Charles Jacobsen authored
      For normal module loads, __init code is freed, so it makes sense
      we update the symbol table to remove those symbols. But for
      LCDs, we want them to hang around.
      eac08288
    • Charles Jacobsen's avatar
      stack-trace: Stack trace 90 percent there. Wow, that was ugly. · d8f3540a
      Charles Jacobsen authored
      I fought this thing until late in the morning. Since we're
      working with a copy of the module image (.ko), some of the fields
      in the struct module have to be updated (symbol and string table
      fields). Then, everything works.
      
      Still getting weird trace symbol for return address at the
      bottom of the stack.
      d8f3540a
    • Charles Jacobsen's avatar
    • Charlie Jacobsen's avatar
      Major overhaul of build process. · 8198c2fb
      Charlie Jacobsen authored
      Full kernel build no longer required. Yay! This should
      cut down on dev time a lot.
      
      I moved all of the LCD source into $(kernel-src)/lcd-domains/,
      so it's all in one spot. There is now a top-level makefile in
      there that triggers building liblcd, the microkernel, and the
      examples. This is built as an *external* build now, even
      though the directory is in the kernel source. The build now takes
      under a minute to do everything LCD related.
      
      This should also make verification easier in the future (e.g.
      building with clang) if we aren't ensnared in the kernel
      source.
      
      Of course, to use the microkernel and examples, you have to
      build the patched kernel and install it. But now when you
      make a few lines of changes in e.g. an example, you don't have
      to trigger a top-level kernel build to rebuild it. Running
      the full kernel build takes on average about 3 - 4 minutes
      (some files are generated everytime, linking is done, and so
      on), and can take upwards of 30 minutes for a full build if you
      re-config'd.
      
      Which brings me to my other change: no more config for LCDs
      in menuconfig. If we create menu entries for every example
      and so on, we end up changing the config too often, and this
      triggers full kernel rebuilds == waste of time. We can use
      macros by setting them via compiler flags (e.g., -DSOME_FLAG).
      Furthermore, it wasn't making sense to me to do conditional
      compilation for LCD support (we always want to compile for that).
      Yes, changes aren't clearly delineated with macros, but you can
      see changes made by just doing 'git diff v3.10.14 some-file-or-dir'.
      
      The wiki has been fully updated with instructions for building,
      and other relevant parts (updated paths to files).
      
      I also took the opportunity to clean up some old stuff lying around
      that is dead (like lcdguest). I incorporated all of the documentation
      in Documentation/lcd-domains into the wiki so it's all in one
      spot now (including some helpful debug tips).
      8198c2fb
  4. 25 Oct, 2016 8 commits
    • Charlie Jacobsen's avatar
      Except IPC, kliblcd fully tested. Everything is working. · 664628a4
      Charlie Jacobsen authored
      Documentation in Documentation/lcd-domains/...
      
      Loading, mapping, and running a module is working correctly, using
      all of the capability code that interposes on each operation (mapping,
      freeing pages, etc.).
      
      cptr allocation and indexing into cspaces is working correctly.
      
      IPC testing and debugging is coming next.
      664628a4
    • Charlie Jacobsen's avatar
      Muktesh's capabilities fully incorporated. Capsicum-style enter/exit. · 6ee9a51f
      Charlie Jacobsen authored
      Builds, but not fully tested. Good tests for capability subsystem, some tests
      for kliblcd.
      
      Non-isolated kernel threads can "enter" the lcd system by doing
      klcd_enter / klcd_exit. They can create other lcd's, set them up, etc. They
      use the same interface that regular lcd's will use, so such code could be
      moved to an lcd, as we had planned. Will document this in Documentation folder
      tomorrow ( == today ).
      
      Capability system does checks now when a capability is deleted/revoked: for
      example, if it's for a page, the microkernel checks if the page is mapped, and
      unmaps it. If the last capability goes away, the page is freed. Documentation
      is in Documentation/lcd-domains/cap.txt.
      
      IPC code is in place, but not tested yet (pray for me).
      
      Debug is taking some time. Sometimes requires a power cycle which adds an
      extra 5 - 10 minutes. Build is slow the first time after reboot. Give me a user
      level program and I'll debug it in 30 seconds! argc
      
      Main arch-independent files:
      
          include/lcd-domains/kliblcd.h, types.h
      
             This is what non-isolated kernel code should include to use the
             kliblcd interface to the microkernel.
      
          virt/lcd-domains/main.c, kliblcd.c, cap.c, ipc.c, internal.h
      
             The microkernel, broken up into pieces.
      
          virt/lcd-domains/tests/
      
             The tests, in progress.
      
      Some old files are still hanging around in virt/lcd-domains and will be
      incorprated/cleaned up soon.
      
      I couldn't squash over the merge from the decomposition branch, so there's a
      bunch of junk commits coming over. (I should've just copied Muktesh's files.)
      
      Conflicts:
      	drivers/Kconfig
      	drivers/lcd-cspace/test.h
      	include/lcd-domains/cap.h
      	include/lcd-prototype/lcd.h
      	include/lcd/console.h
      	include/lcd/elfnote.h
      	include/linux/init_task.h
      	include/linux/module.h
      	include/linux/sched.h
      	virt/lcd-domains/cap.c
      	virt/lcd-domains/ipc.c
      	virt/lcd-domains/lcd-cspace-tests2.c
      Resolved-by: Vikram Narayanan's avatarVikram Narayanan <vikram186@gmail.com>
      6ee9a51f
    • Charlie Jacobsen's avatar
      Partially through starting ipc. · 9edaa6a6
      Charlie Jacobsen authored
      Will probably switch back to single lcd, multiple addr space, etc.
      
      Merging muktesh's caps.
      9edaa6a6
    • Charlie Jacobsen's avatar
      Basic lcd module create, run, and destroy. · e0193fa4
      Charlie Jacobsen authored
      This code is ugly, but it's working.
      
      Tested with basic module, and appears to be working
      properly. I will soon incorporate the patched
      modprobe into the kernel tree, and then this code
      will be usable by everyone.
      
      The ipc code is still unimplemented. The only
      hypercall handled is yield. Also note that other
      exit conditions (e.g. external interrupt) have not
      been fully tested.
      
      Overview:
      -- kernel code calls lcd_create_as_module with
         the module's name
      -- lcd_create_as_module loads the module using
         request_lcd_module (request_lcd_module calls
         the patched modprobe to load the module, and
         the patched modprobe calls back into the lcd
         driver via the ioctrl interface to load the
         module)
      -- lcd_create_as_module then finds the loaded
         module, spawns a kernel thread and passes off
         the module to it
      -- the kernel thread initializes the lcd and
         maps the module inside it, then suspends itself
      -- lcd_run_as_module wakes up the kernel thread
         and tells it to run
      -- lcd_delete_as_module stops the kernel thread
         and deletes the module from the host kernel
      
      File-by-file details:
      
      arch/x86/include/asm/lcd-domains-arch.h
      arch/x86/lcd-domains/lcd-domains-arch-tests.c
      arch/x86/lcd-domains/lcd-domains-arch.c
      -- lcd was not running in 64-bit mode, and my
         checks had one subtle bug
      -- fixed %cr3 load to properly load vmcs first
      -- fixed set program counter to use guest virtual
         rather than guest physical address
      
      include/linux/sched.h
      -- added struct lcd to task_struct
      
      include/linux/init_task.h
      -- lcd pointer set to null when task_struct is
         initialized
      
      include/linux/module.h
      kernel/module.c
      -- made init_module and delete_module system calls
         callable from kernel code
      -- available in module.h via do_sys_init_module and
         do_sys_delete_module
      -- simply moved the majority of the guts of the
         system calls into a non-system call, exported
         routine
      -- take an extra flag, for_lcd; when set, the init
         code skips over running (and deallocating) the
         module's init code, and the delete code skips
         over running the module exit
      -- system calls from user code set for_lcd = 0; this
         ensures existing code still works
      
      include/linux/kmod.h
      kernel/kmod.c
      kernel/sysctl.c
      -- changed __request_module to __do_request_module; takes
         one extra argument, for_lcd
      -- __request_module   ==>  __do_request_module with for_lcd = 0
      -- request_lcd_module ==>  __do_request_module with for_lcd = 1
      -- call_modprobe conditionally uses lcd_modprobe_path, the path
         to a patched modprobe accessible via sysfs
      
      include/lcd-domains/lcd-domains.h
      -- added lcd status enum; see source code doc
      -- three routines for creating/running/destroying
         lcd's that use modules; see source code doc
      
      include/uapi/linux/lcd-domains.h
      -- added interface defns for patched modprobe to call into
         lcd driver for module init; lcd driver loads
         module (via slightly refactored module.c code) on behalf
         of modprobe
      
      virt/lcd-domains/lcd-domains.c
      -- implementation of routines for modules inside lcd's
      -- implementation of module init / delete for lcd's
         (uses patched module.c code)
      
      virt/lcd-domains/Kconfig
      virt/lcd-domains/Makefile
      virt/lcd-domains/lcd-module-load-test.c
      virt/lcd-domains/lcd-tests.c
      -- added test module for lcd module code
      -- test runs automatically when lcd module is inserted
      e0193fa4
    • Charles Jacobsen's avatar
      Fixed build system for lcd, and most compiler errors. · 7c05c7a0
      Charles Jacobsen authored
      Two components to the lcd build now:
      -- arch/x86/lcd/Makefile: for building x86 lcd code
      -- virt/lcd/Makefile: for building arch-indep lcd code
      
      Modified the build system just slightly for a cleaner
      build:
      -- virt/ directory treated like ipc/, usr/, etc. directories
      -- added Kconfig and Makefile to virt/, mirroring drivers/
      -- updated top-level Makefile to include virt/ as vmlinux
         directory / dependency, so build system will recur into
         virt/
      -- updated arch/x86/Kconfig to include virt/Kconfig, so it
         will be included as a menu item
      -- updated arch/x86/Kbuild to include arch/x86/lcd/
      
      Removed old capabilities code in cap/.
      
      Removed lcd syscall.
      
      Temporarily turned off build for drivers/lcd.
      
      Fixed most bugs in lcd-vmx (still need to do lcd_vmx_exit).
      -- minor naming issues in lcd-vmx.h
      -- straight copy over of vmx_disable_intercepts_for_msr,
         but with more doc
      -- removed VMX_EPT_INDIVIDUAL_ADDR macro from vmx.h (where
         did this come from? it's not documented in the intel manual,
         nor is it used in kvm)
      7c05c7a0
    • Anton Burtsev's avatar
      ioctl interface to LCD · 552bd976
      Anton Burtsev authored
      552bd976
    • Anton Burtsev's avatar
      Move things around · 0ec257ef
      Anton Burtsev authored
      Not ideal. Some problems:
      
        - The arch dependent and independent parts are not completely
          separated.
      
        - The module builds in arch/x86/lcd and this is a bit strange
      
        - I can't build with LCD buildin, but build works if I build it
          as a module
      
        - There is still a bunch of old warrnings from Weibin's code
      0ec257ef
    • Sarah Spall's avatar
  5. 16 Oct, 2016 1 commit
    • John Stultz's avatar
      timekeeping: Fix __ktime_get_fast_ns() regression · 4e584cfe
      John Stultz authored
      commit 58bfea9532552d422bde7afa207e1a0f08dffa7d upstream.
      
      In commit 27727df2 ("Avoid taking lock in NMI path with
      CONFIG_DEBUG_TIMEKEEPING"), I changed the logic to open-code
      the timekeeping_get_ns() function, but I forgot to include
      the unit conversion from cycles to nanoseconds, breaking the
      function's output, which impacts users like perf.
      
      This results in bogus perf timestamps like:
       swapper     0 [000]   253.427536:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.426573:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.426687:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.426800:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.426905:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427022:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427127:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427239:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427346:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427463:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   255.426572:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
      
      Instead of more reasonable expected timestamps like:
       swapper     0 [000]    39.953768:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.064839:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.175956:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.287103:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.398217:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.509324:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.620437:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.731546:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.842654:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.953772:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    41.064881:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
      
      Add the proper use of timekeeping_delta_to_ns() to convert
      the cycle delta to nanoseconds as needed.
      
      Thanks to Brendan and Alexei for finding this quickly after
      the v4.8 release. Unfortunately the problematic commit has
      landed in some -stable trees so they'll need this fix as
      well.
      
      Many apologies for this mistake. I'll be looking to add a
      perf-clock sanity test to the kselftest timers tests soon.
      
      Fixes: 27727df2 "timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING"
      Reported-by: default avatarBrendan Gregg <bgregg@netflix.com>
      Reported-by: default avatarAlexei Starovoitov <alexei.starovoitov@gmail.com>
      Tested-and-reviewed-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/1475636148-26539-1-git-send-email-john.stultz@linaro.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4e584cfe
  6. 25 Sep, 2016 2 commits
  7. 23 Sep, 2016 1 commit
    • Tejun Heo's avatar
      cgroup: fix invalid controller enable rejections with cgroup namespace · 9157056d
      Tejun Heo authored
      On the v2 hierarchy, "cgroup.subtree_control" rejects controller
      enables if the cgroup has processes in it.  The enforcement of this
      logic assumes that the cgroup wouldn't have any css_sets associated
      with it if there are no tasks in the cgroup, which is no longer true
      since a79a908f ("cgroup: introduce cgroup namespaces").
      
      When a cgroup namespace is created, it pins the css_set of the
      creating task to use it as the root css_set of the namespace.  This
      extra reference stays as long as the namespace is around and makes
      "cgroup.subtree_control" think that the namespace root cgroup is not
      empty even when it is and thus reject controller enables.
      
      Fix it by making cgroup_subtree_control() walk and test emptiness of
      each css_set instead of testing whether the list_head is empty.
      
      While at it, update the comment of cgroup_task_count() to indicate
      that the returned value may be higher than the number of tasks, which
      has always been true due to temporary references and doesn't break
      anything.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarEvgeny Vereshchagin <evvers@ya.ru>
      Cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
      Cc: Aditya Kali <adityakali@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: stable@vger.kernel.org # v4.6+
      Fixes: a79a908f ("cgroup: introduce cgroup namespaces")
      Link: https://github.com/systemd/systemd/pull/3589#issuecomment-249089541
      9157056d
  8. 22 Sep, 2016 1 commit
  9. 19 Sep, 2016 2 commits
  10. 16 Sep, 2016 1 commit
  11. 13 Sep, 2016 1 commit
    • Joonwoo Park's avatar
      cpuset: handle race between CPU hotplug and cpuset_hotplug_work · 28b89b9e
      Joonwoo Park authored
      A discrepancy between cpu_online_mask and cpuset's effective_cpus
      mask is inevitable during hotplug since cpuset defers updating of
      effective_cpus mask using a workqueue, during which time nothing
      prevents the system from more hotplug operations.  For that reason
      guarantee_online_cpus() walks up the cpuset hierarchy until it finds
      an intersection under the assumption that top cpuset's effective_cpus
      mask intersects with cpu_online_mask even with such a race occurring.
      
      However a sequence of CPU hotplugs can open a time window, during which
      none of the effective CPUs in the top cpuset intersect with
      cpu_online_mask.
      
      For example when there are 4 possible CPUs 0-3 and only CPU0 is online:
      
        ========================  ===========================
         cpu_online_mask           top_cpuset.effective_cpus
        ========================  ===========================
         echo 1 > cpu2/online.
         CPU hotplug notifier woke up hotplug work but not yet scheduled.
            [0,2]                     [0]
      
         echo 0 > cpu0/online.
         The workqueue is still runnable.
            [2]                       [0]
        ========================  ===========================
      
        Now there is no intersection between cpu_online_mask and
        top_cpuset.effective_cpus.  Thus invoking sys_sched_setaffinity() at
        this moment can cause following:
      
         Unable to handle kernel NULL pointer dereference at virtual address 000000d0
         ------------[ cut here ]------------
         Kernel BUG at ffffffc0001389b0 [verbose debug info unavailable]
         Internal error: Oops - BUG: 96000005 [#1] PREEMPT SMP
         Modules linked in:
         CPU: 2 PID: 1420 Comm: taskset Tainted: G        W       4.4.8+ #98
         task: ffffffc06a5c4880 ti: ffffffc06e124000 task.ti: ffffffc06e124000
         PC is at guarantee_online_cpus+0x2c/0x58
         LR is at cpuset_cpus_allowed+0x4c/0x6c
         <snip>
         Process taskset (pid: 1420, stack limit = 0xffffffc06e124020)
         Call trace:
         [<ffffffc0001389b0>] guarantee_online_cpus+0x2c/0x58
         [<ffffffc00013b208>] cpuset_cpus_allowed+0x4c/0x6c
         [<ffffffc0000d61f0>] sched_setaffinity+0xc0/0x1ac
         [<ffffffc0000d6374>] SyS_sched_setaffinity+0x98/0xac
         [<ffffffc000085cb0>] el0_svc_naked+0x24/0x28
      
      The top cpuset's effective_cpus are guaranteed to be identical to
      cpu_online_mask eventually.  Hence fall back to cpu_online_mask when
      there is no intersection between top cpuset's effective_cpus and
      cpu_online_mask.
      Signed-off-by: default avatarJoonwoo Park <joonwoop@codeaurora.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: cgroups@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 3.17+
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      28b89b9e
  12. 10 Sep, 2016 2 commits
    • Alexander Shishkin's avatar
      perf/core: Fix aux_mmap_count vs aux_refcount order · b79ccadd
      Alexander Shishkin authored
      The order of accesses to ring buffer's aux_mmap_count and aux_refcount
      has to be preserved across the users, namely perf_mmap_close() and
      perf_aux_output_begin(), otherwise the inversion can result in the latter
      holding the last reference to the aux buffer and subsequently free'ing
      it in atomic context, triggering a warning.
      
      > ------------[ cut here ]------------
      > WARNING: CPU: 0 PID: 257 at kernel/events/ring_buffer.c:541 __rb_free_aux+0x11a/0x130
      > CPU: 0 PID: 257 Comm: stopbug Not tainted 4.8.0-rc1+ #2596
      > Call Trace:
      >  [<ffffffff810f3e0b>] __warn+0xcb/0xf0
      >  [<ffffffff810f3f3d>] warn_slowpath_null+0x1d/0x20
      >  [<ffffffff8121182a>] __rb_free_aux+0x11a/0x130
      >  [<ffffffff812127a8>] rb_free_aux+0x18/0x20
      >  [<ffffffff81212913>] perf_aux_output_begin+0x163/0x1e0
      >  [<ffffffff8100c33a>] bts_event_start+0x3a/0xd0
      >  [<ffffffff8100c42d>] bts_event_add+0x5d/0x80
      >  [<ffffffff81203646>] event_sched_in.isra.104+0xf6/0x2f0
      >  [<ffffffff8120652e>] group_sched_in+0x6e/0x190
      >  [<ffffffff8120694e>] ctx_sched_in+0x2fe/0x5f0
      >  [<ffffffff81206ca0>] perf_event_sched_in+0x60/0x80
      >  [<ffffffff81206d1b>] ctx_resched+0x5b/0x90
      >  [<ffffffff81207281>] __perf_event_enable+0x1e1/0x240
      >  [<ffffffff81200639>] event_function+0xa9/0x180
      >  [<ffffffff81202000>] ? perf_cgroup_attach+0x70/0x70
      >  [<ffffffff8120203f>] remote_function+0x3f/0x50
      >  [<ffffffff811971f3>] flush_smp_call_function_queue+0x83/0x150
      >  [<ffffffff81197bd3>] generic_smp_call_function_single_interrupt+0x13/0x60
      >  [<ffffffff810a6477>] smp_call_function_single_interrupt+0x27/0x40
      >  [<ffffffff81a26ea9>] call_function_single_interrupt+0x89/0x90
      >  [<ffffffff81120056>] finish_task_switch+0xa6/0x210
      >  [<ffffffff81120017>] ? finish_task_switch+0x67/0x210
      >  [<ffffffff81a1e83d>] __schedule+0x3dd/0xb50
      >  [<ffffffff81a1efe5>] schedule+0x35/0x80
      >  [<ffffffff81128031>] sys_sched_yield+0x61/0x70
      >  [<ffffffff81a25be5>] entry_SYSCALL_64_fastpath+0x18/0xa8
      > ---[ end trace 6235f556f5ea83a9 ]---
      
      This patch puts the checks in perf_aux_output_begin() in the same order
      as that of perf_mmap_close().
      Reported-by: default avatarVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160906132353.19887-3-alexander.shishkin@linux.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b79ccadd
    • Alexander Shishkin's avatar
      perf/core: Fix a race between mmap_close() and set_output() of AUX events · 767ae086
      Alexander Shishkin authored
      In the mmap_close() path we need to stop all the AUX events that are
      writing data to the AUX area that we are unmapping, before we can
      safely free the pages. To determine if an event needs to be stopped,
      we're comparing its ->rb against the one that's getting unmapped.
      However, a SET_OUTPUT ioctl may turn up inside an AUX transaction
      and swizzle event::rb to some other ring buffer, but the transaction
      will keep writing data to the old ring buffer until the event gets
      scheduled out. At this point, mmap_close() will skip over such an
      event and will proceed to free the AUX area, while it's still being
      used by this event, which will set off a warning in the mmap_close()
      path and cause a memory corruption.
      
      To avoid this, always stop an AUX event before its ->rb is updated;
      this will release the (potentially) last reference on the AUX area
      of the buffer. If the event gets restarted, its new ring buffer will
      be used. If another SET_OUTPUT comes and switches it back to the
      old ring buffer that's getting unmapped, it's also fine: this
      ring buffer's aux_mmap_count will be zero and AUX transactions won't
      start any more.
      Reported-by: default avatarVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160906132353.19887-2-alexander.shishkin@linux.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      767ae086
  13. 09 Sep, 2016 1 commit
    • Dan Williams's avatar
      mm: fix cache mode of dax pmd mappings · 9049771f
      Dan Williams authored
      track_pfn_insert() in vmf_insert_pfn_pmd() is marking dax mappings as
      uncacheable rendering them impractical for application usage.  DAX-pte
      mappings are cached and the goal of establishing DAX-pmd mappings is to
      attain more performance, not dramatically less (3 orders of magnitude).
      
      track_pfn_insert() relies on a previous call to reserve_memtype() to
      establish the expected page_cache_mode for the range.  While memremap()
      arranges for reserve_memtype() to be called, devm_memremap_pages() does
      not.  So, teach track_pfn_insert() and untrack_pfn() how to handle
      tracking without a vma, and arrange for devm_memremap_pages() to
      establish the write-back-cache reservation in the memtype tree.
      
      Cc: <stable@vger.kernel.org>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Reported-by: default avatarKai Zhang <kai.ka.zhang@oracle.com>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      9049771f
  14. 05 Sep, 2016 3 commits
    • Tejun Heo's avatar
      PM / QoS: avoid calling cancel_delayed_work_sync() during early boot · c86d06ba
      Tejun Heo authored
      of_clk_init() ends up calling into pm_qos_update_request() very early
      during boot where irq is expected to stay disabled.
      pm_qos_update_request() uses cancel_delayed_work_sync() which
      correctly assumes that irq is enabled on invocation and
      unconditionally disables and re-enables it.
      
      Gate cancel_delayed_work_sync() invocation with kevented_up() to avoid
      enabling irq unexpectedly during early boot.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarQiao Zhou <qiaozhou@asrmicro.com>
      Link: http://lkml.kernel.org/r/d2501c4c-8e7b-bea3-1b01-000b36b5dfe9@asrmicro.comSigned-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c86d06ba
    • Balbir Singh's avatar
      sched/core: Fix a race between try_to_wake_up() and a woken up task · 135e8c92
      Balbir Singh authored
      The origin of the issue I've seen is related to
      a missing memory barrier between check for task->state and
      the check for task->on_rq.
      
      The task being woken up is already awake from a schedule()
      and is doing the following:
      
      	do {
      		schedule()
      		set_current_state(TASK_(UN)INTERRUPTIBLE);
      	} while (!cond);
      
      The waker, actually gets stuck doing the following in
      try_to_wake_up():
      
      	while (p->on_cpu)
      		cpu_relax();
      
      Analysis:
      
      The instance I've seen involves the following race:
      
       CPU1					CPU2
      
       while () {
         if (cond)
           break;
         do {
           schedule();
           set_current_state(TASK_UN..)
         } while (!cond);
      					wakeup_routine()
      					  spin_lock_irqsave(wait_lock)
         raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
       }					  try_to_wake_up()
       set_current_state(TASK_RUNNING);	  ..
       list_del(&waiter.list);
      
      CPU2 wakes up CPU1, but before it can get the wait_lock and set
      current state to TASK_RUNNING the following occurs:
      
       CPU3
       wakeup_routine()
       raw_spin_lock_irqsave(wait_lock)
       if (!list_empty)
         wake_up_process()
         try_to_wake_up()
         raw_spin_lock_irqsave(p->pi_lock)
         ..
         if (p->on_rq && ttwu_wakeup())
         ..
         while (p->on_cpu)
           cpu_relax()
         ..
      
      CPU3 tries to wake up the task on CPU1 again since it finds
      it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
      after CPU2, CPU3 got it.
      
      CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
      the task is spinning on the wait_lock. Interestingly since p->on_rq
      is checked under pi_lock, I've noticed that try_to_wake_up() finds
      p->on_rq to be 0. This was the most confusing bit of the analysis,
      but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
      check is not reliable without this fix IMHO. The race is visible
      (based on the analysis) only when ttwu_queue() does a remote wakeup
      via ttwu_queue_remote. In which case the p->on_rq change is not
      done uder the pi_lock.
      
      The result is that after a while the entire system locks up on
      the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
      
      Reproduction of the issue:
      
      The issue can be reproduced after a long run on my system with 80
      threads and having to tweak available memory to very low and running
      memory stress-ng mmapfork test. It usually takes a long time to
      reproduce. I am trying to work on a test case that can reproduce
      the issue faster, but thats work in progress. I am still testing the
      changes on my still in a loop and the tests seem OK thus far.
      
      Big thanks to Benjamin and Nick for helping debug this as well.
      Ben helped catch the missing barrier, Nick caught every missing
      bit in my theory.
      Signed-off-by: default avatarBalbir Singh <bsingharora@gmail.com>
      [ Updated comment to clarify matching barriers. Many
        architectures do not have a full barrier in switch_to()
        so that cannot be relied upon. ]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nicholas Piggin <nicholas.piggin@gmail.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      135e8c92
    • Peter Zijlstra's avatar
      perf/core: Remove WARN from perf_event_read() · 58763148
      Peter Zijlstra authored
      This effectively reverts commit:
      
        71e7bc2b ("perf/core: Check return value of the perf_event_read() IPI")
      
      ... and puts in a comment explaining why we ignore the return value.
      Reported-by: default avatarVegard Nossum <vegard.nossum@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: David Carrillo-Cisneros <davidcc@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 71e7bc2b ("perf/core: Check return value of the perf_event_read() IPI")
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      58763148
  15. 02 Sep, 2016 1 commit
    • Wanpeng Li's avatar
      tick/nohz: Fix softlockup on scheduler stalls in kvm guest · 08d07259
      Wanpeng Li authored
      tick_nohz_start_idle() is prevented to be called if the idle tick can't 
      be stopped since commit 1f3b0f82 ("tick/nohz: Optimize nohz idle 
      enter"). As a result, after suspend/resume the host machine, full dynticks 
      kvm guest will softlockup:
      
       NMI watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:0]
       Call Trace:
        default_idle+0x31/0x1a0
        arch_cpu_idle+0xf/0x20
        default_idle_call+0x2a/0x50
        cpu_startup_entry+0x39b/0x4d0
        rest_init+0x138/0x140
        ? rest_init+0x5/0x140
        start_kernel+0x4c1/0x4ce
        ? set_init_arg+0x55/0x55
        ? early_idt_handler_array+0x120/0x120
        x86_64_start_reservations+0x24/0x26
        x86_64_start_kernel+0x142/0x14f
      
      In addition, cat /proc/stat | grep cpu in guest or host:
      
      cpu  398 16 5049 15754 5490 0 1 46 0 0
      cpu0 206 5 450 0 0 0 1 14 0 0
      cpu1 81 0 3937 3149 1514 0 0 9 0 0
      cpu2 45 6 332 6052 2243 0 0 11 0 0
      cpu3 65 2 328 6552 1732 0 0 11 0 0
      
      The idle and iowait states are weird 0 for cpu0(housekeeping). 
      
      The bug is present in both guest and host kernels, and they both have 
      cpu0's idle and iowait states issue, however, host kernel's suspend/resume 
      path etc will touch watchdog to avoid the softlockup.
      
      - The watchdog will not be touched in tick_nohz_stop_idle path (need be 
        touched since the scheduler stall is expected) if idle_active flags are 
        not detected.
      - The idle and iowait states will not be accounted when exit idle loop 
        (resched or interrupt) if idle start time and idle_active flags are 
        not set. 
      
      This patch fixes it by reverting commit 1f3b0f82 since can't stop 
      idle tick doesn't mean can't be idle.
      
      Fixes: 1f3b0f82 ("tick/nohz: Optimize nohz idle enter")
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Cc: Sanjeev Yadav<sanjeev.yadav@spreadtrum.com>
      Cc: Gaurav Jindal<gaurav.jindal@spreadtrum.com>
      Cc: stable@vger.kernel.org
      Cc: kvm@vger.kernel.org
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Link: http://lkml.kernel.org/r/1472798303-4154-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      08d07259
  16. 01 Sep, 2016 5 commits
  17. 31 Aug, 2016 2 commits
  18. 30 Aug, 2016 1 commit