1. 25 Oct, 2016 23 commits
    • Charlie Jacobsen's avatar
      Starting code for IPC registers and access. · e51d81aa
      Charlie Jacobsen authored
      Set up new header files, under include/lcd-domains/
      -- lcd-domains.h: main include, contains struct lcd
      -- ipc.h: struct lcd_ipc_regs, for message registers
      Updated virt/lcd/lcd-domains.c to use new headers.
      Updated arch-dep code to use new struct lcd_ipc_regs.
      struct lcd_arch contains a pointer to the allocated
      page for stack / ipc registers. struct lcd (arch-indep)
      contains a pointer to the same memory (so that the
      arch-indep code can access the ipc regs directly if
      it wishes).
      Message registers should be accessed through arch-dep
      macros (to be implemented next) for portability and
      speed (some of the message registers will be
      implemented using machine regs, so the message registers
      in struct lcd_ipc_regs are `shadows').
      Message register design based on seL4. See seL4 manual,
    • Charlie Jacobsen's avatar
      Simple stack initialization code (untested). · 93be900b
      Charlie Jacobsen authored
      Stack / ipc registers buffer initialized and mapped
      in guest physical.
    • Charlie Jacobsen's avatar
      Added GDT and TSS guest physical mapping code (untested). · 543ccab0
      Charlie Jacobsen authored
      -- simple routine combining effects of ept walk and set
      -- part of arch-dep public interface
      Added mapping code to gdt init and tss init, and some
      documentation for those routines.
      Starting code for stack initialization (should be
    • Charlie Jacobsen's avatar
      GDT initialization code in place, and desc init code (untested). · fa0097b0
      Charlie Jacobsen authored
      -- load base, limit, type, etc. into a segment descriptor
      -- loads base, limit, etc. for code, data, and tss segment
         descriptors in gdt
    • Charlie Jacobsen's avatar
      Segment and desc table regs, address space layout in place (untested). · c4780c3c
      Charlie Jacobsen authored
      Address space layout includes tss, gdt, ipc registers, and small
      stack. See lcd-domains-arch.h.
      -- a tss may be required (not sure) while running in non-root,
         even though a stack switch does not occur
      -- a gdt may also be required (even though all info is written in
         the hidden part of the segment registers); again, not sure
      4 KBs is reserved for an IDT if it is needed (not mapped or
      GDT layout given in lcd-domains-arch.h. (GDT build code to
      be implemented / copied over next.)
      LDT is not used (so no need to load access rights, etc.). It
      is marked as unusable.
      Fixed segment register limit fields. These must be 32 bits and
      are always byte granularity. The granularity field in the
      access rights bits is confusing (see Intel SDM V3
    • Charlie Jacobsen's avatar
      EPT deallocation code in place (untested). · c9cb61a2
      Charlie Jacobsen authored
      -- frees all memory associated with extended
         page tables (paging structures and mapped
         physical mem)
      -- frees all memory associated with an epte
         at a level in the hierarchy
      -- uses shallow recursion to make the code
         more readable
      Simple updates to some of the EPT macros.
    • Charlie Jacobsen's avatar
      Finished arch-dep ept code. · 8f65d678
      Charlie Jacobsen authored
      lcd_arch_epte_t type for arch abstraction.
      -- simple lookup of ept entry
      -- optionally allocate ept data structures
         along the way
      -- set the host physical address in the
         (final level) ept entry, along with
         default flags
      -- returns host physical address stored in
         an ept entry
      Remaining old code will be put in arch-indep
    • Charlie Jacobsen's avatar
      Loads / stores to cr3 now handled (untested). · 40de4ae2
      Charlie Jacobsen authored
      This is necessary for e.g. the emulab machines (loads /
      stores to cr3 are not allowed in non-root, so must be
      handled by hypervisor). Code simply copies values between
      fields in lcd data structure.
    • Charlie Jacobsen's avatar
      Simple EPT fault handling code in place (untested). · c37435a0
      Charlie Jacobsen authored
      Removed the `auto' memory alloc and map from the
      original handler. The new handler is simple for now;
      it just reads the guest virtual and physical addresses
      involved in the fault. The arch-indepent code will be
      responsible for deciding what to do.
    • Charlie Jacobsen's avatar
      External interrupt code in place (untested). · 1abf5ea6
      Charlie Jacobsen authored
      -- pretty much a straight copy over of the old code,
         but with comments
      -- one big difference: interrupts are assumed to be
         enabled when this routine is called (I can't see
         how kvm is allowing the handling of external interrupts
         because it disables them when it enters vmx non-root.
         See the kvm code in x86.c:vcpu_enter_guest.)
    • Charlie Jacobsen's avatar
      Set up simple lcd run (no loop) and some handling (untested). · 070e2688
      Charlie Jacobsen authored
      -- disables kernel preemption while lcd is running
      -- simple switch on vmx exit conditions
      -- for nmi's and exceptions generated by lcd
      -- for `hardware exceptions': page faults, traps,
         machine checks
    • Charlie Jacobsen's avatar
      Simple re-naming to arch-agnostic names for arch-dep interface. · c8a88195
      Charlie Jacobsen authored
      -- Moved some vmx-specific data structures into implementation file.
      -- lcd_vmx_* => lcd_arch_*
      -- updated virt/lcd/lcd-domains.c
    • Charlie Jacobsen's avatar
    • Charlie Jacobsen's avatar
    • Charlie Jacobsen's avatar
      Added header doc to lcd-vmx.h and lcd_vmx_destroy. · 4cdb5e83
      Charlie Jacobsen authored
      lcd_destroy => lcd_vmx_destroy. Pretty much a straight
      copy over, but removed some unneeded code.
    • Charles Jacobsen's avatar
      Fixed vmcs configuration bugs (small macro bugs in vmx.h). · aad218b5
      Charles Jacobsen authored
      Debug controls macros for vm exit and vm entry were
      wrong (I wonder if the kvm guys know, it's in the stable
      linux build).
      I had to enable exiting on load / save to %cr3 for it
      to run on emulab machines.
      Tweaked vmx control debugging code, cleaned up
      logic in adjust_vmx_controls, much simpler.
    • Charles Jacobsen's avatar
      Fixed bugs in lcd_vmx_create and dependencies. Clean build. · 4d27a893
      Charles Jacobsen authored
      Conditional compilation on number of autoload msr's.
    • Charlie Jacobsen's avatar
      Finished lcd_vmx_create and its dependencies (untested). · 7607ec9a
      Charlie Jacobsen authored
      vmx_setup_vmcs ==> vmx_setup_vmcs_guest_settings and
      -- execution control (e.g., interrupt handling)
      -- ept pointer
      -- %cr0 and %cr4 access control
      -- initial %cr0, %cr4
      -- segmentation--bases, limits, selectors
      -- guest EFER (long mode enabled, no syscall/sysret)
      -- initial activity and interrupt state
      -- control and segmentation regs
      -- host EFER
      -- no saving of syscall/sysret msrs since these are
         disabled in guest
      -- no page attribute table
    • Charlie Jacobsen's avatar
      About half way done with vmcs initialization code. · acbb9533
      Charlie Jacobsen authored
      lcd_create --> lcd_vmx_create, with a lot of code
      re-factored or removed for now, to keep it simple
      (no gdt, idt, isr, paging bitmap, address space
      init, etc.).
      -- basic ept initialization
      -- vmcs loading on a cpu
         -- re-factored __vmx_setup_cpu to use built-in
            segment descriptor access functions in desc.h
         -- removed host sys_enter storage, since this
            msr is disabled right now anyway
         -- more doc to understand vmcs load process
    • Charles Jacobsen's avatar
      Successful build with lcd_vmx_init and lcd_vmx_exit. · 62e4ac37
      Charles Jacobsen authored
      Added straight copy from old code of lcd_vmx_exit.
      Shifted lcd_vmx_init and lcd_vmx_exit to
      arch/x86/include/asm/lcd-vmx.h. Ideally, if we want
      this to be arch-independent, probably want to change
      header to asm/lcd.h, and routines to lcd_arch_init
      and lcd_arch_exit, or something similar.
    • Charles Jacobsen's avatar
      Fixed build system for lcd, and most compiler errors. · 7c05c7a0
      Charles Jacobsen authored
      Two components to the lcd build now:
      -- arch/x86/lcd/Makefile: for building x86 lcd code
      -- virt/lcd/Makefile: for building arch-indep lcd code
      Modified the build system just slightly for a cleaner
      -- virt/ directory treated like ipc/, usr/, etc. directories
      -- added Kconfig and Makefile to virt/, mirroring drivers/
      -- updated top-level Makefile to include virt/ as vmlinux
         directory / dependency, so build system will recur into
      -- updated arch/x86/Kconfig to include virt/Kconfig, so it
         will be included as a menu item
      -- updated arch/x86/Kbuild to include arch/x86/lcd/
      Removed old capabilities code in cap/.
      Removed lcd syscall.
      Temporarily turned off build for drivers/lcd.
      Fixed most bugs in lcd-vmx (still need to do lcd_vmx_exit).
      -- minor naming issues in lcd-vmx.h
      -- straight copy over of vmx_disable_intercepts_for_msr,
         but with more doc
      -- removed VMX_EPT_INDIVIDUAL_ADDR macro from vmx.h (where
         did this come from? it's not documented in the intel manual,
         nor is it used in kvm)
    • Charlie Jacobsen's avatar
      Finished lcd_vmx_init and its dependencies. · 18122896
      Charlie Jacobsen authored
      Added a few missing macros to arch/x86/include/vmx.h,
      and RESERVED masks for easily determining which bits
      in a vmx control are reserved (needed in adjust_vmx_controls).
      Re-factored setup_vmcs_config and adjust_vmx_controls.
      setup_vmcs_config does pretty much the same thing, but it
      fails immediately if a control isn't available --
      adjust_vmx_controls confirms that the exact desired
      controls are available, and sets the reserved bits to
      1 or 0 as needed. Cleaner comments and organization.
      Re-factored the vmx basic settings to
      Removed some of the vmx feature check code that was in
      the original lcd_vmx_init, as setup_vmcs_config now does
      Essentially a straight copy over of:
      -- __vmx_enable
      -- vmx_enable
      -- vmx_disable
      -- vmx_free_vmxon_areas
      -- __vmxon
      -- __vmxoff
      The only difference is I shifted tbl and cache
      invalidation to vmx_enable (originally in __vmx_enable)
      and added some comments.
      Straight copy over of
      -- vmx_alloc_vmcs
      -- vmx_free_vmcs
      -- invvpid, invept code, with slight renaming
    • Charlie Jacobsen's avatar
      Starting a fresh lcd-vmx arch-dependent interface. · 8a6ad472
      Charlie Jacobsen authored
      Arch-dependent code will go in arch/x86/lcd, and the
      header(s) will reside in arch/x86/include/asm.
      For now, I have only moved some of the arch-dependent
      junk that was in include/lcd/lcd.h into
  2. 16 Oct, 2016 3 commits
  3. 29 Sep, 2016 1 commit
  4. 15 Sep, 2016 1 commit
  5. 06 Sep, 2016 1 commit
    • Kees Cook's avatar
      x86/uaccess: force copy_*_user() to be inlined · e6971009
      Kees Cook authored
      As already done with __copy_*_user(), mark copy_*_user() as __always_inline.
      Without this, the checks for things like __builtin_const_p() won't work
      consistently in either hardened usercopy nor the recent adjustments for
      detecting usercopy overflows at compile time.
      The change in kernel text size is detectable, but very small:
       text      data     bss     dec      hex     filename
      12118735  5768608 14229504 32116847 1ea106f vmlinux.before
      12120207  5768608 14229504 32118319 1ea162f vmlinux.after
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
  6. 30 Aug, 2016 1 commit
    • Josh Poimboeuf's avatar
      mm/usercopy: get rid of CONFIG_DEBUG_STRICT_USER_COPY_CHECKS · 0d025d27
      Josh Poimboeuf authored
      There are three usercopy warnings which are currently being silenced for
      gcc 4.6 and newer:
      1) "copy_from_user() buffer size is too small" compile warning/error
         This is a static warning which happens when object size and copy size
         are both const, and copy size > object size.  I didn't see any false
         positives for this one.  So the function warning attribute seems to
         be working fine here.
         Note this scenario is always a bug and so I think it should be
         changed to *always* be an error, regardless of
      2) "copy_from_user() buffer size is not provably correct" compile warning
         This is another static warning which happens when I enable
         __compiletime_object_size() for new compilers (and
         CONFIG_DEBUG_STRICT_USER_COPY_CHECKS).  It happens when object size
         is const, but copy size is *not*.  In this case there's no way to
         compare the two at build time, so it gives the warning.  (Note the
         warning is a byproduct of the fact that gcc has no way of knowing
         whether the overflow function will be called, so the call isn't dead
         code and the warning attribute is activated.)
         So this warning seems to only indicate "this is an unusual pattern,
         maybe you should check it out" rather than "this is a bug".
         I get 102(!) of these warnings with allyesconfig and the
         __compiletime_object_size() gcc check removed.  I don't know if there
         are any real bugs hiding in there, but from looking at a small
         sample, I didn't see any.  According to Kees, it does sometimes find
         real bugs.  But the false positive rate seems high.
      3) "Buffer overflow detected" runtime warning
         This is a runtime warning where object size is const, and copy size >
         object size.
      All three warnings (both static and runtime) were completely disabled
      for gcc 4.6 with the following commit:
        2fb0815c ("gcc4: disable __compiletime_object_size for GCC 4.6+")
      That commit mistakenly assumed that the false positives were caused by a
      gcc bug in __compiletime_object_size().  But in fact,
      __compiletime_object_size() seems to be working fine.  The false
      positives were instead triggered by #2 above.  (Though I don't have an
      explanation for why the warnings supposedly only started showing up in
      gcc 4.6.)
      So remove warning #2 to get rid of all the false positives, and re-enable
      warnings #1 and #3 by reverting the above commit.
      Furthermore, since #1 is a real bug which is detected at compile time,
      upgrade it to always be an error.
      Having done all that, CONFIG_DEBUG_STRICT_USER_COPY_CHECKS is no longer
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Nilay Vaish <nilayvaish@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  7. 11 Aug, 2016 4 commits
    • Andy Lutomirski's avatar
      x86/boot: Rework reserve_real_mode() to allow multiple tries · 5ff3e2c3
      Andy Lutomirski authored
      If reserve_real_mode() fails, panicing immediately means we're
      doomed.  Make it safe to try more than once to allocate the
       - Degrade a failure from panic() to pr_info().  (If we make it to
         setup_real_mode() without reserving the trampoline, we'll panic
       - Factor out helpers so that platform code can supply a specific
         address to try.
       - Warn if reserve_real_mode() is called after we're done with the
         memblock allocator.  If that were to happen, we would behave
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mario Limonciello <mario_limonciello@dell.com>
      Cc: Matt Fleming <mfleming@suse.de>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/876e383038f3e9971aa72fd20a4f5da05f9d193d.1470821230.git.luto@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Andy Lutomirski's avatar
      x86/boot: Defer setup_real_mode() to early_initcall time · d0de0f68
      Andy Lutomirski authored
      There's no need to run setup_real_mode() as early as we run it.
      Defer it to the same early_initcall that sets up the page
      permissions for the real mode code.
      This should be a code size reduction.  More importantly, it give us
      a longer window in which we can allocate the real mode trampoline.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mario Limonciello <mario_limonciello@dell.com>
      Cc: Matt Fleming <mfleming@suse.de>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/fd62f0da4f79357695e9bf3e365623736b05f119.1470821230.git.luto@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Aaron Lu's avatar
      x86/irq: Do not substract irq_tlb_count from irq_call_count · 82ba4fac
      Aaron Lu authored
      Since commit:
        52aec330 ("x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR")
      the TLB remote shootdown is done through call function vector. That
      commit didn't take care of irq_tlb_count, which a later commit:
        fd0f5869 ("x86: Distinguish TLB shootdown interrupts from other functions call interrupts")
      ... tried to fix.
      The fix assumes every increase of irq_tlb_count has a corresponding
      increase of irq_call_count. So the irq_call_count is always bigger than
      irq_tlb_count and we could substract irq_tlb_count from irq_call_count.
      Unfortunately this is not true for the smp_call_function_single() case.
      The IPI is only sent if the target CPU's call_single_queue is empty when
      adding a csd into it in generic_exec_single. That means if two threads
      are both adding flush tlb csds to the same CPU's call_single_queue, only
      one IPI is sent. In other words, the irq_call_count is incremented by 1
      but irq_tlb_count is incremented by 2. Over time, irq_tlb_count will be
      bigger than irq_call_count and the substract will produce a very large
      irq_call_count value due to overflow.
      Considering that:
        1) it's not worth to send more IPIs for the sake of accurate counting of
           irq_call_count in generic_exec_single();
        2) it's not easy to tell if the call function interrupt is for TLB
           shootdown in __smp_call_function_single_interrupt().
      Not to exclude TLB shootdown from call function count seems to be the
      simplest fix and this patch just does that.
      This bug was found by LKP's cyclic performance regression tracking recently
      with the vm-scalability test suite. I have bisected to commit:
        3dec0ba0 ("mm/rmap: share the i_mmap_rwsem")
      This commit didn't do anything wrong but revealed the irq_call_count
      problem. IIUC, the commit makes rwc->remap_one in rmap_walk_file
      concurrent with multiple threads.  When remap_one is try_to_unmap_one(),
      then multiple threads could queue flush TLB to the same CPU but only
      one IPI will be sent.
      Since the commit was added in Linux v3.19, the counting problem only
      shows up from v3.19 onwards.
      Signed-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Cc: Alex Shi <alex.shi@linaro.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
      Link: http://lkml.kernel.org/r/20160811074430.GA18163@aaronlu.sh.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Dave Hansen's avatar
      x86/mm: Fix swap entry comment and macro · ace7fab7
      Dave Hansen authored
      A recent patch changed the format of a swap PTE.
      The comment explaining the format of the swap PTE is wrong about
      the bits used for the swap type field.  Amusingly, the ASCII art
      and the patch description are correct, but the comment itself
      is wrong.
      As I was looking at this, I also noticed that the
      SWP_OFFSET_FIRST_BIT has an off-by-one error.  This does not
      really hurt anything.  It just wasted a bit of space in the PTE,
      giving us 2^59 bytes of addressable space in our swapfiles
      instead of 2^60.  But, it doesn't match with the comments, and it
      wastes a bit of space, so fix it.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Fixes: 00839ee3 ("x86/mm: Move swap offset/type up in PTE to work around erratum")
      Link: http://lkml.kernel.org/r/20160810172325.E56AD7DA@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
  8. 10 Aug, 2016 3 commits
    • Mike Travis's avatar
      x86/platform/UV: Fix problem with UV4 BIOS providing incorrect PXM values · 22ac2bca
      Mike Travis authored
      There are some circumstances where the UV4 BIOS cannot provide the
      correct Proximity Node values to associate with specific Sockets and
      Physical Nodes.  The decision was made to remove these values from BIOS
      and for the kernel to get these values from the standard ACPI tables.
      Tested-by: default avatarFrank Ramsay <framsay@sgi.com>
      Tested-by: default avatarJohn Estabrook <estabrook@sgi.com>
      Signed-off-by: default avatarMike Travis <travis@sgi.com>
      Reviewed-by: default avatarDimitri Sivanich <sivanich@sgi.com>
      Reviewed-by: default avatarNathan Zimmer <nzimmer@sgi.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Andrew Banman <abanman@sgi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russ Anderson <rja@sgi.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160801184050.414210079@asylum.americas.sgi.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Sebastian Andrzej Siewior's avatar
      x86/mm: Disable preemption during CR3 read+write · 5cf0791d
      Sebastian Andrzej Siewior authored
      There's a subtle preemption race on UP kernels:
      Usually current->mm (and therefore mm->pgd) stays the same during the
      lifetime of a task so it does not matter if a task gets preempted during
      the read and write of the CR3.
      But then, there is this scenario on x86-UP:
      TaskA is in do_exit() and exit_mm() sets current->mm = NULL followed by:
       -> mmput()
       -> exit_mmap()
       -> tlb_finish_mmu()
       -> tlb_flush_mmu()
       -> tlb_flush_mmu_tlbonly()
       -> tlb_flush()
       -> flush_tlb_mm_range()
       -> __flush_tlb_up()
       -> __flush_tlb()
       ->  __native_flush_tlb()
      At this point current->mm is NULL but current->active_mm still points to
      the "old" mm.
      Let's preempt taskA _after_ native_read_cr3() by taskB. TaskB has its
      own mm so CR3 has changed.
      Now preempt back to taskA. TaskA has no ->mm set so it borrows taskB's
      mm and so CR3 remains unchanged. Once taskA gets active it continues
      where it was interrupted and that means it writes its old CR3 value
      back. Everything is fine because userland won't need its memory
      Now the fun part:
      Let's preempt taskA one more time and get back to taskB. This
      time switch_mm() won't do a thing because oldmm (->active_mm)
      is the same as mm (as per context_switch()). So we remain
      with a bad CR3 / PGD and return to userland.
      The next thing that happens is handle_mm_fault() with an address for
      the execution of its code in userland. handle_mm_fault() realizes that
      it has a PTE with proper rights so it returns doing nothing. But the
      CPU looks at the wrong PGD and insists that something is wrong and
      faults again. And again. And one more time…
      This pagefault circle continues until the scheduler gets tired of it and
      puts another task on the CPU. It gets little difficult if the task is a
      RT task with a high priority. The system will either freeze or it gets
      fixed by the software watchdog thread which usually runs at RT-max prio.
      But waiting for the watchdog will increase the latency of the RT task
      which is no good.
      Fix this by disabling preemption across the critical code section.
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1470404259-26290-1-git-send-email-bigeasy@linutronix.de
      [ Prettified the changelog. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Nicolai Stange's avatar
      x86/timers/apic: Inform TSC deadline clockevent device about recalibration · 6731b0d6
      Nicolai Stange authored
      This patch eliminates a source of imprecise APIC timer interrupts,
      which imprecision may result in double interrupts or even late
      The TSC deadline clockevent devices' configuration and registration
      happens before the TSC frequency calibration is refined in
      This results in the TSC clocksource and the TSC deadline clockevent
      devices being configured with slightly different frequencies: the former
      gets the refined one and the latter are configured with the inaccurate
      frequency detected earlier by means of the "Fast TSC calibration using PIT".
      Within the APIC code, introduce the notifier function
      lapic_update_tsc_freq() which reconfigures all per-CPU TSC deadline
      clockevent devices with the current tsc_khz.
      Call it from the TSC code after TSC calibration refinement has happened.
      Signed-off-by: default avatarNicolai Stange <nicstange@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Christopher S. Hall <christopher.s.hall@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: http://lkml.kernel.org/r/20160714152255.18295-3-nicstange@gmail.com
      [ Pushed #ifdef CONFIG_X86_LOCAL_APIC into header, improved changelog. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
  9. 08 Aug, 2016 2 commits
    • Rafael J. Wysocki's avatar
      x86/power/64: Always create temporary identity mapping correctly · e4630fdd
      Rafael J. Wysocki authored
      The low-level resume-from-hibernation code on x86-64 uses
      kernel_ident_mapping_init() to create the temoprary identity mapping,
      but that function assumes that the offset between kernel virtual
      addresses and physical addresses is aligned on the PGD level.
      However, with a randomized identity mapping base, it may be aligned
      on the PUD level and if that happens, the temporary identity mapping
      created by set_up_temporary_mappings() will not reflect the actual
      kernel identity mapping and the image restoration will fail as a
      result (leading to a kernel panic most of the time).
      To fix this problem, rework kernel_ident_mapping_init() to support
      unaligned offsets between KVA and PA up to the PMD level and make
      set_up_temporary_mappings() use it as approprtiate.
      Reported-and-tested-by: default avatarThomas Garnier <thgarnie@google.com>
      Reported-by: default avatarBorislav Petkov <bp@suse.de>
      Suggested-by: default avatarYinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarYinghai Lu <yinghai@kernel.org>
    • Linus Torvalds's avatar
      unsafe_[get|put]_user: change interface to use a error target label · 1bd4403d
      Linus Torvalds authored
      When I initially added the unsafe_[get|put]_user() helpers in commit
      5b24a7a2 ("Add 'unsafe' user access functions for batched
      accesses"), I made the mistake of modeling the interface on our
      traditional __[get|put]_user() functions, which return zero on success,
      or -EFAULT on failure.
      That interface is fairly easy to use, but it's actually fairly nasty for
      good code generation, since it essentially forces the caller to check
      the error value for each access.
      In particular, since the error handling is already internally
      implemented with an exception handler, and we already use "asm goto" for
      various other things, we could fairly easily make the error cases just
      jump directly to an error label instead, and avoid the need for explicit
      checking after each operation.
      So switch the interface to pass in an error label, rather than checking
      the error value in the caller.  Best do it now before we start growing
      more users (the signal handling code in particular would be a good place
      to use the new interface).
      So rather than
      	if (unsafe_get_user(x, ptr))
      		... handle error ..
      the interface is now
      	unsafe_get_user(x, ptr, label);
      where an error during the user mode fetch will now just cause a jump to
      'label' in the caller.
      Right now the actual _implementation_ of this all still ends up being a
      "if (err) goto label", and does not take advantage of any exception
      label tricks, but for "unsafe_put_user()" in particular it should be
      fairly straightforward to convert to using the exception table model.
      Note that "unsafe_get_user()" is much harder to convert to a clever
      exception table model, because current versions of gcc do not allow the
      use of "asm goto" (for the exception) with output values (for the actual
      value to be fetched).  But that is hopefully not a limitation in the
      long term.
      [ Also note that it might be a good idea to switch unsafe_get_user() to
        actually _return_ the value it fetches from user space, but this
        commit only changes the error handling semantics ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  10. 04 Aug, 2016 1 commit
    • Krzysztof Kozlowski's avatar
      dma-mapping: use unsigned long for dma_attrs · 00085f1e
      Krzysztof Kozlowski authored
      The dma-mapping core and the implementations do not change the DMA
      attributes passed by pointer.  Thus the pointer can point to const data.
      However the attributes do not have to be a bitfield.  Instead unsigned
      long will do fine:
      1. This is just simpler.  Both in terms of reading the code and setting
         attributes.  Instead of initializing local attributes on the stack
         and passing pointer to it to dma_set_attr(), just set the bits.
      2. It brings safeness and checking for const correctness because the
         attributes are passed by value.
      Semantic patches for this change (at least most of them):
          virtual patch
          virtual context
          identifier f, attrs;
          - struct dma_attrs *attrs
          + unsigned long attrs
          , ...)
          identifier r.f;
          - NULL
          + 0
          // Options: --all-includes
          virtual patch
          virtual context
          identifier f, attrs;
          type t;
          t f(..., struct dma_attrs *attrs);
          identifier r.f;
          - NULL
          + 0
      Link: http://lkml.kernel.org/r/1468399300-5399-2-git-send-email-k.kozlowski@samsung.comSigned-off-by: default avatarKrzysztof Kozlowski <k.kozlowski@samsung.com>
      Acked-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Acked-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Acked-by: default avatarHans-Christian Noren Egtvedt <egtvedt@samfundet.no>
      Acked-by: Mark Salter <msalter@redhat.com> [c6x]
      Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> [cris]
      Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> [drm]
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Acked-by: Joerg Roedel <jroedel@suse.de> [iommu]
      Acked-by: Fabien Dessenne <fabien.dessenne@st.com> [bdisp]
      Reviewed-by: Marek Szyprowski <m.szyprowski@samsung.com> [vb2-core]
      Acked-by: David Vrabel <david.vrabel@citrix.com> [xen]
      Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [xen swiotlb]
      Acked-by: Joerg Roedel <jroedel@suse.de> [iommu]
      Acked-by: Richard Kuo <rkuo@codeaurora.org> [hexagon]
      Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
      Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390]
      Acked-by: default avatarBjorn Andersson <bjorn.andersson@linaro.org>
      Acked-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no> [avr32]
      Acked-by: Vineet Gupta <vgupta@synopsys.com> [arc]
      Acked-by: Robin Murphy <robin.murphy@arm.com> [arm64 and dma-iommu]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>