1. 20 Mar, 2015 1 commit
    • Will Deacon's avatar
      arm64: efi: don't restore TTBR0 if active_mm points at init_mm · 130c93fd
      Will Deacon authored
      init_mm isn't a normal mm: it has swapper_pg_dir as its pgd (which
      contains kernel mappings) and is used as the active_mm for the idle
      When restoring the pgd after an EFI call, we write current->active_mm
      into TTBR0. If the current task is actually the idle thread (e.g. when
      initialising the EFI RTC before entering userspace), then the TLB can
      erroneously populate itself with junk global entries as a result of
      speculative table walks.
      When we do eventually return to userspace, the task can end up hitting
      these junk mappings leading to lockups, corruption or crashes.
      This patch fixes the problem in the same way as the CPU suspend code by
      ensuring that we never switch to the init_mm in efi_set_pgd and instead
      point TTBR0 at the zero page. A check is also added to cpu_switch_mm to
      BUG if we get passed swapper_pg_dir.
      Reviewed-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Fixes: f3cdfd23
       ("arm64/efi: move SetVirtualAddressMap() to UEFI stub")
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
  2. 14 Mar, 2015 3 commits
  3. 06 Mar, 2015 1 commit
    • Laura Abbott's avatar
      arm64: Don't use is_module_addr in setting page attributes · 8b5f5a07
      Laura Abbott authored
      The set_memory_* functions currently only support module
      addresses. The addresses are validated using is_module_addr.
      That function is special though and relies on internal state
      in the module subsystem to work properly. At the time of
      module initialization and calling set_memory_*, it's too early
      for is_module_addr to work properly so it always returns
      false. Rather than be subject to the whims of the module state,
      just bounds check against the module virtual address range.
      Signed-off-by: default avatarLaura Abbott <lauraa@codeaurora.org>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
  4. 04 Mar, 2015 1 commit
  5. 27 Feb, 2015 3 commits
    • Lorenzo Pieralisi's avatar
      arm64: cpuidle: add asm/proc-fns.h inclusion · af4819af
      Lorenzo Pieralisi authored
      ARM64 CPUidle driver requires the cpu_do_idle function so that it can
      be used to enter the shallowest idle state, and it is declared in
      The current ARM64 CPUidle driver does not include asm/proc-fns.h
      explicitly and it has so far relied on implicit inclusion from other
      header files.
      Owing to some header dependencies reshuffling this currently triggers
      build failures when CONFIG_ARM64_64K_PAGES=y:
      drivers/cpuidle/cpuidle-arm64.c: In function "arm64_enter_idle_state"
      drivers/cpuidle/cpuidle-arm64.c:42:3: error: implicit declaration of
      function "cpu_do_idle" [-Werror=implicit-function-declaration]
      This patch adds the explicit inclusion of the asm/proc-fns.h header file
      in the arm64 asm/cpuidle.h header file, so that the build breakage is fixed
      and the required header inclusion is added to the appropriate arch back-end
      CPUidle header, already included by the CPUidle arm64 driver, where
      CPUidle arch related function declarations belong.
      Reported-by: default avatarLaura Abbott <lauraa@codeaurora.org>
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
    • Catalin Marinas's avatar
      arm64: compat Fix siginfo_t -> compat_siginfo_t conversion on big endian · 9d42d48a
      Catalin Marinas authored
      The native (64-bit) sigval_t union contains sival_int (32-bit) and
      sival_ptr (64-bit). When a compat application invokes a syscall that
      takes a sigval_t value (as part of a larger structure, e.g.
      compat_sys_mq_notify, compat_sys_timer_create), the compat_sigval_t
      union is converted to the native sigval_t with sival_int overlapping
      with either the least or the most significant half of sival_ptr,
      depending on endianness. When the corresponding signal is delivered to a
      compat application, on big endian the current (compat_uptr_t)sival_ptr
      cast always returns 0 since sival_int corresponds to the top part of
      sival_ptr. This patch fixes copy_siginfo_to_user32() so that sival_int
      is copied to the compat_siginfo_t structure.
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarBamvor Jian Zhang <bamvor.zhangjian@huawei.com>
      Tested-by: default avatarBamvor Jian Zhang <bamvor.zhangjian@huawei.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
    • Catalin Marinas's avatar
      arm64: Increase the swiotlb buffer size 64MB · a1e50a82
      Catalin Marinas authored
      With commit 3690951f
       (arm64: Use swiotlb late initialisation), the
      swiotlb buffer size is limited to MAX_ORDER_NR_PAGES. However, there are
      platforms with 32-bit only devices that require bounce buffering via
      swiotlb. This patch changes the swiotlb initialisation to an early 64MB
      memblock allocation. In order to get the swiotlb buffer correctly
      allocated (via memblock_virt_alloc_low_nopanic), this patch also defines
      ARCH_LOW_ADDRESS_LIMIT to the maximum physical address capable of 32-bit
      Reported-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Tested-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
  6. 26 Feb, 2015 6 commits
    • Marc Zyngier's avatar
      arm64: Fix text patching logic when using fixmap · f6242cac
      Marc Zyngier authored
      Patch 2f896d58
       ("arm64: use fixmap for text patching") changed
      the way we patch the kernel text, using a fixmap when the kernel or
      modules are flagged as read only.
      Unfortunately, a flaw in the logic makes it fall over when patching
      modules without CONFIG_DEBUG_SET_MODULE_RONX enabled:
      [   32.032636] Call trace:
      [   32.032716] [<fffffe00003da0dc>] __copy_to_user+0x2c/0x60
      [   32.032837] [<fffffe0000099f08>] __aarch64_insn_write+0x94/0xf8
      [   32.033027] [<fffffe000009a0a0>] aarch64_insn_patch_text_nosync+0x18/0x58
      [   32.033200] [<fffffe000009c3ec>] ftrace_modify_code+0x58/0x84
      [   32.033363] [<fffffe000009c4e4>] ftrace_make_nop+0x3c/0x58
      [   32.033532] [<fffffe0000164420>] ftrace_process_locs+0x3d0/0x5c8
      [   32.033709] [<fffffe00001661cc>] ftrace_module_init+0x28/0x34
      [   32.033882] [<fffffe0000135148>] load_module+0xbb8/0xfc4
      [   32.034044] [<fffffe0000135714>] SyS_finit_module+0x94/0xc4
      This is triggered by the use of virt_to_page() on a module address,
      which ends to pointing to Nowhereland if you're lucky, or corrupt
      your precious data if not.
      This patch fixes the logic by mimicking what is done on arm:
      - If we're patching a module and CONFIG_DEBUG_SET_MODULE_RONX is set,
        use vmalloc_to_page().
      - If we're patching the kernel and CONFIG_DEBUG_RODATA is set,
        use virt_to_page().
      - Otherwise, use the provided address, as we can write to it directly.
      Tested on 4.0-rc1 as a KVM guest.
      Reported-by: default avatarRichard W.M. Jones <rjones@redhat.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarLaura Abbott <lauraa@codeaurora.org>
      Tested-by: default avatarRichard W.M. Jones <rjones@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
    • Ard Biesheuvel's avatar
      arm64: crypto: increase AES interleave to 4x · 0eee0fbd
      Ard Biesheuvel authored
      This patch increases the interleave factor for parallel AES modes
      to 4x. This improves performance on Cortex-A57 by ~35%. This is
      due to the 3-cycle latency of AES instructions on the A57's
      relatively deep pipeline (compared to Cortex-A53 where the AES
      instruction latency is only 2 cycles).
      At the same time, disable inline expansion of the core AES functions,
      as the performance benefit of this feature is negligible.
        Measured on AMD Seattle (using tcrypt.ko mode=500 sec=1):
        Baseline (2x interleave, inline expansion)
        testing speed of async cbc(aes) (cbc-aes-ce) decryption
        test 4 (128 bit key, 8192 byte blocks): 95545 operations in 1 seconds
        test 14 (256 bit key, 8192 byte blocks): 68496 operations in 1 seconds
        This patch (4x interleave, no inline expansion)
        testing speed of async cbc(aes) (cbc-aes-ce) decryption
        test 4 (128 bit key, 8192 byte blocks): 124735 operations in 1 seconds
        test 14 (256 bit key, 8192 byte blocks): 92328 operations in 1 seconds
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
    • Feng Kan's avatar
      arm64: enable PTE type bit in the mask for pte_modify · 6910fa16
      Feng Kan authored
      Caught during Trinity testing. The pte_modify does not allow
      modification for PTE type bit. This cause the test to hang
      the system. It is found that the PTE can't transit from an
      inaccessible page (b00) to a valid page (b11) because the mask
      does not allow it. This happens when a big block of mmaped
      memory is set the PROT_NONE, then the a small piece is broken
      off and set to PROT_WRITE | PROT_READ cause a huge page split.
      Signed-off-by: default avatarFeng Kan <fkan@apm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
    • Yingjoe Chen's avatar
      arm64: mm: remove unused functions and variable protoypes · 06ff87ba
      Yingjoe Chen authored
      The functions __cpu_flush_user_tlb_range and __cpu_flush_kern_tlb_range
      were removed in commit fa48e6f7
       'arm64: mm: Optimise tlb flush logic
      where we have >4K granule'. Global variable cpu_tlb was never used in
      Remove them.
      Signed-off-by: default avatarYingjoe Chen <yingjoe.chen@mediatek.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
    • Will Deacon's avatar
      arm64: psci: move psci firmware calls out of line · f5e0a12c
      Will Deacon authored
      An arm64 allmodconfig fails to build with GCC 5 due to __asmeq
      assertions in the PSCI firmware calling code firing due to mcount
      preambles breaking our assumptions about register allocation of function
        /tmp/ccDqJsJ6.s: Assembler messages:
        /tmp/ccDqJsJ6.s:60: Error: .err encountered
        /tmp/ccDqJsJ6.s:61: Error: .err encountered
        /tmp/ccDqJsJ6.s:62: Error: .err encountered
        /tmp/ccDqJsJ6.s:99: Error: .err encountered
        /tmp/ccDqJsJ6.s:100: Error: .err encountered
        /tmp/ccDqJsJ6.s:101: Error: .err encountered
      This patch fixes the issue by moving the PSCI calls out-of-line into
      their own assembly files, which are safe from the compiler's meddling
      Reported-by: default avatarAndy Whitcroft <apw@canonical.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
    • Nathan Lynch's avatar
      arm64: vdso: minor ABI fix for clock_getres · e1b6b6ce
      Nathan Lynch authored
      The vdso implementation of clock_getres currently returns 0 (success)
      whenever a null timespec is provided by the caller, regardless of the
      clock id supplied.
      This behavior is incorrect.  It should fall back to syscall when an
      unrecognized clock id is passed, even when the timespec argument is
      null.  This ensures that clock_getres always returns an error for
      invalid clock ids.
      Signed-off-by: default avatarNathan Lynch <nathan_lynch@mentor.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
  7. 25 Feb, 2015 1 commit
    • Sudeep Holla's avatar
      arm64: Add L2 cache topology to ARM Ltd boards/models · 7934d69a
      Sudeep Holla authored
      Commit 5d425c18
       ("arm64: kernel: add support for cpu cache
      information") adds cacheinfo support for ARM64. Since there's no
      architectural way of detecting the cpus that share particular cache,
      device tree can be used and the core cacheinfo already supports the
      This patch adds the L2 cache topology on Juno board, FVP/RTSM and
      foundation models.
      Signed-off-by: default avatarSudeep Holla <sudeep.holla@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Liviu Dudau <Liviu.Dudau@arm.com>
      Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
  8. 23 Feb, 2015 3 commits
  9. 13 Feb, 2015 1 commit
    • Andrey Ryabinin's avatar
      mm: vmalloc: pass additional vm_flags to __vmalloc_node_range() · cb9e3c29
      Andrey Ryabinin authored
      For instrumenting global variables KASan will shadow memory backing memory
      for modules.  So on module loading we will need to allocate memory for
      shadow and map it at address in shadow that corresponds to the address
      allocated in module_alloc().
      __vmalloc_node_range() could be used for this purpose, except it puts a
      guard hole after allocated area.  Guard hole in shadow memory should be a
      problem because at some future point we might need to have a shadow memory
      at address occupied by guard hole.  So we could fail to allocate shadow
      for module_alloc().
      Now we have VM_NO_GUARD flag disabling guard page, so we need to pass into
      __vmalloc_node_range().  Add new parameter 'vm_flags' to
      __vmalloc_node_range() function.
      Signed-off-by: default avatarAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: default avatarAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  10. 12 Feb, 2015 1 commit
    • Andy Lutomirski's avatar
      all arches, signal: move restart_block to struct task_struct · f56141e3
      Andy Lutomirski authored
      If an attacker can cause a controlled kernel stack overflow, overwriting
      the restart block is a very juicy exploit target.  This is because the
      restart_block is held in the same memory allocation as the kernel stack.
      Moving the restart block to struct task_struct prevents this exploit by
      making the restart_block harder to locate.
      Note that there are other fields in thread_info that are also easy
      targets, at least on some architectures.
      It's also a decent simplification, since the restart code is more or less
      identical on all architectures.
      [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: default avatarRichard Weinberger <richard@nod.at>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  11. 11 Feb, 2015 2 commits
    • Kirill A. Shutemov's avatar
      mm: make FIRST_USER_ADDRESS unsigned long on all archs · d016bf7e
      Kirill A. Shutemov authored
      LKP has triggered a compiler warning after my recent patch "mm: account
      pmd page tables to the process":
          mm/mmap.c: In function 'exit_mmap':
       >> mm/mmap.c:2857:2: warning: right shift count >= width of type [enabled by default]
      The code:
       > 2857                WARN_ON(mm_nr_pmds(mm) >
         2858                                round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);
      In this, on tile, we have FIRST_USER_ADDRESS defined as 0.  round_up() has
      the same type -- int.  PUD_SHIFT.
      I think the best way to fix it is to define FIRST_USER_ADDRESS as unsigned
      long.  On every arch for consistency.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Naoya Horiguchi's avatar
      mm/hugetlb: reduce arch dependent code around follow_huge_* · 61f77eda
      Naoya Horiguchi authored
      Currently we have many duplicates in definitions around
      follow_huge_addr(), follow_huge_pmd(), and follow_huge_pud(), so this
      patch tries to remove the m.  The basic idea is to put the default
      implementation for these functions in mm/hugetlb.c as weak symbols
      (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETL B), and to implement
      arch-specific code only when the arch needs it.
      For follow_huge_addr(), only powerpc and ia64 have their own
      implementation, and in all other architectures this function just returns
      ERR_PTR(-EINVAL).  So this patch sets returning ERR_PTR(-EINVAL) as
      As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to
      always return 0 in your architecture (like in ia64 or sparc,) it's never
      called (the callsite is optimized away) no matter how implemented it is.
      So in such architectures, we don't need arch-specific implementation.
      In some architecture (like mips, s390 and tile,) their current
      arch-specific follow_huge_(pmd|pud)() are effectively identical with the
      common code, so this patch lets these architecture use the common code.
      One exception is metag, where pmd_huge() could return non-zero but it
      expects follow_huge_pmd() to always return NULL.  This means that we need
      arch-specific implementation which returns NULL.  This behavior looks
      strange to me (because non-zero pmd_huge() implies that the architecture
      supports PMD-based hugepage, so follow_huge_pmd() can/should return some
      relevant value,) but that's beyond this cleanup patch, so let's keep it.
      Justification of non-trivial changes:
      - in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
        patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
        is true when follow_huge_pmd() can be called (note that pmd_huge() has
        the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
      - in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
        code. This patch forces these archs use PMD_MASK, but it's OK because
        they are identical in both archs.
        In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
        In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
        PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
        PTE_ORDER is always 0, so these are identical.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  12. 10 Feb, 2015 1 commit
  13. 06 Feb, 2015 1 commit
    • Paolo Bonzini's avatar
      kvm: add halt_poll_ns module parameter · f7819512
      Paolo Bonzini authored
      This patch introduces a new module parameter for the KVM module; when it
      is present, KVM attempts a bit of polling on every HLT before scheduling
      itself out via kvm_vcpu_block.
      This parameter helps a lot for latency-bound workloads---in particular
      I tested it with O_DSYNC writes with a battery-backed disk in the host.
      In this case, writes are fast (because the data doesn't have to go all
      the way to the platters) but they cannot be merged by either the host or
      the guest.  KVM's performance here is usually around 30% of bare metal,
      or 50% if you use cache=directsync or cache=writethrough (these
      parameters avoid that the guest sends pointless flush requests, and
      at the same time they are not slow because of the battery-backed cache).
      The bad performance happens because on every halt the host CPU decides
      to halt itself too.  When the interrupt comes, the vCPU thread is then
      migrated to a new physical CPU, and in general the latency is horrible
      because the vCPU thread has to be scheduled back in.
      With this patch performance reaches 60-65% of bare metal and, more
      important, 99% of what you get if you use idle=poll in the guest.  This
      means that the tunable gets rid of this particular bottleneck, and more
      work can be done to improve performance in the kernel or QEMU.
      Of course there is some price to pay; every time an otherwise idle vCPUs
      is interrupted by an interrupt, it will poll unnecessarily and thus
      impose a little load on the host.  The above results were obtained with
      a mostly random value of the parameter (500000), and the load was around
      1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
      The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
      that can be used to tune the parameter.  It counts how many HLT
      instructions received an interrupt during the polling period; each
      successful poll avoids that Linux schedules the VCPU thread out and back
      in, and may also avoid a likely trip to C1 and back for the physical CPU.
      While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
      Of these halts, almost all are failed polls.  During the benchmark,
      instead, basically all halts end within the polling period, except a more
      or less constant stream of 50 per second coming from vCPUs that are not
      running the benchmark.  The wasted time is thus very low.  Things may
      be slightly different for Windows VMs, which have a ~10 ms timer tick.
      The effect is also visible on Marcelo's recently-introduced latency
      test for the TSC deadline timer.  Though of course a non-RT kernel has
      awful latency bounds, the latency of the timer is around 8000-10000 clock
      cycles compared to 20000-120000 without setting halt_poll_ns.  For the TSC
      deadline timer, thus, the effect is both a smaller average latency and
      a smaller variance.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  14. 02 Feb, 2015 1 commit
  15. 29 Jan, 2015 6 commits
  16. 28 Jan, 2015 3 commits
    • Mark Rutland's avatar
      arm64: mm: use *_sect to check for section maps · a1c76574
      Mark Rutland authored
      The {pgd,pud,pmd}_bad family of macros have slightly fuzzy
      cross-architecture semantics, and seem to imply a populated entry that
      is not a next-level table, rather than a particular type of entry (e.g.
      a section map).
      In arm64 code, for those cases where we care about whether an entry is a
      section mapping, we can instead use the {pud,pmd}_sect macros to
      explicitly check for this case. This helps to document precisely what we
      care about, making the code easier to read, and allows for future
      relaxation of the *_bad macros to check for other "bad" entries.
      To that end this patch updates the table dumping and initial table setup
      to check for section mappings with {pud,pmd}_sect, and adds/restores
      BUG_ON(*_bad((*p)) checks after we've handled the *_sect and *_none
      cases so as to catch remaining "bad" cases.
      In the fault handling code, show_pte is left with *_bad checks as it
      only cares about whether it can walk the next level table, and this path
      is used for both kernel and userspace fault handling. The former case
      will be followed by a die() where we'll report the address that
      triggered the fault, which can be useful context for debugging.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarSteve Capper <steve.capper@linaro.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
    • Mark Rutland's avatar
      arm64: drop unnecessary cache+tlb maintenance · a3bba370
      Mark Rutland authored
      In paging_init, we call flush_cache_all, but this is backed by Set/Way
      operations which may not achieve anything in the presence of cache line
      migration and/or system caches. If the caches are already in an
      inconsistent state at this point, there is nothing we can do (short of
      flushing the entire physical address space by VA) to empty architected
      and system caches. As such, flush_cache_all only serves to mask other
      potential bugs. Hence, this patch removes the boot-time call to
      Immediately after the cache maintenance we flush the TLBs, but this is
      also unnecessary. Before enabling the MMU, the TLBs are invalidated, and
      thus are initially clean. When changing the contents of active tables
      (e.g. in fixup_executable() for DEBUG_RODATA) we perform the required
      TLB maintenance following the update, and therefore no additional
      maintenance is required to ensure the new table entries are in effect.
      Since activating the MMU we will not have modified system register
      fields permitted to be cached in a TLB, and therefore do not need
      maintenance for any cached system register fields. Hence, the TLB flush
      is unnecessary.
      Shortly after the unnecessary TLB flush, we update TTBR0 to point to an
      empty zero page rather than the idmap, and flush the TLBs. This
      maintenance is necessary to remove the global idmap entries from the
      TLBs (as they would conflict with userspace mappings), and is retained.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Acked-by: default avatarSteve Capper <steve.capper@linaro.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
    • zhichang.yuan's avatar
      arm64:mm: free the useless initial page table · 523d6e9f
      zhichang.yuan authored
      For 64K page system, after mapping a PMD section, the corresponding initial
      page table is not needed any more. That page can be freed.
      Signed-off-by: default avatarZhichang Yuan <zhichang.yuan@linaro.org>
      [catalin.marinas@arm.com: added BUG_ON() to catch late memblock freeing]
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
  17. 27 Jan, 2015 5 commits