1. 01 Sep, 2010 9 commits
    • Paul Mackerras's avatar
      powerpc: Dynamically allocate most lppaca structs · 93c22703
      Paul Mackerras authored
      
      
      This arranges for the lppaca structs for most cpus to be dynamically
      allocated in the same manner as the paca structs.  If we don't include
      support for legacy iSeries, only the first lppaca is statically
      allocated; the rest are dynamically allocated.  If we include legacy
      iSeries support, then we statically allocate the first 64 lppaca
      structs, since the iSeries hypervisor requires that the lppaca
      structs be present in the data section of the kernel image, but
      legacy iSeries supports at most 64 cpus.
      
      With CONFIG_NR_CPUS, the kernel image size for a typical pSeries config
      went from:
      
         text    data     bss     dec     hex filename
      9524478 4734564 8469944 22728986        15ad11a ../test-1024/vmlinux
      
      to:
      
         text    data     bss     dec     hex filename
      9524482 3751508 8469944 21745934        14bd10e ../test-1024/vmlinux
      
      a reduction of 983052 bytes overall.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      93c22703
    • Paul Mackerras's avatar
      powerpc: Abstract indexing of lppaca structs · 8154c5d2
      Paul Mackerras authored
      
      
      Currently we have the lppaca structs as a simple array of NR_CPUS
      entries, taking up space in the data section of the kernel image.
      In future we would like to allocate them dynamically, so this
      abstracts out the accesses to the array, making it easier to
      change how we locate the lppaca for a given cpu in future.
      Specifically, lppaca[cpu] changes to lppaca_of(cpu).
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8154c5d2
    • Michael Neuling's avatar
      powerpc: Move arch_sd_sibling_asym_packing() to smp.c · e1f0ece1
      Michael Neuling authored
      
      
      Simple cleanup by moving arch_sd_sibling_asym_packing from process.c to
      smp.c to save an #ifdef CONFIG_SMP
      
      No functionality change.
      Signed-off-by: default avatarMichael Neuling <mikey@neuling.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      e1f0ece1
    • Anton Blanchard's avatar
      powerpc: Check end of stack canary at oops time · 28b54990
      Anton Blanchard authored
      
      
      Add a check for the stack canary when we oops, similar to x86. This should make
      it clear that we overran our stack:
      
      Unable to handle kernel paging request for data at address 0x24652f63700ac689
      Faulting instruction address: 0xc000000000063d24
      Thread overran stack, or stack corrupted
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      28b54990
    • Anton Blanchard's avatar
      powerpc: Feature nop out reservation clear when stcx checks address · f89451fb
      Anton Blanchard authored
      The POWER architecture does not require stcx to check that it is operating
      on the same address as the larx. This means it is possible for an
      an exception handler to execute a larx, get a reservation, decide
      not to do the stcx and then return back with an active reservation. If the
      interrupted code was in the middle of a larx/stcx sequence the stcx could
      incorrectly succeed.
      
      All recent POWER CPUs check the address before letting the stcx succeed
      so we can create a CPU feature and nop it out. As Ben suggested, we can
      only do this in our syscall path because there is a remote possibility
      some kernel code gets interrupted by an exception that ends up operating
      on the same cacheline.
      
      Thanks to Paul Mackerras and Derek Williams for the idea.
      
      To test this I used a very simple null syscall (actually getppid) testcase
      at http://ozlabs.org/~anton/junkcode/null_syscall.c
      
      
      
      I tested against 2.6.35-git10 with the following changes against the
      pseries_defconfig:
      
      CONFIG_VIRT_CPU_ACCOUNTING=n
      CONFIG_AUDIT=n
      CONFIG_PPC_4K_PAGES=n
      CONFIG_PPC_64K_PAGES=y
      CONFIG_FORCE_MAX_ZONEORDER=9
      CONFIG_PPC_SUBPAGE_PROT=n
      CONFIG_FUNCTION_TRACER=n
      CONFIG_FUNCTION_GRAPH_TRACER=n
      CONFIG_IRQSOFF_TRACER=n
      CONFIG_STACK_TRACER=n
      
      to remove the overhead of virtual CPU accounting, syscall auditing and
      the ftrace mcount tracers. 64kB pages were enabled to minimise TLB misses.
      
      POWER6: +8.2%
      POWER7: +7.0%
      
      Another suggestion was to use a larx to something in the L1 instead of a stcx.
      This was almost as fast as removing the larx on POWER6, but only 3.5% faster
      on POWER7. We can use this to speed up the reservation clear in our
      exception exit code.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      f89451fb
    • Anton Blanchard's avatar
      powerpc: Add 64bit csum_and_copy_to_user · 8c773914
      Anton Blanchard authored
      
      
      This adds the equivalent of csum_and_copy_from_user for the receive side so we
      can copy and checksum in one pass. It is modelled on the generic checksum
      routine.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8c773914
    • Anton Blanchard's avatar
      powerpc: Optimise 64bit csum_partial_copy_generic and add csum_and_copy_from_user · fdd374b6
      Anton Blanchard authored
      
      
      We use the same core loop as the new csum_partial, adding in the
      stores and exception handling code. To keep things simple we do all the
      exception fixup in csum_and_copy_from_user. This wrapper function is
      modelled on the generic checksum code and is careful to always calculate
      a complete checksum even if we only copied part of the data to userspace.
      
      To test this I forced checksumming on over loopback and ran socklib (a
      simple TCP benchmark). On a POWER6 575 throughput improved by 19% with
      this patch. If I forced both the sender and receiver onto the same cpu
      (with the hope of shifting the benchmark from being cache bandwidth limited
      to cpu limited), adding this patch improved performance by 55%
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      fdd374b6
    • Anton Blanchard's avatar
      powerpc: Optimise 64bit csum_partial · 9b83ecb0
      Anton Blanchard authored
      
      
      The main loop of csum_partial runs very slowly on recent POWER CPUs. After some
      analysis on both POWER6 and POWER7 I came up with routine below. First we get
      the source aligned to a double word, ignoring any odd alignment to keep things
      simple. Then we do 64 bytes at a time, with an entry and exit limb of a further
      64 bytes. On both POWER6 and POWER7 this should be as fast as we can go since
      we are limited by the latency of the adde instructions.
      
      To test this I forced checksumming on over loopback and ran socklib (a
      simple TCP benchmark). On a POWER6 575 throughput improved by 11% with
      this patch.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      9b83ecb0
    • Nathan Fontenot's avatar
      powerpc/pseries: Correct rtas_data_buf locking in dlpar code · 93f68f1e
      Nathan Fontenot authored
      
      
      The dlpar code can cause a deadlock to occur when making the RTAS
      configure-connector call.  This occurs because we make kmalloc calls,
      which can block, while parsing the rtas_data_buf and holding the
      rtas_data_buf_lock.  This an cause issues if someone else attempts
      to grab the rtas_data_bug_lock.
      
      This patch alleviates this issue by copying the contents of the rtas_data_buf
      to a local buffer before parsing.  This allows us to only hold the
      rtas_data_buf_lock around the RTAS configure-connector calls.
      Signed-off-by: default avatarNathan Fontenot <nfont@austin.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      93f68f1e
  2. 31 Aug, 2010 6 commits
  3. 30 Aug, 2010 3 commits
    • Michael Neuling's avatar
      powerpc: Don't use kernel stack with translation off · 54a83404
      Michael Neuling authored
      In f761622e
      
       we changed
      early_setup_secondary so it's called using the proper kernel stack
      rather than the emergency one.
      
      Unfortunately, this stack pointer can't be used when translation is off
      on PHYP as this stack pointer might be outside the RMO.  This results in
      the following on all non zero cpus:
        cpu 0x1: Vector: 300 (Data Access) at [c00000001639fd10]
            pc: 000000000001c50c
            lr: 000000000000821c
            sp: c00000001639ff90
           msr: 8000000000001000
           dar: c00000001639ffa0
         dsisr: 42000000
          current = 0xc000000016393540
          paca    = 0xc000000006e00200
            pid   = 0, comm = swapper
      
      The original patch was only tested on bare metal system, so it never
      caught this problem.
      
      This changes __secondary_start so that we calculate the new stack
      pointer but only start using it after we've called early_setup_secondary.
      
      With this patch, the above problem goes away.
      Signed-off-by: default avatarMichael Neuling <mikey@neuling.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      54a83404
    • Paul Mackerras's avatar
      powerpc/perf_event: Reduce latency of calling perf_event_do_pending · b0d278b7
      Paul Mackerras authored
      Commit 0fe1ac48
      
       ("powerpc/perf_event: Fix oops due to
      perf_event_do_pending call") moved the call to perf_event_do_pending
      in timer_interrupt() down so that it was after the irq_enter() call.
      Unfortunately this moved it after the code that checks whether it
      is time for the next decrementer clock event.  The result is that
      the call to perf_event_do_pending() won't happen until the next
      decrementer clock event is due.  This was pointed out by Milton
      Miller.
      
      This fixes it by moving the check for whether it's time for the
      next decrementer clock event down to the point where we're about
      to call the event handler, after we've called perf_event_do_pending.
      
      This has the side effect that on old pre-Core99 Powermacs where we
      use the ppc_n_lost_interrupts mechanism to replay interrupts, a
      replayed interrupt will incur a little more latency since it will
      now do the code from the irq_enter down to the irq_exit, that it
      used to skip.  However, these machines are now old and rare enough
      that this doesn't matter.  To make it clear that ppc_n_lost_interrupts
      is only used on Powermacs, and to speed up the code slightly on
      non-Powermac ppc32 machines, the code that tests ppc_n_lost_interrupts
      is now conditional on CONFIG_PMAC as well as CONFIG_PPC32.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      b0d278b7
    • Matthew McClintock's avatar
      powerpc/kexec: Adds correct calling convention for kexec purgatory · 4562c986
      Matthew McClintock authored
      
      
      Call kexec purgatory code correctly. We were getting lucky before.
      If you examine the powerpc 32bit kexec "purgatory" code you will
      see it expects the following:
      
      >From kexec-tools: purgatory/arch/ppc/v2wrap_32.S
      -> calling convention:
      ->   r3 = physical number of this cpu (all cpus)
      ->   r4 = address of this chunk (master only)
      
      As such, we need to set r3 to the current core, r4 happens to be
      unused by purgatory at the moment but we go ahead and set it
      here as well
      Signed-off-by: default avatarMatthew McClintock <msm@freescale.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      4562c986
  4. 29 Aug, 2010 3 commits
  5. 28 Aug, 2010 19 commits