1. 27 Jan, 2013 2 commits
    • Frederic Weisbecker's avatar
      kvm: Prepare to add generic guest entry/exit callbacks · c11f11fc
      Frederic Weisbecker authored
      
      
      Do some ground preparatory work before adding guest_enter()
      and guest_exit() context tracking callbacks. Those will
      be later used to read the guest cputime safely when we
      run in full dynticks mode.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      c11f11fc
    • Frederic Weisbecker's avatar
      cputime: Generic on-demand virtual cputime accounting · abf917cd
      Frederic Weisbecker authored
      
      
      If we want to stop the tick further idle, we need to be
      able to account the cputime without using the tick.
      
      Virtual based cputime accounting solves that problem by
      hooking into kernel/user boundaries.
      
      However implementing CONFIG_VIRT_CPU_ACCOUNTING require
      low level hooks and involves more overhead. But we already
      have a generic context tracking subsystem that is required
      for RCU needs by archs which plan to shut down the tick
      outside idle.
      
      This patch implements a generic virtual based cputime
      accounting that relies on these generic kernel/user hooks.
      
      There are some upsides of doing this:
      
      - This requires no arch code to implement CONFIG_VIRT_CPU_ACCOUNTING
      if context tracking is already built (already necessary for RCU in full
      tickless mode).
      
      - We can rely on the generic context tracking subsystem to dynamically
      (de)activate the hooks, so that we can switch anytime between virtual
      and tick based accounting. This way we don't have the overhead
      of the virtual accounting when the tick is running periodically.
      
      And one downside:
      
      - There is probably more overhead than a native virtual based cputime
      accounting. But this relies on hooks that are already set anyway.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      abf917cd
  2. 02 Jan, 2013 1 commit
    • Shan Hai's avatar
      powerpc/vdso: Remove redundant locking in update_vsyscall_tz() · ce73ec6d
      Shan Hai authored
      The locking in update_vsyscall_tz() is not only unnecessary because the vdso
      code copies the data unproteced in __kernel_gettimeofday() but also
      introduces a hard to reproduce race condition between update_vsyscall()
      and update_vsyscall_tz(), which causes user space process to loop
      forever in vdso code.
      
      The following patch removes the locking from update_vsyscall_tz().
      
      Locking is not only unnecessary because the vdso code copies the data
      unprotected in __kernel_gettimeofday() but also erroneous because updating
      the tb_update_count is not atomic and introduces a hard to reproduce race
      condition between update_vsyscall() and update_vsyscall_tz(), which further
      causes user space process to loop forever in vdso code.
      
      The below scenario describes the race condition,
      x==0	Boot CPU			other CPU
      	proc_P: x==0
      	    timer interrupt
      		update_vsyscall
      x==1		    x++;sync		settimeofday
      					    update_vsyscall_tz
      x==2						x++;sync
      x==3		    sync;x++
      						sync;x++
      	proc_P: x==3 (loops until x becomes even)
      
      Because the ++ operator would be implemented as three instructions and not
      atomic on powerpc.
      
      A similar change was made for x86 in commit 6c260d58
      
      
      ("x86: vdso: Remove bogus locking in update_vsyscall_tz")
      Signed-off-by: default avatarShan Hai <shan.hai@windriver.com>
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ce73ec6d
  3. 20 Nov, 2012 1 commit
    • Frederic Weisbecker's avatar
      vtime: Warn if irqs aren't disabled on system time accounting APIs · 1b2852b1
      Frederic Weisbecker authored
      
      
      System time accounting APIs such as vtime_account_system() and
      vtime_account_idle() need to be irqsafe. Current callers include
      irq entry, exit and kvm, all of which have been checked against that
      requirement. Now it's better to grow that with an automatic check
      in case we have further callers or we missed something.
      Suggested-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      1b2852b1
  4. 19 Nov, 2012 3 commits
    • Frederic Weisbecker's avatar
      vtime: Consolidate a bit the ctx switch code · e3942ba0
      Frederic Weisbecker authored
      
      
      On ia64 and powerpc, vtime context switch only consists
      in flushing system and user pending time, plus a few
      arch housekeeping.
      
      Consolidate that into a generic implementation. s390 is
      a special case because pending user and system time accounting
      there is hard to dissociate. So it's keeping its own implementation.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      e3942ba0
    • Frederic Weisbecker's avatar
      vtime: Explicitly account pending user time on process tick · bcebdf84
      Frederic Weisbecker authored
      
      
      All vtime implementations just flush the user time on process
      tick. Consolidate that in generic code by calling a user time
      accounting helper. This avoids an indirect call in ia64 and
      prepare to also consolidate vtime context switch code.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      bcebdf84
    • Frederic Weisbecker's avatar
      vtime: Remove the underscore prefix invasion · fd25b4c2
      Frederic Weisbecker authored
      
      
      Prepending irq-unsafe vtime APIs with underscores was actually
      a bad idea as the result is a big mess in the API namespace that
      is even waiting to be further extended. Also these helpers
      are always called from irq safe callers except kvm. Just
      provide a vtime_account_system_irqsafe() for this specific
      case so that we can remove the underscore prefix on other
      vtime functions.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      fd25b4c2
  5. 29 Oct, 2012 1 commit
    • Frederic Weisbecker's avatar
      vtime: Make vtime_account_system() irqsafe · 11113334
      Frederic Weisbecker authored
      
      
      vtime_account_system() currently has only one caller with
      vtime_account() which is irq safe.
      
      Now we are going to call it from other places like kvm where
      irqs are not always disabled by the time we account the cputime.
      
      So let's make it irqsafe. The arch implementation part is now
      prefixed with "__".
      
      vtime_account_idle() arch implementation is prefixed accordingly
      to stay consistent.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      11113334
  6. 25 Sep, 2012 2 commits
    • Frederic Weisbecker's avatar
      vtime: Consolidate system/idle context detection · a7e1a9e3
      Frederic Weisbecker authored
      
      
      Move the code that finds out to which context we account the
      cputime into generic layer.
      
      Archs that consider the whole time spent in the idle task as idle
      time (ia64, powerpc) can rely on the generic vtime_account()
      and implement vtime_account_system() and vtime_account_idle(),
      letting the generic code to decide when to call which API.
      
      Archs that have their own meaning of idle time, such as s390
      that only considers the time spent in CPU low power mode as idle
      time, can just override vtime_account().
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      a7e1a9e3
    • Frederic Weisbecker's avatar
      cputime: Use a proper subsystem naming for vtime related APIs · bf9fae9f
      Frederic Weisbecker authored
      
      
      Use a naming based on vtime as a prefix for virtual based
      cputime accounting APIs:
      
      - account_system_vtime() -> vtime_account()
      - account_switch_vtime() -> vtime_task_switch()
      
      It makes it easier to allow for further declension such
      as vtime_account_system(), vtime_account_idle(), ... if we
      want to find out the context we account to from generic code.
      
      This also make it better to know on which subsystem these APIs
      refer to.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      bf9fae9f
  7. 24 Sep, 2012 2 commits
    • John Stultz's avatar
      time: Convert CONFIG_GENERIC_TIME_VSYSCALL to CONFIG_GENERIC_TIME_VSYSCALL_OLD · 70639421
      John Stultz authored
      
      
      To help migrate archtectures over to the new update_vsyscall method,
      redfine CONFIG_GENERIC_TIME_VSYSCALL as CONFIG_GENERIC_TIME_VSYSCALL_OLD
      
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      70639421
    • John Stultz's avatar
      time: Move update_vsyscall definitions to timekeeper_internal.h · 189374ae
      John Stultz authored
      
      
      Since users will need to include timekeeper_internal.h, move
      update_vsyscall definitions to timekeeper_internal.h.
      
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      189374ae
  8. 17 Sep, 2012 1 commit
    • Li Zhong's avatar
      powerpc/trace: Fix interrupt tracepoints vs. RCU · e72bbbab
      Li Zhong authored
      
      
      There are a few tracepoints in the interrupt code path, which is before
      irq_enter(), or after irq_exit(), like
      trace_irq_entry()/trace_irq_exit() in do_IRQ(),
      trace_timer_interrupt_entry()/trace_timer_interrupt_exit() in
      timer_interrupt().
      
      If the interrupt is from idle(), and because tracepoint contains RCU
      read-side critical section, we could see following suspicious RCU usage
      reported:
      
      [  145.127743] ===============================
      [  145.127747] [ INFO: suspicious RCU usage. ]
      [  145.127752] 3.6.0-rc3+ #1 Not tainted
      [  145.127755] -------------------------------
      [  145.127759] /root/.workdir/linux/arch/powerpc/include/asm/trace.h:33
      suspicious rcu_dereference_check() usage!
      [  145.127765]
      [  145.127765] other info that might help us debug this:
      [  145.127765]
      [  145.127771]
      [  145.127771] RCU used illegally from idle CPU!
      [  145.127771] rcu_scheduler_active = 1, debug_locks = 0
      [  145.127777] RCU used illegally from extended quiescent state!
      [  145.127781] no locks held by swapper/0/0.
      [  145.127785]
      [  145.127785] stack backtrace:
      [  145.127789] Call Trace:
      [  145.127796] [c00000000108b530] [c000000000013c40] .show_stack
      +0x70/0x1c0 (unreliable)
      [  145.127806] [c00000000108b5e0]
      [c0000000000f59d8] .lockdep_rcu_suspicious+0x118/0x150
      [  145.127813] [c00000000108b680] [c00000000000fc58] .do_IRQ+0x498/0x500
      [  145.127820] [c00000000108b750] [c000000000003950]
      hardware_interrupt_common+0x150/0x180
      [  145.127828] --- Exception: 501 at .plpar_hcall_norets+0x84/0xd4
      [  145.127828]     LR = .check_and_cede_processor+0x38/0x70
      [  145.127836] [c00000000108bab0] [c0000000000665dc] .shared_cede_loop
      +0x5c/0x100
      [  145.127844] [c00000000108bb70] [c000000000588ab0] .cpuidle_enter
      +0x30/0x50
      [  145.127850] [c00000000108bbe0]
      [c000000000588b0c] .cpuidle_enter_state+0x3c/0xb0
      [  145.127857] [c00000000108bc60] [c000000000589730] .cpuidle_idle_call
      +0x150/0x6c0
      [  145.127863] [c00000000108bd30] [c000000000058440] .pSeries_idle
      +0x10/0x40
      [  145.127870] [c00000000108bda0] [c00000000001683c] .cpu_idle
      +0x18c/0x2d0
      [  145.127876] [c00000000108be60] [c00000000000b434] .rest_init
      +0x124/0x1b0
      [  145.127884] [c00000000108bef0] [c0000000009d0d28] .start_kernel
      +0x568/0x588
      [  145.127890] [c00000000108bf90] [c000000000009660] .start_here_common
      +0x20/0x40
      
      This is because the RCU usage in interrupt context should be used in
      area marked by rcu_irq_enter()/rcu_irq_exit(), called in
      irq_enter()/irq_exit() respectively.
      
      Move them into the irq_enter()/irq_exit() area to avoid the reporting.
      Signed-off-by: default avatarLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      e72bbbab
  9. 05 Sep, 2012 1 commit
    • Paul Mackerras's avatar
      powerpc: Give hypervisor decrementer interrupts their own handler · dabe859e
      Paul Mackerras authored
      
      
      At the moment the handler for hypervisor decrementer interrupts is
      the same as for decrementer interrupts, i.e. timer_interrupt().
      This is bogus; if we ever do get a hypervisor decrementer interrupt
      it won't have anything to do with the next timer event.  In fact
      the only time we get hypervisor decrementer interrupts is when one
      is left pending on exit from a KVM guest.
      
      When we get a hypervisor decrementer interrupt we don't need to do
      anything special to clear it, since they are edge-triggered on the
      transition of HDEC from 0 to -1.  Thus this adds an empty handler
      function for them.  We don't need to have them masked when interrupts
      are soft-disabled, so we use STD_EXCEPTION_HV instead of
      MASKABLE_EXCEPTION_HV.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      dabe859e
  10. 20 Aug, 2012 1 commit
    • Frederic Weisbecker's avatar
      cputime: Consolidate vtime handling on context switch · baa36046
      Frederic Weisbecker authored
      
      
      The archs that implement virtual cputime accounting all
      flush the cputime of a task when it gets descheduled
      and sometimes set up some ground initialization for the
      next task to account its cputime.
      
      These archs all put their own hooks in their context
      switch callbacks and handle the off-case themselves.
      
      Consolidate this by creating a new account_switch_vtime()
      callback called in generic code right after a context switch
      and that these archs must implement to flush the prev task
      cputime and initialize the next task cputime related state.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      baa36046
  11. 07 Jun, 2012 1 commit
    • Paul Mackerras's avatar
      powerpc/time: Sanity check of decrementer expiration is necessary · 860aed25
      Paul Mackerras authored
      This reverts 68568add
      
       ("powerpc/time: Remove unnecessary sanity check
      of decrementer expiration").  We do need to check whether we have reached
      the expiration time of the next event, because we sometimes get an early
      decrementer interrupt, most notably when we set the decrementer to 1 in
      arch_irq_work_raise().  The effect of not having the sanity check is that
      if timer_interrupt() gets called early, we leave the decrementer set to
      its maximum value, which means we then don't get any more decrementer
      interrupts for about 4 seconds (or longer, depending on timebase
      frequency).  I saw these pauses as a consequence of getting a stray
      hypervisor decrementer interrupt left over from exiting a KVM guest.
      
      This isn't quite a straight revert because of changes to the surrounding
      code, but it restores the same algorithm as was previously used.
      
      Cc: stable@vger.kernel.org
      Acked-by: default avatarAnton Blanchard <anton@samba.org>
      Acked-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      860aed25
  12. 06 May, 2012 1 commit
    • Bharat Bhushan's avatar
      KVM: PPC: Use clockevent multiplier and shifter for decrementer · 6e35994d
      Bharat Bhushan authored
      
      
      Time for which the hrtimer is started for decrementer emulation is calculated
      using tb_ticks_per_usec. While hrtimer uses the clockevent for DEC
      reprogramming (if needed) and which calculate timebase ticks using the
      multiplier and shifter mechanism implemented within clockevent layer.
      
      It was observed that this conversion (timebase->time->timebase) are not
      correct because the mechanism are not consistent.
      In our setup it adds 2% jitter.
      
      With this patch clockevent multiplier and shifter mechanism are used when
      starting hrtimer for decrementer emulation. Now the jitter is < 0.5%.
      Signed-off-by: default avatarBharat Bhushan <bharat.bhushan@freescale.com>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      6e35994d
  13. 20 Mar, 2012 1 commit
  14. 08 Mar, 2012 1 commit
    • Benjamin Herrenschmidt's avatar
      powerpc: Rework lazy-interrupt handling · 7230c564
      Benjamin Herrenschmidt authored
      The current implementation of lazy interrupts handling has some
      issues that this tries to address.
      
      We don't do the various workarounds we need to do when re-enabling
      interrupts in some cases such as when returning from an interrupt
      and thus we may still lose or get delayed decrementer or doorbell
      interrupts.
      
      The current scheme also makes it much harder to handle the external
      "edge" interrupts provided by some BookE processors when using the
      EPR facility (External Proxy) and the Freescale Hypervisor.
      
      Additionally, we tend to keep interrupts hard disabled in a number
      of cases, such as decrementer interrupts, external interrupts, or
      when a masked decrementer interrupt is pending. This is sub-optimal.
      
      This is an attempt at fixing it all in one go by reworking the way
      we do the lazy interrupt disabling from the ground up.
      
      The base idea is to replace the "hard_enabled" field with a
      "irq_happened" field in which we store a bit mask of what interrupt
      occurred while soft-disabled.
      
      When re-enabling, either via arch_local_irq_restore() or when returning
      from an interrupt, we can now decide what to do by testing bits in that
      field.
      
      We then implement replaying of the missed interrupts either by
      re-using the existing exception frame (in exception exit case) or via
      the creation of a new one from an assembly trampoline (in the
      arch_local_irq_enable case).
      
      This removes the need to play with the decrementer to try to create
      fake interrupts, among others.
      
      In addition, this adds a few refinements:
      
       - We no longer  hard disable decrementer interrupts that occur
      while soft-disabled. We now simply bump the decrementer back to max
      (on BookS) or leave it stopped (on BookE) and continue with hard interrupts
      enabled, which means that we'll potentially get better sample quality from
      performance monitor interrupts.
      
       - Timer, decrementer and doorbell interrupts now hard-enable
      shortly after removing the source of the interrupt, which means
      they no longer run entirely hard disabled. Again, this will improve
      perf sample quality.
      
       - On Book3E 64-bit, we now make the performance monitor interrupt
      act as an NMI like Book3S (the necessary C code for that to work
      appear to already be present in the FSL perf code, notably calling
      nmi_enter instead of irq_enter). (This also fixes a bug where BookE
      perfmon interrupts could clobber r14 ... oops)
      
       - We could make "masked" decrementer interrupts act as NMIs when doing
      timer-based perf sampling to improve the sample quality.
      
      Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      ---
      
      v2:
      
      - Add hard-enable to decrementer, timer and doorbells
      - Fix CR clobber in masked irq handling on BookE
      - Make embedded perf interrupt act as an NMI
      - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
        to retrigger an interrupt without preventing hard-enable
      
      v3:
      
       - Fix or vs. ori bug on Book3E
       - Fix enabling of interrupts for some exceptions on Book3E
      
      v4:
      
       - Fix resend of doorbells on return from interrupt on Book3E
      
      v5:
      
       - Rebased on top of my latest series, which involves some significant
      rework of some aspects of the patch.
      
      v6:
       - 32-bit compile fix
       - more compile fixes with various .config combos
       - factor out the asm code to soft-disable interrupts
       - remove the C wrapper around preempt_schedule_irq
      
      v7:
       - Fix a bug with hard irq state tracking on native power7
      7230c564
  15. 18 Dec, 2011 1 commit
  16. 24 Nov, 2011 6 commits
  17. 31 Oct, 2011 1 commit
  18. 01 Jul, 2011 1 commit
  19. 17 Apr, 2011 1 commit
  20. 31 Mar, 2011 1 commit
  21. 29 Mar, 2011 1 commit
    • Anton Blanchard's avatar
      powerpc: Fix accounting of softirq time when idle · ad5d1c88
      Anton Blanchard authored
      commit cf9efce0
      
       (powerpc: Account time using timebase rather
      than PURR) used in_irq() to detect if the time was spent in
      interrupt processing. This only catches hardirq context so if we
      are in softirq context and in the idle loop we end up accounting it
      as idle time. If we instead use in_interrupt() we catch both softirq
      and hardirq time.
      
      The issue was found when running a network intensive workload. top
      showed the following:
      
      0.0%us,  1.1%sy,  0.0%ni, 85.7%id,  0.0%wa,  9.9%hi,  3.3%si,  0.0%st
      
      85.7% idle. But this was wildly different to the perf events data.
      To confirm the suspicion I ran something to keep the core busy:
      
      # yes > /dev/null &
      
      8.2%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa, 10.3%hi, 81.4%si,  0.0%st
      
      We only got 8.2% of the CPU for the userspace task and softirq has
      shot up to 81.4%.
      
      With the patch below top shows the correct stats:
      
      0.0%us,  0.0%sy,  0.0%ni,  5.3%id,  0.0%wa, 13.3%hi, 81.3%si,  0.0%st
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ad5d1c88
  22. 20 Jan, 2011 1 commit
  23. 08 Dec, 2010 1 commit
    • Heiko Schocher's avatar
      powerpc/time: printk time stamp init not correct · 364a1246
      Heiko Schocher authored
      
      
      problem:
      
      I see sometimes on my mpc5200 based board such printk timing
      information:
      
      [    0.000000] NR_IRQS:512 nr_irqs:512 16
      [    0.000000] MPC52xx PIC is up and running!
      [    0.000000] clocksource: timebase mult[79364d9] shift[22] registered
      [    0.000000] console [ttyPSC0] enabled
      [  130.300633] pid_max: default: 32768 minimum: 301
      [  130.305647] Mount-cache hash table entries: 512
      [  130.315818] NET: Registered protocol family 16
      
      reason:
      if the tbu not starts from 0 when linux boots, boot_tb
      maybe could not store the real 64 bit tbu value, because
      boot_tp is only a 32 bit unsigned long.
      
      solution:
      change boot_tb to u64
      
      [BenH: Made it u64 instead of unsigned long long]
      Signed-off-by: default avatarHeiko Schocher <hs@denx.de>
      cc: Wolfgang Denk <wd@denx.de>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      364a1246
  24. 18 Oct, 2010 1 commit
    • Peter Zijlstra's avatar
      irq_work: Add generic hardirq context callbacks · e360adbe
      Peter Zijlstra authored
      
      
      Provide a mechanism that allows running code in IRQ context. It is
      most useful for NMI code that needs to interact with the rest of the
      system -- like wakeup a task to drain buffers.
      
      Perf currently has such a mechanism, so extract that and provide it as
      a generic feature, independent of perf so that others may also
      benefit.
      
      The IRQ context callback is generated through self-IPIs where
      possible, or on architectures like powerpc the decrementer (the
      built-in timer facility) is set to generate an interrupt immediately.
      
      Architectures that don't have anything like this get to do with a
      callback from the timer tick. These architectures can call
      irq_work_run() at the tail of any IRQ handlers that might enqueue such
      work (like the perf IRQ handler) to avoid undue latencies in
      processing the work.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarKyle McMartin <kyle@mcmartin.ca>
      Acked-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      [ various fixes ]
      Signed-off-by: default avatarHuang Ying <ying.huang@intel.com>
      LKML-Reference: <1287036094.7768.291.camel@yhuang-dev>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      e360adbe
  25. 13 Oct, 2010 1 commit
  26. 01 Sep, 2010 2 commits
    • Paul Mackerras's avatar
      powerpc/pseries: Re-enable dispatch trace log userspace interface · 872e439a
      Paul Mackerras authored
      
      
      Since the cpu accounting code uses the hypervisor dispatch trace log
      now when CONFIG_VIRT_CPU_ACCOUNTING = y, the previous commit disabled
      access to it via files in the /sys/kernel/debug/powerpc/dtl/ directory
      in that case.  This restores those files.
      
      To do this, we now have a hook that the cpu accounting code will call
      as it processes each entry from the hypervisor dispatch trace log.
      The code in dtl.c now uses that to fill up its ring buffer, rather
      than having the hypervisor fill the ring buffer directly.
      
      This also fixes dtl_file_read() to handle overflow conditions a bit
      better and adds a spinlock to ensure that race conditions (multiple
      processes opening or reading the file concurrently) are handled
      correctly.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      872e439a
    • Paul Mackerras's avatar
      powerpc: Account time using timebase rather than PURR · cf9efce0
      Paul Mackerras authored
      
      
      Currently, when CONFIG_VIRT_CPU_ACCOUNTING is enabled, we use the
      PURR register for measuring the user and system time used by
      processes, as well as other related times such as hardirq and
      softirq times.  This turns out to be quite confusing for users
      because it means that a program will often be measured as taking
      less time when run on a multi-threaded processor (SMT2 or SMT4 mode)
      than it does when run on a single-threaded processor (ST mode), even
      though the program takes longer to finish.  The discrepancy is
      accounted for as stolen time, which is also confusing, particularly
      when there are no other partitions running.
      
      This changes the accounting to use the timebase instead, meaning that
      the reported user and system times are the actual number of real-time
      seconds that the program was executing on the processor thread,
      regardless of which SMT mode the processor is in.  Thus a program will
      generally show greater user and system times when run on a
      multi-threaded processor than on a single-threaded processor.
      
      On pSeries systems on POWER5 or later processors, we measure the
      stolen time (time when this partition wasn't running) using the
      hypervisor dispatch trace log.  We check for new entries in the
      log on every entry from user mode and on every transition from
      kernel process context to soft or hard IRQ context (i.e. when
      account_system_vtime() gets called).  So that we can correctly
      distinguish time stolen from user time and time stolen from system
      time, without having to check the log on every exit to user mode,
      we store separate timestamps for exit to user mode and entry from
      user mode.
      
      On systems that have a SPURR (POWER6 and POWER7), we read the SPURR
      in account_system_vtime() (as before), and then apportion the SPURR
      ticks since the last time we read it between scaled user time and
      scaled system time according to the relative proportions of user
      time and system time over the same interval.  This avoids having to
      read the SPURR on every kernel entry and exit.  On systems that have
      PURR but not SPURR (i.e., POWER5), we do the same using the PURR
      rather than the SPURR.
      
      This disables the DTL user interface in /sys/debug/kernel/powerpc/dtl
      for now since it conflicts with the use of the dispatch trace log
      by the time accounting code.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      cf9efce0
  27. 30 Aug, 2010 1 commit
    • Paul Mackerras's avatar
      powerpc/perf_event: Reduce latency of calling perf_event_do_pending · b0d278b7
      Paul Mackerras authored
      Commit 0fe1ac48
      
       ("powerpc/perf_event: Fix oops due to
      perf_event_do_pending call") moved the call to perf_event_do_pending
      in timer_interrupt() down so that it was after the irq_enter() call.
      Unfortunately this moved it after the code that checks whether it
      is time for the next decrementer clock event.  The result is that
      the call to perf_event_do_pending() won't happen until the next
      decrementer clock event is due.  This was pointed out by Milton
      Miller.
      
      This fixes it by moving the check for whether it's time for the
      next decrementer clock event down to the point where we're about
      to call the event handler, after we've called perf_event_do_pending.
      
      This has the side effect that on old pre-Core99 Powermacs where we
      use the ppc_n_lost_interrupts mechanism to replay interrupts, a
      replayed interrupt will incur a little more latency since it will
      now do the code from the irq_enter down to the irq_exit, that it
      used to skip.  However, these machines are now old and rare enough
      that this doesn't matter.  To make it clear that ppc_n_lost_interrupts
      is only used on Powermacs, and to speed up the code slightly on
      non-Powermac ppc32 machines, the code that tests ppc_n_lost_interrupts
      is now conditional on CONFIG_PMAC as well as CONFIG_PPC32.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      b0d278b7
  28. 28 Jul, 2010 2 commits
    • Paul Mackerras's avatar
      powerpc: Clean up obsolete code relating to decrementer and timebase · d75d68cf
      Paul Mackerras authored
      
      
      Since the decrementer and timekeeping code was moved over to using
      the generic clockevents and timekeeping infrastructure, several
      variables and functions have been obsolete and effectively unused.
      This deletes them.
      
      In particular, wakeup_decrementer() is no longer needed since the
      generic code reprograms the decrementer as part of the process of
      resuming the timekeeping code, which happens during sysdev resume.
      Thus the wakeup_decrementer calls in the suspend_enter methods for
      52xx platforms have been removed.  The call in the powermac cpu
      frequency change code has been replaced by set_dec(1), which will
      cause a timer interrupt as soon as interrupts are enabled, and the
      generic code will then reprogram the decrementer with the correct
      value.
      
      This also simplifies the generic_suspend_en/disable_irqs functions
      and makes them static since they are not referenced outside time.c.
      The preempt_enable/disable calls are removed because the generic
      code has disabled all but the boot cpu at the point where these
      functions are called, so we can't be moved to another cpu.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      d75d68cf
    • Paul Mackerras's avatar
      powerpc: Rework VDSO gettimeofday to prevent time going backwards · 0e469db8
      Paul Mackerras authored
      
      
      Currently it is possible for userspace to see the result of
      gettimeofday() going backwards by 1 microsecond, assuming that
      userspace is using the gettimeofday() in the VDSO.  The VDSO
      gettimeofday() algorithm computes the time in "xsecs", which are
      units of 2^-20 seconds, or approximately 0.954 microseconds,
      using the algorithm
      
      	now = (timebase - tb_orig_stamp) * tb_to_xs + stamp_xsec
      
      and then converts the time in xsecs to seconds and microseconds.
      
      The kernel updates the tb_orig_stamp and stamp_xsec values every
      tick in update_vsyscall().  If the length of the tick is not an
      integer number of xsecs, then some precision is lost in converting
      the current time to xsecs.  For example, with CONFIG_HZ=1000, the
      tick is 1ms long, which is 1048.576 xsecs.  That means that
      stamp_xsec will advance by either 1048 or 1049 on each tick.
      With the right conditions, it is possible for userspace to get
      (timebase - tb_orig_stamp) * tb_to_xs being 1049 if the kernel is
      slightly late in updating the vdso_datapage, and then for stamp_xsec
      to advance by 1048 when the kernel does update it, and for userspace
      to then see (timebase - tb_orig_stamp) * tb_to_xs being zero due to
      integer truncation.  The result is that time appears to go backwards
      by 1 microsecond.
      
      To fix this we change the VDSO gettimeofday to use a new field in the
      VDSO datapage which stores the nanoseconds part of the time as a
      fractional number of seconds in a 0.32 binary fraction format.
      (Or put another way, as a 32-bit number in units of 0.23283 ns.)
      This is convenient because we can use the mulhwu instruction to
      convert it to either microseconds or nanoseconds.
      
      Since it turns out that computing the time of day using this new field
      is simpler than either using stamp_xsec (as gettimeofday does) or
      stamp_xtime.tv_nsec (as clock_gettime does), this converts both
      gettimeofday and clock_gettime to use the new field.  The existing
      __do_get_tspec function is converted to use the new field and take
      a parameter in r7 that indicates the desired resolution, 1,000,000
      for microseconds or 1,000,000,000 for nanoseconds.  The __do_get_xsec
      function is then unused and is deleted.
      
      The new algorithm is
      
      	now = ((timebase - tb_orig_stamp) << 12) * tb_to_xs
      		+ (stamp_xtime_seconds << 32) + stamp_sec_fraction
      
      with 'now' in units of 2^-32 seconds.  That is then converted to
      seconds and either microseconds or nanoseconds with
      
      	seconds = now >> 32
      	partseconds = ((now & 0xffffffff) * resolution) >> 32
      
      The 32-bit VDSO code also makes a further simplification: it ignores
      the bottom 32 bits of the tb_to_xs value, which is a 0.64 format binary
      fraction.  Doing so gets rid of 4 multiply instructions.  Assuming
      a timebase frequency of 1GHz or less and an update interval of no
      more than 10ms, the upper 32 bits of tb_to_xs will be at least
      4503599, so the error from ignoring the low 32 bits will be at most
      2.2ns, which is more than an order of magnitude less than the time
      taken to do gettimeofday or clock_gettime on our fastest processors,
      so there is no possibility of seeing inconsistent values due to this.
      
      This also moves update_gtod() down next to its only caller, and makes
      update_vsyscall use the time passed in via the wall_time argument rather
      than accessing xtime directly.  At present, wall_time always points to
      xtime, but that could change in future.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      0e469db8