Skip to content
  • Ingo Molnar's avatar
    [PATCH] sched: fix bad missed wakeups in the i386, x86_64, ia64, ACPI and APM idle code · 0888f06a
    Ingo Molnar authored
    Fernando Lopez-Lezcano reported frequent scheduling latencies and audio
    xruns starting at the 2.6.18-rt kernel, and those problems persisted all
    until current -rt kernels. The latencies were serious and unjustified by
    system load, often in the milliseconds range.
    
    After a patient and heroic multi-month effort of Fernando, where he
    tested dozens of kernels, tried various configs, boot options,
    test-patches of mine and provided latency traces of those incidents, the
    following 'smoking gun' trace was captured by him:
    
                     _------=> CPU#
                    / _-----=> irqs-off
                   | / _----=> need-resched
                   || / _---=> hardirq/softirq
                   ||| / _--=> preempt-depth
                   |||| /
                   |||||     delay
       cmd     pid ||||| time  |   caller
          \   /    |||||   \   |   /
      IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup (try_to_wake_up)
      IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup <<...>-5856> (37 0)
      IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup (c01262ba 0 0)
      IRQ_19-1479  1D..1    0us : resched_task (try_to_wake_up)
      IRQ_19-1479  1D..1    0us : __spin_unlock_irqrestore (try_to_wake_up)
      ...
      <idle>-0     1...1   11us!: default_idle (cpu_idle)
      ...
      <idle>-0     0Dn.1  602us : smp_apic_timer_interrupt (c0103baf 1 0)
      ...
       <...>-5856  0D..2  618us : __switch_to (__schedule)
       <...>-5856  0D..2  618us : __schedule <<idle>-0> (20 162)
       <...>-5856  0D..2  619us : __spin_unlock_irq (__schedule)
       <...>-5856  0...1  619us : trace_stop_sched_switched (__schedule)
       <...>-5856  0D..1  619us : trace_stop_sched_switched <<...>-5856> (37 0)
    
    what is visible in this trace is that CPU#1 ran try_to_wake_up() for
    PID:5856, it placed PID:5856 on CPU#0's runqueue and ran resched_task()
    for CPU#0. But it decided to not send an IPI that no CPU - due to
    TS_POLLING. But CPU#0 never woke up after its NEED_RESCHED bit was set,
    and only rescheduled to PID:5856 upon the next lapic timer IRQ. The
    result was a 600+ usecs latency and a missed wakeup!
    
    the bug turned out to be an idle-wakeup bug introduced into the mainline
    kernel this summer via an optimization in the x86_64 tree:
    
        commit 495ab9c0
    
    
        Author: Andi Kleen <ak@suse.de>
        Date:   Mon Jun 26 13:59:11 2006 +0200
    
        [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status
    
        During some profiling I noticed that default_idle causes a lot of
        memory traffic. I think that is caused by the atomic operations
        to clear/set the polling flag in thread_info. There is actually
        no reason to make this atomic - only the idle thread does it
        to itself, other CPUs only read it. So I moved it into ti->status.
    
    the problem is this type of change:
    
            if (!hlt_counter && boot_cpu_data.hlt_works_ok) {
    -               clear_thread_flag(TIF_POLLING_NRFLAG);
    +               current_thread_info()->status &= ~TS_POLLING;
                    smp_mb__after_clear_bit();
                    while (!need_resched()) {
                            local_irq_disable();
    
    this changes clear_thread_flag() to an explicit clearing of TS_POLLING.
    clear_thread_flag() is defined as:
    
            clear_bit(flag, &ti->flags);
    
    and clear_bit() is a LOCK-ed atomic instruction on all x86 platforms:
    
      static inline void clear_bit(int nr, volatile unsigned long * addr)
      {
              __asm__ __volatile__( LOCK_PREFIX
                      "btrl %1,%0"
    
    hence smp_mb__after_clear_bit() is defined as a simple compile barrier:
    
      #define smp_mb__after_clear_bit()       barrier()
    
    but the explicit TS_POLLING clearing introduced by the patch:
    
    +               current_thread_info()->status &= ~TS_POLLING;
    
    is not an atomic op! So the clearing of the TS_POLLING bit is freely
    reorderable with the reading of the NEED_RESCHED bit - and both now
    reside in different memory addresses.
    
    CPU idle wakeup very much depends on ordered memory ops, the clearing of
    the TS_POLLING flag must always be done before we test need_resched()
    and hit the idle instruction(s). [Symmetrically, the wakeup code needs
    to set NEED_RESCHED before it tests the TS_POLLING flag, so memory
    ordering is paramount.]
    
    Fernando's dual-core Athlon64 system has a sufficiently advanced memory
    ordering model so that it triggered this scenario very often.
    
    ( And it also turned out that the reason why these latencies never
      triggered on my testsystems is that i routinely use idle=poll, which
      was the only idle variant not affected by this bug. )
    
    The fix is to change the smp_mb__after_clear_bit() to an smp_mb(), to
    act as an absolute barrier between the TS_POLLING write and the
    NEED_RESCHED read. This affects almost all idling methods (default,
    ACPI, APM), on all 3 x86 architectures: i386, x86_64, ia64.
    
    Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
    Tested-by: default avatarFernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
    Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    0888f06a