Skip to content
  • Steven Rostedt's avatar
    sched/cpupri: Fix memory barriers for vec updates to always be in order · d473750b
    Steven Rostedt authored
    
    
    [ This patch actually compiles. Thanks to Mike Galbraith for pointing
    that out. I compiled and booted this patch with no issues. ]
    
    Re-examining the cpupri patch, I see there's a possible race because the
    update of the two priorities vec->counts are not protected by a memory
    barrier.
    
    When a RT runqueue is overloaded and wants to push an RT task to another
    runqueue, it scans the RT priority vectors in a loop from lowest
    priority to highest.
    
    When we queue or dequeue an RT task that changes a runqueue's highest
    priority task, we update the vectors to show that a runqueue is rated at
    a different priority. To do this, we first set the new priority mask,
    and increment the vec->count, and then set the old priority mask by
    decrementing the vec->count.
    
    If we are lowering the runqueue's RT priority rating, it will trigger a
    RT pull, and we do not care if we miss pushing to this runqueue or not.
    
    But if we raise the priority, but the priority is still lower than an RT
    task that is looking to be pushed, we must make sure that this runqueue
    is still seen by the push algorithm (the loop).
    
    Because the loop reads from lowest to highest, and the new priority is
    set before the old one is cleared, we will either see the new or old
    priority set and the vector will be checked.
    
    But! Since there's no memory barrier between the updates of the two, the
    old count may be decremented first before the new count is incremented.
    This means the loop may see the old count of zero and skip it, and also
    the new count of zero before it was updated. A possible runqueue that
    the RT task could move to could be missed.
    
    A conditional memory barrier is placed between the vec->count updates
    and is only called when both updates are done.
    
    The smp_wmb() has also been changed to smp_mb__before_atomic_inc/dec(),
    as they are not needed by archs that already synchronize
    atomic_inc/dec().
    
    The smp_rmb() has been moved to be called at every iteration of the loop
    so that the race between seeing the two updates is visible by each
    iteration of the loop, as an arch is free to optimize the reading of
    memory of the counters in the loop.
    
    Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
    Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Nick Piggin <npiggin@kernel.dk>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Link: http://lkml.kernel.org/r/1312547269.18583.194.camel@gandalf.stny.rr.com
    
    
    Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
    d473750b