1. 12 Mar, 2012 1 commit
  2. 01 Mar, 2012 2 commits
  3. 22 Feb, 2012 1 commit
  4. 27 Jan, 2012 1 commit
    • Chanho Min's avatar
      sched/rt: Fix task stack corruption under __ARCH_WANT_INTERRUPTS_ON_CTXSW · cb297a3e
      Chanho Min authored
      
      
      This issue happens under the following conditions:
      
       1. preemption is off
       2. __ARCH_WANT_INTERRUPTS_ON_CTXSW is defined
       3. RT scheduling class
       4. SMP system
      
      Sequence is as follows:
      
       1.suppose current task is A. start schedule()
       2.task A is enqueued pushable task at the entry of schedule()
         __schedule
          prev = rq->curr;
          ...
          put_prev_task
           put_prev_task_rt
            enqueue_pushable_task
       4.pick the task B as next task.
         next = pick_next_task(rq);
       3.rq->curr set to task B and context_switch is started.
         rq->curr = next;
       4.At the entry of context_swtich, release this cpu's rq->lock.
         context_switch
          prepare_task_switch
           prepare_lock_switch
            raw_spin_unlock_irq(&rq->lock);
       5.Shortly after rq->lock is released, interrupt is occurred and start IRQ context
       6.try_to_wake_up() which called by ISR acquires rq->lock
          try_to_wake_up
           ttwu_remote
            rq = __task_rq_lock(p)
            ttwu_do_wakeup(rq, p, wake_flags);
              task_woken_rt
       7.push_rt_task picks the task A which is enqueued before.
         task_woken_rt
          push_rt_tasks(rq)
           next_task = pick_next_pushable_task(rq)
       8.At find_lock_lowest_rq(), If double_lock_balance() returns 0,
         lowest_rq can be the remote rq.
        (But,If preemption is on, double_lock_balance always return 1 and it
         does't happen.)
         push_rt_task
          find_lock_lowest_rq
           if (double_lock_balance(rq, lowest_rq))..
       9.find_lock_lowest_rq return the available rq. task A is migrated to
         the remote cpu/rq.
         push_rt_task
          ...
          deactivate_task(rq, next_task, 0);
          set_task_cpu(next_task, lowest_rq->cpu);
          activate_task(lowest_rq, next_task, 0);
       10. But, task A is on irq context at this cpu.
           So, task A is scheduled by two cpus at the same time until restore from IRQ.
           Task A's stack is corrupted.
      
      To fix it, don't migrate an RT task if it's still running.
      Signed-off-by: default avatarChanho Min <chanho.min@lge.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/r/CAOAMb1BHA=5fm7KTewYyke6u-8DP0iUuJMpgQw54vNeXFsGpoQ@mail.gmail.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      cb297a3e
  5. 06 Dec, 2011 2 commits
    • Shan Hai's avatar
      sched/rt: Code cleanup, remove a redundant function call · 5b680fd6
      Shan Hai authored
      
      
      The second call to sched_rt_period() is redundant, because the value of the
      rt_runtime was already read and it was protected by the ->rt_runtime_lock.
      Signed-off-by: default avatarShan Hai <haishan.bai@gmail.com>
      Reviewed-by: default avatarKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1322535836-13590-2-git-send-email-haishan.bai@gmail.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      5b680fd6
    • Mike Galbraith's avatar
      sched: Use rt.nr_cpus_allowed to recover select_task_rq() cycles · 76854c7e
      Mike Galbraith authored
      
      
      rt.nr_cpus_allowed is always available, use it to bail from select_task_rq()
      when only one cpu can be used, and saves some cycles for pinned tasks.
      
      See the line marked with '*' below:
      
        # taskset -c 3 pipe-test
      
         PerfTop:     997 irqs/sec  kernel:89.5%  exact:  0.0% [1000Hz cycles],  (all, CPU: 3)
      ------------------------------------------------------------------------------------------------
      
                   Virgin                                    Patched
                   samples  pcnt function                    samples  pcnt function
                   _______ _____ ___________________________ _______ _____ ___________________________
      
                   2880.00 10.2% __schedule                  3136.00 11.3% __schedule
                   1634.00  5.8% pipe_read                   1615.00  5.8% pipe_read
                   1458.00  5.2% system_call                 1534.00  5.5% system_call
                   1382.00  4.9% _raw_spin_lock_irqsave      1412.00  5.1% _raw_spin_lock_irqsave
                   1202.00  4.3% pipe_write                  1255.00  4.5% copy_user_generic_string
                   1164.00  4.1% copy_user_generic_string    1241.00  4.5% __switch_to
                   1097.00  3.9% __switch_to                  929.00  3.3% mutex_lock
                    872.00  3.1% mutex_lock                   846.00  3.0% mutex_unlock
                    687.00  2.4% mutex_unlock                 804.00  2.9% pipe_write
                    682.00  2.4% native_sched_clock           713.00  2.6% native_sched_clock
                    643.00  2.3% system_call_after_swapgs     653.00  2.3% _raw_spin_unlock_irqrestore
                    617.00  2.2% sched_clock_local            633.00  2.3% fsnotify
                    612.00  2.2% fsnotify                     605.00  2.2% sched_clock_local
                    596.00  2.1% _raw_spin_unlock_irqrestore  593.00  2.1% system_call_after_swapgs
                    542.00  1.9% sysret_check                 559.00  2.0% sysret_check
                    467.00  1.7% fget_light                   472.00  1.7% fget_light
                    462.00  1.6% finish_task_switch           461.00  1.7% finish_task_switch
                    437.00  1.5% vfs_write                    442.00  1.6% vfs_write
                    431.00  1.5% do_sync_write                428.00  1.5% do_sync_write
      *             413.00  1.5% select_task_rq_fair          404.00  1.5% _raw_spin_lock_irq
                    386.00  1.4% update_curr                  402.00  1.4% update_curr
                    385.00  1.4% rw_verify_area               389.00  1.4% do_sync_read
                    377.00  1.3% _raw_spin_lock_irq           378.00  1.4% vfs_read
                    369.00  1.3% do_sync_read                 340.00  1.2% pipe_iov_copy_from_user
                    360.00  1.3% vfs_read                     316.00  1.1% __wake_up_sync_key
                    342.00  1.2% hrtick_start_fair            313.00  1.1% __wake_up_common
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1321971504.6855.15.camel@marge.simson.net
      
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      76854c7e
  6. 17 Nov, 2011 2 commits
  7. 16 Nov, 2011 1 commit
  8. 14 Nov, 2011 1 commit
  9. 06 Oct, 2011 3 commits
  10. 18 Sep, 2011 1 commit
  11. 14 Aug, 2011 6 commits
  12. 01 Jul, 2011 1 commit
  13. 15 Jun, 2011 2 commits
  14. 28 May, 2011 1 commit
  15. 16 May, 2011 2 commits
  16. 14 Apr, 2011 2 commits
  17. 31 Mar, 2011 1 commit
  18. 04 Mar, 2011 1 commit
    • Balbir Singh's avatar
      sched: Fix sched rt group scheduling when hierachy is enabled · 0c3b9168
      Balbir Singh authored
      
      
      The current sched rt code is broken when it comes to hierarchical
      scheduling, this patch fixes two problems
      
      1. It adds redundant enqueuing (harmless) when it finds a queue
         has tasks enqueued, but it has no run time and it is not
         throttled.
      
      2. The most important change is in sched_rt_rq_enqueue/dequeue.
         The code just picks the rt_rq belonging to the current cpu
         on which the period timer runs, the patch fixes it, so that
         the correct rt_se is enqueued/dequeued.
      
      Tested with a simple hierarchy
      
      /c/d, c and d assigned similar runtimes of 50,000 and a while
      1 loop runs within "d". Both c and d get throttled, without
      the patch, the task just stops running and never runs (depends
      on where the sched_rt b/w timer runs). With the patch, the
      task is throttled and runs as expected.
      
      [ bharata, suggestions on how to pick the rt_se belong to the
        rt_rq and correct cpu ]
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: default avatarBharata B Rao <bharata@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org
      LKML-Reference: <20110303113435.GA2868@balbir.in.ibm.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      0c3b9168
  19. 03 Feb, 2011 1 commit
  20. 26 Jan, 2011 1 commit
    • Peter Zijlstra's avatar
      sched: Fix switch_from_fair() · da7a735e
      Peter Zijlstra authored
      
      
      When a task is taken out of the fair class we must ensure the vruntime
      is properly normalized because when we put it back in it will assume
      to be normalized.
      
      The case that goes wrong is when changing away from the fair class
      while sleeping. Sleeping tasks have non-normalized vruntime in order
      to make sleeper-fairness work. So treat the switch away from fair as a
      wakeup and preserve the relative vruntime.
      
      Also update sysrq-n to call the ->switch_{to,from} methods.
      Reported-by: default avatarOnkalo Samu <samu.p.onkalo@nokia.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      da7a735e
  21. 13 Dec, 2010 2 commits
    • Steven Rostedt's avatar
      sched: Change pick_next_task_rt from unlikely to likely · 8e54a2c0
      Steven Rostedt authored
      
      
      The if (unlikely(!rt_rq->rt_nr_running)) test in pick_next_task_rt()
      tests if there is another rt task ready to run. If so, then pick it.
      
      In most systems, only one RT task runs at a time most of the time.
      Running the branch unlikely annotator profiler on a system doing average
      work "running firefox, evolution, xchat, distcc builds, etc", it showed the
      following:
      
       correct incorrect  %        Function                  File              Line
       ------- ---------  -        --------                  ----              ----
        324344 135104992  99 _pick_next_task_rt             sched_rt.c           1064
      
      99% of the time the condition is true. When an RT task schedules out,
      it is unlikely that another RT task is waiting to run on that same run queue.
      
      Simply remove the unlikely() condition.
      Acked-by: default avatarGregory Haskins <ghaskins@novell.com>
      Cc:Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      8e54a2c0
    • Yong Zhang's avatar
      sched: Cleanup pre_schedule_rt · 33c3d6c6
      Yong Zhang authored
      Since [commit 9a897c5a
      
      :
      sched: RT-balance, replace hooks with pre/post schedule and wakeup methods]
      we must call pre_schedule_rt if prev is rt task.
      So condition rt_task(prev) is always true and the 'unlikely' declaration is
      simply incorrect.
      Signed-off-by: default avatarYong Zhang <yong.zhang0@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      33c3d6c6
  22. 18 Nov, 2010 1 commit
  23. 18 Oct, 2010 2 commits
    • Venkatesh Pallipadi's avatar
      sched: Do not account irq time to current task · 305e6835
      Venkatesh Pallipadi authored
      
      
      Scheduler accounts both softirq and interrupt processing times to the
      currently running task. This means, if the interrupt processing was
      for some other task in the system, then the current task ends up being
      penalized as it gets shorter runtime than otherwise.
      
      Change sched task accounting to acoount only actual task time from
      currently running task. Now update_curr(), modifies the delta_exec to
      depend on rq->clock_task.
      
      Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
      extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
      for later.
      
      This change will impact scheduling behavior in interrupt heavy conditions.
      
      Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
      task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
      spending 75%+ of its time in irq processing. CPU 3 spending around 35%
      time running nc task.
      
      Now, if I run another CPU intensive task on CPU 2, without this change
      /proc/<pid>/schedstat shows 100% of time accounted to this task. With this
      change, it rightly shows less than 25% accounted to this task as remaining
      time is actually spent on irq processing.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      305e6835
    • Peter Zijlstra's avatar
      sched: Unindent labels · 49246274
      Peter Zijlstra authored
      
      
      Labels should be on column 0.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      49246274
  24. 21 Sep, 2010 2 commits
    • Steven Rostedt's avatar
      sched: Give CPU bound RT tasks preference · b3bc211c
      Steven Rostedt authored
      
      
      If a high priority task is waking up on a CPU that is running a
      lower priority task that is bound to a CPU, see if we can move the
      high RT task to another CPU first. Note, if all other CPUs are
      running higher priority tasks than the CPU bounded current task,
      then it will be preempted regardless.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Gregory Haskins <ghaskins@novell.com>
      LKML-Reference: <20100921024138.888922071@goodmis.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      b3bc211c
    • Steven Rostedt's avatar
      sched: Try not to migrate higher priority RT tasks · 43fa5460
      Steven Rostedt authored
      
      
      When first working on the RT scheduler design, we concentrated on
      keeping all CPUs running RT tasks instead of having multiple RT
      tasks on a single CPU waiting for the migration thread to move
      them. Instead we take a more proactive stance and push or pull RT
      tasks from one CPU to another on wakeup or scheduling.
      
      When an RT task wakes up on a CPU that is running another RT task,
      instead of preempting it and killing the cache of the running RT
      task, we look to see if we can migrate the RT task that is waking
      up, even if the RT task waking up is of higher priority.
      
      This may sound a bit odd, but RT tasks should be limited in
      migration by the user anyway. But in practice, people do not do
      this, which causes high prio RT tasks to bounce around the CPUs.
      This becomes even worse when we have priority inheritance, because
      a high prio task can block on a lower prio task and boost its
      priority. When the lower prio task wakes up the high prio task, if
      it happens to be on the same CPU it will migrate off of it.
      
      But in reality, the above does not happen much either, because the
      wake up of the lower prio task, which has already been boosted, if
      it was on the same CPU as the higher prio task, it would then
      migrate off of it. But anyway, we do not want to migrate them
      either.
      
      To examine the scheduling, I created a test program and examined it
      under kernelshark. The test program created CPU * 2 threads, where
      each thread had a different priority. The program takes different
      options. The options used in this change log was to have priority
      inheritance mutexes or not.
      
      All threads did the following loop:
      
      static void grab_lock(long id, int iter, int l)
      {
      	ftrace_write("thread %ld iter %d, taking lock %d\n",
      		     id, iter, l);
      	pthread_mutex_lock(&locks[l]);
      	ftrace_write("thread %ld iter %d, took lock %d\n",
      		     id, iter, l);
      	busy_loop(nr_tasks - id);
      	ftrace_write("thread %ld iter %d, unlock lock %d\n",
      		     id, iter, l);
      	pthread_mutex_unlock(&locks[l]);
      }
      
      void *start_task(void *id)
      {
      	[...]
      	while (!done) {
      		for (l = 0; l < nr_locks; l++) {
      			grab_lock(id, i, l);
      			ftrace_write("thread %ld iter %d sleeping\n",
      				     id, i);
      			ms_sleep(id);
      		}
      		i++;
      	}
      	[...]
      }
      
      The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The
      ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes
      to the ftrace buffer to help analyze via ftrace.
      
      The higher the id, the higher the prio, the shorter it does the
      busy loop, but the longer it spins. This is usually the case with
      RT tasks, the lower priority tasks usually run longer than higher
      priority tasks.
      
      At the end of the test, it records the number of loops each thread
      took, as well as the number of voluntary preemptions, non-voluntary
      preemptions, and number of migrations each thread took, taking the
      information from /proc/$$/sched and /proc/$$/status.
      
      Running this on a 4 CPU processor, the results without changes to
      the kernel looked like this:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:         53      3220       1470             98
        1:        562       773        724             98
        2:        752       933       1375             98
        3:        749        39        697             98
        4:        758         5        515             98
        5:        764         2        679             99
        6:        761         2        535             99
        7:        757         3        346             99
      
      total:     5156       4977      6341            787
      
      Each thread regardless of priority migrated a few hundred times.
      The higher priority tasks, were a little better but still took
      quite an impact.
      
      By letting higher priority tasks bump the lower prio task from the
      CPU, things changed a bit:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:         37      2835       1937             98
        1:        666      1821       1865             98
        2:        654      1003       1385             98
        3:        664       635        973             99
        4:        698       197        352             99
        5:        703       101        159             99
        6:        708         1         75             99
        7:        713         1          2             99
      
      total:     4843       6594      6748            789
      
      The total # of migrations did not change (several runs showed the
      difference all within the noise). But we now see a dramatic
      improvement to the higher priority tasks. (kernelshark showed that
      the watchdog timer bumped the highest priority task to give it the
      2 count. This was actually consistent with every run).
      
      Notice that the # of iterations did not change either.
      
      The above was with priority inheritance mutexes. That is, when the
      higher prority task blocked on a lower priority task, the lower
      priority task would inherit the higher priority task (which shows
      why task 6 was bumped so many times). When not using priority
      inheritance mutexes, the current kernel shows this:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:         56      3101       1892             95
        1:        594       713        937             95
        2:        625       188        618             95
        3:        628         4        491             96
        4:        640         7        468             96
        5:        631         2        501             96
        6:        641         1        466             96
        7:        643         2        497             96
      
      total:     4458       4018      5870            765
      
      Not much changed with or without priority inheritance mutexes. But
      if we let the high priority task bump lower priority tasks on
      wakeup we see:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:        115      3439       2782             98
        1:        633      1354       1583             99
        2:        652       919       1218             99
        3:        645       713        934             99
        4:        690         3          3             99
        5:        694         1          4             99
        6:        720         3          4             99
        7:        747         0          1            100
      
      Which shows a even bigger change. The big difference between task 3
      and task 4 is because we have only 4 CPUs on the machine, causing
      the 4 highest prio tasks to always have preference.
      
      Although I did not measure cache misses, and I'm sure there would
      be little to measure since the test was not data intensive, I could
      imagine large improvements for higher priority tasks when dealing
      with lower priority tasks. Thus, I'm satisfied with making the
      change and agreeing with what Gregory Haskins argued a few years
      ago when we first had this discussion.
      
      One final note. All tasks in the above tests were RT tasks. Any RT
      task will always preempt a non RT task that is running on the CPU
      the RT task wants to run on.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Gregory Haskins <ghaskins@novell.com>
      LKML-Reference: <20100921024138.605460343@goodmis.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      43fa5460