Commit b8ae30ee authored by Linus Torvalds's avatar Linus Torvalds

Merge branch 'sched-core-for-linus' of...

Merge branch 'sched-core-for-linus' of git://

* 'sched-core-for-linus' of git:// (49 commits)
  stop_machine: Move local variable closer to the usage site in cpu_stop_cpu_callback()
  sched, wait: Use wrapper functions
  sched: Remove a stale comment
  ondemand: Make the iowait-is-busy time a sysfs tunable
  ondemand: Solve a big performance issue by counting IOWAIT time as busy
  sched: Intoduce get_cpu_iowait_time_us()
  sched: Eliminate the ts->idle_lastupdate field
  sched: Fold updating of the last_update_time_info into update_ts_time_stats()
  sched: Update the idle statistics in get_cpu_idle_time_us()
  sched: Introduce a function to update the idle statistics
  sched: Add a comment to get_cpu_idle_time_us()
  cpu_stop: add dummy implementation for UP
  sched: Remove rq argument to the tracepoints
  rcu: need barrier() in UP synchronize_sched_expedited()
  sched: correctly place paranioa memory barriers in synchronize_sched_expedited()
  sched: kill paranoia check in synchronize_sched_expedited()
  sched: replace migration_thread with cpu_stop
  stop_machine: reimplement using cpu_stop
  cpu_stop: implement stop_cpu[s]()
  sched: Fix select_idle_sibling() logic in select_task_rq_fair()
parents 4d7b4ac2 9c6f7e43
......@@ -182,16 +182,6 @@ Similarly, sched_expedited RCU provides the following:
sched_expedited-torture: Reader Pipe: 12660320201 95875 0 0 0 0 0 0 0 0 0
sched_expedited-torture: Reader Batch: 12660424885 0 0 0 0 0 0 0 0 0 0
sched_expedited-torture: Free-Block Circulation: 1090795 1090795 1090794 1090793 1090792 1090791 1090790 1090789 1090788 1090787 0
state: -1 / 0:0 3:0 4:0
As before, the first four lines are similar to those for RCU.
The last line shows the task-migration state. The first number is
-1 if synchronize_sched_expedited() is idle, -2 if in the process of
posting wakeups to the migration kthreads, and N when waiting on CPU N.
Each of the colon-separated fields following the "/" is a CPU:state pair.
Valid states are "0" for idle, "1" for waiting for quiescent state,
"2" for passed through quiescent state, and "3" when a race with a
CPU-hotplug event forces use of the synchronize_sched() primitive.
......@@ -211,7 +211,7 @@ provide fair CPU time to each such task group. For example, it may be
desirable to first provide fair CPU time to each user on the system and then to
each task belonging to a user.
CONFIG_GROUP_SCHED strives to achieve exactly that. It lets tasks to be
CONFIG_CGROUP_SCHED strives to achieve exactly that. It lets tasks to be
grouped and divides CPU time fairly among such groups.
CONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and
......@@ -220,38 +220,11 @@ SCHED_RR) tasks.
CONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and
At present, there are two (mutually exclusive) mechanisms to group tasks for
CPU bandwidth control purposes:
- Based on user id (CONFIG_USER_SCHED)
With this option, tasks are grouped according to their user id.
- Based on "cgroup" pseudo filesystem (CONFIG_CGROUP_SCHED)
This options needs CONFIG_CGROUPS to be defined, and lets the administrator
These options need CONFIG_CGROUPS to be defined, and let the administrator
create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See
Documentation/cgroups/cgroups.txt for more information about this filesystem.
Only one of these options to group tasks can be chosen and not both.
When CONFIG_USER_SCHED is defined, a directory is created in sysfs for each new
user and a "cpu_share" file is added in that directory.
# cd /sys/kernel/uids
# cat 512/cpu_share # Display user 512's CPU share
# echo 2048 > 512/cpu_share # Modify user 512's CPU share
# cat 512/cpu_share # Display user 512's CPU share
CPU bandwidth between two users is divided in the ratio of their CPU shares.
For example: if you would like user "root" to get twice the bandwidth of user
"guest," then set the cpu_share for both the users such that "root"'s cpu_share
is twice "guest"'s cpu_share.
When CONFIG_CGROUP_SCHED is defined, a "cpu.shares" file is created for each
When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
group created using the pseudo filesystem. See example steps below to create
task groups and modify their CPU share using the "cgroups" pseudo filesystem.
......@@ -273,24 +246,3 @@ task groups and modify their CPU share using the "cgroups" pseudo filesystem.
# #Launch gmplayer (or your favourite movie player)
# echo <movie_player_pid> > multimedia/tasks
8. Implementation note: user namespaces
User namespaces are intended to be hierarchical. But they are currently
only partially implemented. Each of those has ramifications for CFS.
First, since user namespaces are hierarchical, the /sys/kernel/uids
presentation is inadequate. Eventually we will likely want to use sysfs
tagging to provide private views of /sys/kernel/uids within each user
Second, the hierarchical nature is intended to support completely
unprivileged use of user namespaces. So if using user groups, then
we want the users in a user namespace to be children of the user
who created it.
That is currently unimplemented. So instead, every user in a new
user namespace will receive 1024 shares just like any user in the
initial user namespace. Note that at the moment creation of a new
user namespace requires each of CAP_SYS_ADMIN, CAP_SETUID, and
......@@ -126,23 +126,12 @@ priority!
2.3 Basis for grouping tasks
There are two compile-time settings for allocating CPU bandwidth. These are
configured using the "Basis for grouping tasks" multiple choice menu under
General setup > Group CPU Scheduler:
a. CONFIG_USER_SCHED (aka "Basis for grouping tasks" = "user id")
This lets you use the virtual files under
"/sys/kernel/uids/<uid>/cpu_rt_runtime_us" to control he CPU time reserved for
each user .
The other option is:
.o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups")
Enabling CONFIG_RT_GROUP_SCHED lets you explicitly allocate real
CPU bandwidth to task groups.
This uses the /cgroup virtual file system and
"/cgroup/<cgroup>/cpu.rt_runtime_us" to control the CPU time reserved for each
control group instead.
control group.
For more information on working with control groups, you should read
Documentation/cgroups/cgroups.txt as well.
......@@ -161,8 +150,7 @@ For now, this can be simplified to just the following (but see Future plans):
There is work in progress to make the scheduling period for each group
("/sys/kernel/uids/<uid>/cpu_rt_period_us" or
"/cgroup/<cgroup>/cpu.rt_period_us" respectively) configurable as well.
("/cgroup/<cgroup>/cpu.rt_period_us") configurable as well.
The constraint on the period is that a subgroup must have a smaller or
equal period to its parent. But realistically its not very useful _yet_
......@@ -391,7 +391,6 @@ static void __init time_init_wq(void)
if (time_sync_wq)
time_sync_wq = create_singlethread_workqueue("timesync");
......@@ -73,6 +73,7 @@ enum {DBS_NORMAL_SAMPLE, DBS_SUB_SAMPLE};
struct cpu_dbs_info_s {
cputime64_t prev_cpu_idle;
cputime64_t prev_cpu_iowait;
cputime64_t prev_cpu_wall;
cputime64_t prev_cpu_nice;
struct cpufreq_policy *cur_policy;
......@@ -108,6 +109,7 @@ static struct dbs_tuners {
unsigned int down_differential;
unsigned int ignore_nice;
unsigned int powersave_bias;
unsigned int io_is_busy;
} dbs_tuners_ins = {
......@@ -148,6 +150,16 @@ static inline cputime64_t get_cpu_idle_time(unsigned int cpu, cputime64_t *wall)
return idle_time;
static inline cputime64_t get_cpu_iowait_time(unsigned int cpu, cputime64_t *wall)
u64 iowait_time = get_cpu_iowait_time_us(cpu, wall);
if (iowait_time == -1ULL)
return 0;
return iowait_time;
* Find right freq to be set now with powersave_bias on.
* Returns the freq_hi to be used right now and will set freq_hi_jiffies,
......@@ -249,6 +261,7 @@ static ssize_t show_##file_name \
return sprintf(buf, "%u\n", dbs_tuners_ins.object); \
show_one(sampling_rate, sampling_rate);
show_one(io_is_busy, io_is_busy);
show_one(up_threshold, up_threshold);
show_one(ignore_nice_load, ignore_nice);
show_one(powersave_bias, powersave_bias);
......@@ -299,6 +312,23 @@ static ssize_t store_sampling_rate(struct kobject *a, struct attribute *b,
return count;
static ssize_t store_io_is_busy(struct kobject *a, struct attribute *b,
const char *buf, size_t count)
unsigned int input;
int ret;
ret = sscanf(buf, "%u", &input);
if (ret != 1)
return -EINVAL;
dbs_tuners_ins.io_is_busy = !!input;
return count;
static ssize_t store_up_threshold(struct kobject *a, struct attribute *b,
const char *buf, size_t count)
......@@ -381,6 +411,7 @@ static struct global_attr _name = \
__ATTR(_name, 0644, show_##_name, store_##_name)
......@@ -392,6 +423,7 @@ static struct attribute *dbs_attributes[] = {
......@@ -470,14 +502,15 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
for_each_cpu(j, policy->cpus) {
struct cpu_dbs_info_s *j_dbs_info;
cputime64_t cur_wall_time, cur_idle_time;
unsigned int idle_time, wall_time;
cputime64_t cur_wall_time, cur_idle_time, cur_iowait_time;
unsigned int idle_time, wall_time, iowait_time;
unsigned int load, load_freq;
int freq_avg;
j_dbs_info = &per_cpu(od_cpu_dbs_info, j);
cur_idle_time = get_cpu_idle_time(j, &cur_wall_time);
cur_iowait_time = get_cpu_iowait_time(j, &cur_wall_time);
wall_time = (unsigned int) cputime64_sub(cur_wall_time,
......@@ -487,6 +520,10 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
j_dbs_info->prev_cpu_idle = cur_idle_time;
iowait_time = (unsigned int) cputime64_sub(cur_iowait_time,
j_dbs_info->prev_cpu_iowait = cur_iowait_time;
if (dbs_tuners_ins.ignore_nice) {
cputime64_t cur_nice;
unsigned long cur_nice_jiffies;
......@@ -504,6 +541,16 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
idle_time += jiffies_to_usecs(cur_nice_jiffies);
* For the purpose of ondemand, waiting for disk IO is an
* indication that you're performance critical, and not that
* the system is actually idle. So subtract the iowait time
* from the cpu idle time.
if (dbs_tuners_ins.io_is_busy && idle_time >= iowait_time)
idle_time -= iowait_time;
if (unlikely(!wall_time || wall_time < idle_time))
......@@ -617,6 +664,29 @@ static inline void dbs_timer_exit(struct cpu_dbs_info_s *dbs_info)
* Not all CPUs want IO time to be accounted as busy; this dependson how
* efficient idling at a higher frequency/voltage is.
* Pavel Machek says this is not so for various generations of AMD and old
* Intel systems.
* Mike Chan (androidlcom) calis this is also not true for ARM.
* Because of this, whitelist specific known (series) of CPUs by default, and
* leave all others up to the user.
static int should_io_be_busy(void)
#if defined(CONFIG_X86)
* For Intel, Core 2 (model 15) andl later have an efficient idle.
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
boot_cpu_data.x86 == 6 &&
boot_cpu_data.x86_model >= 15)
return 1;
return 0;
static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
unsigned int event)
......@@ -679,6 +749,7 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
dbs_tuners_ins.sampling_rate =
dbs_tuners_ins.io_is_busy = should_io_be_busy();
......@@ -80,12 +80,6 @@ static void do_suspend(void)
shutting_down = SHUTDOWN_SUSPEND;
err = stop_machine_create();
if (err) {
printk(KERN_ERR "xen suspend: failed to setup stop_machine %d\n", err);
goto out;
/* If the kernel is preemptible, we need to freeze all the processes
to prevent them from being in the middle of a pagetable update
......@@ -93,7 +87,7 @@ static void do_suspend(void)
err = freeze_processes();
if (err) {
printk(KERN_ERR "xen suspend: freeze failed %d\n", err);
goto out_destroy_sm;
goto out;
......@@ -136,12 +130,8 @@ out_resume:
shutting_down = SHUTDOWN_INVALID;
#endif /* CONFIG_PM_SLEEP */
......@@ -1140,8 +1140,7 @@ retry:
* ep_poll_callback() when events will become available.
init_waitqueue_entry(&wait, current);
wait.flags |= WQ_FLAG_EXCLUSIVE;
__add_wait_queue(&ep->wq, &wait);
__add_wait_queue_exclusive(&ep->wq, &wait);
for (;;) {
......@@ -21,8 +21,7 @@ extern int number_of_cpusets; /* How many cpusets are defined in system? */
extern int cpuset_init(void);
extern void cpuset_init_smp(void);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern void cpuset_cpus_allowed_locked(struct task_struct *p,
struct cpumask *mask);
extern int cpuset_cpus_allowed_fallback(struct task_struct *p);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
#define cpuset_current_mems_allowed (current->mems_allowed)
void cpuset_init_current_mems_allowed(void);
......@@ -69,9 +68,6 @@ struct seq_file;
extern void cpuset_task_status_allowed(struct seq_file *m,
struct task_struct *task);
extern void cpuset_lock(void);
extern void cpuset_unlock(void);
extern int cpuset_mem_spread_node(void);
static inline int cpuset_do_page_mem_spread(void)
......@@ -105,10 +101,11 @@ static inline void cpuset_cpus_allowed(struct task_struct *p,
cpumask_copy(mask, cpu_possible_mask);
static inline void cpuset_cpus_allowed_locked(struct task_struct *p,
struct cpumask *mask)
static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
cpumask_copy(mask, cpu_possible_mask);
cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
return cpumask_any(cpu_active_mask);
static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
......@@ -157,9 +154,6 @@ static inline void cpuset_task_status_allowed(struct seq_file *m,
static inline void cpuset_lock(void) {}
static inline void cpuset_unlock(void) {}
static inline int cpuset_mem_spread_node(void)
return 0;
......@@ -64,8 +64,6 @@ static inline long rcu_batches_completed_bh(void)
return 0;
extern int rcu_expedited_torture_stats(char *page);
static inline void rcu_force_quiescent_state(void)
......@@ -36,7 +36,6 @@ extern void rcu_sched_qs(int cpu);
extern void rcu_bh_qs(int cpu);
extern void rcu_note_context_switch(int cpu);
extern int rcu_needs_cpu(int cpu);
extern int rcu_expedited_torture_stats(char *page);
......@@ -274,11 +274,17 @@ extern cpumask_var_t nohz_cpu_mask;
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
extern int select_nohz_load_balancer(int cpu);
extern int get_nohz_load_balancer(void);
extern int nohz_ratelimit(int cpu);
static inline int select_nohz_load_balancer(int cpu)
return 0;
static inline int nohz_ratelimit(int cpu)
return 0;
......@@ -953,6 +959,7 @@ struct sched_domain {
char *name;
unsigned int span_weight;
* Span of all CPUs in this domain.
......@@ -1025,12 +1032,17 @@ struct sched_domain;
#define WF_SYNC 0x01 /* waker goes to sleep after wakup */
#define WF_FORK 0x02 /* child wakeup after fork */
#define ENQUEUE_HEAD 4
struct sched_class {
const struct sched_class *next;
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup,
bool head);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
......@@ -1039,7 +1051,8 @@ struct sched_class {
void (*put_prev_task) (struct rq *rq, struct task_struct *p);
int (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
int (*select_task_rq)(struct rq *rq, struct task_struct *p,
int sd_flag, int flags);
void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
void (*post_schedule) (struct rq *this_rq);
......@@ -1076,36 +1089,8 @@ struct load_weight {
unsigned long weight, inv_weight;
* CFS stats for a schedulable entity (task, task-group etc)
* Current field usage histogram:
* 4 se->block_start
* 4 se->run_node
* 4 se->sleep_start
* 6 se->load.weight
struct sched_entity {
struct load_weight load; /* for load-balancing */
struct rb_node run_node;
struct list_head group_node;
unsigned int on_rq;
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;
u64 last_wakeup;
u64 avg_overlap;
u64 nr_migrations;
u64 start_runtime;
u64 avg_wakeup;
struct sched_statistics {
u64 wait_start;
u64 wait_max;
u64 wait_count;
......@@ -1137,6 +1122,24 @@ struct sched_entity {
u64 nr_wakeups_affine_attempts;
u64 nr_wakeups_passive;
u64 nr_wakeups_idle;
struct sched_entity {
struct load_weight load; /* for load-balancing */
struct rb_node run_node;
struct list_head group_node;
unsigned int on_rq;
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;
u64 nr_migrations;
struct sched_statistics statistics;
......@@ -1839,6 +1842,7 @@ extern void sched_clock_idle_sleep_event(void);
extern void sched_clock_idle_wakeup_event(u64 delta_ns);
extern void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p);
extern void idle_task_exit(void);
static inline void idle_task_exit(void) {}
/* "Bogolock": stop the entire machine, disable interrupts. This is a
very heavy lock, which is equivalent to grabbing every spinlock
(and more). So the "read" side to such a lock is anything which
disables preeempt. */
#include <linux/cpu.h>
#include <linux/cpumask.h>
#include <linux/list.h>
#include <asm/system.h>
* stop_cpu[s]() is simplistic per-cpu maximum priority cpu
* monopolization mechanism. The caller can specify a non-sleeping
* function to be executed on a single or multiple cpus preempting all
* other processes and monopolizing those cpus until it finishes.
* Resources for this mechanism are preallocated when a cpu is brought
* up and requests are guaranteed to be served as long as the target
* cpus are online.
typedef int (*cpu_stop_fn_t)(void *arg);
struct cpu_stop_work {
struct list_head list; /* cpu_stopper->works */
cpu_stop_fn_t fn;
void *arg;
struct cpu_stop_done *done;
int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg);
void stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg,
struct cpu_stop_work *work_buf);
int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg);
int try_stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg);
#else /* CONFIG_SMP */
#include <linux/workqueue.h>
struct cpu_stop_work {
struct work_struct work;
cpu_stop_fn_t fn;
void *arg;
static inline int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg)
int ret = -ENOENT;
if (cpu == smp_processor_id())
ret = fn(arg);
return ret;
static void stop_one_cpu_nowait_workfn(struct work_struct *work)
struct cpu_stop_work *stwork =
container_of(work, struct cpu_stop_work, work);
static inline void stop_one_cpu_nowait(unsigned int cpu,
cpu_stop_fn_t fn, void *arg,
struct cpu_stop_work *work_buf)
if (cpu == smp_processor_id()) {
INIT_WORK(&work_buf->work, stop_one_cpu_nowait_workfn);
work_buf->fn = fn;
work_buf->arg = arg;
static inline int stop_cpus(const struct cpumask *cpumask,
cpu_stop_fn_t fn, void *arg)
if (cpumask_test_cpu(raw_smp_processor_id(), cpumask))