- Jul 27, 2011
-
-
Oleg Nesterov authored
sys_ssetmask(), sys_rt_sigsuspend() and compat_sys_rt_sigsuspend() change ->blocked directly. This is not correct, see the changelog in e6fa16ab "signal: sigprocmask() should do retarget_shared_pending()" Change them to use set_current_blocked(). Another change is that now we are doing ->saved_sigmask = ->blocked lockless, it doesn't make any sense to do this under ->siglock. Signed-off-by:
Oleg Nesterov <oleg@redhat.com> Reviewed-by:
Matt Fleming <matt.fleming@linux.intel.com> Acked-by:
Tejun Heo <tj@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Jul 26, 2011
-
-
Arun Sharma authored
This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by:
Arun Sharma <asharma@fb.com> Reviewed-by:
Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by:
Mike Frysinger <vapier@gentoo.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Hugh Dickins authored
When a kernel BUG or oops occurs, ChromeOS intends to panic and immediately reboot, with stacktrace and other messages preserved in RAM across reboot. But the longer we delay, the more likely the user is to poweroff and lose the info. panic_timeout (seconds before rebooting) is set by panic= boot option or sysctl or /proc/sys/kernel/panic; but 0 means wait forever, so at present we have to delay at least 1 second. Let a negative number mean reboot immediately (with the small cosmetic benefit of suppressing that newline-less "Rebooting in %d seconds.." message). Signed-off-by:
Hugh Dickins <hughd@chromium.org> Signed-off-by:
Mandeep Singh Baines <msb@chromium.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Olaf Hering <olaf@aepfle.de> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Vitaliy Ivanov authored
Selecting GCOV for UML causing configuration mismatch: warning: (GCOV_KERNEL) selects CONSTRUCTORS which has unmet direct dependencies (!UML) Constructors are not needed for UML. Signed-off-by:
Vitaliy Ivanov <vitalivanov@gmail.com> Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Acked-by:
Richard Weinberger <richard@nod.at> Acked-by:
WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Vasiliy Kulikov authored
Add support for the shm_rmid_forced sysctl. If set to 1, all shared memory objects in current ipc namespace will be automatically forced to use IPC_RMID. The POSIX way of handling shmem allows one to create shm objects and call shmdt(), leaving shm object associated with no process, thus consuming memory not counted via rlimits. With shm_rmid_forced=1 the shared memory object is counted at least for one process, so OOM killer may effectively kill the fat process holding the shared memory. It obviously breaks POSIX - some programs relying on the feature would stop working. So set shm_rmid_forced=1 only if you're sure nobody uses "orphaned" memory. Use shm_rmid_forced=0 by default for compatability reasons. The feature was previously impemented in -ow as a configure option. [akpm@linux-foundation.org: fix documentation, per Randy] [akpm@linux-foundation.org: fix warning] [akpm@linux-foundation.org: readability/conventionality tweaks] [akpm@linux-foundation.org: fix shm_rmid_forced/shm_forced_rmid confusion, use standard comment layout] Signed-off-by:
Vasiliy Kulikov <segoon@openwall.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Serge E. Hallyn" <serge.hallyn@canonical.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Solar Designer <solar@openwall.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Daniel Rebelo de Oliveira authored
Signed-off-by:
Daniel Rebelo de Oliveira <psykon@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Michal Hocko authored
[ This patch has already been accepted as commit 0ac0c0d0 but later reverted (commit 35926ff5) because it itroduced arch specific __node_random which was defined only for x86 code so it broke other archs. This is a followup without any arch specific code. Other than that there are no functional changes.] Some workloads that create a large number of small files tend to assign too many pages to node 0 (multi-node systems). Part of the reason is that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts at node 0 for newly created tasks. This patch changes the rotor to be initialized to a random node number of the cpuset. [akpm@linux-foundation.org: fix layout] [Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration] [mhocko@suse.cz: Make it arch independent] [akpm@linux-foundation.org: fix CONFIG_NUMA=y, MAX_NUMNODES>1 build] Signed-off-by:
Jack Steiner <steiner@sgi.com> Signed-off-by:
Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by:
Michal Hocko <mhocko@suse.cz> Reviewed-by:
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Paul Menage <menage@google.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Jack Steiner <steiner@sgi.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Paul Menage <menage@google.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Robin Holt <holt@sgi.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Jul 25, 2011
-
-
Stephen Boyd authored
If CONFIG_IKCONFIG=m but CONFIG_IKCONFIG_PROC=n we get a module that has no MODULE_LICENSE definition. Move the MODULE_*() definitions outside the CONFIG_IKCONFIG_PROC #ifdef to prevent this configuration from tainting the kernel. Signed-off-by:
Stephen Boyd <bebarino@gmail.com> Acked-by:
Randy Dunlap <rdunlap@xenotime.net> Acked-by:
WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Amerigo Wang authored
It is not necessary to share the same notifier.h. This patch already moves register_reboot_notifier() and unregister_reboot_notifier() from kernel/notifier.c to kernel/sys.c. [amwang@redhat.com: make allyesconfig succeed on ppc64] Signed-off-by:
WANG Cong <amwang@redhat.com> Cc: David Miller <davem@davemloft.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Signed-off-by:
WANG Cong <amwang@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Maxin B John authored
devres uses the pointer value as key after it's freed, which is safe but triggers spurious use-after-free warnings on some static analysis tools. Rearrange code to avoid such warnings. Signed-off-by:
Maxin B. John <maxin.john@gmail.com> Reviewed-by:
Rolf Eike Beer <eike-kernel@sf-tec.de> Acked-by:
Tejun Heo <tj@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Benjamin Herrenschmidt authored
I haven't reproduced it myself but the fail scenario is that on such machines (notably ARM and some embedded powerpc), if you manage to hit that futex path on a writable page whose dirty bit has gone from the PTE, you'll livelock inside the kernel from what I can tell. It will go in a loop of trying the atomic access, failing, trying gup to "fix it up", getting succcess from gup, go back to the atomic access, failing again because dirty wasn't fixed etc... So I think you essentially hang in the kernel. The scenario is probably rare'ish because affected architecture are embedded and tend to not swap much (if at all) so we probably rarely hit the case where dirty is missing or young is missing, but I think Shan has a piece of SW that can reliably reproduce it using a shared writable mapping & fork or something like that. On archs who use SW tracking of dirty & young, a page without dirty is effectively mapped read-only and a page without young unaccessible in the PTE. Additionally, some architectures might lazily flush the TLB when relaxing write protection (by doing only a local flush), and expect a fault to invalidate the stale entry if it's still present on another processor. The futex code assumes that if the "in_atomic()" access -EFAULT's, it can "fix it up" by causing get_user_pages() which would then be equivalent to taking the fault. However that isn't the case. get_user_pages() will not call handle_mm_fault() in the case where the PTE seems to have the right permissions, regardless of the dirty and young state. It will eventually update those bits ... in the struct page, but not in the PTE. Additionally, it will not handle the lazy TLB flushing that can be required by some architectures in the fault case. Basically, gup is the wrong interface for the job. The patch provides a more appropriate one which boils down to just calling handle_mm_fault() since what we are trying to do is simulate a real page fault. The futex code currently attempts to write to user memory within a pagefault disabled section, and if that fails, tries to fix it up using get_user_pages(). This doesn't work on archs where the dirty and young bits are maintained by software, since they will gate access permission in the TLB, and will not be updated by gup(). In addition, there's an expectation on some archs that a spurious write fault triggers a local TLB flush, and that is missing from the picture as well. I decided that adding those "features" to gup() would be too much for this already too complex function, and instead added a new simpler fixup_user_fault() which is essentially a wrapper around handle_mm_fault() which the futex code can call. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout] Signed-off-by:
Benjamin Herrenschmidt <benh@kernel.crashing.org> Reported-by:
Shan Hai <haishan.bai@gmail.com> Tested-by:
Shan Hai <haishan.bai@gmail.com> Cc: David Laight <David.Laight@ACULAB.COM> Acked-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Darren Hart <darren.hart@intel.com> Cc: <stable@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Jul 24, 2011
-
-
Kay Sievers authored
Userspace wants to manage module parameters with udev rules. This currently only works for loaded modules, but not for built-in ones. To allow access to the built-in modules we need to re-trigger all module load events that happened before any userspace was running. We already do the same thing for all devices, subsystems(buses) and drivers. This adds the currently missing /sys/module/<name>/uevent files to all module entries. Signed-off-by:
Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (split & trivial fix)
-
Kay Sievers authored
This simplifies the next patch, where we have an attribute on a builtin module (ie. module == NULL). Signed-off-by:
Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (split into 2)
-
Jonas Bonn authored
The module loader code allows architectures to hook into the code by providing a small number of entry points that each arch must implement. This patch provides __weakly linked generic implementations of these entry points for architectures that don't need to do anything special. Signed-off-by:
Jonas Bonn <jonas@southpole.se> Signed-off-by:
Rusty Russell <rusty@rustcorp.com.au>
-
Satoru Moriya authored
In STANDARD_PARAM_DEF, param_set_* handles the case in which strtolfn returns -EINVAL but it may return -ERANGE. If it returns -ERANGE, param_set_* may set uninitialized value to the paramerter. We should handle both cases. The one of the cases in which strtolfn() returns -ERANGE is following: *Type of module parameter is long *Set the parameter more than LONG_MAX Signed-off-by:
Satoru Moriya <satoru.moriya@hds.com> Signed-off-by:
Rusty Russell <rusty@rustcorp.com.au>
-
- Jul 22, 2011
-
-
Lin Ming authored
No need to define a new "cfs_rq" variable in the "for" block. Just use the one at the top of the function. Signed-off-by:
Lin Ming <ming.m.lin@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1311297271.3938.1352.camel@minggr.sh.intel.com Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Jul 21, 2011
-
-
Peter Zijlstra authored
Thomas noticed that a lock marked with lockdep_set_novalidate_class() will still trigger warnings for IRQ inversions. Cure this by skipping those when marking irq state. Reported-and-tested-by:
Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-2dp5vmpsxeraqm42kgww6ge2@git.kernel.org Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Lin Ming authored
PMU type id can be allocated dynamically, so perf_event_attr::type check when copying attribute from userspace to kernel is not valid. Signed-off-by:
Lin Ming <ming.m.lin@intel.com> Cc: Robert Richter <robert.richter@amd.com> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1309421396-17438-4-git-send-email-ming.m.lin@intel.com Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Stephan Baerwolf authored
"entity_key()" is only used in "__enqueue_entity()" and its only function is to subtract a tasks vruntime by its groups minvruntime. Before this patch a rbtree enqueue-decision is done by comparing two tasks in the style: "if (entity_key(cfs_rq, se) < entity_key(cfs_rq, entry))" which would be "if (se->vruntime-cfs_rq->min_vruntime < entry->vruntime-cfs_rq->min_vruntime)" or (if reducing cfs_rq->min_vruntime out) "if (se->vruntime < entry->vruntime)" which is "if (entity_before(se, entry))" So we do not need "entity_key()". If "entity_before()" is inline we will also save one subtraction (only one, because "entity_key(cfs_rq, se)" was cached in "key") Signed-off-by:
Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-ns12mnd2h5w8rb9agd8hnsfk@git.kernel.org Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Jan H. Schönherr authored
Clean up cfs/rt runqueue initialization by moving group scheduling related code into the corresponding functions. Also, keep group scheduling as an add-on, so that things are only done additionally, i. e. remove the init_*_rq() calls from init_tg_*_entry(). (This removes a redundant initalization during sched_init()). In case of group scheduling rt_rq->highest_prio.curr is now initialized twice, but adding another #ifdef seems not worth it. Signed-off-by:
Jan H. Schönherr <schnhrr@cs.tu-berlin.de> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1310661163-16606-1-git-send-email-schnhrr@cs.tu-berlin.de Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Richard Kennedy authored
Reorder root_domain to remove 8 bytes of alignment padding on 64 bit builds, this shrinks the size from 1736 to 1728 bytes, therefore using one fewer cachelines. Signed-off-by:
Richard Kennedy <richard@rsk.demon.co.uk> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1310726492.1977.5.camel@castor.rsk Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Bianca Lutz authored
If a task group is to be created and alloc_fair_sched_group() fails, then the rt_bandwidth of the corresponding task group is not yet initialized. The caller, sched_create_group(), starts a clean up procedure which calls free_rt_sched_group() which unconditionally destroys the not yet initialized rt_bandwidth. This crashes or hangs the system in lock_hrtimer_base(): UP systems dereference a NULL pointer, while SMP systems loop endlessly on a condition that cannot become true. This patch simply avoids the destruction of rt_bandwidth when the initialization code path was not reached. (This was discovered by accident with a custom kernel modification.) Signed-off-by:
Bianca Lutz <sowilo@cs.tu-berlin.de> Signed-off-by:
Jan Schoenherr <schnhrr@cs.tu-berlin.de> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1310580816-10861-7-git-send-email-schnhrr@cs.tu-berlin.de Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Jan Schoenherr authored
The last reference to cpu_cfs_rq() was removed with commit 88ec22d3 ("sched: Remove the cfs_rq dependency from set_task_cpu()"). Thus, remove this function, too. Signed-off-by:
Jan Schoenherr <schnhrr@cs.tu-berlin.de> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1310580816-10861-3-git-send-email-schnhrr@cs.tu-berlin.de Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Jan Schoenherr authored
This patch fixes a typo located in a comment. Signed-off-by:
Jan Schoenherr <schnhrr@cs.tu-berlin.de> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1310580816-10861-2-git-send-email-schnhrr@cs.tu-berlin.de Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Peter Zijlstra authored
Use for_each_leaf_cfs_rq() instead of list_for_each_entry_rcu(), this achieves that load_balance_fair() only iterates those task_groups that actually have tasks on busiest, and that we iterate bottom-up, trying to move light groups before the heavier ones. No idea if it will actually work out to be beneficial in practice, does anybody have a cgroup workload that might show a difference one way or the other? [ Also move update_h_load to sched_fair.c, loosing #ifdef-ery ] Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by:
Paul Turner <pjt@google.com> Link: http://lkml.kernel.org/r/1310557009.2586.28.camel@twins Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Paul Turner authored
In dequeue_task_fair() we bail on dequeue when we encounter a parenting entity with additional weight. However, we perform a double shares update on this entity as we continue the shares update traversal from this point, despite dequeue_entity() having already updated its queuing cfs_rq. Avoid this by starting from the parent when we resume. Signed-off-by:
Paul Turner <pjt@google.com> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110707053059.797714697@google.com Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Paul Turner authored
While looking at check_preempt_wakeup() I realized that we are potentially updating the wrong entity in the fair-group scheduling case. In this case the current task's cfs_rq may not be the same as the one used for the comparison between the waking task and the existing task's vruntime. This potentially results in us using a stale vruntime in the pre-emption decision, providing a small false preference for the previous task. The effects of this are bounded since we always perform a hierarchal update on the tick. Signed-off-by:
Paul Turner <pjt@google.com> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/CAPM31R+2Ke2urUZKao5W92_LupdR4AYEv-EZWiJ3tG=tEes2cw@mail.gmail.com Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Oleg Nesterov authored
Simple test-case, int main(void) { int pid, status; pid = fork(); if (!pid) { pause(); assert(0); return 0x23; } assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0); assert(wait(&status) == pid); assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGSTOP); kill(pid, SIGCONT); // <--- also clears STOP_DEQUEUD assert(ptrace(PTRACE_CONT, pid, 0,0) == 0); assert(wait(&status) == pid); assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGCONT); assert(ptrace(PTRACE_CONT, pid, 0, SIGSTOP) == 0); assert(wait(&status) == pid); assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGSTOP); kill(pid, SIGKILL); return 0; } Without the patch it hangs. After the patch SIGSTOP "injected" by the tracer is not ignored and stops the tracee. Note also that if this test-case uses, say, SIGWINCH instead of SIGCONT, everything works without the patch. This can't be right, and this is confusing. The problem is that SIGSTOP (or any other sig_kernel_stop() signal) has no effect without JOBCTL_STOP_DEQUEUED. This means it is simply ignored after PTRACE_CONT unless JOBCTL_STOP_DEQUEUED was set "by accident", say it wasn't cleared after initial SIGSTOP sent by PTRACE_ATTACH. At first glance we could change ptrace_signal() to add STOP_DEQUEUED after return from ptrace_stop(), but this is not right in case when the tracer does not change the reported SIGSTOP and SIGCONT comes in between. This is even more wrong with PT_SEIZED, SIGCONT adds JOBCTL_TRAP_NOTIFY which will be "lost" during the TRAP_STOP | TRAP_NOTIFY report. So lets add STOP_DEQUEUED _before_ we report the signal. It has no effect unless sig_kernel_stop() == T after the tracer resumes us, and in the latter case the pending STOP_DEQUEUED means no SIGCONT in between, we should stop. Note also that if SIGCONT was sent, PT_SEIZED tracee will correctly report PTRACE_EVENT_STOP/SIGTRAP and thus the tracer can notice the fact SIGSTOP was cancelled. Also, move the current->ptrace check from ptrace_signal() to its caller, get_signal_to_deliver(), this looks more natural. Signed-off-by:
Oleg Nesterov <oleg@redhat.com> Acked-by:
Tejun Heo <tj@kernel.org>
-
- Jul 20, 2011
-
-
Christoph Hellwig authored
Now that the last users is gone these can be removed. Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
John Stultz authored
Terribly embarassing. Don't know how I committed this, but its KERN_WARNING not KERN_WARN. This fixes the following compile error: kernel/time/timekeeping.c: In function ‘__timekeeping_inject_sleeptime’: kernel/time/timekeeping.c:608: error: ‘KERN_WARN’ undeclared (first use in this function) kernel/time/timekeeping.c:608: error: (Each undeclared identifier is reported only once kernel/time/timekeeping.c:608: error: for each function it appears in.) kernel/time/timekeeping.c:608: error: expected ‘)’ before string constant make[2]: *** [kernel/time/timekeeping.o] Error 1 Reported-by:
Ingo Molnar <mingo@elte.hu> Signed-off-by:
John Stultz <john.stultz@linaro.org>
-
Paul E. McKenney authored
The RCU callback free_head just calls kfree(), so we can use kfree_rcu() instead of call_rcu(). Signed-off-by:
Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Josh Triplett <josh@joshtriplett.org>
-
Lai Jiangshan authored
The rcu callback __put_tree() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(__put_tree). Signed-off-by:
Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by:
Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Reviewed-by:
Josh Triplett <josh@joshtriplett.org>
-
Paul E. McKenney authored
The __lock_task_sighand() function calls rcu_read_lock() with interrupts and preemption enabled, but later calls rcu_read_unlock() with interrupts disabled. It is therefore possible that this RCU read-side critical section will be preempted and later RCU priority boosted, which means that rcu_read_unlock() will call rt_mutex_unlock() in order to deboost itself, but with interrupts disabled. This results in lockdep splats, so this commit nests the RCU read-side critical section within the interrupt-disabled region of code. This prevents the RCU read-side critical section from being preempted, and thus prevents the attempt to deboost with interrupts disabled. It is quite possible that a better long-term fix is to make rt_mutex_unlock() disable irqs when acquiring the rt_mutex structure's ->wait_lock. Signed-off-by:
Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by:
Paul E. McKenney <paulmck@linux.vnet.ibm.com>
-
Peter Zijlstra authored
The rcu_read_unlock_special() function relies on in_irq() to exclude scheduler activity from interrupt level. This fails because exit_irq() can invoke the scheduler after clearing the preempt_count() bits that in_irq() uses to determine that it is at interrupt level. This situation can result in failures as follows: $task IRQ SoftIRQ rcu_read_lock() /* do stuff */ <preempt> |= UNLOCK_BLOCKED rcu_read_unlock() --t->rcu_read_lock_nesting irq_enter(); /* do stuff, don't use RCU */ irq_exit(); sub_preempt_count(IRQ_EXIT_OFFSET); invoke_softirq() ttwu(); spin_lock_irq(&pi->lock) rcu_read_lock(); /* do stuff */ rcu_read_unlock(); rcu_read_unlock_special() rcu_report_exp_rnp() ttwu() spin_lock_irq(&pi->lock) /* deadlock */ rcu_read_unlock_special(t); Ed can simply trigger this 'easy' because invoke_softirq() immediately does a ttwu() of ksoftirqd/# instead of doing the in-place softirq stuff first, but even without that the above happens. Cure this by also excluding softirqs from the rcu_read_unlock_special() handler and ensuring the force_irqthreads ksoftirqd/# wakeup is done from full softirq context. [ Alternatively, delaying the ->rcu_read_lock_nesting decrement until after the special handling would make the thing more robust in the face of interrupts as well. And there is a separate patch for that. ] Cc: Thomas Gleixner <tglx@linutronix.de> Reported-and-tested-by:
Ed Tomlinson <edt@aei.ca> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by:
Paul E. McKenney <paulmck@linux.vnet.ibm.com>
-
Peter Zijlstra authored
Ensure scheduler_ipi() calls irq_{enter,exit} when it does some actual work. Traditionally we never did any actual work from the resched IPI and all magic happened in the return from interrupt path. Now that we do do some work, we need to ensure irq_{enter,exit} are called so that we don't confuse things. This affects things like timekeeping, NO_HZ and RCU, basically everything with a hook in irq_enter/exit. Explicit examples of things going wrong are: sched_clock_cpu() -- has a callback when leaving NO_HZ state to take a new reading from GTOD and TSC. Without this callback, time is stuck in the past. RCU -- needs in_irq() to work in order to avoid some nasty deadlocks Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by:
Paul E. McKenney <paulmck@linux.vnet.ibm.com>
-
Paul E. McKenney authored
The addition of RCU read-side critical sections within runqueue and priority-inheritance lock critical sections introduced some deadlock cycles, for example, involving interrupts from __rcu_read_unlock() where the interrupt handlers call wake_up(). This situation can cause the instance of __rcu_read_unlock() invoked from interrupt to do some of the processing that would otherwise have been carried out by the task-level instance of __rcu_read_unlock(). When the interrupt-level instance of __rcu_read_unlock() is called with a scheduler lock held from interrupt-entry/exit situations where in_irq() returns false, deadlock can result. This commit resolves these deadlocks by using negative values of the per-task ->rcu_read_lock_nesting counter to indicate that an instance of __rcu_read_unlock() is in flight, which in turn prevents instances from interrupt handlers from doing any special processing. This patch is inspired by Steven Rostedt's earlier patch that similarly made __rcu_read_unlock() guard against interrupt-mediated recursion (see https://lkml.org/lkml/2011/7/15/326 ), but this commit refines Steven's approach to avoid the need for preemption disabling on the __rcu_read_unlock() fastpath and to also avoid the need for manipulating a separate per-CPU variable. This patch avoids need for preempt_disable() by instead using negative values of the per-task ->rcu_read_lock_nesting counter. Note that nested rcu_read_lock()/rcu_read_unlock() pairs are still permitted, but they will never see ->rcu_read_lock_nesting go to zero, and will therefore never invoke rcu_read_unlock_special(), thus preventing them from seeing the RCU_READ_UNLOCK_BLOCKED bit should it be set in ->rcu_read_unlock_special. This patch also adds a check for ->rcu_read_unlock_special being negative in rcu_check_callbacks(), thus preventing the RCU_READ_UNLOCK_NEED_QS bit from being set should a scheduling-clock interrupt occur while __rcu_read_unlock() is exiting from an outermost RCU read-side critical section. Of course, __rcu_read_unlock() can be preempted during the time that ->rcu_read_lock_nesting is negative. This could result in the setting of the RCU_READ_UNLOCK_BLOCKED bit after __rcu_read_unlock() checks it, and would also result it this task being queued on the corresponding rcu_node structure's blkd_tasks list. Therefore, some later RCU read-side critical section would enter rcu_read_unlock_special() to clean up -- which could result in deadlock if that critical section happened to be in the scheduler where the runqueue or priority-inheritance locks were held. This situation is dealt with by making rcu_preempt_note_context_switch() check for negative ->rcu_read_lock_nesting, thus refraining from queuing the task (and from setting RCU_READ_UNLOCK_BLOCKED) if we are already exiting from the outermost RCU read-side critical section (in other words, we really are no longer actually in that RCU read-side critical section). In addition, rcu_preempt_note_context_switch() invokes rcu_read_unlock_special() to carry out the cleanup in this case, which clears out the ->rcu_read_unlock_special bits and dequeues the task (if necessary), in turn avoiding needless delay of the current RCU grace period and needless RCU priority boosting. It is still illegal to call rcu_read_unlock() while holding a scheduler lock if the prior RCU read-side critical section has ever had either preemption or irqs enabled. However, the common use case is legal, namely where then entire RCU read-side critical section executes with irqs disabled, for example, when the scheduler lock is held across the entire lifetime of the RCU read-side critical section. Signed-off-by:
Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by:
Paul E. McKenney <paulmck@linux.vnet.ibm.com>
-
Peter Zijlstra authored
When creating sched_domains, stop when we've covered the entire target span instead of continuing to create domains, only to later find they're redundant and throw them away again. This avoids single node systems from touching funny NUMA sched_domain creation code and reduces the risks of the new SD_OVERLAP code. Requested-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Anton Blanchard <anton@samba.org> Cc: mahesh@linux.vnet.ibm.com Cc: benh@kernel.crashing.org Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1311180177.29152.57.camel@twins Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Peter Zijlstra authored
Allow for sched_domain spans that overlap by giving such domains their own sched_group list instead of sharing the sched_groups amongst each-other. This is needed for machines with more than 16 nodes, because sched_domain_node_span() will generate a node mask from the 16 nearest nodes without regard if these masks have any overlap. Currently sched_domains have a sched_group that maps to their child sched_domain span, and since there is no overlap we share the sched_group between the sched_domains of the various CPUs. If however there is overlap, we would need to link the sched_group list in different ways for each cpu, and hence sharing isn't possible. In order to solve this, allocate private sched_groups for each CPU's sched_domain but have the sched_groups share a sched_group_power structure such that we can uniquely track the power. Reported-and-tested-by:
Anton Blanchard <anton@samba.org> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-08bxqw9wis3qti9u5inifh3y@git.kernel.org Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Peter Zijlstra authored
In order to prepare for non-unique sched_groups per domain, we need to carry the cpu_power elsewhere, so put a level of indirection in. Reported-and-tested-by:
Anton Blanchard <anton@samba.org> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.org Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Jul 19, 2011
-
-
Al Viro authored
Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-