1. 07 Feb, 2013 2 commits
  2. 05 Feb, 2013 1 commit
  3. 04 Feb, 2013 1 commit
  4. 03 Feb, 2013 1 commit
  5. 31 Jan, 2013 1 commit
  6. 27 Jan, 2013 7 commits
    • Frederic Weisbecker's avatar
      cputime: Safely read cputime of full dynticks CPUs · 6a61671b
      Frederic Weisbecker authored
      
      
      While remotely reading the cputime of a task running in a
      full dynticks CPU, the values stored in utime/stime fields
      of struct task_struct may be stale. Its values may be those
      of the last kernel <-> user transition time snapshot and
      we need to add the tickless time spent since this snapshot.
      
      To fix this, flush the cputime of the dynticks CPUs on
      kernel <-> user transition and record the time / context
      where we did this. Then on top of this snapshot and the current
      time, perform the fixup on the reader side from task_times()
      accessors.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      [fixed kvm module related build errors]
      Signed-off-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      6a61671b
    • Frederic Weisbecker's avatar
      kvm: Prepare to add generic guest entry/exit callbacks · c11f11fc
      Frederic Weisbecker authored
      
      
      Do some ground preparatory work before adding guest_enter()
      and guest_exit() context tracking callbacks. Those will
      be later used to read the guest cputime safely when we
      run in full dynticks mode.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      c11f11fc
    • Frederic Weisbecker's avatar
      cputime: Use accessors to read task cputime stats · 6fac4829
      Frederic Weisbecker authored
      
      
      This is in preparation for the full dynticks feature. While
      remotely reading the cputime of a task running in a full
      dynticks CPU, we'll need to do some extra-computation. This
      way we can account the time it spent tickless in userspace
      since its last cputime snapshot.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      6fac4829
    • Frederic Weisbecker's avatar
      cputime: Allow dynamic switch between tick/virtual based cputime accounting · 3f4724ea
      Frederic Weisbecker authored
      
      
      Allow to dynamically switch between tick and virtual based
      cputime accounting. This way we can provide a kind of "on-demand"
      virtual based cputime accounting. In this mode, the kernel relies
      on the context tracking subsystem to dynamically probe on kernel
      boundaries.
      
      This is in preparation for being able to stop the timer tick in
      more places than just the idle state. Doing so will depend on
      CONFIG_VIRT_CPU_ACCOUNTING_GEN which makes it possible to account
      the cputime without the tick by hooking on kernel/user boundaries.
      
      Depending whether the tick is stopped or not, we can switch between
      tick and vtime based accounting anytime in order to minimize the
      overhead associated to user hooks.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      3f4724ea
    • Frederic Weisbecker's avatar
      cputime: Generic on-demand virtual cputime accounting · abf917cd
      Frederic Weisbecker authored
      
      
      If we want to stop the tick further idle, we need to be
      able to account the cputime without using the tick.
      
      Virtual based cputime accounting solves that problem by
      hooking into kernel/user boundaries.
      
      However implementing CONFIG_VIRT_CPU_ACCOUNTING require
      low level hooks and involves more overhead. But we already
      have a generic context tracking subsystem that is required
      for RCU needs by archs which plan to shut down the tick
      outside idle.
      
      This patch implements a generic virtual based cputime
      accounting that relies on these generic kernel/user hooks.
      
      There are some upsides of doing this:
      
      - This requires no arch code to implement CONFIG_VIRT_CPU_ACCOUNTING
      if context tracking is already built (already necessary for RCU in full
      tickless mode).
      
      - We can rely on the generic context tracking subsystem to dynamically
      (de)activate the hooks, so that we can switch anytime between virtual
      and tick based accounting. This way we don't have the overhead
      of the virtual accounting when the tick is running periodically.
      
      And one downside:
      
      - There is probably more overhead than a native virtual based cputime
      accounting. But this relies on hooks that are already set anyway.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      abf917cd
    • Frederic Weisbecker's avatar
      cputime: Move default nsecs_to_cputime() to jiffies based cputime file · ae8dda5c
      Frederic Weisbecker authored
      
      
      If the architecture doesn't provide an implementation of
      nsecs_to_cputime(), the cputime accounting core uses a
      default one that converts the nanoseconds to jiffies. However
      this only makes sense if we use the jiffies based cputime.
      
      For now it doesn't matter much because this API is only
      called on code that uses jiffies based cputime accounting.
      
      But the code may evolve and this API may be used more
      broadly in the future. Keeping this default implementation
      around is very error prone as it may introduce a bug and
      hide it on architectures that don't override this API.
      
      Fix this by moving this definition to the jiffies based
      cputime headers as it is the only place where it belongs to.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      ae8dda5c
    • Frederic Weisbecker's avatar
      cputime: Avoid multiplication overflow on utime scaling · 62188451
      Frederic Weisbecker authored
      We scale stime, utime values based on rtime (sum_exec_runtime
      converted to jiffies). During scaling we multiple rtime * utime,
      which seems to be fine, since both values are converted to u64,
      but it's not.
      
      Let assume HZ is 1000 - 1ms tick. Process consist of 64 threads,
      run for 1 day, threads utilize 100% cpu on user space. Machine
      has 64 cpus.
      
      Process rtime = utime will be 64 * 24 * 60 * 60 * 1000 jiffies,
      which is 0x149970000. Multiplication rtime * utime result is
      0x1a855771100000000, which can not be covered in 64 bits.
      
      Result of overflow is stall of utime values visible in user
      space (prev_utime in kernel), even if application still consume
      lot of CPU time.
      
      A solution to solve this is to perform the multiplication on
      stime instead of utime. It's easy to grow the utime value fast
      with a CPU bound thread in userspace for example. Now we assume
      that doing so with stime is much harder. In most cases a task
      shouldn't ever spend much time in kernel space as it tends to
      sleep waiting for jobs completion when they take long to
      achieve. IO is the typical example of that.
      
      Hence scaling the cputime by performing the multiplication on
      stime instead of utime should considerably reduce the chances of
      an overflow on most workloads.
      
      This is largely inspired by a patch from Stanislaw Gruszka:
      http://lkml.kernel.org/r/20130107113144.GA7544@redhat.com
      
      Inspired-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Reported-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Acked-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1359217182-25184-1-git-send-email-fweisbec@gmail.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      62188451
  7. 26 Jan, 2013 1 commit
    • Frederic Weisbecker's avatar
      context_tracking: Export context state for generic vtime · 95a79fd4
      Frederic Weisbecker authored
      
      
      Export the context state: whether we run in user / kernel
      from the context tracking subsystem point of view.
      
      This is going to be used by the generic virtual cputime
      accounting subsystem that is needed to implement the full
      dynticks.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      95a79fd4
  8. 25 Jan, 2013 1 commit
    • Ying Xue's avatar
      sched/rt: Avoid updating RT entry timeout twice within one tick period · 57d2aa00
      Ying Xue authored
      
      
      The issue below was found in 2.6.34-rt rather than mainline rt
      kernel, but the issue still exists upstream as well.
      
      So please let me describe how it was noticed on 2.6.34-rt:
      
      On this version, each softirq has its own thread, it means there
      is at least one RT FIFO task per cpu. The priority of these
      tasks is set to 49 by default. If user launches an RT FIFO task
      with priority lower than 49 of softirq RT tasks, it's possible
      there are two RT FIFO tasks enqueued one cpu runqueue at one
      moment. By current strategy of balancing RT tasks, when it comes
      to RT tasks, we really need to put them off to a CPU that they
      can run on as soon as possible. Even if it means a bit of cache
      line flushing, we want RT tasks to be run with the least latency.
      
      When the user RT FIFO task which just launched before is
      running, the sched timer tick of the current cpu happens. In this
      tick period, the timeout value of the user RT task will be
      updated once. Subsequently, we try to wake up one softirq RT
      task on its local cpu. As the priority of current user RT task
      is lower than the softirq RT task, the current task will be
      preempted by the higher priority softirq RT task. Before
      preemption, we check to see if current can readily move to a
      different cpu. If so, we will reschedule to allow the RT push logic
      to try to move current somewhere else. Whenever the woken
      softirq RT task runs, it first tries to migrate the user FIFO RT
      task over to a cpu that is running a task of lesser priority. If
      migration is done, it will send a reschedule request to the found
      cpu by IPI interrupt. Once the target cpu responds the IPI
      interrupt, it will pick the migrated user RT task to preempt its
      current task. When the user RT task is running on the new cpu,
      the sched timer tick of the cpu fires. So it will tick the user
      RT task again. This also means the RT task timeout value will be
      updated again. As the migration may be done in one tick period,
      it means the user RT task timeout value will be updated twice
      within one tick.
      
      If we set a limit on the amount of cpu time for the user RT task
      by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted
      upon reaching the soft limit.
      
      But exactly when the SIGXCPU signal should be sent depends on the
      RT task timeout value. In fact the timeout mechanism of sending
      the SIGXCPU signal assumes the RT task timeout is increased once
      every tick.
      
      However, currently the timeout value may be added twice per
      tick. So it results in the SIGXCPU signal being sent earlier
      than expected.
      
      To solve this issue, we prevent the timeout value from increasing
      twice within one tick time by remembering the jiffies value of
      last updating the timeout. As long as the RT task's jiffies is
      different with the global jiffies value, we allow its timeout to
      be updated.
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarFan Du <fan.du@windriver.com>
      Reviewed-by: default avatarYong Zhang <yong.zhang0@gmail.com>
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1342508623-2887-1-git-send-email-ying.xue@windriver.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      57d2aa00
  9. 24 Jan, 2013 3 commits
  10. 22 Jan, 2013 4 commits
    • Tejun Heo's avatar
      async: fix __lowest_in_progress() · f56c3196
      Tejun Heo authored
      Commit 083b804c
      
       ("async: use workqueue for worker pool") made it
      possible that async jobs are moved from pending to running out-of-order.
      While pending async jobs will be queued and dispatched for execution in
      the same order, nothing guarantees they'll enter "1) move self to the
      running queue" of async_run_entry_fn() in the same order.
      
      Before the conversion, async implemented its own worker pool.  An async
      worker, upon being woken up, fetches the first item from the pending
      list, which kept the executing lists sorted.  The conversion to
      workqueue was done by adding work_struct to each async_entry and async
      just schedules the work item.  The queueing and dispatching of such work
      items are still in order but now each worker thread is associated with a
      specific async_entry and moves that specific async_entry to the
      executing list.  So, depending on which worker reaches that point
      earlier, which is non-deterministic, we may end up moving an async_entry
      with larger cookie before one with smaller one.
      
      This broke __lowest_in_progress().  running->domain may not be properly
      sorted and is not guaranteed to contain lower cookies than pending list
      when not empty.  Fix it by ensuring sort-inserting to the running list
      and always looking at both pending and running when trying to determine
      the lowest cookie.
      
      Over time, the async synchronization implementation became quite messy.
      We better restructure it such that each async_entry is linked to two
      lists - one global and one per domain - and not move it when execution
      starts.  There's no reason to distinguish pending and running.  They
      behave the same for synchronization purposes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f56c3196
    • Oleg Nesterov's avatar
      wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task · 9067ac85
      Oleg Nesterov authored
      
      
      wake_up_process() should never wakeup a TASK_STOPPED/TRACED task.
      Change it to use TASK_NORMAL and add the WARN_ON().
      
      TASK_ALL has no other users, probably can be killed.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9067ac85
    • Oleg Nesterov's avatar
      ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL · 9899d11f
      Oleg Nesterov authored
      
      
      putreg() assumes that the tracee is not running and pt_regs_access() can
      safely play with its stack.  However a killed tracee can return from
      ptrace_stop() to the low-level asm code and do RESTORE_REST, this means
      that debugger can actually read/modify the kernel stack until the tracee
      does SAVE_REST again.
      
      set_task_blockstep() can race with SIGKILL too and in some sense this
      race is even worse, the very fact the tracee can be woken up breaks the
      logic.
      
      As Linus suggested we can clear TASK_WAKEKILL around the arch_ptrace()
      call, this ensures that nobody can ever wakeup the tracee while the
      debugger looks at it.  Not only this fixes the mentioned problems, we
      can do some cleanups/simplifications in arch_ptrace() paths.
      
      Probably ptrace_unfreeze_traced() needs more callers, for example it
      makes sense to make the tracee killable for oom-killer before
      access_process_vm().
      
      While at it, add the comment into may_ptrace_stop() to explain why
      ptrace_stop() still can't rely on SIGKILL and signal_pending_state().
      Reported-by: default avatarSalman Qazi <sqazi@google.com>
      Reported-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9899d11f
    • Oleg Nesterov's avatar
      ptrace: introduce signal_wake_up_state() and ptrace_signal_wake_up() · 910ffdb1
      Oleg Nesterov authored
      
      
      Cleanup and preparation for the next change.
      
      signal_wake_up(resume => true) is overused. None of ptrace/jctl callers
      actually want to wakeup a TASK_WAKEKILL task, but they can't specify the
      necessary mask.
      
      Turn signal_wake_up() into signal_wake_up_state(state), reintroduce
      signal_wake_up() as a trivial helper, and add ptrace_signal_wake_up()
      which adds __TASK_TRACED.
      
      This way ptrace_signal_wake_up() can work "inside" ptrace_request()
      even if the tracee doesn't have the TASK_WAKEKILL bit set.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      910ffdb1
  11. 21 Jan, 2013 1 commit
    • Steven Rostedt's avatar
      ftrace: Be first to run code modification on modules · c1bf08ac
      Steven Rostedt authored
      If some other kernel subsystem has a module notifier, and adds a kprobe
      to a ftrace mcount point (now that kprobes work on ftrace points),
      when the ftrace notifier runs it will fail and disable ftrace, as well
      as kprobes that are attached to ftrace points.
      
      Here's the error:
      
       WARNING: at kernel/trace/ftrace.c:1618 ftrace_bug+0x239/0x280()
       Hardware name: Bochs
       Modules linked in: fat(+) stap_56d28a51b3fe546293ca0700b10bcb29__8059(F) nfsv4 auth_rpcgss nfs dns_resolver fscache xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack lockd sunrpc ppdev parport_pc parport microcode virtio_net i2c_piix4 drm_kms_helper ttm drm i2c_core [last unloaded: bid_shared]
       Pid: 8068, comm: modprobe Tainted: GF            3.7.0-0.rc8.git0.1.fc19.x86_64 #1
       Call Trace:
        [<ffffffff8105e70f>] warn_slowpath_common+0x7f/0xc0
        [<ffffffff81134106>] ? __probe_kernel_read+0x46/0x70
        [<ffffffffa0180000>] ? 0xffffffffa017ffff
        [<ffffffffa0180000>] ? 0xffffffffa017ffff
        [<ffffffff8105e76a>] warn_slowpath_null+0x1a/0x20
        [<ffffffff810fd189>] ftrace_bug+0x239/0x280
        [<ffffffff810fd626>] ftrace_process_locs+0x376/0x520
        [<ffffffff810fefb7>] ftrace_module_notify+0x47/0x50
        [<ffffffff8163912d>] notifier_call_chain+0x4d/0x70
        [<ffffffff810882f8>] __blocking_notifier_call_chain+0x58/0x80
        [<ffffffff81088336>] blocking_notifier_call_chain+0x16/0x20
        [<ffffffff810c2a23>] sys_init_module+0x73/0x220
        [<ffffffff8163d719>] system_call_fastpath+0x16/0x1b
       ---[ end trace 9ef46351e53bbf80 ]---
       ftrace failed to modify [<ffffffffa0180000>] init_once+0x0/0x20 [fat]
        actual: cc:bb:d2:4b:e1
      
      A kprobe was added to the init_once() function in the fat module on load.
      But this happened before ftrace could have touched the code. As ftrace
      didn't run yet, the kprobe system had no idea it was a ftrace point and
      simply added a breakpoint to the code (0xcc in the cc:bb:d2:4b:e1).
      
      Then when ftrace went to modify the location from a call to mcount/fentry
      into a nop, it didn't see a call op, but instead it saw the breakpoint op
      and not knowing what to do with it, ftrace shut itself down.
      
      The solution is to simply give the ftrace module notifier the max priority.
      This should have been done regardless, as the core code ftrace modification
      also happens very early on in boot up. This makes the module modification
      closer to core modification.
      
      Link: http://lkml.kernel.org/r/20130107140333.593683061@goodmis.org
      
      
      
      Cc: stable@vger.kernel.org
      Acked-by: default avatarMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Reported-by: default avatarFrank Ch. Eigler <fche@redhat.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      c1bf08ac
  12. 20 Jan, 2013 2 commits
  13. 19 Jan, 2013 1 commit
  14. 16 Jan, 2013 1 commit
    • Tejun Heo's avatar
      module, async: async_synchronize_full() on module init iff async is used · 774a1221
      Tejun Heo authored
      If the default iosched is built as module, the kernel may deadlock
      while trying to load the iosched module on device probe if the probing
      was running off async.  This is because async_synchronize_full() at
      the end of module init ends up waiting for the async job which
      initiated the module loading.
      
       async A				modprobe
      
       1. finds a device
       2. registers the block device
       3. request_module(default iosched)
      					4. modprobe in userland
      					5. load and init module
      					6. async_synchronize_full()
      
      Async A waits for modprobe to finish in request_module() and modprobe
      waits for async A to finish in async_synchronize_full().
      
      Because there's no easy to track dependency once control goes out to
      userland, implementing properly nested flushing is difficult.  For
      now, make module init perform async_synchronize_full() iff module init
      has queued async jobs as suggested by Linus.
      
      This avoids the described deadlock because iosched module doesn't use
      async and thus wouldn't invoke async_synchronize_full().  This is
      hacky and incomplete.  It will deadlock if async module loading nests;
      however, this works around the known problem case and seems to be the
      best of bad options.
      
      For more details, please refer to the following thread.
      
        http://thread.gmane.org/gmane.linux.kernel/1420814
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarAlex Riesen <raa.lkml@gmail.com>
      Tested-by: default avatarMing Lei <ming.lei@canonical.com>
      Tested-by: default avatarAlex Riesen <raa.lkml@gmail.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      774a1221
  15. 14 Jan, 2013 1 commit
  16. 11 Jan, 2013 7 commits
    • Rusty Russell's avatar
      module: put modules in list much earlier. · 1fb9341a
      Rusty Russell authored
      Prarit's excellent bug report:
      > In recent Fedora releases (F17 & F18) some users have reported seeing
      > messages similar to
      >
      > [   15.478160] kvm: Could not allocate 304 bytes percpu data
      > [   15.478174] PERCPU: allocation failed, size=304 align=32, alloc from
      > reserved chunk failed
      >
      > during system boot.  In some cases, users have also reported seeing this
      > message along with a failed load of other modules.
      >
      > What is happening is systemd is loading an instance of the kvm module for
      > each cpu found (see commit e9bda3b3
      
      ).  When the module load occurs the kernel
      > currently allocates the modules percpu data area prior to checking to see
      > if the module is already loaded or is in the process of being loaded.  If
      > the module is already loaded, or finishes load, the module loading code
      > releases the current instance's module's percpu data.
      
      Now we have a new state MODULE_STATE_UNFORMED, we can insert the
      module into the list (and thus guarantee its uniqueness) before we
      allocate the per-cpu region.
      Reported-by: default avatarPrarit Bhargava <prarit@redhat.com>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Tested-by: default avatarPrarit Bhargava <prarit@redhat.com>
      1fb9341a
    • Rusty Russell's avatar
      module: add new state MODULE_STATE_UNFORMED. · 0d21b0e3
      Rusty Russell authored
      
      
      You should never look at such a module, so it's excised from all paths
      which traverse the modules list.
      
      We add the state at the end, to avoid gratuitous ABI break (ksplice).
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      0d21b0e3
    • Andrew Morton's avatar
      kernel/audit.c: avoid negative sleep durations · 82919919
      Andrew Morton authored
      
      
      audit_log_start() performs the same jiffies comparison in two places.
      If sufficient time has elapsed between the two comparisons, the second
      one produces a negative sleep duration:
      
        schedule_timeout: wrong timeout value fffffffffffffff0
        Pid: 6606, comm: trinity-child1 Not tainted 3.8.0-rc1+ #43
        Call Trace:
          schedule_timeout+0x305/0x340
          audit_log_start+0x311/0x470
          audit_log_exit+0x4b/0xfb0
          __audit_syscall_exit+0x25f/0x2c0
          sysret_audit+0x17/0x21
      
      Fix it by performing the comparison a single time.
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@redhat.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82919919
    • Kees Cook's avatar
      audit: catch possible NULL audit buffers · 0644ec0c
      Kees Cook authored
      
      
      It's possible for audit_log_start() to return NULL.  Handle it in the
      various callers.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Julien Tinnes <jln@google.com>
      Cc: Will Drewry <wad@google.com>
      Cc: Steve Grubb <sgrubb@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0644ec0c
    • Kees Cook's avatar
      audit: create explicit AUDIT_SECCOMP event type · 7b9205bd
      Kees Cook authored
      
      
      The seccomp path was using AUDIT_ANOM_ABEND from when seccomp mode 1
      could only kill a process.  While we still want to make sure an audit
      record is forced on a kill, this should use a separate record type since
      seccomp mode 2 introduces other behaviors.
      
      In the case of "handled" behaviors (process wasn't killed), only emit a
      record if the process is under inspection.  This change also fixes
      userspace examination of seccomp audit events, since it was considered
      malformed due to missing fields of the AUDIT_ANOM_ABEND event type.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Julien Tinnes <jln@google.com>
      Acked-by: default avatarWill Drewry <wad@chromium.org>
      Acked-by: default avatarSteve Grubb <sgrubb@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b9205bd
    • Jiri Kosina's avatar
      lockdep, rwsem: provide down_write_nest_lock() · 1b963c81
      Jiri Kosina authored
      
      
      down_write_nest_lock() provides a means to annotate locking scenario
      where an outer lock is guaranteed to serialize the order nested locks
      are being acquired.
      
      This is analogoue to already existing mutex_lock_nest_lock() and
      spin_lock_nest_lock().
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Tested-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b963c81
    • Steven Rostedt's avatar
      tracing: Fix regression with irqsoff tracer and tracing_on file · 2df8f8a6
      Steven Rostedt authored
      Commit 02404baf
      
       "tracing: Remove deprecated tracing_enabled file"
      removed the tracing_enabled file as it never worked properly and
      the tracing_on file should be used instead. But the tracing_on file
      didn't call into the tracers start/stop routines like the
      tracing_enabled file did. This caused trace-cmd to break when it
      enabled the irqsoff tracer.
      
      If you just did "echo irqsoff > current_tracer" then it would work
      properly. But the tool trace-cmd disables tracing first by writing
      "0" into the tracing_on file. Then it writes "irqsoff" into
      current_tracer and then writes "1" into tracing_on. Unfortunately,
      the above commit changed the irqsoff tracer to check the tracing_on
      status instead of the tracing_enabled status. If it's disabled then
      it does not start the tracer internals.
      
      The problem is that writing "1" into tracing_on does not call the
      tracers "start" routine like writing "1" into tracing_enabled did.
      This makes the irqsoff tracer not start when using the trace-cmd
      tool, and is a regression for userspace.
      
      Simple fix is to have the tracing_on file call the tracers start()
      method when being enabled (and the stop() method when disabled).
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      2df8f8a6
  17. 10 Jan, 2013 1 commit
  18. 09 Jan, 2013 1 commit
    • Steven Rostedt's avatar
      tracing: Fix regression of trace_options file setting · a8dd2176
      Steven Rostedt authored
      
      
      The latest change to allow trace options to be set on the command
      line also broke the trace_options file.
      
      The zeroing of the last byte of the option name that is echoed into
      the trace_option file was removed with the consolidation of some
      of the code. The compare between the option and what was written to
      the trace_options file fails because the string holding the data
      written doesn't terminate with a null character.
      
      A zero needs to be added to the end of the string copied from
      user space.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      a8dd2176
  19. 05 Jan, 2013 2 commits
  20. 04 Jan, 2013 1 commit
    • Roland Dreier's avatar
      printk: fix incorrect length from print_time() when seconds > 99999 · 35dac27c
      Roland Dreier authored
      print_prefix() passes a NULL buf to print_time() to get the length of
      the time prefix; when printk times are enabled, the current code just
      returns the constant 15, which matches the format "[%5lu.%06lu] " used
      to print the time value.  However, this is obviously incorrect when the
      whole seconds part of the time gets beyond 5 digits (100000 seconds is a
      bit more than a day of uptime).
      
      The simple fix is to use snprintf(NULL, 0, ...) to calculate the actual
      length of the time prefix.  This could be micro-optimized but it seems
      better to have simpler, more readable code here.
      
      The bug leads to the syslog system call miscomputing which messages fit
      into the userspace buffer.  If there are enough messages to fill
      log_buf_len and some have a timestamp >= 100000, dmesg may fail with:
      
          # dmesg
          klogctl: Bad address
      
      When this happens, strace shows that the failure is indeed EFAULT due to
      the kernel mistakenly accessing past the end of dmesg's buffer, since
      dmesg asks the kernel how big a buffer it needs, allocates a bit more,
      and then gets an error when it asks the kernel to fill it:
      
          syslog(0xa, 0, 0)                       = 1048576
          mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa4d25d2000
          syslog(0x3, 0x7fa4d25d2010, 0x100008)   = -1 EFAULT (Bad address)
      
      As far as I can see, the bug has been there as long as print_time(),
      which comes from commit 084681d1
      
       ("printk: flush continuation lines
      immediately to console") in 3.5-rc5.
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Sylvain Munaut <s.munaut@whatever-company.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      35dac27c