1. 25 Nov, 2015 1 commit
  2. 24 Nov, 2015 4 commits
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Remove redundant update of page timestamp · 70004986
      Steven Rostedt (Red Hat) authored
      
      
      The first commit of a buffer page updates the timestamp of that page. No
      need to have the update to the next page add the timestamp too. It will only
      be replaced by the first commit on that page anyway.
      
      Only update to a page if it contains an event.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      70004986
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Use READ_ONCE() for most tail_page access · 8573636e
      Steven Rostedt (Red Hat) authored
      
      
      As cpu_buffer->tail_page may be modified by interrupts at almost any time,
      the flow of logic is very important. Do not let gcc get smart with
      re-reading cpu_buffer->tail_page by adding READ_ONCE() around most of its
      accesses.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      8573636e
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Put back the length if crossed page with add_timestamp · bd1b7cd3
      Steven Rostedt (Red Hat) authored
      Commit fcc742ea "ring-buffer: Add event descriptor to simplify passing
      data" added a descriptor that holds various data instead of passing around
      several variables through parameters. The problem was that one of the
      parameters was modified in a function and the code was designed not to have
      an effect on that modified  parameter. Now that the parameter is a
      descriptor and any modifications to it are non-volatile, the size of the
      data could be unnecessarily expanded.
      
      Remove the extra space added if a timestamp was added and the event went
      across the page.
      
      Cc: stable@vger.kernel.org # 4.3+
      Fixes: fcc742ea
      
       "ring-buffer: Add event descriptor to simplify passing data"
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      bd1b7cd3
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Update read stamp with first real commit on page · b81f472a
      Steven Rostedt (Red Hat) authored
      
      
      Do not update the read stamp after swapping out the reader page from the
      write buffer. If the reader page is swapped out of the buffer before an
      event is written to it, then the read_stamp may get an out of date
      timestamp, as the page timestamp is updated on the first commit to that
      page.
      
      rb_get_reader_page() only returns a page if it has an event on it, otherwise
      it will return NULL. At that point, check if the page being returned has
      events and has not been read yet. Then at that point update the read_stamp
      to match the time stamp of the reader page.
      
      Cc: stable@vger.kernel.org # 2.6.30+
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      b81f472a
  3. 02 Nov, 2015 4 commits
  4. 03 Sep, 2015 1 commit
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Revert "ring-buffer: Get timestamp after event is allocated" · b7dc42fd
      Steven Rostedt (Red Hat) authored
      The commit a4543a2f
      
       "ring-buffer: Get timestamp after event is
      allocated" is needed for some future work. But after adding it, there is a
      race somewhere that causes the saved timestamp to have a slight shift, and
      get ahead of the actual timestamp and make it look like time goes backwards.
      
      I'm still looking into why this happens, but in the mean time, this is
      holding up other work to get in. I'm reverting the change for now (which
      makes the problem go away), and will add it back after I know what is wrong
      and fix it.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      b7dc42fd
  5. 20 Jul, 2015 5 commits
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Reorganize function locations · d90fd774
      Steven Rostedt (Red Hat) authored
      
      
      Functions in ring-buffer.c have gotten interleaved between different
      use cases. Move the functions around to get like functions closer
      together. This may or may not help gcc keep cache locality, but it
      makes it a little easier to work with the code.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      d90fd774
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Make sure event has enough room for extend and padding · 7d75e683
      Steven Rostedt (Red Hat) authored
      
      
      Now that events only add time extends after it is committed, in case
      an event comes in before it can discard the allocated event, the time
      extend needs to be stored within the event. If the event is bigger
      than then size needed for the time extend, padding must be added.
      The minimum padding size is 8 bytes. Thus if the event is 12 bytes
      (size of time extend + 4), there will not be enough room to add both
      the time extend and padding. Make sure all events are either 8 bytes
      or 16 or more bytes.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      7d75e683
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Get timestamp after event is allocated · a4543a2f
      Steven Rostedt (Red Hat) authored
      
      
      Move the capturing of the timestamp to after an event is allocated.
      If the event is not a commit (where it is an event that preempted
      another event), then no timestamp is needed, because the delta of
      nested events is always zero.
      
      If the event starts on a new page, no delta needs to be calculated
      as the full timestamp will be added to the page header, and the
      event will have a delta of zero.
      
      Now if the event requires a time extend (the delta does not fit
      in the 27 bit delta slot in the header), then the event is discarded,
      the length is extended to hold the TIME_EXTEND event that allows for
      a 59 bit delta, and the commit is tried again.
      
      If the event can't be discarded (another event came in after it),
      then the TIME_EXTEND is added directly to the allocated event and
      the rest of the event is given padding.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      a4543a2f
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Move the adding of the extended timestamp out of line · 9826b273
      Steven Rostedt (Red Hat) authored
      
      
      Requiring a extended time stamp is an uncommon occurrence, and it is
      best to do it out of line when needed.
      
      Add a noinline function that handles the extended timestamp and
      have it called with an unlikely to completely move it out of the
      fast path.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      9826b273
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Add event descriptor to simplify passing data · fcc742ea
      Steven Rostedt (Red Hat) authored
      
      
      Add rb_event_info descriptor to pass event info to functions a bit
      easier than using a bunch of parameters. This will also allow for
      changing the code around a bit to find better fast paths.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      fcc742ea
  6. 29 May, 2015 1 commit
  7. 28 May, 2015 2 commits
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Remove useless unused tracing_off_permanent() · 3c6296f7
      Steven Rostedt (Red Hat) authored
      
      
      The tracing_off_permanent() call is a way to disable all ring_buffers.
      Nothing uses it and nothing should use it, as tracing_off() and
      friends are better, as they disable the ring buffers related to
      tracing. The tracing_off_permanent() even disabled non tracing
      ring buffers. This is a bit drastic, and was added to handle NMIs
      doing outputs that could corrupt the ring buffer when only tracing
      used them. It is now obsolete and adds a little overhead, it should
      be removed.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      3c6296f7
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Give NMIs a chance to lock the reader_lock · 289a5a25
      Steven Rostedt (Red Hat) authored
      
      
      Currently, if an NMI does a dump of a ring buffer, it disables
      all ring buffers from ever doing any writes again. This is because
      it wont take the locks for the cpu_buffer and this can cause
      corruption if it preempted a read, or a read happens on another
      CPU for the current cpu buffer. This is a bit overkill.
      
      First, it should at least try to take the lock, and if it fails
      then disable it. Also, there's no need to disable all ring
      buffers, even those that are unrelated to what is being read.
      Only disable the per cpu ring buffer that is being read if
      it can not get the lock for it.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      289a5a25
  8. 27 May, 2015 3 commits
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Add trace_recursive checks to ring_buffer_write() · 985e871b
      Steven Rostedt (Red Hat) authored
      
      
      The ring_buffer_write() function isn't protected by the trace recursive
      writes. Luckily, this function is not used as much and is unlikely
      to ever recurse. But it should still have the protection, because
      even a call to ring_buffer_lock_reserve() could cause ring buffer
      corruption if called when ring_buffer_write() is being used.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      985e871b
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Allways do the trace_recursive checks · 6776221b
      Steven Rostedt (Red Hat) authored
      
      
      Currently the trace_recursive checks are only done if CONFIG_TRACING
      is enabled. That was because there use to be a dependency with tracing
      for the recursive checks (it used the task_struct trace recursive
      variable). But now it uses its own variable and there is no dependency.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      6776221b
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Move recursive check to per_cpu descriptor · 58a09ec6
      Steven Rostedt (Red Hat) authored
      
      
      Instead of using a global per_cpu variable to perform the recursive
      checks into the ring buffer, use the already existing per_cpu descriptor
      that is part of the ring buffer itself.
      
      Not only does this simplify the code, it also allows for one ring buffer
      to be used within the guts of the use of another ring buffer. For example
      trace_printk() can now be used within the ring buffer to record changes
      done by an instance into the main ring buffer. The recursion checks
      will prevent the trace_printk() itself from causing recursive issues
      with the main ring buffer (it is just ignored), but the recursive
      checks wont prevent the trace_printk() from recording other ring buffers.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      58a09ec6
  9. 21 May, 2015 1 commit
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Add unlikelys to make fast path the default · 3205f806
      Steven Rostedt (Red Hat) authored
      
      
      I was running the trace_event benchmark and noticed that the times
      to record a trace_event was all over the place. I looked at the assembly
      of the ring_buffer_lock_reserver() and saw this:
      
       <ring_buffer_lock_reserve>:
             31 c0                   xor    %eax,%eax
             48 83 3d 76 47 bd 00    cmpq   $0x1,0xbd4776(%rip)        # ffffffff81d10d60 <ring_buffer_flags>
             01
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             75 1d                   jne    ffffffff8113c60d <ring_buffer_lock_reserve+0x2d>
             65 ff 05 69 e3 ec 7e    incl   %gs:0x7eece369(%rip)        # a960 <__preempt_count>
             8b 47 08                mov    0x8(%rdi),%eax
             85 c0                   test   %eax,%eax
       +---- 74 12                   je     ffffffff8113c610 <ring_buffer_lock_reserve+0x30>
       |     65 ff 0d 5b e3 ec 7e    decl   %gs:0x7eece35b(%rip)        # a960 <__preempt_count>
       |     0f 84 85 00 00 00       je     ffffffff8113c690 <ring_buffer_lock_reserve+0xb0>
       |     31 c0                   xor    %eax,%eax
       |     5d                      pop    %rbp
       |     c3                      retq
       |     90                      nop
       +---> 65 44 8b 05 48 e3 ec    mov    %gs:0x7eece348(%rip),%r8d        # a960 <__preempt_count>
             7e
             41 81 e0 ff ff ff 7f    and    $0x7fffffff,%r8d
             b0 08                   mov    $0x8,%al
             65 8b 0d 58 36 ed 7e    mov    %gs:0x7eed3658(%rip),%ecx        # fc80 <current_context>
             41 f7 c0 00 ff 1f 00    test   $0x1fff00,%r8d
             74 1e                   je     ffffffff8113c64f <ring_buffer_lock_reserve+0x6f>
             41 f7 c0 00 00 10 00    test   $0x100000,%r8d
             b0 01                   mov    $0x1,%al
             75 13                   jne    ffffffff8113c64f <ring_buffer_lock_reserve+0x6f>
             41 81 e0 00 00 0f 00    and    $0xf0000,%r8d
             49 83 f8 01             cmp    $0x1,%r8
             19 c0                   sbb    %eax,%eax
             83 e0 02                and    $0x2,%eax
             83 c0 02                add    $0x2,%eax
             85 c8                   test   %ecx,%eax
             75 ab                   jne    ffffffff8113c5fe <ring_buffer_lock_reserve+0x1e>
             09 c8                   or     %ecx,%eax
             65 89 05 24 36 ed 7e    mov    %eax,%gs:0x7eed3624(%rip)        # fc80 <current_context>
      
      The arrow is the fast path.
      
      After adding the unlikely's, the fast path looks a bit better:
      
       <ring_buffer_lock_reserve>:
             31 c0                   xor    %eax,%eax
             48 83 3d 76 47 bd 00    cmpq   $0x1,0xbd4776(%rip)        # ffffffff81d10d60 <ring_buffer_flags>
             01
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             75 7b                   jne    ffffffff8113c66b <ring_buffer_lock_reserve+0x8b>
             65 ff 05 69 e3 ec 7e    incl   %gs:0x7eece369(%rip)        # a960 <__preempt_count>
             8b 47 08                mov    0x8(%rdi),%eax
             85 c0                   test   %eax,%eax
             0f 85 9f 00 00 00       jne    ffffffff8113c6a1 <ring_buffer_lock_reserve+0xc1>
             65 8b 0d 57 e3 ec 7e    mov    %gs:0x7eece357(%rip),%ecx        # a960 <__preempt_count>
             81 e1 ff ff ff 7f       and    $0x7fffffff,%ecx
             b0 08                   mov    $0x8,%al
             65 8b 15 68 36 ed 7e    mov    %gs:0x7eed3668(%rip),%edx        # fc80 <current_context>
             f7 c1 00 ff 1f 00       test   $0x1fff00,%ecx
             75 50                   jne    ffffffff8113c670 <ring_buffer_lock_reserve+0x90>
             85 d0                   test   %edx,%eax
             75 7d                   jne    ffffffff8113c6a1 <ring_buffer_lock_reserve+0xc1>
             09 d0                   or     %edx,%eax
             65 89 05 53 36 ed 7e    mov    %eax,%gs:0x7eed3653(%rip)        # fc80 <current_context>
             65 8b 05 fc da ec 7e    mov    %gs:0x7eecdafc(%rip),%eax        # a130 <cpu_number>
             89 c2                   mov    %eax,%edx
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      3205f806
  10. 13 May, 2015 1 commit
  11. 30 Mar, 2015 1 commit
  12. 25 Mar, 2015 1 commit
    • Steven Rostedt's avatar
      ring-buffer: Replace this_cpu_*() with __this_cpu_*() · 80a9b64e
      Steven Rostedt authored
      It has come to my attention that this_cpu_read/write are horrible on
      architectures other than x86. Worse yet, they actually disable
      preemption or interrupts! This caused some unexpected tracing results
      on ARM.
      
         101.356868: preempt_count_add <-ring_buffer_lock_reserve
         101.356870: preempt_count_sub <-ring_buffer_lock_reserve
      
      The ring_buffer_lock_reserve has recursion protection that requires
      accessing a per cpu variable. But since preempt_disable() is traced, it
      too got traced while accessing the variable that is suppose to prevent
      recursion like this.
      
      The generic version of this_cpu_read() and write() are:
      
       #define this_cpu_generic_read(pcp)					\
       ({	typeof(pcp) ret__;						\
      	preempt_disable();						\
      	ret__ = *this_cpu_ptr(&(pcp));					\
      	preempt_enable();						\
      	ret__;								\
       })
      
       #define this_cpu_generic_to_op(pcp, val, op)				\
       do {									\
      	unsigned long flags;						\
      	raw_local_irq_save(flags);					\
      	*__this_cpu_ptr(&(pcp)) op val;					\
      	raw_local_irq_restore(flags);					\
       } while (0)
      
      Which is unacceptable for locations that know they are within preempt
      disabled or interrupt disabled locations.
      
      Paul McKenney stated that __this_cpu_() versions produce much better code on
      other architectures than this_cpu_() does, if we know that the call is done in
      a preempt disabled location.
      
      I also changed the recursive_unlock() to use two local variables instead
      of accessing the per_cpu variable twice.
      
      Link: http://lkml.kernel.org/r/20150317114411.GE3589@linux.vnet.ibm.com
      Link: http://lkml.kernel.org/r/20150317104038.312e73d1@gandalf.local.home
      
      
      
      Cc: stable@vger.kernel.org
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Reported-by: default avatarUwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>
      Tested-by: default avatarUwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      80a9b64e
  13. 11 Feb, 2015 1 commit
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Do not wake up a splice waiter when page is not full · 1e0d6714
      Steven Rostedt (Red Hat) authored
      When an application connects to the ring buffer via splice, it can only
      read full pages. Splice does not work with partial pages. If there is
      not enough data to fill a page, the splice command will either block
      or return -EAGAIN (if set to nonblock).
      
      Code was added where if the page is not full, to just sleep again.
      The problem is, it will get woken up again on the next event. That
      is, when something is written into the ring buffer, if there is a waiter
      it will wake it up. The waiter would then check the buffer, see that
      it still does not have enough data to fill a page and go back to sleep.
      To make matters worse, when the waiter goes back to sleep, it could
      cause another event, which would wake it back up again to see it
      doesn't have enough data and sleep again. This produces a tremendous
      overhead and fills the ring buffer with noise.
      
      For example, recording sched_switch on an idle system for 10 seconds
      produces 25,350,475 events!!!
      
      Create another wait queue for those waiters wanting full pages.
      When an event is written, it only wakes up waiters if there's a full
      page of data. It does not wake up the waiter if the page is not yet
      full.
      
      After this change, recording sched_switch on an idle system for 10
      seconds produces only 800 events. Getting rid of 25,349,675 useless
      events (99.9969% of events!!), is something to take seriously.
      
      Cc: stable@vger.kernel.org # 3.16+
      Cc: Rabin Vincent <rabin@rab.in>
      Fixes: e30f53aa
      
       "tracing: Do not busy wait in buffer splice"
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      1e0d6714
  14. 22 Jan, 2015 1 commit
  15. 19 Nov, 2014 1 commit
  16. 10 Nov, 2014 1 commit
    • Rabin Vincent's avatar
      tracing: Do not busy wait in buffer splice · e30f53aa
      Rabin Vincent authored
      On a !PREEMPT kernel, attempting to use trace-cmd results in a soft
      lockup:
      
       # trace-cmd record -e raw_syscalls:* -F false
       NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trace-cmd:61]
       ...
       Call Trace:
        [<ffffffff8105b580>] ? __wake_up_common+0x90/0x90
        [<ffffffff81092e25>] wait_on_pipe+0x35/0x40
        [<ffffffff810936e3>] tracing_buffers_splice_read+0x2e3/0x3c0
        [<ffffffff81093300>] ? tracing_stats_read+0x2a0/0x2a0
        [<ffffffff812d10ab>] ? _raw_spin_unlock+0x2b/0x40
        [<ffffffff810dc87b>] ? do_read_fault+0x21b/0x290
        [<ffffffff810de56a>] ? handle_mm_fault+0x2ba/0xbd0
        [<ffffffff81095c80>] ? trace_event_buffer_lock_reserve+0x40/0x80
        [<ffffffff810951e2>] ? trace_buffer_lock_reserve+0x22/0x60
        [<ffffffff81095c80>] ? trace_event_buffer_lock_reserve+0x40/0x80
        [<ffffffff8112415d>] do_splice_to+0x6d/0x90
        [<ffffffff81126971>] SyS_splice+0x7c1/0x800
        [<ffffffff812d1edd>] tracesys_phase2+0xd3/0xd8
      
      The problem is this: tracing_buffers_splice_read() calls
      ring_buffer_wait() to wait for data in the ring buffers.  The buffers
      are not empty so ring_buffer_wait() returns immediately.  But
      tracing_buffers_splice_read() calls ring_buffer_read_page() with full=1,
      meaning it only wants to read a full page.  When the full page is not
      available, tracing_buffers_splice_read() tries to wait again with
      ring_buffer_wait(), which again returns immediately, and so on.
      
      Fix this by adding a "full" argument to ring_buffer_wait() which will
      make ring_buffer_wait() wait until the writer has left the reader's
      page, i.e.  until full-page reads will succeed.
      
      Link: http://lkml.kernel.org/r/1415645194-25379-1-git-send-email-rabin@rab.in
      
      Cc: stable@vger.kernel.org # 3.16+
      Fixes: b1169cc6
      
       ("tracing: Remove mock up poll wait function")
      Signed-off-by: default avatarRabin Vincent <rabin@rab.in>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      e30f53aa
  17. 02 Oct, 2014 1 commit
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Fix infinite spin in reading buffer · 24607f11
      Steven Rostedt (Red Hat) authored
      Commit 651e22f2 "ring-buffer: Always reset iterator to reader page"
      fixed one bug but in the process caused another one. The reset is to
      update the header page, but that fix also changed the way the cached
      reads were updated. The cache reads are used to test if an iterator
      needs to be updated or not.
      
      A ring buffer iterator, when created, disables writes to the ring buffer
      but does not stop other readers or consuming reads from happening.
      Although all readers are synchronized via a lock, they are only
      synchronized when in the ring buffer functions. Those functions may
      be called by any number of readers. The iterator continues down when
      its not interrupted by a consuming reader. If a consuming read
      occurs, the iterator starts from the beginning of the buffer.
      
      The way the iterator sees that a consuming read has happened since
      its last read is by checking the reader "cache". The cache holds the
      last counts of the read and the reader page itself.
      
      Commit 651e22f2 changed what was saved by the cache_read when
      the rb_iter_reset() occurred, making the iterator never match the cache.
      Then if the iterator calls rb_iter_reset(), it will go into an
      infinite loop by checking if the cache doesn't match, doing the reset
      and retrying, just to see that the cache still doesn't match! Which
      should never happen as the reset is suppose to set the cache to the
      current value and there's locks that keep a consuming reader from
      having access to the data.
      
      Fixes: 651e22f2
      
       "ring-buffer: Always reset iterator to reader page"
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      24607f11
  18. 25 Aug, 2014 1 commit
    • Josef Bacik's avatar
      trace: Fix epoll hang when we race with new entries · 4ce97dbf
      Josef Bacik authored
      Epoll on trace_pipe can sometimes hang in a weird case.  If the ring buffer is
      empty when we set waiters_pending but an event shows up exactly at that moment
      we can miss being woken up by the ring buffers irq work.  Since
      ring_buffer_empty() is inherently racey we will sometimes think that the buffer
      is not empty.  So we don't get woken up and we don't think there are any events
      even though there were some ready when we added the watch, which makes us hang.
      This patch fixes this by making sure that we are actually on the wait list
      before we set waiters_pending, and add a memory barrier to make sure
      ring_buffer_empty() is going to be correct.
      
      Link: http://lkml.kernel.org/p/1408989581-23727-1-git-send-email-jbacik@fb.com
      
      
      
      Cc: stable@vger.kernel.org # 3.10+
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      4ce97dbf
  19. 06 Aug, 2014 2 commits
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Always reset iterator to reader page · 651e22f2
      Steven Rostedt (Red Hat) authored
      When performing a consuming read, the ring buffer swaps out a
      page from the ring buffer with a empty page and this page that
      was swapped out becomes the new reader page. The reader page
      is owned by the reader and since it was swapped out of the ring
      buffer, writers do not have access to it (there's an exception
      to that rule, but it's out of scope for this commit).
      
      When reading the "trace" file, it is a non consuming read, which
      means that the data in the ring buffer will not be modified.
      When the trace file is opened, a ring buffer iterator is allocated
      and writes to the ring buffer are disabled, such that the iterator
      will not have issues iterating over the data.
      
      Although the ring buffer disabled writes, it does not disable other
      reads, or even consuming reads. If a consuming read happens, then
      the iterator is reset and starts reading from the beginning again.
      
      My tests would sometimes trigger this bug on my i386 box:
      
      WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
      Modules linked in:
      CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
      Hardware name:                  /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
       00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
       f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
       ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
      Call Trace:
       [<c18796b3>] dump_stack+0x4b/0x75
       [<c103a0e3>] warn_slowpath_common+0x7e/0x95
       [<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
       [<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
       [<c103a185>] warn_slowpath_fmt+0x33/0x35
       [<c10bd85a>] __trace_find_cmdline+0x66/0xaa^M
       [<c10bed04>] trace_find_cmdline+0x40/0x64
       [<c10c3c16>] trace_print_context+0x27/0xec
       [<c10c4360>] ? trace_seq_printf+0x37/0x5b
       [<c10c0b15>] print_trace_line+0x319/0x39b
       [<c10ba3fb>] ? ring_buffer_read+0x47/0x50
       [<c10c13b1>] s_show+0x192/0x1ab
       [<c10bfd9a>] ? s_next+0x5a/0x7c
       [<c112e76e>] seq_read+0x267/0x34c
       [<c1115a25>] vfs_read+0x8c/0xef
       [<c112e507>] ? seq_lseek+0x154/0x154
       [<c1115ba2>] SyS_read+0x54/0x7f
       [<c188488e>] syscall_call+0x7/0xb
      ---[ end trace 3f507febd6b4cc83 ]---
      >>>> ##### CPU 1 buffer started ####
      
      Which was the __trace_find_cmdline() function complaining about the pid
      in the event record being negative.
      
      After adding more test cases, this would trigger more often. Strangely
      enough, it would never trigger on a single test, but instead would trigger
      only when running all the tests. I believe that was the case because it
      required one of the tests to be shutting down via delayed instances while
      a new test started up.
      
      After spending several days debugging this, I found that it was caused by
      the iterator becoming corrupted. Debugging further, I found out why
      the iterator became corrupted. It happened with the rb_iter_reset().
      
      As consuming reads may not read the full reader page, and only part
      of it, there's a "read" field to know where the last read took place.
      The iterator, must also start at the read position. In the rb_iter_reset()
      code, if the reader page was disconnected from the ring buffer, the iterator
      would start at the head page within the ring buffer (where writes still
      happen). But the mistake there was that it still used the "read" field
      to start the iterator on the head page, where it should always start
      at zero because readers never read from within the ring buffer where
      writes occur.
      
      I originally wrote a patch to have it set the iter->head to 0 instead
      of iter->head_page->read, but then I questioned why it wasn't always
      setting the iter to point to the reader page, as the reader page is
      still valid.  The list_empty(reader_page->list) just means that it was
      successful in swapping out. But the reader_page may still have data.
      
      There was a bug report a long time ago that was not reproducible that
      had something about trace_pipe (consuming read) not matching trace
      (iterator read). This may explain why that happened.
      
      Anyway, the correct answer to this bug is to always use the reader page
      an not reset the iterator to inside the writable ring buffer.
      
      Cc: stable@vger.kernel.org # 2.6.28+
      Fixes: d769041f
      
       "ring_buffer: implement new locking"
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      651e22f2
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Up rb_iter_peek() loop count to 3 · 021de3d9
      Steven Rostedt (Red Hat) authored
      After writting a test to try to trigger the bug that caused the
      ring buffer iterator to become corrupted, I hit another bug:
      
       WARNING: CPU: 1 PID: 5281 at kernel/trace/ring_buffer.c:3766 rb_iter_peek+0x113/0x238()
       Modules linked in: ipt_MASQUERADE sunrpc [...]
       CPU: 1 PID: 5281 Comm: grep Tainted: G        W     3.16.0-rc3-test+ #143
       Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
        0000000000000000 ffffffff81809a80 ffffffff81503fb0 0000000000000000
        ffffffff81040ca1 ffff8800796d6010 ffffffff810c138d ffff8800796d6010
        ffff880077438c80 ffff8800796d6010 ffff88007abbe600 0000000000000003
       Call Trace:
        [<ffffffff81503fb0>] ? dump_stack+0x4a/0x75
        [<ffffffff81040ca1>] ? warn_slowpath_common+0x7e/0x97
        [<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
        [<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
        [<ffffffff810c14df>] ? ring_buffer_iter_peek+0x2d/0x5c
        [<ffffffff810c6f73>] ? tracing_iter_reset+0x6e/0x96
        [<ffffffff810c74a3>] ? s_start+0xd7/0x17b
        [<ffffffff8112b13e>] ? kmem_cache_alloc_trace+0xda/0xea
        [<ffffffff8114cf94>] ? seq_read+0x148/0x361
        [<ffffffff81132d98>] ? vfs_read+0x93/0xf1
        [<ffffffff81132f1b>] ? SyS_read+0x60/0x8e
        [<ffffffff8150bf9f>] ? tracesys+0xdd/0xe2
      
      Debugging this bug, which triggers when the rb_iter_peek() loops too
      many times (more than 2 times), I discovered there's a case that can
      cause that function to legitimately loop 3 times!
      
      rb_iter_peek() is different than rb_buffer_peek() as the rb_buffer_peek()
      only deals with the reader page (it's for consuming reads). The
      rb_iter_peek() is for traversing the buffer without consuming it, and as
      such, it can loop for one more reason. That is, if we hit the end of
      the reader page or any page, it will go to the next page and try again.
      
      That is, we have this:
      
       1. iter->head > iter->head_page->page->commit
          (rb_inc_iter() which moves the iter to the next page)
          try again
      
       2. event = rb_iter_head_event()
          event->type_len == RINGBUF_TYPE_TIME_EXTEND
          rb_advance_iter()
          try again
      
       3. read the event.
      
      But we never get to 3, because the count is greater than 2 and we
      cause the WARNING and return NULL.
      
      Up the counter to 3.
      
      Cc: stable@vger.kernel.org # 2.6.37+
      Fixes: 69d1b839
      
       "ring-buffer: Bind time extend and data events together"
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      021de3d9
  20. 23 Jul, 2014 1 commit
  21. 18 Jul, 2014 1 commit
  22. 15 Jul, 2014 1 commit
    • Martin Lau's avatar
      ring-buffer: Fix polling on trace_pipe · 97b8ee84
      Martin Lau authored
      ring_buffer_poll_wait() should always put the poll_table to its wait_queue
      even there is immediate data available.  Otherwise, the following epoll and
      read sequence will eventually hang forever:
      
      1. Put some data to make the trace_pipe ring_buffer read ready first
      2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
      3. epoll_wait()
      4. read(trace_pipe_fd) till EAGAIN
      5. Add some more data to the trace_pipe ring_buffer
      6. epoll_wait() -> this epoll_wait() will block forever
      
      ~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
        ring_buffer_poll_wait() returns immediately without adding poll_table,
        which has poll_table->_qproc pointing to ep_poll_callback(), to its
        wait_queue.
      ~ During the epoll_wait() call in step 3 and step 6,
        ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
        because the poll_table->_qproc is NULL and it is how epoll works.
      ~ When there is new data available in step 6, ring_buffer does not know
        it has to call ep_poll_callback() because it is not in its wait queue.
        Hence, block forever.
      
      Other poll implementation seems to call poll_wait() unconditionally as the very
      first thing to do.  For example, tcp_poll() in tcp.c.
      
      Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com
      
      Cc: stable@vger.kernel.org # 2.6.27
      Fixes: 2a2cc8f7
      
       "ftrace: allow the event pipe to be polled"
      Reviewed-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarMartin Lau <kafai@fb.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      97b8ee84
  23. 10 Jun, 2014 1 commit
  24. 20 Mar, 2014 1 commit
    • Srivatsa S. Bhat's avatar
      trace, ring-buffer: Fix CPU hotplug callback registration · d39ad278
      Srivatsa S. Bhat authored
      
      
      Subsystems that want to register CPU hotplug callbacks, as well as perform
      initialization for the CPUs that are already online, often do it as shown
      below:
      
      	get_online_cpus();
      
      	for_each_online_cpu(cpu)
      		init_cpu(cpu);
      
      	register_cpu_notifier(&foobar_cpu_notifier);
      
      	put_online_cpus();
      
      This is wrong, since it is prone to ABBA deadlocks involving the
      cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
      with CPU hotplug operations).
      
      Instead, the correct and race-free way of performing the callback
      registration is:
      
      	cpu_notifier_register_begin();
      
      	for_each_online_cpu(cpu)
      		init_cpu(cpu);
      
      	/* Note the use of the double underscored version of the API */
      	__register_cpu_notifier(&foobar_cpu_notifier);
      
      	cpu_notifier_register_done();
      
      Fix the tracing ring-buffer code by using this latter form of callback
      registration.
      
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      d39ad278
  25. 11 Feb, 2014 1 commit
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Fix first commit on sub-buffer having non-zero delta · d651aa1d
      Steven Rostedt (Red Hat) authored
      Each sub-buffer (buffer page) has a full 64 bit timestamp. The events on
      that page use a 27 bit delta against that timestamp in order to save on
      bits written to the ring buffer. If the time between events is larger than
      what the 27 bits can hold, a "time extend" event is added to hold the
      entire 64 bit timestamp again and the events after that hold a delta from
      that timestamp.
      
      As a "time extend" is always paired with an event, it is logical to just
      allocate the event with the time extend, to make things a bit more efficient.
      
      Unfortunately, when the pairing code was written, it removed the "delta = 0"
      from the first commit on a page, causing the events on the page to be
      slightly skewed.
      
      Fixes: 69d1b839
      
       "ring-buffer: Bind time extend and data events together"
      Cc: stable@vger.kernel.org # 2.6.37+
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      d651aa1d
  26. 13 Jan, 2014 1 commit
    • Peter Zijlstra's avatar
      sched/clock, x86: Use a static_key for sched_clock_stable · 35af99e6
      Peter Zijlstra authored
      
      
      In order to avoid the runtime condition and variable load turn
      sched_clock_stable into a static_key.
      
      Also provide a shorter implementation of local_clock() and
      cpu_clock(int) when sched_clock_stable==1.
      
                              MAINLINE   PRE       POST
      
          sched_clock_stable: 1          1         1
          (cold) sched_clock: 329841     221876    215295
          (cold) local_clock: 301773     234692    220773
          (warm) sched_clock: 38375      25602     25659
          (warm) local_clock: 100371     33265     27242
          (warm) rdtsc:       27340      24214     24208
          sched_clock_stable: 0          0         0
          (cold) sched_clock: 382634     235941    237019
          (cold) local_clock: 396890     297017    294819
          (warm) sched_clock: 38194      25233     25609
          (warm) local_clock: 143452     71234     71232
          (warm) rdtsc:       27345      24245     24243
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-eummbdechzz37mwmpags1gjr@git.kernel.org
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      35af99e6