1. 11 Oct, 2011 1 commit
  2. 04 Oct, 2011 9 commits
  3. 30 Sep, 2011 1 commit
    • Peter Zijlstra's avatar
      posix-cpu-timers: Cure SMP wobbles · d670ec13
      Peter Zijlstra authored
      David reported:
        Attached below is a watered-down version of rt/tst-cpuclock2.c from
        GLIBC.  Just build it with "gcc -o test test.c -lpthread -lrt" or
        Run it several times, and you will see cases where the main thread
        will measure a process clock difference before and after the nanosleep
        which is smaller than the cpu-burner thread's individual thread clock
        difference.  This doesn't make any sense since the cpu-burner thread
        is part of the top-level process's thread group.
        I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
        64-bit binaries).
        For example:
        [davem@boricha build-x86_64-linux]$ ./test
        process: before(0.001221967) after(0.498624371) diff(497402404)
        thread:  before(0.000081692) after(0.498316431) diff(498234739)
        self:    before(0.001223521) after(0.001240219) diff(16698)
        [davem@boricha build-x86_64-linux]$ 
        The diff of 'process' should always be >= the diff of 'thread'.
        I make sure to wrap the 'thread' clock measurements the most tightly
        around the nanosleep() call, and that the 'process' clock measurements
        are the outer-most ones.
        #include <unistd.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <time.h>
        #include <fcntl.h>
        #include <string.h>
        #include <errno.h>
        #include <pthread.h>
        static pthread_barrier_t barrier;
        static void *chew_cpu(void *arg)
      	  while (1)
      		  __asm__ __volatile__("" : : : "memory");
      	  return NULL;
        int main(void)
      	  clockid_t process_clock, my_thread_clock, th_clock;
      	  struct timespec process_before, process_after;
      	  struct timespec me_before, me_after;
      	  struct timespec th_before, th_after;
      	  struct timespec sleeptime;
      	  unsigned long diff;
      	  pthread_t th;
      	  int err;
      	  err = clock_getcpuclockid(0, &process_clock);
      	  if (err)
      		  return 1;
      	  err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
      	  if (err)
      		  return 1;
      	  pthread_barrier_init(&barrier, NULL, 2);
      	  err = pthread_create(&th, NULL, chew_cpu, NULL);
      	  if (err)
      		  return 1;
      	  err = pthread_getcpuclockid(th, &th_clock);
      	  if (err)
      		  return 1;
      	  err = clock_gettime(process_clock, &process_before);
      	  if (err)
      		  return 1;
      	  err = clock_gettime(my_thread_clock, &me_before);
      	  if (err)
      		  return 1;
      	  err = clock_gettime(th_clock, &th_before);
      	  if (err)
      		  return 1;
      	  sleeptime.tv_sec = 0;
      	  sleeptime.tv_nsec = 500000000;
      	  nanosleep(&sleeptime, NULL);
      	  err = clock_gettime(th_clock, &th_after);
      	  if (err)
      		  return 1;
      	  err = clock_gettime(my_thread_clock, &me_after);
      	  if (err)
      		  return 1;
      	  err = clock_gettime(process_clock, &process_after);
      	  if (err)
      		  return 1;
      	  diff = process_after.tv_nsec - process_before.tv_nsec;
      	  printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 process_before.tv_sec, process_before.tv_nsec,
      		 process_after.tv_sec, process_after.tv_nsec, diff);
      	  diff = th_after.tv_nsec - th_before.tv_nsec;
      	  printf("thread:  before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 th_before.tv_sec, th_before.tv_nsec,
      		 th_after.tv_sec, th_after.tv_nsec, diff);
      	  diff = me_after.tv_nsec - me_before.tv_nsec;
      	  printf("self:    before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 me_before.tv_sec, me_before.tv_nsec,
      		 me_after.tv_sec, me_after.tv_nsec, diff);
      	  return 0;
      This is due to us using p->se.sum_exec_runtime in
      thread_group_cputime() where we iterate the thread group and sum all
      data. This does not take time since the last schedule operation (tick
      or otherwise) into account. We can cure this by using
      task_sched_runtime() at the cost of having to take locks.
      This also means we can (and must) do away with
      thread_group_sched_runtime() since the modified thread_group_cputime()
      is now more accurate and would deadlock when called from
      Aside of that it makes the function safe on 32 bit systems. The old
      code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
      64bit value and could be changed on another cpu at the same time.
      Reported-by: default avatarDavid Miller <davem@davemloft.net>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org
      Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
      Tested-by: default avatarDavid Miller <davem@davemloft.net>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
  4. 28 Sep, 2011 1 commit
    • Richard Cochran's avatar
      ptp: fix L2 event message recognition · f75159e9
      Richard Cochran authored
      The IEEE 1588 standard defines two kinds of messages, event and general
      messages. Event messages require time stamping, and general do not. When
      using UDP transport, two separate ports are used for the two message
      The BPF designed to recognize event messages incorrectly classifies L2
      general messages as event messages. This commit fixes the issue by
      extending the filter to check the message type field for L2 PTP packets.
      Event messages are be distinguished from general messages by testing
      the "general" bit.
      Signed-off-by: default avatarRichard Cochran <richard.cochran@omicron.at>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  5. 27 Sep, 2011 1 commit
    • Linus Torvalds's avatar
      vfs: remove LOOKUP_NO_AUTOMOUNT flag · b6c8069d
      Linus Torvalds authored
      That flag no longer makes sense, since we don't look up automount points
      as eagerly any more.  Additionally, it turns out that the NO_AUTOMOUNT
      handling was buggy to begin with: it would avoid automounting even for
      cases where we really *needed* to do the automount handling, and could
      return ENOENT for autofs entries that hadn't been instantiated yet.
      With our new non-eager automount semantics, one discussion has been
      about adding a AT_AUTOMOUNT flag to vfs_fstatat (and thus the
      newfstatat() and fstatat64() system calls), but it's probably not worth
      it: you can always force at least directory automounting by simply
      adding the final '/' to the filename, which works for *all* of the stat
      family system calls, old and new.
      So AT_NO_AUTOMOUNT (and thus LOOKUP_NO_AUTOMOUNT) really were just a
      result of our bad default behavior.
      Acked-by: default avatarIan Kent <raven@themaw.net>
      Acked-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  6. 26 Sep, 2011 2 commits
    • Linus Torvalds's avatar
      vfs pathname lookup: Add LOOKUP_AUTOMOUNT flag · d94c177b
      Linus Torvalds authored
      Since we've now turned around and made LOOKUP_FOLLOW *not* force an
      automount, we want to add the ability to force an automount event on
      lookup even if we don't happen to have one of the other flags that force
      Most cases will never want to use this, since you'd normally want to
      delay automounting as long as possible, which usually implies
      LOOKUP_OPEN (when we open a file or directory, we really cannot avoid
      the automount any more).
      But Trond argued sufficiently forcefully that at a minimum bind mounting
      a file and quotactl will want to force the automount lookup.  Some other
      cases (like nfs_follow_remote_path()) could use it too, although
      LOOKUP_DIRECTORY would work there as well.
      This commit just adds the flag and logic, no users yet, though.  It also
      doesn't actually touch the LOOKUP_NO_AUTOMOUNT flag that is related, and
      was made irrelevant by the same change that made us not follow on
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Ian Kent <raven@themaw.net>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Greg KH <gregkh@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Peter Zijlstra's avatar
      sched, tracing: Show PREEMPT_ACTIVE state in trace_sched_switch · 557ab425
      Peter Zijlstra authored
      We had need to see the difference between scheduling a runnable task and
      a runnable task being involuntarily preempted.
      No app should rely on the old string output (the binary
      trace event record format is not changed).
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/1316164603.10174.11.camel@twins
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
  7. 20 Sep, 2011 2 commits
  8. 18 Sep, 2011 1 commit
  9. 16 Sep, 2011 3 commits
  10. 15 Sep, 2011 2 commits
    • Michael S. Tsirkin's avatar
      net: copy userspace buffers on device forwarding · 48c83012
      Michael S. Tsirkin authored
      dev_forward_skb loops an skb back into host networking
      stack which might hang on the memory indefinitely.
      In particular, this can happen in macvtap in bridged mode.
      Copy the userspace fragments to avoid blocking the
      sender in that case.
      As this patch makes skb_copy_ubufs extern now,
      I also added some documentation and made it clear
      the SKBTX_DEV_ZEROCOPY flag automatically instead
      of doing it in all callers. This can be made into a separate
      patch if people feel it's worth it.
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Eric Dumazet's avatar
      tcp: Change possible SYN flooding messages · 946cedcc
      Eric Dumazet authored
      "Possible SYN flooding on port xxxx " messages can fill logs on servers.
      Change logic to log the message only once per listener, and add two new
      SNMP counters to track :
      TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client
      TCPReqQFullDrop : number of times a SYN request was dropped because
      syncookies were not enabled.
      Based on a prior patch from Tom Herbert, and suggestions from David.
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      CC: Tom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  11. 14 Sep, 2011 2 commits
  12. 08 Sep, 2011 2 commits
  13. 06 Sep, 2011 3 commits
  14. 30 Aug, 2011 2 commits
  15. 29 Aug, 2011 1 commit
    • Stephane Eranian's avatar
      perf events: Fix slow and broken cgroup context switch code · a8d757ef
      Stephane Eranian authored
      The current cgroup context switch code was incorrect leading
      to bogus counts. Furthermore, as soon as there was an active
      cgroup event on a CPU, the context switch cost on that CPU
      would increase by a significant amount as demonstrated by a
      simple ping/pong example:
       $ ./pong
       Both processes pinned to CPU1, running for 10s
       10684.51 ctxsw/s
      Now start a cgroup perf stat:
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 100
      $ ./pong
       Both processes pinned to CPU1, running for 10s
       6674.61 ctxsw/s
      That's a 37% penalty.
      Note that pong is not even in the monitored cgroup.
      The results shown by perf stat are bogus:
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 100
       Performance counter stats for 'sleep 100':
       CPU1 <not counted> cycles   test
       CPU1 16,984,189,138 cycles  #    0.000 GHz
      The second 'cycles' event should report a count @ CPU clock
      (here 2.4GHz) as it is counting across all cgroups.
      The patch below fixes the bogus accounting and bypasses any
      cgroup switches in case the outgoing and incoming tasks are
      in the same cgroup.
      With this patch the same test now yields:
       $ ./pong
       Both processes pinned to CPU1, running for 10s
       10775.30 ctxsw/s
      Start perf stat with cgroup:
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
      Run pong outside the cgroup:
       $ /pong
       Both processes pinned to CPU1, running for 10s
       10687.80 ctxsw/s
      The penalty is now less than 2%.
      And the results for perf stat are correct:
      $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
       Performance counter stats for 'sleep 10':
       CPU1 <not counted> cycles test #    0.000 GHz
       CPU1 23,933,981,448 cycles      #    0.000 GHz
      Now perf stat reports the correct counts for
      for the non cgroup event.
      If we run pong inside the cgroup, then we also get the
      correct counts:
      $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
       Performance counter stats for 'sleep 10':
       CPU1 22,297,726,205 cycles test #    0.000 GHz
       CPU1 23,933,981,448 cycles      #    0.000 GHz
            10.001457237 seconds time elapsed
      Signed-off-by: default avatarStephane Eranian <eranian@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
  16. 26 Aug, 2011 1 commit
  17. 25 Aug, 2011 5 commits
    • Dilan Lee's avatar
      backlight: add a callback 'notify_after' for backlight control · cc7993f6
      Dilan Lee authored
      We need a callback to do some things after pwm_enable, pwm_disable
      and pwm_config.
      Signed-off-by: default avatarDilan Lee <dilee@nvidia.com>
      Reviewed-by: default avatarRobert Morell <rmorell@nvidia.com>
      Reviewed-by: default avatarArun Murthy <arun.murthy@stericsson.com>
      Cc: Richard Purdie <rpurdie@rpsys.net>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Alexandre Bounine's avatar
      rapidio: fix use of non-compatible registers · 284fb68d
      Alexandre Bounine authored
      Replace/remove use of RIO v.1.2 registers/bits that are not
      forward-compatible with newer versions of RapidIO specification.
      RapidIO specification v.1.3 removed Write Port CSR, Doorbell CSR,
      Mailbox CSR and Mailbox and Doorbell bits of the PEF CAR.
      Use of removed (since RIO v.1.3) register bits affects users of
      currently available 1.3 and 2.x compliant devices who may use not so
      recent kernel versions.
      Removing checks for unsupported bits makes corresponding routines
      compatible with all versions of RapidIO specification.  Therefore,
      backporting makes stable kernel versions compliant with RIO v.1.3 and
      later as well.
      Signed-off-by: default avatarAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Thomas Moll <thomas.moll@sysgo.com>
      Cc: Chul Kim <chul.kim@idt.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Evgeniy Polyakov's avatar
    • Josh Boyer's avatar
      lockdep: Add helper function for dir vs file i_mutex annotation · e096d0c7
      Josh Boyer authored
      Purely in-memory filesystems do not use the inode hash as the dcache
      tells us if an entry already exists.  As a result, they do not call
      unlock_new_inode, and thus directory inodes do not get put into a
      different lockdep class for i_sem.
      We need the different lockdep classes, because the locking order for
      i_mutex is different for directory inodes and regular inodes.  Directory
      inodes can do "readdir()", which takes i_mutex *before* possibly taking
      mm->mmap_sem (due to a page fault while copying the directory entry to
      user space).
      In contrast, regular inodes can be mmap'ed, which takes mm->mmap_sem
      before accessing i_mutex.
      The two cases can never happen for the same inode, so no real deadlock
      can occur, but without the different lockdep classes, lockdep cannot
      understand that.  As a result, if CONFIG_DEBUG_LOCK_ALLOC is set, this
      can lead to false positives from lockdep like below:
          find/645 is trying to acquire lock:
           (&mm->mmap_sem){++++++}, at: [<ffffffff81109514>] might_fault+0x5c/0xac
          but task is already holding lock:
           (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffff81149f34>]
          which lock already depends on the new lock.
          the existing dependency chain (in reverse order) is:
          -> #1 (&sb->s_type->i_mutex_key#15){+.+.+.}:
                [<ffffffff8108ac26>] lock_acquire+0xbf/0x103
                [<ffffffff814db822>] __mutex_lock_common+0x4c/0x361
                [<ffffffff814dbc46>] mutex_lock_nested+0x40/0x45
                [<ffffffff811daa87>] hugetlbfs_file_mmap+0x82/0x110
                [<ffffffff81111557>] mmap_region+0x258/0x432
                [<ffffffff811119dd>] do_mmap_pgoff+0x2ac/0x306
                [<ffffffff81111b4f>] sys_mmap_pgoff+0x118/0x16a
                [<ffffffff8100c858>] sys_mmap+0x22/0x24
                [<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b
          -> #0 (&mm->mmap_sem){++++++}:
                [<ffffffff8108a4bc>] __lock_acquire+0xa1a/0xcf7
                [<ffffffff8108ac26>] lock_acquire+0xbf/0x103
                [<ffffffff81109541>] might_fault+0x89/0xac
                [<ffffffff81149cff>] filldir+0x6f/0xc7
                [<ffffffff811586ea>] dcache_readdir+0x67/0x205
                [<ffffffff81149f54>] vfs_readdir+0x7b/0xb4
                [<ffffffff8114a073>] sys_getdents+0x7e/0xd1
                [<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b
      This patch moves the directory vs file lockdep annotation into a helper
      function that can be called by in-memory filesystems and has hugetlbfs
      call it.
      Signed-off-by: default avatarJosh Boyer <jwboyer@redhat.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Andi Kleen's avatar
      Add a personality to report 2.6.x version numbers · be27425d
      Andi Kleen authored
      I ran into a couple of programs which broke with the new Linux 3.0
      version.  Some of those were binary only.  I tried to use LD_PRELOAD to
      work around it, but it was quite difficult and in one case impossible
      because of a mix of 32bit and 64bit executables.
      For example, all kind of management software from HP doesnt work, unless
      we pretend to run a 2.6 kernel.
        $ uname -a
        Linux svivoipvnx001 3.0.0-08107-g97cd98f #1062 SMP Fri Aug 12 18:11:45 CEST 2011 i686 i686 i386 GNU/Linux
        $ hpacucli ctrl all show
        Error: No controllers detected.
        $ rpm -qf /usr/sbin/hpacucli
      Another notable case is that Python now reports "linux3" from
      sys.platform(); which in turn can break things that were checking
      sys.platform() == "linux2":
      It seems pretty clear to me though it's a bug in the apps that are using
      '==' instead of .startswith(), but this allows us to unbreak broken
      This patch adds a UNAME26 personality that makes the kernel report a
      2.6.40+x version number instead.  The x is the x in 3.x.
      I know this is somewhat ugly, but I didn't find a better workaround, and
      compatibility to existing programs is important.
      Some programs also read /proc/sys/kernel/osrelease.  This can be worked
      around in user space with mount --bind (and a mount namespace)
      To use:
        wget ftp://ftp.kernel.org/pub/linux/kernel/people/ak/uname26/uname26.c
        gcc -o uname26 uname26.c
        ./uname26 program
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  18. 24 Aug, 2011 1 commit