1. 11 Oct, 2011 1 commit
  2. 04 Oct, 2011 9 commits
  3. 30 Sep, 2011 1 commit
    • Peter Zijlstra's avatar
      posix-cpu-timers: Cure SMP wobbles · d670ec13
      Peter Zijlstra authored
      David reported:
        Attached below is a watered-down version of rt/tst-cpuclock2.c from
        GLIBC.  Just build it with "gcc -o test test.c -lpthread -lrt" or
        Run it several times, and you will see cases where the main thread
        will measure a process clock difference before and after the nanosleep
        which is smaller than the cpu-burner thread's individual thread clock
        difference.  This doesn't make any sense since the cpu-burner thread
        is part of the top-level process's thread group.
        I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
        64-bit binaries).
        For example:
        [davem@boricha build-x86_64-linux]$ ./test
        process: before(0.001221967) after(0.498624371) diff(497402404)
        thread:  before(0.000081692) after(0.498316431) diff(498234739)
        self:    before(0.001223521) after(0.001240219) diff(16698)
        [davem@boricha build-x86_64-linux]$ 
        The diff of 'process' should always be >= the diff of 'thread'.
        I make sure to wrap the 'thread' clock measurements the most tightly
        around the nanosleep() call, and that the 'process' clock measurements
        are the outer-most ones.
        #include <unistd.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <time.h>
        #include <fcntl.h>
        #include <string.h>
        #include <errno.h>
        #include <pthread.h>
        static pthread_barrier_t barrier;
        static void *chew_cpu(void *arg)
      	  while (1)
      		  __asm__ __volatile__("" : : : "memory");
      	  return NULL;
        int main(void)
      	  clockid_t process_clock, my_thread_clock, th_clock;
      	  struct timespec process_before, process_after;
      	  struct timespec me_before, me_after;
      	  struct timespec th_before, th_after;
      	  struct timespec sleeptime;
      	  unsigned long diff;
      	  pthread_t th;
      	  int err;
      	  err = clock_getcpuclockid(0, &process_clock);
      	  if (err)
      		  return 1;
      	  err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
      	  if (err)
      		  return 1;
      	  pthread_barrier_init(&barrier, NULL, 2);
      	  err = pthread_create(&th, NULL, chew_cpu, NULL);
      	  if (err)
      		  return 1;
      	  err = pthread_getcpuclockid(th, &th_clock);
      	  if (err)
      		  return 1;
      	  err = clock_gettime(process_clock, &process_before);
      	  if (err)
      		  return 1;
      	  err = clock_gettime(my_thread_clock, &me_before);
      	  if (err)
      		  return 1;
      	  err = clock_gettime(th_clock, &th_before);
      	  if (err)
      		  return 1;
      	  sleeptime.tv_sec = 0;
      	  sleeptime.tv_nsec = 500000000;
      	  nanosleep(&sleeptime, NULL);
      	  err = clock_gettime(th_clock, &th_after);
      	  if (err)
      		  return 1;
      	  err = clock_gettime(my_thread_clock, &me_after);
      	  if (err)
      		  return 1;
      	  err = clock_gettime(process_clock, &process_after);
      	  if (err)
      		  return 1;
      	  diff = process_after.tv_nsec - process_before.tv_nsec;
      	  printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 process_before.tv_sec, process_before.tv_nsec,
      		 process_after.tv_sec, process_after.tv_nsec, diff);
      	  diff = th_after.tv_nsec - th_before.tv_nsec;
      	  printf("thread:  before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 th_before.tv_sec, th_before.tv_nsec,
      		 th_after.tv_sec, th_after.tv_nsec, diff);
      	  diff = me_after.tv_nsec - me_before.tv_nsec;
      	  printf("self:    before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 me_before.tv_sec, me_before.tv_nsec,
      		 me_after.tv_sec, me_after.tv_nsec, diff);
      	  return 0;
      This is due to us using p->se.sum_exec_runtime in
      thread_group_cputime() where we iterate the thread group and sum all
      data. This does not take time since the last schedule operation (tick
      or otherwise) into account. We can cure this by using
      task_sched_runtime() at the cost of having to take locks.
      This also means we can (and must) do away with
      thread_group_sched_runtime() since the modified thread_group_cputime()
      is now more accurate and would deadlock when called from
      Aside of that it makes the function safe on 32 bit systems. The old
      code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
      64bit value and could be changed on another cpu at the same time.
      Reported-by: default avatarDavid Miller <davem@davemloft.net>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org
      Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
      Tested-by: default avatarDavid Miller <davem@davemloft.net>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
  4. 28 Sep, 2011 1 commit
    • Richard Cochran's avatar
      ptp: fix L2 event message recognition · f75159e9
      Richard Cochran authored
      The IEEE 1588 standard defines two kinds of messages, event and general
      messages. Event messages require time stamping, and general do not. When
      using UDP transport, two separate ports are used for the two message
      The BPF designed to recognize event messages incorrectly classifies L2
      general messages as event messages. This commit fixes the issue by
      extending the filter to check the message type field for L2 PTP packets.
      Event messages are be distinguished from general messages by testing
      the "general" bit.
      Signed-off-by: default avatarRichard Cochran <richard.cochran@omicron.at>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  5. 27 Sep, 2011 1 commit
    • Linus Torvalds's avatar
      vfs: remove LOOKUP_NO_AUTOMOUNT flag · b6c8069d
      Linus Torvalds authored
      That flag no longer makes sense, since we don't look up automount points
      as eagerly any more.  Additionally, it turns out that the NO_AUTOMOUNT
      handling was buggy to begin with: it would avoid automounting even for
      cases where we really *needed* to do the automount handling, and could
      return ENOENT for autofs entries that hadn't been instantiated yet.
      With our new non-eager automount semantics, one discussion has been
      about adding a AT_AUTOMOUNT flag to vfs_fstatat (and thus the
      newfstatat() and fstatat64() system calls), but it's probably not worth
      it: you can always force at least directory automounting by simply
      adding the final '/' to the filename, which works for *all* of the stat
      family system calls, old and new.
      So AT_NO_AUTOMOUNT (and thus LOOKUP_NO_AUTOMOUNT) really were just a
      result of our bad default behavior.
      Acked-by: default avatarIan Kent <raven@themaw.net>
      Acked-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  6. 26 Sep, 2011 1 commit
    • Linus Torvalds's avatar
      vfs pathname lookup: Add LOOKUP_AUTOMOUNT flag · d94c177b
      Linus Torvalds authored
      Since we've now turned around and made LOOKUP_FOLLOW *not* force an
      automount, we want to add the ability to force an automount event on
      lookup even if we don't happen to have one of the other flags that force
      Most cases will never want to use this, since you'd normally want to
      delay automounting as long as possible, which usually implies
      LOOKUP_OPEN (when we open a file or directory, we really cannot avoid
      the automount any more).
      But Trond argued sufficiently forcefully that at a minimum bind mounting
      a file and quotactl will want to force the automount lookup.  Some other
      cases (like nfs_follow_remote_path()) could use it too, although
      LOOKUP_DIRECTORY would work there as well.
      This commit just adds the flag and logic, no users yet, though.  It also
      doesn't actually touch the LOOKUP_NO_AUTOMOUNT flag that is related, and
      was made irrelevant by the same change that made us not follow on
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Ian Kent <raven@themaw.net>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Greg KH <gregkh@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  7. 20 Sep, 2011 2 commits
  8. 15 Sep, 2011 2 commits
    • Michael S. Tsirkin's avatar
      net: copy userspace buffers on device forwarding · 48c83012
      Michael S. Tsirkin authored
      dev_forward_skb loops an skb back into host networking
      stack which might hang on the memory indefinitely.
      In particular, this can happen in macvtap in bridged mode.
      Copy the userspace fragments to avoid blocking the
      sender in that case.
      As this patch makes skb_copy_ubufs extern now,
      I also added some documentation and made it clear
      the SKBTX_DEV_ZEROCOPY flag automatically instead
      of doing it in all callers. This can be made into a separate
      patch if people feel it's worth it.
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Eric Dumazet's avatar
      tcp: Change possible SYN flooding messages · 946cedcc
      Eric Dumazet authored
      "Possible SYN flooding on port xxxx " messages can fill logs on servers.
      Change logic to log the message only once per listener, and add two new
      SNMP counters to track :
      TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client
      TCPReqQFullDrop : number of times a SYN request was dropped because
      syncookies were not enabled.
      Based on a prior patch from Tom Herbert, and suggestions from David.
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      CC: Tom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  9. 14 Sep, 2011 2 commits
  10. 08 Sep, 2011 1 commit
  11. 06 Sep, 2011 1 commit
  12. 29 Aug, 2011 1 commit
    • Stephane Eranian's avatar
      perf events: Fix slow and broken cgroup context switch code · a8d757ef
      Stephane Eranian authored
      The current cgroup context switch code was incorrect leading
      to bogus counts. Furthermore, as soon as there was an active
      cgroup event on a CPU, the context switch cost on that CPU
      would increase by a significant amount as demonstrated by a
      simple ping/pong example:
       $ ./pong
       Both processes pinned to CPU1, running for 10s
       10684.51 ctxsw/s
      Now start a cgroup perf stat:
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 100
      $ ./pong
       Both processes pinned to CPU1, running for 10s
       6674.61 ctxsw/s
      That's a 37% penalty.
      Note that pong is not even in the monitored cgroup.
      The results shown by perf stat are bogus:
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 100
       Performance counter stats for 'sleep 100':
       CPU1 <not counted> cycles   test
       CPU1 16,984,189,138 cycles  #    0.000 GHz
      The second 'cycles' event should report a count @ CPU clock
      (here 2.4GHz) as it is counting across all cgroups.
      The patch below fixes the bogus accounting and bypasses any
      cgroup switches in case the outgoing and incoming tasks are
      in the same cgroup.
      With this patch the same test now yields:
       $ ./pong
       Both processes pinned to CPU1, running for 10s
       10775.30 ctxsw/s
      Start perf stat with cgroup:
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
      Run pong outside the cgroup:
       $ /pong
       Both processes pinned to CPU1, running for 10s
       10687.80 ctxsw/s
      The penalty is now less than 2%.
      And the results for perf stat are correct:
      $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
       Performance counter stats for 'sleep 10':
       CPU1 <not counted> cycles test #    0.000 GHz
       CPU1 23,933,981,448 cycles      #    0.000 GHz
      Now perf stat reports the correct counts for
      for the non cgroup event.
      If we run pong inside the cgroup, then we also get the
      correct counts:
      $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
       Performance counter stats for 'sleep 10':
       CPU1 22,297,726,205 cycles test #    0.000 GHz
       CPU1 23,933,981,448 cycles      #    0.000 GHz
            10.001457237 seconds time elapsed
      Signed-off-by: default avatarStephane Eranian <eranian@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
  13. 26 Aug, 2011 1 commit
  14. 25 Aug, 2011 5 commits
    • Dilan Lee's avatar
      backlight: add a callback 'notify_after' for backlight control · cc7993f6
      Dilan Lee authored
      We need a callback to do some things after pwm_enable, pwm_disable
      and pwm_config.
      Signed-off-by: default avatarDilan Lee <dilee@nvidia.com>
      Reviewed-by: default avatarRobert Morell <rmorell@nvidia.com>
      Reviewed-by: default avatarArun Murthy <arun.murthy@stericsson.com>
      Cc: Richard Purdie <rpurdie@rpsys.net>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Alexandre Bounine's avatar
      rapidio: fix use of non-compatible registers · 284fb68d
      Alexandre Bounine authored
      Replace/remove use of RIO v.1.2 registers/bits that are not
      forward-compatible with newer versions of RapidIO specification.
      RapidIO specification v.1.3 removed Write Port CSR, Doorbell CSR,
      Mailbox CSR and Mailbox and Doorbell bits of the PEF CAR.
      Use of removed (since RIO v.1.3) register bits affects users of
      currently available 1.3 and 2.x compliant devices who may use not so
      recent kernel versions.
      Removing checks for unsupported bits makes corresponding routines
      compatible with all versions of RapidIO specification.  Therefore,
      backporting makes stable kernel versions compliant with RIO v.1.3 and
      later as well.
      Signed-off-by: default avatarAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Thomas Moll <thomas.moll@sysgo.com>
      Cc: Chul Kim <chul.kim@idt.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Evgeniy Polyakov's avatar
    • Josh Boyer's avatar
      lockdep: Add helper function for dir vs file i_mutex annotation · e096d0c7
      Josh Boyer authored
      Purely in-memory filesystems do not use the inode hash as the dcache
      tells us if an entry already exists.  As a result, they do not call
      unlock_new_inode, and thus directory inodes do not get put into a
      different lockdep class for i_sem.
      We need the different lockdep classes, because the locking order for
      i_mutex is different for directory inodes and regular inodes.  Directory
      inodes can do "readdir()", which takes i_mutex *before* possibly taking
      mm->mmap_sem (due to a page fault while copying the directory entry to
      user space).
      In contrast, regular inodes can be mmap'ed, which takes mm->mmap_sem
      before accessing i_mutex.
      The two cases can never happen for the same inode, so no real deadlock
      can occur, but without the different lockdep classes, lockdep cannot
      understand that.  As a result, if CONFIG_DEBUG_LOCK_ALLOC is set, this
      can lead to false positives from lockdep like below:
          find/645 is trying to acquire lock:
           (&mm->mmap_sem){++++++}, at: [<ffffffff81109514>] might_fault+0x5c/0xac
          but task is already holding lock:
           (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffff81149f34>]
          which lock already depends on the new lock.
          the existing dependency chain (in reverse order) is:
          -> #1 (&sb->s_type->i_mutex_key#15){+.+.+.}:
                [<ffffffff8108ac26>] lock_acquire+0xbf/0x103
                [<ffffffff814db822>] __mutex_lock_common+0x4c/0x361
                [<ffffffff814dbc46>] mutex_lock_nested+0x40/0x45
                [<ffffffff811daa87>] hugetlbfs_file_mmap+0x82/0x110
                [<ffffffff81111557>] mmap_region+0x258/0x432
                [<ffffffff811119dd>] do_mmap_pgoff+0x2ac/0x306
                [<ffffffff81111b4f>] sys_mmap_pgoff+0x118/0x16a
                [<ffffffff8100c858>] sys_mmap+0x22/0x24
                [<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b
          -> #0 (&mm->mmap_sem){++++++}:
                [<ffffffff8108a4bc>] __lock_acquire+0xa1a/0xcf7
                [<ffffffff8108ac26>] lock_acquire+0xbf/0x103
                [<ffffffff81109541>] might_fault+0x89/0xac
                [<ffffffff81149cff>] filldir+0x6f/0xc7
                [<ffffffff811586ea>] dcache_readdir+0x67/0x205
                [<ffffffff81149f54>] vfs_readdir+0x7b/0xb4
                [<ffffffff8114a073>] sys_getdents+0x7e/0xd1
                [<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b
      This patch moves the directory vs file lockdep annotation into a helper
      function that can be called by in-memory filesystems and has hugetlbfs
      call it.
      Signed-off-by: default avatarJosh Boyer <jwboyer@redhat.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Andi Kleen's avatar
      Add a personality to report 2.6.x version numbers · be27425d
      Andi Kleen authored
      I ran into a couple of programs which broke with the new Linux 3.0
      version.  Some of those were binary only.  I tried to use LD_PRELOAD to
      work around it, but it was quite difficult and in one case impossible
      because of a mix of 32bit and 64bit executables.
      For example, all kind of management software from HP doesnt work, unless
      we pretend to run a 2.6 kernel.
        $ uname -a
        Linux svivoipvnx001 3.0.0-08107-g97cd98f #1062 SMP Fri Aug 12 18:11:45 CEST 2011 i686 i686 i386 GNU/Linux
        $ hpacucli ctrl all show
        Error: No controllers detected.
        $ rpm -qf /usr/sbin/hpacucli
      Another notable case is that Python now reports "linux3" from
      sys.platform(); which in turn can break things that were checking
      sys.platform() == "linux2":
      It seems pretty clear to me though it's a bug in the apps that are using
      '==' instead of .startswith(), but this allows us to unbreak broken
      This patch adds a UNAME26 personality that makes the kernel report a
      2.6.40+x version number instead.  The x is the x in 3.x.
      I know this is somewhat ugly, but I didn't find a better workaround, and
      compatibility to existing programs is important.
      Some programs also read /proc/sys/kernel/osrelease.  This can be worked
      around in user space with mount --bind (and a mount namespace)
      To use:
        wget ftp://ftp.kernel.org/pub/linux/kernel/people/ak/uname26/uname26.c
        gcc -o uname26 uname26.c
        ./uname26 program
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  15. 24 Aug, 2011 1 commit
  16. 23 Aug, 2011 3 commits
    • Jiri Slaby's avatar
      TTY: pty, fix pty counting · 24d406a6
      Jiri Slaby authored
      tty_operations->remove is normally called like:
      However tty_shutdown() is called from queue_release_one_tty() only if
      tty_operations->shutdown is NULL. But for pty, it is not.
      pty_unix98_shutdown() is used there as ->shutdown.
      So tty_operations->remove of pty (i.e. pty_unix98_remove()) is never
      called. This results in invalid pty_count. I.e. what can be seen in
      I see this was already reported at:
      But it was not fixed since then.
      This patch is kind of a hackish way. The problem lies in ->install. We
      allocate there another tty (so-called tty->link). So ->install is
      called once, but ->remove twice, for both tty and tty->link. The fix
      here is to count both tty and tty->link and divide the count by 2 for
      And to have ->remove called, let's make tty_driver_remove_tty() global
      and call that from pty_unix98_shutdown() (tty_operations->shutdown).
      While at it, let's document that when ->shutdown is defined,
      tty_shutdown() is not called.
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: stable <stable@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
    • Christoph Hellwig's avatar
      block: separate priority boosting from REQ_META · 65299a3b
      Christoph Hellwig authored
      Add a new REQ_PRIO to let requests preempt others in the cfq I/O schedule,
      and lave REQ_META purely for marking requests as metadata in blktrace.
      All existing callers of REQ_META except for XFS are updated to also
      set REQ_PRIO for now.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarNamhyung Kim <namhyung@gmail.com>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
    • Christoph Hellwig's avatar
      block: remove READ_META and WRITE_META · 5dc06c5a
      Christoph Hellwig authored
      Replace all occurnanced of the undocumented READ_META with READ | REQ_META
      and remove the unused WRITE_META define.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
  17. 22 Aug, 2011 1 commit
    • Pavan Savoy's avatar
      drivers:misc:ti-st: platform hooks for chip states · 0d7c5f25
      Pavan Savoy authored
      Certain platform specific or Host-WiLink Interface specific actions would be
      required to be taken when the chip is being enabled and after the chip is
      disabled such as configuration of the mux modes for the GPIO of host connected
      to the nshutdown of the chip or relinquishing UART after the chip is disabled.
      Similar actions can also be taken when the chip is in deep sleep or when the
      chip is awake. Performance enhancements such as configuring the host to run
      faster when chip is awake and slower when chip is asleep can also be made
      Signed-off-by: default avatarPavan Savoy <pavan_savoy@ti.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
  18. 19 Aug, 2011 1 commit
    • Wu Fengguang's avatar
      squeeze max-pause area and drop pass-good area · bb082295
      Wu Fengguang authored
      Revert the pass-good area introduced in ffd1f609
      introduce max-pause and pass-good dirty limits") and make the max-pause
      area smaller and safe.
      This fixes ~30% performance regression in the ext3 data=writeback
      fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
      12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.
      Using deadline scheduler also has a regression, but not that big as CFQ,
      so this suggests we have some write starvation.
      The test logs show that
      - the disks are sometimes under utilized
      - global dirty pages sometimes rush high to the pass-good area for
        several hundred seconds, while in the mean time some bdi dirty pages
        drop to very low value (bdi_dirty << bdi_thresh).  Then suddenly the
        global dirty pages dropped under global dirty threshold and bdi_dirty
        rush very high (for example, 2 times higher than bdi_thresh). During
        which time balance_dirty_pages() is not called at all.
      So the problems are
      1) The random writes progress so slow that they break the assumption of
         the max-pause logic that "8 pages per 200ms is typically more than
         enough to curb heavy dirtiers".
      2) The max-pause logic ignored task_bdi_thresh and thus opens the possibility
         for some bdi's to over dirty pages, leading to (bdi_dirty >> bdi_thresh)
         and then (bdi_thresh >> bdi_dirty) for others.
      3) The higher max-pause/pass-good thresholds somehow leads to the bad
         swing of dirty pages.
      The fix is to allow the task to slightly dirty over task_bdi_thresh, but
      no way to exceed bdi_dirty and/or global dirty_thresh.
      Tests show that it fixed the JBOD regression completely (both behavior
      and performance), while still being able to cut down large pause times
      in balance_dirty_pages() for single-disk cases.
      Reported-by: default avatarLi Shaohua <shaohua.li@intel.com>
      Tested-by: default avatarLi Shaohua <shaohua.li@intel.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
  19. 17 Aug, 2011 2 commits
  20. 15 Aug, 2011 1 commit
    • Jeff Moyer's avatar
      block: fix flush machinery for stacking drivers with differring flush flags · 4853abaa
      Jeff Moyer authored
      Commit ae1b1539
      , block: reimplement
      FLUSH/FUA to support merge, introduced a performance regression when
      running any sort of fsyncing workload using dm-multipath and certain
      storage (in our case, an HP EVA).  The test I ran was fs_mark, and it
      dropped from ~800 files/sec on ext4 to ~100 files/sec.  It turns out
      that dm-multipath always advertised flush+fua support, and passed
      commands on down the stack, where those flags used to get stripped off.
      The above commit changed that behavior:
      static inline struct request *__elv_next_request(struct request_queue *q)
              struct request *rq;
              while (1) {
      -               while (!list_empty(&q->queue_head)) {
      +               if (!list_empty(&q->queue_head)) {
                              rq = list_entry_rq(q->queue_head.next);
      -                       if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
      -                           (rq->cmd_flags & REQ_FLUSH_SEQ))
      -                               return rq;
      -                       rq = blk_do_flush(q, rq);
      -                       if (rq)
      -                               return rq;
      +                       return rq;
      Note that previously, a command would come in here, have
      REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
      struct request *blk_do_flush(struct request_queue *q, struct request *rq)
              unsigned int fflags = q->flush_flags; /* may change, cache it */
              bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
              bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
              bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
              unsigned skip = 0;
              if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
                      rq->cmd_flags &= ~REQ_FLUSH;
      		if (!has_fua)
      			rq->cmd_flags &= ~REQ_FUA;
      	        return rq;
      So, the flush machinery was bypassed in such cases (q->flush_flags == 0
      && rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
      Now, however, we don't get into the flush machinery at all.  Instead,
      __elv_next_request just hands a request with flush and fua bits set to
      the scsi_request_fn, even if the underlying request_queue does not
      support flush or fua.
      The agreed upon approach is to fix the flush machinery to allow
      stacking.  While this isn't used in practice (since there is only one
      request-based dm target, and that target will now reflect the flush
      flags of the underlying device), it does future-proof the solution, and
      make it function as designed.
      In order to make this work, I had to add a field to the struct request,
      inside the flush structure (to store the original req->end_io).  Shaohua
      had suggested overloading the union with rb_node and completion_data,
      but the completion data is used by device mapper and can also be used by
      other drivers.  So, I didn't see a way around the additional field.
      I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
      the lost performance.  Comments and other testers, as always, are
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
  21. 14 Aug, 2011 2 commits