1. 06 Sep, 2011 1 commit
  2. 26 Aug, 2011 1 commit
  3. 25 Aug, 2011 5 commits
    • Dilan Lee's avatar
      backlight: add a callback 'notify_after' for backlight control · cc7993f6
      Dilan Lee authored
      
      
      We need a callback to do some things after pwm_enable, pwm_disable
      and pwm_config.
      
      Signed-off-by: default avatarDilan Lee <dilee@nvidia.com>
      Reviewed-by: default avatarRobert Morell <rmorell@nvidia.com>
      Reviewed-by: default avatarArun Murthy <arun.murthy@stericsson.com>
      Cc: Richard Purdie <rpurdie@rpsys.net>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc7993f6
    • Alexandre Bounine's avatar
      rapidio: fix use of non-compatible registers · 284fb68d
      Alexandre Bounine authored
      
      
      Replace/remove use of RIO v.1.2 registers/bits that are not
      forward-compatible with newer versions of RapidIO specification.
      
      RapidIO specification v.1.3 removed Write Port CSR, Doorbell CSR,
      Mailbox CSR and Mailbox and Doorbell bits of the PEF CAR.
      
      Use of removed (since RIO v.1.3) register bits affects users of
      currently available 1.3 and 2.x compliant devices who may use not so
      recent kernel versions.
      
      Removing checks for unsupported bits makes corresponding routines
      compatible with all versions of RapidIO specification.  Therefore,
      backporting makes stable kernel versions compliant with RIO v.1.3 and
      later as well.
      
      Signed-off-by: default avatarAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Thomas Moll <thomas.moll@sysgo.com>
      Cc: Chul Kim <chul.kim@idt.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      284fb68d
    • Evgeniy Polyakov's avatar
    • Josh Boyer's avatar
      lockdep: Add helper function for dir vs file i_mutex annotation · e096d0c7
      Josh Boyer authored
      
      
      Purely in-memory filesystems do not use the inode hash as the dcache
      tells us if an entry already exists.  As a result, they do not call
      unlock_new_inode, and thus directory inodes do not get put into a
      different lockdep class for i_sem.
      
      We need the different lockdep classes, because the locking order for
      i_mutex is different for directory inodes and regular inodes.  Directory
      inodes can do "readdir()", which takes i_mutex *before* possibly taking
      mm->mmap_sem (due to a page fault while copying the directory entry to
      user space).
      
      In contrast, regular inodes can be mmap'ed, which takes mm->mmap_sem
      before accessing i_mutex.
      
      The two cases can never happen for the same inode, so no real deadlock
      can occur, but without the different lockdep classes, lockdep cannot
      understand that.  As a result, if CONFIG_DEBUG_LOCK_ALLOC is set, this
      can lead to false positives from lockdep like below:
      
          find/645 is trying to acquire lock:
           (&mm->mmap_sem){++++++}, at: [<ffffffff81109514>] might_fault+0x5c/0xac
      
          but task is already holding lock:
           (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffff81149f34>]
          vfs_readdir+0x5b/0xb4
      
          which lock already depends on the new lock.
      
          the existing dependency chain (in reverse order) is:
      
          -> #1 (&sb->s_type->i_mutex_key#15){+.+.+.}:
                [<ffffffff8108ac26>] lock_acquire+0xbf/0x103
                [<ffffffff814db822>] __mutex_lock_common+0x4c/0x361
                [<ffffffff814dbc46>] mutex_lock_nested+0x40/0x45
                [<ffffffff811daa87>] hugetlbfs_file_mmap+0x82/0x110
                [<ffffffff81111557>] mmap_region+0x258/0x432
                [<ffffffff811119dd>] do_mmap_pgoff+0x2ac/0x306
                [<ffffffff81111b4f>] sys_mmap_pgoff+0x118/0x16a
                [<ffffffff8100c858>] sys_mmap+0x22/0x24
                [<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b
      
          -> #0 (&mm->mmap_sem){++++++}:
                [<ffffffff8108a4bc>] __lock_acquire+0xa1a/0xcf7
                [<ffffffff8108ac26>] lock_acquire+0xbf/0x103
                [<ffffffff81109541>] might_fault+0x89/0xac
                [<ffffffff81149cff>] filldir+0x6f/0xc7
                [<ffffffff811586ea>] dcache_readdir+0x67/0x205
                [<ffffffff81149f54>] vfs_readdir+0x7b/0xb4
                [<ffffffff8114a073>] sys_getdents+0x7e/0xd1
                [<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b
      
      This patch moves the directory vs file lockdep annotation into a helper
      function that can be called by in-memory filesystems and has hugetlbfs
      call it.
      
      Signed-off-by: default avatarJosh Boyer <jwboyer@redhat.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e096d0c7
    • Andi Kleen's avatar
      Add a personality to report 2.6.x version numbers · be27425d
      Andi Kleen authored
      I ran into a couple of programs which broke with the new Linux 3.0
      version.  Some of those were binary only.  I tried to use LD_PRELOAD to
      work around it, but it was quite difficult and in one case impossible
      because of a mix of 32bit and 64bit executables.
      
      For example, all kind of management software from HP doesnt work, unless
      we pretend to run a 2.6 kernel.
      
        $ uname -a
        Linux svivoipvnx001 3.0.0-08107-g97cd98f #1062 SMP Fri Aug 12 18:11:45 CEST 2011 i686 i686 i386 GNU/Linux
      
        $ hpacucli ctrl all show
      
        Error: No controllers detected.
      
        $ rpm -qf /usr/sbin/hpacucli
        hpacucli-8.75-12.0
      
      Another notable case is that Python now reports "linux3" from
      sys.platform(); which in turn can break things that were checking
      sys.platform() == "linux2":
      
        https://bugzilla.mozilla.org/show_bug.cgi?id=664564
      
      It seems pretty clear to me though it's a bug in the apps that are using
      '==' instead of .startswith(), but this allows us to unbreak broken
      programs.
      
      This patch adds a UNAME26 personality that makes the kernel report a
      2.6.40+x version number instead.  The x is the x in 3.x.
      
      I know this is somewhat ugly, but I didn't find a better workaround, and
      compatibility to existing programs is important.
      
      Some programs also read /proc/sys/kernel/osrelease.  This can be worked
      around in user space with mount --bind (and a mount namespace)
      
      To use:
      
        wget ftp://ftp.kernel.org/pub/linux/kernel/people/ak/uname26/uname26.c
      
      
        gcc -o uname26 uname26.c
        ./uname26 program
      
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be27425d
  4. 23 Aug, 2011 1 commit
    • Jiri Slaby's avatar
      TTY: pty, fix pty counting · 24d406a6
      Jiri Slaby authored
      tty_operations->remove is normally called like:
      queue_release_one_tty
       ->tty_shutdown
         ->tty_driver_remove_tty
           ->tty_operations->remove
      
      However tty_shutdown() is called from queue_release_one_tty() only if
      tty_operations->shutdown is NULL. But for pty, it is not.
      pty_unix98_shutdown() is used there as ->shutdown.
      
      So tty_operations->remove of pty (i.e. pty_unix98_remove()) is never
      called. This results in invalid pty_count. I.e. what can be seen in
      /proc/sys/kernel/pty/nr.
      
      I see this was already reported at:
        https://lkml.org/lkml/2009/11/5/370
      
      
      But it was not fixed since then.
      
      This patch is kind of a hackish way. The problem lies in ->install. We
      allocate there another tty (so-called tty->link). So ->install is
      called once, but ->remove twice, for both tty and tty->link. The fix
      here is to count both tty and tty->link and divide the count by 2 for
      user.
      
      And to have ->remove called, let's make tty_driver_remove_tty() global
      and call that from pty_unix98_shutdown() (tty_operations->shutdown).
      
      While at it, let's document that when ->shutdown is defined,
      tty_shutdown() is not called.
      
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: stable <stable@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      24d406a6
  5. 22 Aug, 2011 1 commit
    • Pavan Savoy's avatar
      drivers:misc:ti-st: platform hooks for chip states · 0d7c5f25
      Pavan Savoy authored
      
      
      Certain platform specific or Host-WiLink Interface specific actions would be
      required to be taken when the chip is being enabled and after the chip is
      disabled such as configuration of the mux modes for the GPIO of host connected
      to the nshutdown of the chip or relinquishing UART after the chip is disabled.
      
      Similar actions can also be taken when the chip is in deep sleep or when the
      chip is awake. Performance enhancements such as configuring the host to run
      faster when chip is awake and slower when chip is asleep can also be made
      here.
      
      Signed-off-by: default avatarPavan Savoy <pavan_savoy@ti.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      0d7c5f25
  6. 19 Aug, 2011 1 commit
    • Wu Fengguang's avatar
      squeeze max-pause area and drop pass-good area · bb082295
      Wu Fengguang authored
      Revert the pass-good area introduced in ffd1f609
      
       ("writeback:
      introduce max-pause and pass-good dirty limits") and make the max-pause
      area smaller and safe.
      
      This fixes ~30% performance regression in the ext3 data=writeback
      fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
      12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.
      
      Using deadline scheduler also has a regression, but not that big as CFQ,
      so this suggests we have some write starvation.
      
      The test logs show that
      
      - the disks are sometimes under utilized
      
      - global dirty pages sometimes rush high to the pass-good area for
        several hundred seconds, while in the mean time some bdi dirty pages
        drop to very low value (bdi_dirty << bdi_thresh).  Then suddenly the
        global dirty pages dropped under global dirty threshold and bdi_dirty
        rush very high (for example, 2 times higher than bdi_thresh). During
        which time balance_dirty_pages() is not called at all.
      
      So the problems are
      
      1) The random writes progress so slow that they break the assumption of
         the max-pause logic that "8 pages per 200ms is typically more than
         enough to curb heavy dirtiers".
      
      2) The max-pause logic ignored task_bdi_thresh and thus opens the possibility
         for some bdi's to over dirty pages, leading to (bdi_dirty >> bdi_thresh)
         and then (bdi_thresh >> bdi_dirty) for others.
      
      3) The higher max-pause/pass-good thresholds somehow leads to the bad
         swing of dirty pages.
      
      The fix is to allow the task to slightly dirty over task_bdi_thresh, but
      no way to exceed bdi_dirty and/or global dirty_thresh.
      
      Tests show that it fixed the JBOD regression completely (both behavior
      and performance), while still being able to cut down large pause times
      in balance_dirty_pages() for single-disk cases.
      
      Reported-by: default avatarLi Shaohua <shaohua.li@intel.com>
      Tested-by: default avatarLi Shaohua <shaohua.li@intel.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      bb082295
  7. 17 Aug, 2011 2 commits
  8. 15 Aug, 2011 1 commit
    • Jeff Moyer's avatar
      block: fix flush machinery for stacking drivers with differring flush flags · 4853abaa
      Jeff Moyer authored
      Commit ae1b1539
      
      , block: reimplement
      FLUSH/FUA to support merge, introduced a performance regression when
      running any sort of fsyncing workload using dm-multipath and certain
      storage (in our case, an HP EVA).  The test I ran was fs_mark, and it
      dropped from ~800 files/sec on ext4 to ~100 files/sec.  It turns out
      that dm-multipath always advertised flush+fua support, and passed
      commands on down the stack, where those flags used to get stripped off.
      The above commit changed that behavior:
      
      static inline struct request *__elv_next_request(struct request_queue *q)
      {
              struct request *rq;
      
              while (1) {
      -               while (!list_empty(&q->queue_head)) {
      +               if (!list_empty(&q->queue_head)) {
                              rq = list_entry_rq(q->queue_head.next);
      -                       if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
      -                           (rq->cmd_flags & REQ_FLUSH_SEQ))
      -                               return rq;
      -                       rq = blk_do_flush(q, rq);
      -                       if (rq)
      -                               return rq;
      +                       return rq;
                      }
      
      Note that previously, a command would come in here, have
      REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
      
      struct request *blk_do_flush(struct request_queue *q, struct request *rq)
      {
              unsigned int fflags = q->flush_flags; /* may change, cache it */
              bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
              bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
              bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
              REQ_FUA);
              unsigned skip = 0;
      ...
              if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
                      rq->cmd_flags &= ~REQ_FLUSH;
      		if (!has_fua)
      			rq->cmd_flags &= ~REQ_FUA;
      	        return rq;
      	}
      
      So, the flush machinery was bypassed in such cases (q->flush_flags == 0
      && rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
      
      Now, however, we don't get into the flush machinery at all.  Instead,
      __elv_next_request just hands a request with flush and fua bits set to
      the scsi_request_fn, even if the underlying request_queue does not
      support flush or fua.
      
      The agreed upon approach is to fix the flush machinery to allow
      stacking.  While this isn't used in practice (since there is only one
      request-based dm target, and that target will now reflect the flush
      flags of the underlying device), it does future-proof the solution, and
      make it function as designed.
      
      In order to make this work, I had to add a field to the struct request,
      inside the flush structure (to store the original req->end_io).  Shaohua
      had suggested overloading the union with rb_node and completion_data,
      but the completion data is used by device mapper and can also be used by
      other drivers.  So, I didn't see a way around the additional field.
      
      I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
      the lost performance.  Comments and other testers, as always, are
      appreciated.
      
      Cheers,
      Jeff
      
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      4853abaa
  9. 14 Aug, 2011 1 commit
  10. 13 Aug, 2011 1 commit
  11. 11 Aug, 2011 3 commits
    • Vasiliy Kulikov's avatar
      move RLIMIT_NPROC check from set_user() to do_execve_common() · 72fa5997
      Vasiliy Kulikov authored
      The patch http://lkml.org/lkml/2003/7/13/226
      
       introduced an RLIMIT_NPROC
      check in set_user() to check for NPROC exceeding via setuid() and
      similar functions.
      
      Before the check there was a possibility to greatly exceed the allowed
      number of processes by an unprivileged user if the program relied on
      rlimit only.  But the check created new security threat: many poorly
      written programs simply don't check setuid() return code and believe it
      cannot fail if executed with root privileges.  So, the check is removed
      in this patch because of too often privilege escalations related to
      buggy programs.
      
      The NPROC can still be enforced in the common code flow of daemons
      spawning user processes.  Most of daemons do fork()+setuid()+execve().
      The check introduced in execve() (1) enforces the same limit as in
      setuid() and (2) doesn't create similar security issues.
      
      Neil Brown suggested to track what specific process has exceeded the
      limit by setting PF_NPROC_EXCEEDED process flag.  With the change only
      this process would fail on execve(), and other processes' execve()
      behaviour is not changed.
      
      Solar Designer suggested to re-check whether NPROC limit is still
      exceeded at the moment of execve().  If the process was sleeping for
      days between set*uid() and execve(), and the NPROC counter step down
      under the limit, the defered execve() failure because NPROC limit was
      exceeded days ago would be unexpected.  If the limit is not exceeded
      anymore, we clear the flag on successful calls to execve() and fork().
      
      The flag is also cleared on successful calls to set_user() as the limit
      was exceeded for the previous user, not the current one.
      
      Similar check was introduced in -ow patches (without the process flag).
      
      v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().
      
      Reviewed-by: default avatarJames Morris <jmorris@namei.org>
      Signed-off-by: default avatarVasiliy Kulikov <segoon@openwall.com>
      Acked-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72fa5997
    • Namhyung Kim's avatar
      blktrace: add FLUSH/FUA support · c09c47ca
      Namhyung Kim authored
      
      
      Add FLUSH/FUA support to blktrace. As FLUSH precedes WRITE and/or
      FUA follows WRITE, use the same 'F' flag for both cases and
      distinguish them by their (relative) position. The end results
      look like (other flags might be shown also):
      
       - WRITE:            W
       - WRITE_FLUSH:      FW
       - WRITE_FUA:        WF
       - WRITE_FLUSH_FUA:  FWF
      
      Note that we reuse TC_BARRIER due to lack of bit space of act_mask
      so that the older versions of blktrace tools will report flush
      requests as barriers from now on.
      
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarNamhyung Kim <namhyung@gmail.com>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      c09c47ca
    • Matthew Wilcox's avatar
      Move some REQ flags to the common bio/request area · 8e4bf844
      Matthew Wilcox authored
      
      
      REQ_SECURE, REQ_FLUSH and REQ_FUA may all be set on a bio as well as
      on a request, so relocate them to the shared part of the enum.
      
      Signed-off-by: default avatarMatthew Wilcox <willy@linux.intel.com>
      Signed-off-by: default avatarNamhyung Kim <namhyung@gmail.com>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      8e4bf844
  12. 09 Aug, 2011 1 commit
  13. 08 Aug, 2011 4 commits
  14. 07 Aug, 2011 2 commits
  15. 06 Aug, 2011 5 commits
    • Linus Torvalds's avatar
      vfs: optimize inode cache access patterns · 3ddcd056
      Linus Torvalds authored
      
      
      The inode structure layout is largely random, and some of the vfs paths
      really do care.  The path lookup in particular is already quite D$
      intensive, and profiles show that accessing the 'inode->i_op->xyz'
      fields is quite costly.
      
      We already optimized the dcache to not unnecessarily load the d_op
      structure for members that are often NULL using the DCACHE_OP_xyz bits
      in dentry->d_flags, and this does something very similar for the inode
      ops that are used during pathname lookup.
      
      It also re-orders the fields so that the fields accessed by 'stat' are
      together at the beginning of the inode structure, and roughly in the
      order accessed.
      
      The effect of this seems to be in the 1-2% range for an empty kernel
      "make -j" run (which is fairly kernel-intensive, mostly in filename
      lookup), so it's visible.  The numbers are fairly noisy, though, and
      likely depend a lot on exact microarchitecture.  So there's more tuning
      to be done.
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ddcd056
    • Linus Torvalds's avatar
      vfs: renumber DCACHE_xyz flags, remove some stale ones · 830c0f0e
      Linus Torvalds authored
      
      
      Gcc tends to generate better code with small integers, including the
      DCACHE_xyz flag tests - so move the common ones to be first in the list.
      Also just remove the unused DCACHE_INOTIFY_PARENT_WATCHED and
      DCACHE_AUTOFS_PENDING values, their users no longer exists in the source
      tree.
      
      And add a "unlikely()" to the DCACHE_OP_COMPARE test, since we want the
      common case to be a nice straight-line fall-through.
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      830c0f0e
    • David S. Miller's avatar
      net: Compute protocol sequence numbers and fragment IDs using MD5. · 6e5714ea
      David S. Miller authored
      
      
      Computers have become a lot faster since we compromised on the
      partial MD4 hash which we use currently for performance reasons.
      
      MD5 is a much safer choice, and is inline with both RFC1948 and
      other ISS generators (OpenBSD, Solaris, etc.)
      
      Furthermore, only having 24-bits of the sequence number be truly
      unpredictable is a very serious limitation.  So the periodic
      regeneration and 8-bit counter have been removed.  We compute and
      use a full 32-bit sequence number.
      
      For ipv6, DCCP was found to use a 32-bit truncated initial sequence
      number (it needs 43-bits) and that is fixed here as well.
      
      Reported-by: default avatarDan Kaminsky <dan@doxpara.com>
      Tested-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e5714ea
    • David S. Miller's avatar
      crypto: Move md5_transform to lib/md5.c · bc0b96b5
      David S. Miller authored
      
      
      We are going to use this for TCP/IP sequence number and fragment ID
      generation.
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc0b96b5
    • Mandeep Singh Baines's avatar
      lib/sha1: use the git implementation of SHA-1 · 1eb19a12
      Mandeep Singh Baines authored
      
      
      For ChromiumOS, we use SHA-1 to verify the integrity of the root
      filesystem.  The speed of the kernel sha-1 implementation has a major
      impact on our boot performance.
      
      To improve boot performance, we investigated using the heavily optimized
      sha-1 implementation used in git.  With the git sha-1 implementation, we
      see a 11.7% improvement in boot time.
      
      10 reboots, remove slowest/fastest.
      
      Before:
      
        Mean: 6.58 seconds Stdev: 0.14
      
      After (with git sha-1, this patch):
      
        Mean: 5.89 seconds Stdev: 0.07
      
      The other cool thing about the git SHA-1 implementation is that it only
      needs 64 bytes of stack for the workspace while the original kernel
      implementation needed 320 bytes.
      
      Signed-off-by: default avatarMandeep Singh Baines <msb@chromium.org>
      Cc: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
      Cc: Nicolas Pitre <nico@cam.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: linux-crypto@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1eb19a12
  16. 05 Aug, 2011 1 commit
  17. 04 Aug, 2011 2 commits
  18. 03 Aug, 2011 7 commits
    • Hugh Dickins's avatar
      tmpfs radix_tree: locate_item to speed up swapoff · e504f3fd
      Hugh Dickins authored
      
      
      We have already acknowledged that swapoff of a tmpfs file is slower than
      it was before conversion to the generic radix_tree: a little slower
      there will be acceptable, if the hotter paths are faster.
      
      But it was a shock to find swapoff of a 500MB file 20 times slower on my
      laptop, taking 10 minutes; and at that rate it significantly slows down
      my testing.
      
      Now, most of that turned out to be overhead from PROVE_LOCKING and
      PROVE_RCU: without those it was only 4 times slower than before; and
      more realistic tests on other machines don't fare as badly.
      
      I've tried a number of things to improve it, including tagging the swap
      entries, then doing lookup by tag: I'd expected that to halve the time,
      but in practice it's erratic, and often counter-productive.
      
      The only change I've so far found to make a consistent improvement, is
      to short-circuit the way we go back and forth, gang lookup packing
      entries into the array supplied, then shmem scanning that array for the
      target entry.  Scanning in place doubles the speed, so it's now only
      twice as slow as before (or three times slower when the PROVEs are on).
      
      So, add radix_tree_locate_item() as an expedient, once-off,
      single-caller hack to do the lookup directly in place.  #ifdef it on
      CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited
      applicability as save space in other configurations.  And, sadly,
      #include sched.h for cond_resched().
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e504f3fd
    • Hugh Dickins's avatar
      tmpfs: use kmemdup for short symlinks · 69f07ec9
      Hugh Dickins authored
      
      
      But we've not yet removed the old swp_entry_t i_direct[16] from
      shmem_inode_info.  That's because it was still being shared with the
      inline symlink.  Remove it now (saving 64 or 128 bytes from shmem inode
      size), and use kmemdup() for short symlinks, say, those up to 128 bytes.
      
      I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode()
      rather than shmem_evict_inode(), where we usually do such freeing? I
      guess it doesn't matter, and I'm not into NUMA mpol testing right now.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69f07ec9
    • Hugh Dickins's avatar
      tmpfs: convert mem_cgroup shmem to radix-swap · aa3b1895
      Hugh Dickins authored
      
      
      Remove mem_cgroup_shmem_charge_fallback(): it was only required when we
      had to move swappage to filecache with GFP_NOWAIT.
      
      Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(), by
      moving its call out from shmem_add_to_page_cache() to two of thats three
      callers.  But leave it doing mem_cgroup_uncharge_cache_page() on error:
      although asymmetrical, it's easier for all 3 callers to handle.
      
      These two changes would also be appropriate if anyone were to start
      using shmem_read_mapping_page_gfp() with GFP_NOWAIT.
      
      Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test
      radix_tree_exceptional_entry() to get what it needs for itself.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa3b1895
    • Hugh Dickins's avatar
      tmpfs: miscellaneous trivial cleanups · 41ffe5d5
      Hugh Dickins authored
      
      
      While it's at its least, make a number of boring nitpicky cleanups to
      shmem.c, mostly for consistency of variable naming.  Things like "swap"
      instead of "entry", "pgoff_t index" instead of "unsigned long idx".
      
      And since everything else here is prefixed "shmem_", better change
      init_tmpfs() to shmem_init().
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41ffe5d5
    • Hugh Dickins's avatar
      tmpfs: demolish old swap vector support · 285b2c4f
      Hugh Dickins authored
      
      
      The maximum size of a shmem/tmpfs file has been limited by the maximum
      size of its triple-indirect swap vector.  With 4kB page size, maximum
      filesize was just over 2TB on a 32-bit kernel, but sadly one eighth of
      that on a 64-bit kernel.  (With 8kB page size, maximum filesize was just
      over 4TB on a 64-bit kernel, but 16TB on a 32-bit kernel,
      MAX_LFS_FILESIZE being then more restrictive than swap vector layout.)
      
      It's a shame that tmpfs should be more restrictive than ramfs, and this
      limitation has now been noticed.  Add another level to the swap vector?
      No, it became obscure and hard to maintain, once I complicated it to
      make use of highmem pages nine years ago: better choose another way.
      
      Surely, if 2.4 had had the radix tree pagecache introduced in 2.5, then
      tmpfs would never have invented its own peculiar radix tree: we would
      have fitted swap entries into the common radix tree instead, in much the
      same way as we fit swap entries into page tables.
      
      And why should each file have a separate radix tree for its pages and
      for its swap entries? The swap entries are required precisely where and
      when the pages are not.  We want to put them together in a single radix
      tree: which can then avoid much of the locking which was needed to
      prevent them from being exchanged underneath us.
      
      This also avoids the waste of memory devoted to swap vectors, first in
      the shmem_inode itself, then at least two more pages once a file grew
      beyond 16 data pages (pages accounted by df and du, but not by memcg).
      Allocated upfront, to avoid allocation when under swapping pressure, but
      pure waste when CONFIG_SWAP is not set - I have never spattered around
      the ifdefs to prevent that, preferring this move to sharing the common
      radix tree instead.
      
      There are three downsides to sharing the radix tree.  One, that it binds
      tmpfs more tightly to the rest of mm, either requiring knowledge of swap
      entries in radix tree there, or duplication of its code here in shmem.c.
      I believe that the simplications and memory savings (and probable higher
      performance, not yet measured) justify that.
      
      Two, that on HIGHMEM systems with SWAP enabled, it's the lowmem radix
      nodes that cannot be freed under memory pressure - whereas before it was
      the less precious highmem swap vector pages that could not be freed.
      I'm hoping that 64-bit has now been accessible for long enough, that the
      highmem argument has grown much less persuasive.
      
      Three, that swapoff is slower than it used to be on tmpfs files, since
      it's using a simple generic mechanism not tailored to it: I find this
      noticeable, and shall want to improve, but maybe nobody else will
      notice.
      
      So...  now remove most of the old swap vector code from shmem.c.  But,
      for the moment, keep the simple i_direct vector of 16 pages, with simple
      accessors shmem_put_swap() and shmem_get_swap(), as a toy implementation
      to help mark where swap needs to be handled in subsequent patches.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      285b2c4f
    • Hugh Dickins's avatar
      mm: let swap use exceptional entries · a2c16d6c
      Hugh Dickins authored
      
      
      If swap entries are to be stored along with struct page pointers in a
      radix tree, they need to be distinguished as exceptional entries.
      
      Most of the handling of swap entries in radix tree will be contained in
      shmem.c, but a few functions in filemap.c's common code need to check
      for their appearance: find_get_page(), find_lock_page(),
      find_get_pages() and find_get_pages_contig().
      
      So as not to slow their fast paths, tuck those checks inside the
      existing checks for unlikely radix_tree_deref_slot(); except for
      find_lock_page(), where it is an added test.  And make it a BUG in
      find_get_pages_tag(), which is not applied to tmpfs files.
      
      A part of the reason for eliminating shmem_readpage() earlier, was to
      minimize the places where common code would need to allow for swap
      entries.
      
      The swp_entry_t known to swapfile.c must be massaged into a slightly
      different form when stored in the radix tree, just as it gets massaged
      into a pte_t when stored in page tables.
      
      In an i386 kernel this limits its information (type and page offset) to
      30 bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum
      swapfile size of 128GB.  Which is less than the 512GB we previously
      allowed with X86_PAE (where the swap entry can occupy the entire upper
      32 bits of a pte_t), but not a new limitation on 32-bit without PAE; and
      there's not a new limitation on 64-bit (where swap filesize is already
      limited to 16TB by a 32-bit page offset).  Thirty areas of 128GB is
      probably still enough swap for a 64GB 32-bit machine.
      
      Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and
      enforce filesize limit in read_swap_header(), just as for ptes.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2c16d6c
    • Hugh Dickins's avatar
      radix_tree: exceptional entries and indices · 6328650b
      Hugh Dickins authored
      
      
      A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its
      peculiar swap vector, instead keeping a file's swap entries in the same
      radix tree as its struct page pointers: thus saving memory, and
      simplifying its code and locking.
      
      This patch:
      
      The radix_tree is used by several subsystems for different purposes.  A
      major use is to store the struct page pointers of a file's pagecache for
      memory management.  But what if mm wanted to store something other than
      page pointers there too?
      
      The low bit of a radix_tree entry is already used to denote an indirect
      pointer, for internal use, and the unlikely radix_tree_deref_retry()
      case.
      
      Define the next bit as denoting an exceptional entry, and supply inline
      functions radix_tree_exception() to return non-0 in either unlikely
      case, and radix_tree_exceptional_entry() to return non-0 in the second
      case.
      
      If a subsystem already uses radix_tree with that bit set, no problem: it
      does not affect internal workings at all, but is defined for the
      convenience of those storing well-aligned pointers in the radix_tree.
      
      The radix_tree_gang_lookups have an implicit assumption that the caller
      can deduce the offset of each entry returned e.g.  by the page->index of
      a struct page.  But that may not be feasible for some kinds of item to
      be stored there.
      
      radix_tree_gang_lookup_slot() allow for an optional indices argument,
      output array in which to return those offsets.  The same could be added
      to other radix_tree_gang_lookups, but for now keep it to the only one
      for which we need it.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6328650b