1. 23 Jun, 2016 3 commits
    • Eric W. Biederman's avatar
      vfs: Generalize filesystem nodev handling. · a2982cc9
      Eric W. Biederman authored
      Introduce a function may_open_dev that tests MNT_NODEV and a new
      superblock flab SB_I_NODEV.  Use this new function in all of the
      places where MNT_NODEV was previously tested.
      
      Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
      filesystems should never support device nodes, and a simple superblock
      flags makes that very hard to get wrong.  With SB_I_NODEV set if any
      device nodes somehow manage to show up on on a filesystem those
      device nodes will be unopenable.
      Acked-by: 's avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
      a2982cc9
    • Eric W. Biederman's avatar
      ipc/mqueue: The mqueue filesystem should never contain executables · 3ee69014
      Eric W. Biederman authored
      Set SB_I_NOEXEC on mqueuefs to ensure small implementation mistakes
      do not result in executable on mqueuefs by accident.
      Acked-by: 's avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
      3ee69014
    • Eric W. Biederman's avatar
      vfs: Pass data, ns, and ns->userns to mount_ns · d91ee87d
      Eric W. Biederman authored
      Today what is normally called data (the mount options) is not passed
      to fill_super through mount_ns.
      
      Pass the mount options and the namespace separately to mount_ns so
      that filesystems such as proc that have mount options, can use
      mount_ns.
      
      Pass the user namespace to mount_ns so that the standard permission
      check that verifies the mounter has permissions over the namespace can
      be performed in mount_ns instead of in each filesystems .mount method.
      Thus removing the duplication between mqueuefs and proc in terms of
      permission checks.  The extra permission check does not currently
      affect the rpc_pipefs filesystem and the nfsd filesystem as those
      filesystems do not currently allow unprivileged mounts.  Without
      unpvileged mounts it is guaranteed that the caller has already passed
      capable(CAP_SYS_ADMIN) which guarantees extra permission check will
      pass.
      
      Update rpc_pipefs and the nfsd filesystem to ensure that the network
      namespace reference is always taken in fill_super and always put in kill_sb
      so that the logic is simpler and so that errors originating inside of
      fill_super do not cause a network namespace leak.
      Acked-by: 's avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
      d91ee87d
  2. 04 Apr, 2016 1 commit
    • Kirill A. Shutemov's avatar
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov authored
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: 's avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  3. 22 Jan, 2016 1 commit
    • Al Viro's avatar
      wrappers for ->i_mutex access · 5955102c
      Al Viro authored
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  4. 14 Jan, 2016 1 commit
    • Vladimir Davydov's avatar
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov authored
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: 's avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: 's avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  5. 06 Aug, 2015 1 commit
    • Marcus Gelderie's avatar
      ipc: modify message queue accounting to not take kernel data structures into account · de54b9ac
      Marcus Gelderie authored
      A while back, the message queue implementation in the kernel was
      improved to use btrees to speed up retrieval of messages, in commit
      d6629859 ("ipc/mqueue: improve performance of send/recv").
      
      That patch introducing the improved kernel handling of message queues
      (using btrees) has, as a by-product, changed the meaning of the QSIZE
      field in the pseudo-file created for the queue.  Before, this field
      reflected the size of the user-data in the queue.  Since, it also takes
      kernel data structures into account.  For example, if 13 bytes of user
      data are in the queue, on my machine the file reports a size of 61
      bytes.
      
      There was some discussion on this topic before (for example
      https://lkml.org/lkml/2014/10/1/115).  Commenting on a th lkml, Michael
      Kerrisk gave the following background
      (https://lkml.org/lkml/2015/6/16/74):
      
          The pseudofiles in the mqueue filesystem (usually mounted at
          /dev/mqueue) expose fields with metadata describing a message
          queue. One of these fields, QSIZE, as originally implemented,
          showed the total number of bytes of user data in all messages in
          the message queue, and this feature was documented from the
          beginning in the mq_overview(7) page. In 3.5, some other (useful)
          work happened to break the user-space API in a couple of places,
          including the value exposed via QSIZE, which now includes a measure
          of kernel overhead bytes for the queue, a figure that renders QSIZE
          useless for its original purpose, since there's no way to deduce
          the number of overhead bytes consumed by the implementation.
          (The other user-space breakage was subsequently fixed.)
      
      This patch removes the accounting of kernel data structures in the
      queue.  Reporting the size of these data-structures in the QSIZE field
      was a breaking change (see Michael's comment above).  Without the QSIZE
      field reporting the total size of user-data in the queue, there is no
      way to deduce this number.
      
      It should be noted that the resource limit RLIMIT_MSGQUEUE is counted
      against the worst-case size of the queue (in both the old and the new
      implementation).  Therefore, the kernel overhead accounting in QSIZE is
      not necessary to help the user understand the limitations RLIMIT imposes
      on the processes.
      Signed-off-by: 's avatarMarcus Gelderie <redmnic@gmail.com>
      Acked-by: 's avatarDoug Ledford <dledford@redhat.com>
      Acked-by: 's avatarMichael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: 's avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: John Duffy <jb_duffy@btinternet.com>
      Cc: Arto Bendiken <arto@bendiken.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      de54b9ac
  6. 08 May, 2015 1 commit
    • Davidlohr Bueso's avatar
      ipc/mqueue: Implement lockless pipelined wakeups · fa6004ad
      Davidlohr Bueso authored
      This patch moves the wakeup_process() invocation so it is not done under
      the info->lock by making use of a lockless wake_q. With this change, the
      waiter is woken up once it is STATE_READY and it does not need to loop
      on SMP if it is still in STATE_PENDING. In the timeout case we still need
      to grab the info->lock to verify the state.
      
      This change should also avoid the introduction of preempt_disable() in -rt
      which avoids a busy-loop which pools for the STATE_PENDING -> STATE_READY
      change if the waiter has a higher priority compared to the waker.
      
      Additionally, this patch micro-optimizes wq_sleep by using the cheaper
      cousin of set_current_state(TASK_INTERRUPTABLE) as we will block no
      matter what, thus get rid of the implied barrier.
      Signed-off-by: 's avatarDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: 's avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: 's avatarGeorge Spelvin <linux@horizon.com>
      Acked-by: 's avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris Mason <clm@fb.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: dave@stgolabs.net
      Link: http://lkml.kernel.org/r/1430748166.1940.17.camel@stgolabs.netSigned-off-by: 's avatarIngo Molnar <mingo@kernel.org>
      fa6004ad
  7. 15 Apr, 2015 1 commit
  8. 19 Nov, 2014 1 commit
    • Al Viro's avatar
      new helper: audit_file() · 9f45f5bf
      Al Viro authored
      ... for situations when we don't have any candidate in pathnames - basically,
      in descriptor-based syscalls.
      
      [Folded the build fix for !CONFIG_AUDITSYSCALL configs from Chen Gang]
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      9f45f5bf
  9. 07 Apr, 2014 1 commit
  10. 25 Feb, 2014 1 commit
    • Davidlohr Bueso's avatar
      ipc,mqueue: remove limits for the amount of system-wide queues · f3713fd9
      Davidlohr Bueso authored
      Commit 93e6f119 ("ipc/mqueue: cleanup definition names and
      locations") added global hardcoded limits to the amount of message
      queues that can be created.  While these limits are per-namespace,
      reality is that it ends up breaking userspace applications.
      Historically users have, at least in theory, been able to create up to
      INT_MAX queues, and limiting it to just 1024 is way too low and dramatic
      for some workloads and use cases.  For instance, Madars reports:
      
       "This update imposes bad limits on our multi-process application.  As
        our app uses approaches that each process opens its own set of queues
        (usually something about 3-5 queues per process).  In some scenarios
        we might run up to 3000 processes or more (which of-course for linux
        is not a problem).  Thus we might need up to 9000 queues or more.  All
        processes run under one user."
      
      Other affected users can be found in launchpad bug #1155695:
        https://bugs.launchpad.net/ubuntu/+source/manpages/+bug/1155695
      
      Instead of increasing this limit, revert it entirely and fallback to the
      original way of dealing queue limits -- where once a user's resource
      limit is reached, and all memory is used, new queues cannot be created.
      Signed-off-by: 's avatarDavidlohr Bueso <davidlohr@hp.com>
      Reported-by: 's avatarMadars Vitolins <m@silodev.com>
      Acked-by: 's avatarDoug Ledford <dledford@redhat.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>	[3.5+]
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3713fd9
  11. 27 Jan, 2014 2 commits
  12. 08 Nov, 2013 1 commit
    • J. Bruce Fields's avatar
      locks: break delegations on unlink · b21996e3
      J. Bruce Fields authored
      We need to break delegations on any operation that changes the set of
      links pointing to an inode.  Start with unlink.
      
      Such operations also hold the i_mutex on a parent directory.  Breaking a
      delegation may require waiting for a timeout (by default 90 seconds) in
      the case of a unresponsive NFS client.  To avoid blocking all directory
      operations, we therefore drop locks before waiting for the delegation.
      The logic then looks like:
      
      	acquire locks
      	...
      	test for delegation; if found:
      		take reference on inode
      		release locks
      		wait for delegation break
      		drop reference on inode
      		retry
      
      It is possible this could never terminate.  (Even if we take precautions
      to prevent another delegation being acquired on the same inode, we could
      get a different inode on each retry.)  But this seems very unlikely.
      
      The initial test for a delegation happens after the lock on the target
      inode is acquired, but the directory inode may have been acquired
      further up the call stack.  We therefore add a "struct inode **"
      argument to any intervening functions, which we use to pass the inode
      back up to the caller in the case it needs a delegation synchronously
      broken.
      
      Cc: David Howells <dhowells@redhat.com>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Dustin Kirkland <dustin.kirkland@gazzang.com>
      Acked-by: 's avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: 's avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      b21996e3
  13. 09 Jul, 2013 1 commit
    • Jeff Layton's avatar
      audit: fix mq_open and mq_unlink to add the MQ root as a hidden parent audit_names record · 79f6530c
      Jeff Layton authored
      The old audit PATH records for mq_open looked like this:
      
        type=PATH msg=audit(1366282323.982:869): item=1 name=(null) inode=6777
        dev=00:0c mode=041777 ouid=0 ogid=0 rdev=00:00
        obj=system_u:object_r:tmpfs_t:s15:c0.c1023
        type=PATH msg=audit(1366282323.982:869): item=0 name="test_mq" inode=26732
        dev=00:0c mode=0100700 ouid=0 ogid=0 rdev=00:00
        obj=staff_u:object_r:user_tmpfs_t:s15:c0.c1023
      
      ...with the audit related changes that went into 3.7, they now look like this:
      
        type=PATH msg=audit(1366282236.776:3606): item=2 name=(null) inode=66655
        dev=00:0c mode=0100700 ouid=0 ogid=0 rdev=00:00
        obj=staff_u:object_r:user_tmpfs_t:s15:c0.c1023
        type=PATH msg=audit(1366282236.776:3606): item=1 name=(null) inode=6926
        dev=00:0c mode=041777 ouid=0 ogid=0 rdev=00:00
        obj=system_u:object_r:tmpfs_t:s15:c0.c1023
        type=PATH msg=audit(1366282236.776:3606): item=0 name="test_mq"
      
      Both of these look wrong to me.  As Steve Grubb pointed out:
      
       "What we need is 1 PATH record that identifies the MQ.  The other PATH
        records probably should not be there."
      
      Fix it to record the mq root as a parent, and flag it such that it
      should be hidden from view when the names are logged, since the root of
      the mq filesystem isn't terribly interesting.  With this change, we get
      a single PATH record that looks more like this:
      
        type=PATH msg=audit(1368021604.836:484): item=0 name="test_mq" inode=16914
        dev=00:0c mode=0100644 ouid=0 ogid=0 rdev=00:00
        obj=unconfined_u:object_r:user_tmpfs_t:s0
      
      In order to do this, a new audit_inode_parent_hidden() function is
      added.  If we do it this way, then we avoid having the existing callers
      of audit_inode needing to do any sort of flag conversion if auditing is
      inactive.
      Signed-off-by: 's avatarJeff Layton <jlayton@redhat.com>
      Reported-by: 's avatarJiri Jaburek <jjaburek@redhat.com>
      Cc: Steve Grubb <sgrubb@redhat.com>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      79f6530c
  14. 27 Mar, 2013 1 commit
    • Eric W. Biederman's avatar
      ipc: Restrict mounting the mqueue filesystem · a636b702
      Eric W. Biederman authored
      Only allow mounting the mqueue filesystem if the caller has CAP_SYS_ADMIN
      rights over the ipc namespace.   The principle here is if you create
      or have capabilities over it you can mount it, otherwise you get to live
      with what other people have mounted.
      
      This information is not particularly sensitive and mqueue essentially
      only reports which posix messages queues exist.  Still when creating a
      restricted environment for an application to live any extra
      information may be of use to someone with sufficient creativity.  The
      historical if imperfect way this information has been restricted has
      been not to allow mounts and restricting this to ipc namespace
      creators maintains the spirit of the historical restriction.
      
      Cc: stable@vger.kernel.org
      Acked-by: 's avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
      a636b702
  15. 22 Mar, 2013 1 commit
    • Vladimir Davydov's avatar
      mqueue: sys_mq_open: do not call mnt_drop_write() if read-only · 38d78e58
      Vladimir Davydov authored
      mnt_drop_write() must be called only if mnt_want_write() succeeded,
      otherwise the mnt_writers counter will diverge.
      
      mnt_writers counters are used to check if remounting FS as read-only is
      OK, so after an extra mnt_drop_write() call, it would be impossible to
      remount mqueue FS as read-only.  Besides, on umount a warning would be
      printed like this one:
      
        =====================================
        [ BUG: bad unlock balance detected! ]
        3.9.0-rc3 #5 Not tainted
        -------------------------------------
        a.out/12486 is trying to release lock (sb_writers) at:
        mnt_drop_write+0x1f/0x30
        but there are no more locks to release!
      Signed-off-by: 's avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      38d78e58
  16. 22 Feb, 2013 1 commit
  17. 27 Jan, 2013 1 commit
  18. 12 Oct, 2012 2 commits
    • Jeff Layton's avatar
      audit: make audit_inode take struct filename · adb5c247
      Jeff Layton authored
      Keep a pointer to the audit_names "slot" in struct filename.
      
      Have all of the audit_inode callers pass a struct filename ponter to
      audit_inode instead of a string pointer. If the aname field is already
      populated, then we can skip walking the list altogether and just use it
      directly.
      Signed-off-by: 's avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      adb5c247
    • Jeff Layton's avatar
      vfs: define struct filename and have getname() return it · 91a27b2a
      Jeff Layton authored
      getname() is intended to copy pathname strings from userspace into a
      kernel buffer. The result is just a string in kernel space. It would
      however be quite helpful to be able to attach some ancillary info to
      the string.
      
      For instance, we could attach some audit-related info to reduce the
      amount of audit-related processing needed. When auditing is enabled,
      we could also call getname() on the string more than once and not
      need to recopy it from userspace.
      
      This patchset converts the getname()/putname() interfaces to return
      a struct instead of a string. For now, the struct just tracks the
      string in kernel space and the original userland pointer for it.
      
      Later, we'll add other information to the struct as it becomes
      convenient.
      Signed-off-by: 's avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      91a27b2a
  19. 11 Oct, 2012 1 commit
    • Jeff Layton's avatar
      audit: set the name_len in audit_inode for parent lookups · bfcec708
      Jeff Layton authored
      Currently, this gets set mostly by happenstance when we call into
      audit_inode_child. While that might be a little more efficient, it seems
      wrong. If the syscall ends up failing before audit_inode_child ever gets
      called, then you'll have an audit_names record that shows the full path
      but has the parent inode info attached.
      
      Fix this by passing in a parent flag when we call audit_inode that gets
      set to the value of LOOKUP_PARENT. We can then fix up the pathname for
      the audit entry correctly from the get-go.
      
      While we're at it, clean up the no-op macro for audit_inode in the
      !CONFIG_AUDITSYSCALL case.
      Signed-off-by: 's avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      bfcec708
  20. 09 Oct, 2012 1 commit
  21. 26 Sep, 2012 2 commits
  22. 18 Aug, 2012 1 commit
  23. 22 Jul, 2012 1 commit
  24. 14 Jul, 2012 2 commits
  25. 31 May, 2012 9 commits
    • Doug Ledford's avatar
      ipc/mqueue: add rbtree node caching support · ce2d52cc
      Doug Ledford authored
      When I wrote the first patch that added the rbtree support for message
      queue insertion, it sped up the case where the queue was very full
      drastically from the original code.  It, however, slowed down the case
      where the queue was empty (not drastically though).
      
      This patch caches the last freed rbtree node struct so we can quickly
      reuse it when we get a new message.  This is the common path for any queue
      that very frequently goes from 0 to 1 then back to 0 messages in queue.
      
      Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
      msg_insert, so this patch attempts to speculatively allocate a new node
      struct outside of the spin lock when we know we need it, but will still
      fall back to a GFP_ATOMIC allocation if it has to.
      
      Once I added the caching, the necessary various ret = ; spin_unlock
      gyrations in mq_timedsend were getting pretty ugly, so this also slightly
      refactors that function to streamline the flow of the code and the
      function exit.
      
      Finally, while working on getting performance back I made sure that all of
      the node structs were always fully initialized when they were first used,
      rendering the use of kzalloc unnecessary and a waste of CPU cycles.
      
      The net result of all of this is:
      
      1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
         on it when necessary.
      
      2) We will speculatively allocate a node struct using GFP_KERNEL if our
         cache is empty (and save the struct to our cache if it's still empty
         after we have obtained the spin lock).
      
      3) The performance of the common queue empty case has significantly
         improved and is now much more in line with the older performance for
         this case.
      
      The performance changes are:
      
                  Old mqueue      new mqueue      new mqueue + caching
      queue empty
      send/recv   305/288ns       349/318ns       310/322ns
      
      I don't think we'll ever be able to get the recv performance back, but
      that's because the old recv performance was a direct result and
      consequence of the old methods abysmal send performance.  The recv path
      simply must do more so that the send path does not incur such a penalty
      under higher queue depths.
      
      As it turns out, the new caching code also sped up the various queue full
      cases relative to my last patch.  That could be because of the difference
      between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
      change in code flow in the mq_timedsend routine.  Regardless, I'll take
      it.  It wasn't huge, and I *would* say it was within the margin for error,
      but after many repeated runs what I'm seeing is that the old numbers trend
      slightly higher (about 10 to 20ns depending on which test is the one
      running).
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: 's avatarDoug Ledford <dledford@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce2d52cc
    • Doug Ledford's avatar
      ipc/mqueue: strengthen checks on mqueue creation · 113289cc
      Doug Ledford authored
      We already check the mq attr struct if it's passed in, but now that the
      admin can set system wide defaults separate from maximums, it's actually
      possible to set the defaults to something that would overflow.  So, if
      there is no attr struct passed in to the open call, check the default
      values.
      
      While we are at it, simplify mq_attr_ok() by making it return 0 or an
      error condition, so that way if we add more tests to it later, we have the
      option of what error should be returned instead of the calling location
      having to pick a possibly inaccurate error code.
      
      [akpm@linux-foundation.org: s/ENOMEM/EOVERFLOW/]
      Signed-off-by: 's avatarDoug Ledford <dledford@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Acked-by: 's avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      113289cc
    • Doug Ledford's avatar
      ipc/mqueue: correct mq_attr_ok test · 2c12ea49
      Doug Ledford authored
      While working on the other parts of the mqueue stuff, I noticed that the
      calculation for overflow in mq_attr_ok didn't actually match reality (this
      is especially true since my last patch which changed how we account memory
      slightly).
      
      In particular, we used to test for overflow using:
        msgs * msgsize + msgs * sizeof(struct msg_msg *)
      
      That was never really correct because each message we allocate via
      load_msg() is actually a struct msg_msg followed by the data for the
      message (and if struct msg_msg + data exceeds PAGE_SIZE we end up
      allocating struct msg_msgseg structs too, but accounting for them would
      get really tedious, so let's ignore those...they're only a pointer in size
      anyway).  This patch updates the calculation to be more accurate in
      regards to maximum possible memory consumption by the mqueue.
      
      [akpm@linux-foundation.org: add a local to simplify overflow-checking expression]
      Signed-off-by: 's avatarDoug Ledford <dledford@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Acked-by: 's avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c12ea49
    • Doug Ledford's avatar
      ipc/mqueue: improve performance of send/recv · d6629859
      Doug Ledford authored
      The existing implementation of the POSIX message queue send and recv
      functions is, well, abysmal.  Even worse than abysmal.  I submitted a
      patch to increase the maximum POSIX message queue limit to 65536 due to
      customer needs, however, upon looking over the send/recv implementation, I
      realized that my customer needs help with that too even if they don't know
      it.  The basic problem is that, given the fairly typical use case scenario
      for a large queue of queueing lots of messages all at the same priority (I
      verified with my customer that this is indeed what their app does), the
      msg_insert routine is basically a frikkin' bubble sort.  I mean, whoa,
      that's *so* middle school.
      
      OK, OK, to not slam the original author too much, I'm sure they didn't
      envision a queue depth of 50,000+ messages.  No one would think that
      moving elements in an array, one at a time, and dereferencing each pointer
      in that array to check priority of the message being pointed too, again
      one at a time, for 50,000+ times would be good.  So let's assume that, as
      is typical, the users have found a way to break our code simply by using
      it in a way we didn't envision.  Fair enough.
      
      "So, just how broken is it?", you ask.  I wondered the same thing, so I
      wrote an app to let me know.  It's my next patch.  It gave me some
      interesting results.  Here's what it tested:
      
      Interference with other apps - In continuous mode, the app just sits there
      and hits a message queue forever, while you go do something productive on
      another terminal using other CPUs.  You then measure how long it takes you
      to do that something productive.  Then you restart the app in fake
      continuous mode, and it sits in a tight loop on a CPU while you repeat
      your tests.  The whole point of this is to keep one CPU tied up (so it
      can't be used in your other work) but in one case tied up hitting the
      mqueue code so we can see the effect of walking that 65,528 element array
      one pointer at a time on the global CPU cache.  If it's bad, then it will
      slow down your app on the other CPUs just by polluting cache mercilessly.
      In the fake case, it will be in a tight loop, but not polluting cache.
      Testing the mqueue subsystem directly - Here we just run a number of tests
      to see how the mqueue subsystem performs under different conditions.  A
      couple conditions are known to be worst case for the old system, and some
      routines, so this tests all of them.
      
      So, on to the results already:
      
      Subsystem/Test                  Old                         New
      
      Time to compile linux
      kernel (make -j12 on a
      6 core CPU)
        Running mqueue test     user 49m10.744s             user 45m26.294s
      			   sys  5m51.924s              sys  4m59.894s
      			 total 55m02.668s            total 50m26.188s
      
        Running fake test       user 45m32.686s             user 45m18.552s
                                 sys  5m12.465s              sys  4m56.468s
                               total 50m45.151s            total 50m15.020s
      
        % slowdown from mqueue
          cache thrashing            ~8%                         ~.5%
      
      Avg time to send/recv (in nanoseconds per message)
        when queue empty            305/288                    349/318
        when queue full (65528 messages)
          constant priority      526589/823                    362/314
          increasing priority    403105/916                    495/445
          decreasing priority     73420/594                    482/409
          random priority        280147/920                    546/436
      
      Time to fill/drain queue (65528 messages, in seconds)
        constant priority         17.37/.12                    .13/.12
        increasing priority        4.14/.14                    .21/.18
        decreasing priority       12.93/.13                    .21/.18
        random priority            8.88/.16                    .22/.17
      
      So, I think the results speak for themselves.  It's possible this
      implementation could be improved by cacheing at least one priority level
      in the node tree (that would bring the queue empty performance more in
      line with the old implementation), but this works and is *so* much better
      than what we had, especially for the common case of a single priority in
      use, that further refinements can be in follow on patches.
      
      [akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
      [levinsasha928@gmail.com: use correct gfp flags in msg_insert]
      Signed-off-by: 's avatarDoug Ledford <dledford@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Acked-by: 's avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: 's avatarSasha Levin <levinsasha928@gmail.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6629859
    • KOSAKI Motohiro's avatar
      mqueue: separate mqueue default value from maximum value · cef0184c
      KOSAKI Motohiro authored
      Commit b231cca4 ("message queues: increase range limits") changed
      mqueue default value when attr parameter is specified NULL from hard
      coded value to fs.mqueue.{msg,msgsize}_max sysctl value.
      
      This made large side effect.  When user need to use two mqueue
      applications 1) using !NULL attr parameter and it require big message
      size and 2) using NULL attr parameter and only need small size message,
      app (1) require to raise fs.mqueue.msgsize_max and app (2) consume large
      memory size even though it doesn't need.
      
      Doug Ledford propsed to switch back it to static hard coded value.
      However it also has a compatibility problem.  Some applications might
      started depend on the default value is tunable.
      
      The solution is to separate default value from maximum value.
      Signed-off-by: 's avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: 's avatarDoug Ledford <dledford@redhat.com>
      Acked-by: 's avatarDoug Ledford <dledford@redhat.com>
      Acked-by: 's avatarJoe Korty <joe.korty@ccur.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Acked-by: 's avatarSerge E. Hallyn <serue@us.ibm.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      cef0184c
    • KOSAKI Motohiro's avatar
      mqueue: don't use kmalloc with KMALLOC_MAX_SIZE · fd1f87d2
      KOSAKI Motohiro authored
      KMALLOC_MAX_SIZE is not a good threshold.  It is extremely high and
      problematic.  Unfortunately, some silly drivers depend on this and we
      can't change it.  But any new code needn't use such extreme ugly high
      order allocations.  It brings us awful fragmentation issues and system
      slowdown.
      Signed-off-by: 's avatarKOSAKI Motohiro <mkosaki@jp.fujitsu.com>
      Acked-by: 's avatarDoug Ledford <dledford@redhat.com>
      Acked-by: 's avatarJoe Korty <joe.korty@ccur.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd1f87d2
    • Doug Ledford's avatar
      ipc/mqueue: update maximums for the mqueue subsystem · 5b5c4d1a
      Doug Ledford authored
      Commit b231cca4 ("message queues: increase range limits") changed the
      maximum size of a message in a message queue from INT_MAX to 8192*128.
      Unfortunately, we had customers that relied on a size much larger than
      8192*128 on their production systems.  After reviewing POSIX, we found
      that it is silent on the maximum message size.  We did find a couple other
      areas in which it was not silent.  Fix up the mqueue maximums so that the
      customer's system can continue to work, and document both the POSIX and
      real world requirements in ipc_namespace.h so that we don't have this
      issue crop back up.
      
      Also, commit 9cf18e1d ("ipc: HARD_MSGMAX should be higher not lower
      on 64bit") fiddled with HARD_MSGMAX without realizing that the number was
      intentionally in place to limit the msg queue depth to one that was small
      enough to kmalloc an array of pointers (hence why we divided 128k by
      sizeof(long)).  If we wish to meet POSIX requirements, we have no choice
      but to change our allocation to a vmalloc instead (at least for the large
      queue size case).  With that, it's possible to increase our allowed
      maximum to the POSIX requirements (or more if we choose).
      
      [sfr@canb.auug.org.au: using vmalloc requires including vmalloc.h]
      Signed-off-by: 's avatarDoug Ledford <dledford@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: 's avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: 's avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b5c4d1a
    • Doug Ledford's avatar
      ipc/mqueue: enforce hard limits · 02967ea0
      Doug Ledford authored
      In two places we don't enforce the hard limits for CAP_SYS_RESOURCE apps.
      In preparation for making more reasonable hard limits, start enforcing
      them even on CAP_SYS_RESOURCE.
      Signed-off-by: 's avatarDoug Ledford <dledford@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: 's avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      02967ea0
    • Doug Ledford's avatar
      ipc/mqueue: switch back to using non-max values on create · 858ee378
      Doug Ledford authored
      Commit b231cca4 ("message queues: increase range limits") changed
      how we create a queue that does not include an attr struct passed to
      open so that it creates the queue with whatever the maximum values are.
      However, if the admin has set the maximums to allow flexibility in
      creating a queue (aka, both a large size and large queue are allowed,
      but combined they create a queue too large for the RLIMIT_MSGQUEUE of
      the user), then attempts to create a queue without an attr struct will
      fail.  Switch back to using acceptable defaults regardless of what the
      maximums are.
      
      Note: so far, we only know of a few applications that rely on this
      behavior (specifically, set the maximums in /proc, then run the
      application which calls mq_open() without passing in an attr struct, and
      the application expects the newly created message queue to have the
      maximum sizes that were set in /proc used on the mq_open() call, and all
      of those applications that we know of are actually part of regression
      test suites that were coded to do something like this:
      
      for size in 4096 65536 $((1024 * 1024)) $((16 * 1024 * 1024)); do
      	echo $size > /proc/sys/fs/mqueue/msgsize_max
      	mq_open || echo "Error opening mq with size $size"
      done
      
      These test suites that depend on any behavior like this are broken.  The
      concept that programs should rely upon the system wide maximum in order
      to get their desired results instead of simply using a attr struct to
      specify what they want is fundamentally unfriendly programming practice
      for any multi-tasking OS.
      
      Fixing this will break those few apps that we know of (and those app
      authors recognize the brokenness of their code and the need to fix it).
      However, the following patch "mqueue: separate mqueue default value"
      allows a workaround in the form of new knobs for the default msg queue
      creation parameters for any software out there that we don't already
      know about that might rely on this behavior at the moment.
      Signed-off-by: 's avatarDoug Ledford <dledford@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: 's avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      858ee378
  26. 05 May, 2012 1 commit