1. 09 May, 2016 1 commit
    • Serge E. Hallyn's avatar
      cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces · 4f41fc59
      Serge E. Hallyn authored
      Patch summary:
      When showing a cgroupfs entry in mountinfo, show the path of the mount
      root dentry relative to the reader's cgroup namespace root.
      Short explanation (courtesy of mkerrisk):
      If we create a new cgroup namespace, then we want both /proc/self/cgroup
      and /proc/self/mountinfo to show cgroup paths that are correctly
      virtualized with respect to the cgroup mount point.  Previous to this
      patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
      does not.
      Long version:
      When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
      namespace, and then mounts a new instance of the freezer cgroup, the new
      mount will be rooted at /a/b.  The root dentry field of the mountinfo
      entry will show '/a/b'.
       cat > /tmp/do1 << EOF
       mount -t cgroup -o freezer freezer /mnt
       grep freezer /proc/self/mountinfo
       unshare -Gm  bash /tmp/do1
       > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
       > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer
      The task's freezer cgroup entry in /proc/self/cgroup will simply show
       grep freezer /proc/self/cgroup
      If instead the same task simply bind mounts the /a/b cgroup directory,
      the resulting mountinfo entry will again show /a/b for the dentry root.
      However in this case the task will find its own cgroup at /mnt/a/b,
      not at /mnt:
       mount --bind /sys/fs/cgroup/freezer/a/b /mnt
       130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
      In other words, there is no way for the task to know, based on what is
      in mountinfo, which cgroup directory is its own.
      Example (by mkerrisk):
      First, a little script to save some typing and verbiage:
      echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
      cat /proc/self/mountinfo | grep freezer |
              awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'
      Create cgroup, place this shell into the cgroup, and look at the state
      of the /proc files:
      2653                         # Our shell
      14254                        # cat(1)
              /proc/self/cgroup:      10:freezer:/a/b
              mountinfo:              /       /sys/fs/cgroup/freezer
      Create a shell in new cgroup and mount namespaces. The act of creating
      a new cgroup namespace causes the process's current cgroups directories
      to become its cgroup root directories. (Here, I'm using my own version
      of the "unshare" utility, which takes the same options as the util-linux
      Look at the state of the /proc files:
              /proc/self/cgroup:      10:freezer:/
              mountinfo:              /       /sys/fs/cgroup/freezer
      The third entry in /proc/self/cgroup (the pathname of the cgroup inside
      the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
      is rooted at /a/b in the outer namespace.
      However, the info in /proc/self/mountinfo is not for this cgroup
      namespace, since we are seeing a duplicate of the mount from the
      old mount namespace, and the info there does not correspond to the
      new cgroup namespace. However, trying to create a new mount still
      doesn't show us the right information in mountinfo:
                                            # propagating to other mountns
              /proc/self/cgroup:      7:freezer:/
              mountinfo:              /a/b    /mnt/freezer
      The act of creating a new cgroup namespace caused the process's
      current freezer directory, "/a/b", to become its cgroup freezer root
      directory. In other words, the pathname directory of the directory
      within the newly mounted cgroup filesystem should be "/",
      but mountinfo wrongly shows us "/a/b". The consequence of this is
      that the process in the cgroup namespace cannot correctly construct
      the pathname of its cgroup root directory from the information in
      With this patch, the dentry root field in mountinfo is shown relative
      to the reader's cgroup namespace.  So the same steps as above:
              /proc/self/cgroup:      10:freezer:/a/b
              mountinfo:              /       /sys/fs/cgroup/freezer
              /proc/self/cgroup:      10:freezer:/
              mountinfo:              /../..  /sys/fs/cgroup/freezer
              /proc/self/cgroup:      10:freezer:/
              mountinfo:              /       /mnt/freezer
      cgroup.clone_children  freezer.parent_freezing  freezer.state      tasks
      cgroup.procs           freezer.self_freezing    notify_on_release
      2653                   # First shell that placed in this cgroup
      3164                   # Shell started by 'unshare'
      14197                  # cat(1)
      Signed-off-by: default avatarSerge Hallyn <serge.hallyn@ubuntu.com>
      Tested-by: default avatarMichael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: default avatarMichael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  2. 30 Apr, 2016 1 commit
    • Chris Wilson's avatar
      kernfs: Move faulting copy_user operations outside of the mutex · e4234a1f
      Chris Wilson authored
      A fault in a user provided buffer may lead anywhere, and lockdep warns
      that we have a potential deadlock between the mm->mmap_sem and the
      kernfs file mutex:
      [   82.811702] ======================================================
      [   82.811705] [ INFO: possible circular locking dependency detected ]
      [   82.811709] 4.5.0-rc4-gfxbench+ #1 Not tainted
      [   82.811711] -------------------------------------------------------
      [   82.811714] kms_setmode/5859 is trying to acquire lock:
      [   82.811717]  (&dev->struct_mutex){+.+.+.}, at: [<ffffffff8150d9c1>] drm_gem_mmap+0x1a1/0x270
      [   82.811731]
      but task is already holding lock:
      [   82.811734]  (&mm->mmap_sem){++++++}, at: [<ffffffff8117b364>] vm_mmap_pgoff+0x44/0xa0
      [   82.811745]
      which lock already depends on the new lock.
      [   82.811749]
      the existing dependency chain (in reverse order) is:
      [   82.811752]
      -> #3 (&mm->mmap_sem){++++++}:
      [   82.811761]        [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
      [   82.811766]        [<ffffffff8118bc65>] __might_fault+0x75/0xa0
      [   82.811771]        [<ffffffff8124da4a>] kernfs_fop_write+0x8a/0x180
      [   82.811787]        [<ffffffff811d1023>] __vfs_write+0x23/0xe0
      [   82.811792]        [<ffffffff811d1d74>] vfs_write+0xa4/0x190
      [   82.811797]        [<ffffffff811d2c14>] SyS_write+0x44/0xb0
      [   82.811801]        [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
      [   82.811807]
      -> #2 (s_active#6){++++.+}:
      [   82.811814]        [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
      [   82.811819]        [<ffffffff8124c070>] __kernfs_remove+0x210/0x2f0
      [   82.811823]        [<ffffffff8124d040>] kernfs_remove_by_name_ns+0x40/0xa0
      [   82.811828]        [<ffffffff8124e9e0>] sysfs_remove_file_ns+0x10/0x20
      [   82.811832]        [<ffffffff815318d4>] device_del+0x124/0x250
      [   82.811837]        [<ffffffff81531a19>] device_unregister+0x19/0x60
      [   82.811841]        [<ffffffff8153c051>] cpu_cache_sysfs_exit+0x51/0xb0
      [   82.811846]        [<ffffffff8153c628>] cacheinfo_cpu_callback+0x38/0x70
      [   82.811851]        [<ffffffff8109ae89>] notifier_call_chain+0x39/0xa0
      [   82.811856]        [<ffffffff8109aef9>] __raw_notifier_call_chain+0x9/0x10
      [   82.811860]        [<ffffffff810786de>] cpu_notify+0x1e/0x40
      [   82.811865]        [<ffffffff81078779>] cpu_notify_nofail+0x9/0x20
      [   82.811869]        [<ffffffff81078ac3>] _cpu_down+0x233/0x340
      [   82.811874]        [<ffffffff81079019>] disable_nonboot_cpus+0xc9/0x350
      [   82.811878]        [<ffffffff810d2e11>] suspend_devices_and_enter+0x5a1/0xb50
      [   82.811883]        [<ffffffff810d3903>] pm_suspend+0x543/0x8d0
      [   82.811888]        [<ffffffff810d1b77>] state_store+0x77/0xe0
      [   82.811892]        [<ffffffff813fa68f>] kobj_attr_store+0xf/0x20
      [   82.811897]        [<ffffffff8124e740>] sysfs_kf_write+0x40/0x50
      [   82.811902]        [<ffffffff8124dafc>] kernfs_fop_write+0x13c/0x180
      [   82.811906]        [<ffffffff811d1023>] __vfs_write+0x23/0xe0
      [   82.811910]        [<ffffffff811d1d74>] vfs_write+0xa4/0x190
      [   82.811914]        [<ffffffff811d2c14>] SyS_write+0x44/0xb0
      [   82.811918]        [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
      [   82.811923]
      -> #1 (cpu_hotplug.lock){+.+.+.}:
      [   82.811929]        [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
      [   82.811933]        [<ffffffff817b6f72>] mutex_lock_nested+0x62/0x3b0
      [   82.811940]        [<ffffffff810784c1>] get_online_cpus+0x61/0x80
      [   82.811944]        [<ffffffff811170eb>] stop_machine+0x1b/0xe0
      [   82.811949]        [<ffffffffa0178edd>] gen8_ggtt_insert_entries__BKL+0x2d/0x30 [i915]
      [   82.812009]        [<ffffffffa017d3a6>] ggtt_bind_vma+0x46/0x70 [i915]
      [   82.812045]        [<ffffffffa017eb70>] i915_vma_bind+0x140/0x290 [i915]
      [   82.812081]        [<ffffffffa01862b9>] i915_gem_object_do_pin+0x899/0xb00 [i915]
      [   82.812117]        [<ffffffffa0186555>] i915_gem_object_pin+0x35/0x40 [i915]
      [   82.812154]        [<ffffffffa019a23e>] intel_init_pipe_control+0xbe/0x210 [i915]
      [   82.812192]        [<ffffffffa0197312>] intel_logical_rings_init+0xe2/0xde0 [i915]
      [   82.812232]        [<ffffffffa0186fe3>] i915_gem_init+0xf3/0x130 [i915]
      [   82.812278]        [<ffffffffa02097ed>] i915_driver_load+0xf2d/0x1770 [i915]
      [   82.812318]        [<ffffffff81512474>] drm_dev_register+0xa4/0xb0
      [   82.812323]        [<ffffffff8151467e>] drm_get_pci_dev+0xce/0x1e0
      [   82.812328]        [<ffffffffa01472cf>] i915_pci_probe+0x2f/0x50 [i915]
      [   82.812360]        [<ffffffff8143f907>] pci_device_probe+0x87/0xf0
      [   82.812366]        [<ffffffff81535f89>] driver_probe_device+0x229/0x450
      [   82.812371]        [<ffffffff81536233>] __driver_attach+0x83/0x90
      [   82.812375]        [<ffffffff81533c61>] bus_for_each_dev+0x61/0xa0
      [   82.812380]        [<ffffffff81535879>] driver_attach+0x19/0x20
      [   82.812384]        [<ffffffff8153535f>] bus_add_driver+0x1ef/0x290
      [   82.812388]        [<ffffffff81536e9b>] driver_register+0x5b/0xe0
      [   82.812393]        [<ffffffff8143e83b>] __pci_register_driver+0x5b/0x60
      [   82.812398]        [<ffffffff81514866>] drm_pci_init+0xd6/0x100
      [   82.812402]        [<ffffffffa027c094>] 0xffffffffa027c094
      [   82.812406]        [<ffffffff810003de>] do_one_initcall+0xae/0x1d0
      [   82.812412]        [<ffffffff811595a0>] do_init_module+0x5b/0x1cb
      [   82.812417]        [<ffffffff81106160>] load_module+0x1c20/0x2480
      [   82.812422]        [<ffffffff81106bae>] SyS_finit_module+0x7e/0xa0
      [   82.812428]        [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
      [   82.812433]
      -> #0 (&dev->struct_mutex){+.+.+.}:
      [   82.812439]        [<ffffffff810cbe59>] __lock_acquire+0x1fc9/0x20f0
      [   82.812443]        [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
      [   82.812456]        [<ffffffff8150d9e7>] drm_gem_mmap+0x1c7/0x270
      [   82.812460]        [<ffffffff81196a14>] mmap_region+0x334/0x580
      [   82.812466]        [<ffffffff81196fc4>] do_mmap+0x364/0x410
      [   82.812470]        [<ffffffff8117b38d>] vm_mmap_pgoff+0x6d/0xa0
      [   82.812474]        [<ffffffff811950f4>] SyS_mmap_pgoff+0x184/0x220
      [   82.812479]        [<ffffffff8100a0fd>] SyS_mmap+0x1d/0x20
      [   82.812484]        [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
      [   82.812489]
      other info that might help us debug this:
      [   82.812493] Chain exists of:
        &dev->struct_mutex --> s_active#6 --> &mm->mmap_sem
      [   82.812502]  Possible unsafe locking scenario:
      [   82.812506]        CPU0                    CPU1
      [   82.812508]        ----                    ----
      [   82.812510]   lock(&mm->mmap_sem);
      [   82.812514]                                lock(s_active#6);
      [   82.812519]                                lock(&mm->mmap_sem);
      [   82.812522]   lock(&dev->struct_mutex);
      [   82.812526]
       *** DEADLOCK ***
      [   82.812531] 1 lock held by kms_setmode/5859:
      [   82.812533]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117b364>] vm_mmap_pgoff+0x44/0xa0
      [   82.812541]
      stack backtrace:
      [   82.812547] CPU: 0 PID: 5859 Comm: kms_setmode Not tainted 4.5.0-rc4-gfxbench+ #1
      [   82.812550] Hardware name:                  /NUC5CPYB, BIOS PYBSWCEL.86A.0040.2015.0814.1353 08/14/2015
      [   82.812553]  0000000000000000 ffff880079407bf0 ffffffff813f8505 ffffffff825fb270
      [   82.812560]  ffffffff825c4190 ffff880079407c30 ffffffff810c84ac ffff880079407c90
      [   82.812566]  ffff8800797ed328 ffff8800797ecb00 0000000000000001 ffff8800797ed350
      [   82.812573] Call Trace:
      [   82.812578]  [<ffffffff813f8505>] dump_stack+0x67/0x92
      [   82.812582]  [<ffffffff810c84ac>] print_circular_bug+0x1fc/0x310
      [   82.812586]  [<ffffffff810cbe59>] __lock_acquire+0x1fc9/0x20f0
      [   82.812590]  [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
      [   82.812594]  [<ffffffff8150d9c1>] ? drm_gem_mmap+0x1a1/0x270
      [   82.812599]  [<ffffffff8150d9e7>] drm_gem_mmap+0x1c7/0x270
      [   82.812603]  [<ffffffff8150d9c1>] ? drm_gem_mmap+0x1a1/0x270
      [   82.812608]  [<ffffffff81196a14>] mmap_region+0x334/0x580
      [   82.812612]  [<ffffffff81196fc4>] do_mmap+0x364/0x410
      [   82.812616]  [<ffffffff8117b38d>] vm_mmap_pgoff+0x6d/0xa0
      [   82.812629]  [<ffffffff811950f4>] SyS_mmap_pgoff+0x184/0x220
      [   82.812633]  [<ffffffff8100a0fd>] SyS_mmap+0x1d/0x20
      [   82.812637]  [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
      Highly unlikely though this scenario is, we can avoid the issue entirely
      by moving the copy operation from out under the kernfs_get_active()
      tracking by assigning the preallocated buffer its own mutex. The
      temporary buffer allocation doesn't require mutex locking as it is
      entirely local.
      The locked section was extended by the addition of the preallocated buf
      to speed up md user operations in
      commit 2b75869b
      Author: NeilBrown <neilb@suse.de>
      Date:   Mon Oct 13 16:41:28 2014 +1100
          sysfs/kernfs: allow attributes to request write buffer be pre-allocated.
      Reported-by: default avatarVille Syrjälä <ville.syrjala@linux.intel.com>
      Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: default avatarJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: NeilBrown <neilb@suse.de>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  3. 16 Feb, 2016 2 commits
  4. 20 Nov, 2015 1 commit
  5. 18 Aug, 2015 1 commit
  6. 01 Jul, 2015 1 commit
  7. 18 Jun, 2015 1 commit
  8. 13 Feb, 2015 1 commit
    • Tejun Heo's avatar
      kernfs: remove KERNFS_STATIC_NAME · dfeb0750
      Tejun Heo authored
      When a new kernfs node is created, KERNFS_STATIC_NAME is used to avoid
      making a separate copy of its name.  It's currently only used for sysfs
      attributes whose filenames are required to stay accessible and unchanged.
      There are rare exceptions where these names are allocated and formatted
      dynamically but for the vast majority of cases they're consts in the
      rodata section.
      Now that kernfs is converted to use kstrdup_const() and kfree_const(),
      there's little point in keeping KERNFS_STATIC_NAME around.  Remove it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Andrzej Hajda <a.hajda@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  9. 07 Nov, 2014 1 commit
    • NeilBrown's avatar
      sysfs/kernfs: allow attributes to request write buffer be pre-allocated. · 2b75869b
      NeilBrown authored
      md/raid allows metadata management to be performed in user-space.
      A various times, particularly on device failure, the metadata needs
      to be updated before further writes can be permitted.
      This means that the user-space program which updates metadata much
      not block on writeout, and so must not allocate memory.
      mlockall(MCL_CURRENT|MCL_FUTURE) and pre-allocation can avoid all
      memory allocation issues for user-memory, but that does not help
      kernel memory.
      Several kernel objects can be pre-allocated.  e.g. files opened before
      any writes to the array are permitted.
      However some kernel allocation happens in places that cannot be
      In particular, writes to sysfs files (to tell md that it can now
      allow writes to the array) allocate a buffer using GFP_KERNEL.
      This patch allows attributes to be marked as "PREALLOC".  In that case
      the maximal buffer is allocated when the file is opened, and then used
      on each write instead of allocating a new buffer.
      As the same buffer is now shared for all writes on the same file
      description, the mutex is extended to cover full use of the buffer
      including the copy_from_user().
      The new __ATTR_PREALLOC() 'or's a new flag in to the 'mode', which is
      inspected by sysfs_add_file_mode_ns() to determine if the file should be
      marked as requiring prealloc.
      Despite the comment, we *do* use ->seq_show together with ->prealloc
      in this patch.  The next patch fixes that.
      Signed-off-by: default avatarNeilBrown  <neilb@suse.de>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  10. 02 Jul, 2014 1 commit
    • Tejun Heo's avatar
      kernfs: kernfs_notify() must be useable from non-sleepable contexts · ecca47ce
      Tejun Heo authored
      d911d987 ("kernfs: make kernfs_notify() trigger inotify events
      too") added fsnotify triggering to kernfs_notify() which requires a
      sleepable context.  There are already existing users of
      kernfs_notify() which invoke it from an atomic context and in general
      it's silly to require a sleepable context for triggering a
      The following is an invalid context bug triggerd by md invoking
      sysfs_notify() from IO completion path.
       BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
       in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
       2 locks held by swapper/1/0:
        #0:  (&(&vblk->vq_lock)->rlock){-.-...}, at: [<ffffffffa0039042>] virtblk_done+0x42/0xe0 [virtio_blk]
        #1:  (&(&bitmap->counts.lock)->rlock){-.....}, at: [<ffffffff81633718>] bitmap_endwrite+0x68/0x240
       irq event stamp: 33518
       hardirqs last  enabled at (33515): [<ffffffff8102544f>] default_idle+0x1f/0x230
       hardirqs last disabled at (33516): [<ffffffff818122ed>] common_interrupt+0x6d/0x72
       softirqs last  enabled at (33518): [<ffffffff810a1272>] _local_bh_enable+0x22/0x50
       softirqs last disabled at (33517): [<ffffffff810a29e0>] irq_enter+0x60/0x80
       CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.16.0-0.rc2.git2.1.fc21.x86_64 #1
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        0000000000000000 f90db13964f4ee05 ffff88007d403b80 ffffffff81807b4c
        0000000000000000 ffff88007d403ba8 ffffffff810d4f14 0000000000000000
        0000000000441800 ffff880078fa1780 ffff88007d403c38 ffffffff8180caf2
       Call Trace:
        <IRQ>  [<ffffffff81807b4c>] dump_stack+0x4d/0x66
        [<ffffffff810d4f14>] __might_sleep+0x184/0x240
        [<ffffffff8180caf2>] mutex_lock_nested+0x42/0x440
        [<ffffffff812d76a0>] kernfs_notify+0x90/0x150
        [<ffffffff8163377c>] bitmap_endwrite+0xcc/0x240
        [<ffffffffa00de863>] close_write+0x93/0xb0 [raid1]
        [<ffffffffa00df029>] r1_bio_write_done+0x29/0x50 [raid1]
        [<ffffffffa00e0474>] raid1_end_write_request+0xe4/0x260 [raid1]
        [<ffffffff813acb8b>] bio_endio+0x6b/0xa0
        [<ffffffff813b46c4>] blk_update_request+0x94/0x420
        [<ffffffff813bf0ea>] blk_mq_end_io+0x1a/0x70
        [<ffffffffa00392c2>] virtblk_request_done+0x32/0x80 [virtio_blk]
        [<ffffffff813c0648>] __blk_mq_complete_request+0x88/0x120
        [<ffffffff813c070a>] blk_mq_complete_request+0x2a/0x30
        [<ffffffffa0039066>] virtblk_done+0x66/0xe0 [virtio_blk]
        [<ffffffffa002535a>] vring_interrupt+0x3a/0xa0 [virtio_ring]
        [<ffffffff81116177>] handle_irq_event_percpu+0x77/0x340
        [<ffffffff8111647d>] handle_irq_event+0x3d/0x60
        [<ffffffff81119436>] handle_edge_irq+0x66/0x130
        [<ffffffff8101c3e4>] handle_irq+0x84/0x150
        [<ffffffff818146ad>] do_IRQ+0x4d/0xe0
        [<ffffffff818122f2>] common_interrupt+0x72/0x72
        <EOI>  [<ffffffff8105f706>] ? native_safe_halt+0x6/0x10
        [<ffffffff81025454>] default_idle+0x24/0x230
        [<ffffffff81025f9f>] arch_cpu_idle+0xf/0x20
        [<ffffffff810f5adc>] cpu_startup_entry+0x37c/0x7b0
        [<ffffffff8104df1b>] start_secondary+0x25b/0x300
      This patch fixes it by punting the notification delivery through a
      work item.  This ends up adding an extra pointer to kernfs_elem_attr
      enlarging kernfs_node by a pointer, which is not ideal but not a very
      big deal either.  If this turns out to be an actual issue, we can move
      kernfs_elem_attr->size to kernfs_node->iattr later.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJosh Boyer <jwboyer@fedoraproject.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  11. 30 Jun, 2014 1 commit
  12. 03 Jun, 2014 1 commit
  13. 27 May, 2014 1 commit
  14. 13 May, 2014 1 commit
    • Tejun Heo's avatar
      kernfs, sysfs, cgroup: restrict extra perm check on open to sysfs · 555724a8
      Tejun Heo authored
      The kernfs open method - kernfs_fop_open() - inherited extra
      permission checks from sysfs.  While the vfs layer allows ignoring the
      read/write permissions checks if the issuer has CAP_DAC_OVERRIDE,
      sysfs explicitly denied open regardless of the cap if the file doesn't
      have any of the UGO perms of the requested access or doesn't implement
      the requested operation.  It can be debated whether this was a good
      idea or not but the behavior is too subtle and dangerous to change at
      this point.
      After cgroup got converted to kernfs, this extra perm check also got
      applied to cgroup breaking libcgroup which opens write-only files with
      O_RDWR as root.  This patch gates the extra open permission check with
      a new flag KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK and enables it for sysfs.
      For sysfs, nothing changes.  For cgroup, root now can perform any
      operation regardless of the permissions as it was before kernfs
      conversion.  Note that kernfs still fails unimplemented operations
      with -EINVAL.
      While at it, add comments explaining KERNFS_ROOT flags.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarAndrey Wagin <avagin@gmail.com>
      Tested-by: default avatarAndrey Wagin <avagin@gmail.com>
      Cc: Li Zefan <lizefan@huawei.com>
      References: http://lkml.kernel.org/g/CANaxB-xUm3rJ-Cbp72q-rQJO5mZe1qK6qXsQM=vh0U8upJ44+A@mail.gmail.com
      Fixes: 2bd59d48 ("cgroup: convert to kernfs")
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  15. 25 Apr, 2014 1 commit
  16. 08 Mar, 2014 1 commit
    • Tejun Heo's avatar
      kernfs: cache atomic_write_len in kernfs_open_file · b7ce40cf
      Tejun Heo authored
      While implementing atomic_write_len, 4d3773c4 ("kernfs: implement
      kernfs_ops->atomic_write_len") moved data copy from userland inside
      kernfs_get_active() and kernfs_open_file->mutex so that
      kernfs_ops->atomic_write_len can be accessed before copying buffer
      from userland; unfortunately, this could lead to locking order
      inversion involving mmap_sem if copy_from_user() takes a page fault.
        [ INFO: possible circular locking dependency detected ]
        3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26 Tainted: G        W
        trinity-c236/10658 is trying to acquire lock:
         (&of->mutex#2){+.+.+.}, at: [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120
        but task is already holding lock:
         (&mm->mmap_sem){++++++}, at: [<mm/util.c:397>] vm_mmap_pgoff+0x6e/0xe0
        which lock already depends on the new lock.
        the existing dependency chain (in reverse order) is:
       -> #1 (&mm->mmap_sem){++++++}:
      	 [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0
      	 [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0
      	 [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0
      	 [<mm/memory.c:4188>] might_fault+0x7e/0xb0
      	 [<arch/x86/include/asm/uaccess.h:713 fs/kernfs/file.c:291>] kernfs_fop_write+0xd8/0x190
      	 [<fs/read_write.c:473>] vfs_write+0xe3/0x1d0
      	 [<fs/read_write.c:523 fs/read_write.c:515>] SyS_write+0x5d/0xa0
      	 [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2
       -> #0 (&of->mutex#2){+.+.+.}:
      	 [<kernel/locking/lockdep.c:1840>] check_prev_add+0x13f/0x560
      	 [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0
      	 [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0
      	 [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0
      	 [<kernel/locking/mutex.c:470 kernel/locking/mutex.c:571>] mutex_lock_nested+0x6a/0x510
      	 [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120
      	 [<mm/mmap.c:1573>] mmap_region+0x310/0x5c0
      	 [<mm/mmap.c:1365>] do_mmap_pgoff+0x385/0x430
      	 [<mm/util.c:399>] vm_mmap_pgoff+0x8f/0xe0
      	 [<mm/mmap.c:1416 mm/mmap.c:1374>] SyS_mmap_pgoff+0x1b0/0x210
      	 [<arch/x86/kernel/sys_x86_64.c:72>] SyS_mmap+0x1d/0x20
      	 [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2
        other info that might help us debug this:
         Possible unsafe locking scenario:
      	 CPU0                    CPU1
      	 ----                    ----
         *** DEADLOCK ***
        1 lock held by trinity-c236/10658:
         #0:  (&mm->mmap_sem){++++++}, at: [<mm/util.c:397>] vm_mmap_pgoff+0x6e/0xe0
        stack backtrace:
        CPU: 2 PID: 10658 Comm: trinity-c236 Tainted: G        W 3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26
         0000000000000000 ffff88011911fa48 ffffffff8438e945 0000000000000000
         0000000000000000 ffff88011911fa98 ffffffff811a0109 ffff88011911fab8
         ffff88011911fab8 ffff88011911fa98 ffff880119128cc0 ffff880119128cf8
        Call Trace:
         [<lib/dump_stack.c:52>] dump_stack+0x52/0x7f
         [<kernel/locking/lockdep.c:1213>] print_circular_bug+0x129/0x160
         [<kernel/locking/lockdep.c:1840>] check_prev_add+0x13f/0x560
         [<include/linux/spinlock.h:343 mm/slub.c:1933>] ? deactivate_slab+0x511/0x550
         [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0
         [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0
         [<mm/mmap.c:1552>] ? mmap_region+0x24a/0x5c0
         [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0
         [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120
         [<kernel/locking/mutex.c:470 kernel/locking/mutex.c:571>] mutex_lock_nested+0x6a/0x510
         [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120
         [<kernel/sched/core.c:2477>] ? get_parent_ip+0x11/0x50
         [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120
         [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120
         [<mm/mmap.c:1573>] mmap_region+0x310/0x5c0
         [<mm/mmap.c:1365>] do_mmap_pgoff+0x385/0x430
         [<mm/util.c:397>] ? vm_mmap_pgoff+0x6e/0xe0
         [<mm/util.c:399>] vm_mmap_pgoff+0x8f/0xe0
         [<kernel/rcu/update.c:97>] ? __rcu_read_unlock+0x44/0xb0
         [<fs/file.c:641>] ? dup_fd+0x3c0/0x3c0
         [<mm/mmap.c:1416 mm/mmap.c:1374>] SyS_mmap_pgoff+0x1b0/0x210
         [<arch/x86/kernel/sys_x86_64.c:72>] SyS_mmap+0x1d/0x20
         [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2
      Fix it by caching atomic_write_len in kernfs_open_file during open so
      that it can be determined without accessing kernfs_ops in
      kernfs_fop_write().  This restores the structure of kernfs_fop_write()
      before 4d3773c4 with updated @len determination logic.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      References: http://lkml.kernel.org/g/53113485.2090407@oracle.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  17. 25 Feb, 2014 1 commit
    • Li Zefan's avatar
      sysfs: fix namespace refcnt leak · fed95bab
      Li Zefan authored
      As mount() and kill_sb() is not a one-to-one match, we shoudn't get
      ns refcnt unconditionally in sysfs_mount(), and instead we should
      get the refcnt only when kernfs_mount() allocated a new superblock.
      - Changed the name of the new argument, suggested by Tejun.
      - Made the argument optional, suggested by Tejun.
      - Make the new argument as second-to-last arg, suggested by Tejun.
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
       fs/kernfs/mount.c      | 8 +++++++-
       fs/sysfs/mount.c       | 5 +++--
       include/linux/kernfs.h | 9 +++++----
       3 files changed, 15 insertions(+), 7 deletions(-)
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  18. 07 Feb, 2014 14 commits
    • Tejun Heo's avatar
      kernfs: add CONFIG_KERNFS · ba341d55
      Tejun Heo authored
      As sysfs was kernfs's only user, kernfs has been piggybacking on
      CONFIG_SYSFS; however, kernfs is scheduled to grow a new user very
      soon.  Introduce a separate config option CONFIG_KERNFS which is to be
      selected by kernfs users.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tejun Heo's avatar
      kernfs: implement kernfs_get_parent(), kernfs_name/path() and friends · 3eef34ad
      Tejun Heo authored
      kernfs_node->parent and ->name are currently marked as "published"
      indicating that kernfs users may access them directly; however, those
      fields may get updated by kernfs_rename[_ns]() and unrestricted access
      may lead to erroneous values or oops.
      Protect ->parent and ->name updates with a irq-safe spinlock
      kernfs_rename_lock and implement the following accessors for these
      * kernfs_name()		- format the node's name into the specified buffer
      * kernfs_path()		- format the node's path into the specified buffer
      * pr_cont_kernfs_name()	- pr_cont a node's name (doesn't need buffer)
      * pr_cont_kernfs_path()	- pr_cont a node's path (doesn't need buffer)
      * kernfs_get_parent()	- pin and return a node's parent
      All can be called under any context.  The recursive sysfs_pathname()
      in fs/sysfs/dir.c is replaced with kernfs_path() and
      sysfs_rename_dir_ns() is updated to use kernfs_get_parent() instead of
      dereferencing parent directly.
      v2: Dummy definition of kernfs_path() for !CONFIG_KERNFS was missing
          static inline making it cause a lot of build warnings.  Add it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tejun Heo's avatar
      kernfs: implement kernfs_node_from_dentry(), kernfs_root_from_sb() and kernfs_rename() · 0c23b225
      Tejun Heo authored
      Implement helpers to determine node from dentry and root from
      super_block.  Also add a kernfs_rename_ns() wrapper which assumes NULL
      namespace.  These generally make sense and will be used by cgroup.
      v2: Some dummy implementations for !CONFIG_SYSFS was missing.  Fixed.
          Reported by kbuild test robot.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tejun Heo's avatar
      kernfs: add kernfs_open_file->priv · 2536390d
      Tejun Heo authored
      Add a private data field to be used by kernfs file operations.  This
      generally makes sense and will be used by cgroup.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tejun Heo's avatar
      kernfs: implement kernfs_ops->atomic_write_len · 4d3773c4
      Tejun Heo authored
      A write to a kernfs_node is buffered through a kernel buffer.  Writes
      <= PAGE_SIZE are performed atomically, while larger ones are executed
      in PAGE_SIZE chunks.  While this is enough for sysfs, cgroup which is
      scheduled to be converted to use kernfs needs a bit more control over
      This patch adds kernfs_ops->atomic_write_len.  If not set (zero), the
      behavior stays the same.  If set, writes upto the size are executed
      atomically and larger writes are rejected with -E2BIG.
      A different implementation strategy would be allowing configuring
      chunking size while making the original write size available to the
      write method; however, such strategy, while being more complicated,
      doesn't really buy anything.  If the write implementation has to
      handle chunking, the specific chunk size shouldn't matter all that
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tejun Heo's avatar
      kernfs: allow nodes to be created in the deactivated state · d35258ef
      Tejun Heo authored
      Currently, kernfs_nodes are made visible to userland on creation,
      which makes it difficult for kernfs users to atomically succeed or
      fail creation of multiple nodes.  In addition, if something fails
      after creating some nodes, the created nodes might already be in use
      and their active refs need to be drained for removal, which has the
      potential to introduce tricky reverse locking dependency on active_ref
      depending on how the error path is synchronized.
      This patch introduces per-root flag KERNFS_ROOT_CREATE_DEACTIVATED.
      If set, all nodes under the root are created in the deactivated state
      and stay invisible to userland until explicitly enabled by the new
      kernfs_activate() API.  Also, nodes which have never been activated
      are guaranteed to bypass draining on removal thus allowing error paths
      to not worry about lockding dependency on active_ref draining.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tejun Heo's avatar
      kernfs: implement kernfs_syscall_ops->remount_fs() and ->show_options() · 6a7fed4e
      Tejun Heo authored
      Add two super_block related syscall callbacks ->remount_fs() and
      ->show_options() to kernfs_syscall_ops.  These simply forward the
      matching super_operations.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tejun Heo's avatar
      kernfs: rename kernfs_dir_ops to kernfs_syscall_ops · 90c07c89
      Tejun Heo authored
      We're gonna need non-dir syscall callbacks, which will make dir_ops a
      misnomer.  Let's rename kernfs_dir_ops to kernfs_syscall_ops.
      This is pure rename.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tejun Heo's avatar
      kernfs: invoke dir_ops while holding active ref of the target node · 07c7530d
      Tejun Heo authored
      kernfs_dir_ops are currently being invoked without any active
      reference, which makes it tricky for the invoked operations to
      determine whether the objects associated those nodes are safe to
      access and will remain that way for the duration of such operations.
      kernfs already has active_ref mechanism to deal with this which makes
      the removal of a given node the synchronization point for gating the
      file operations.  There's no reason for dir_ops to be any different.
      Update the dir_ops handling so that active_ref is held while the
      dir_ops are executing.  This guarantees that while a dir_ops is
      executing the target nodes stay alive.
      As kernfs_dir_ops doesn't have any in-kernel user at this point, this
      doesn't affect anybody.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tejun Heo's avatar
      kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers · 6b0afc2a
      Tejun Heo authored
      Sometimes it's necessary to implement a node which wants to delete
      nodes including itself.  This isn't straightforward because of kernfs
      active reference.  While a file operation is in progress, an active
      reference is held and kernfs_remove() waits for all such references to
      drain before completing.  For a self-deleting node, this is a deadlock
      as kernfs_remove() ends up waiting for an active reference that itself
      is sitting on top of.
      This currently is worked around in the sysfs layer using
      sysfs_schedule_callback() which makes such removals asynchronous.
      While it works, it's rather cumbersome and inherently breaks
      synchronicity of the operation - the file operation which triggered
      the operation may complete before the removal is finished (or even
      started) and the removal may fail asynchronously.  If a removal
      operation is immmediately followed by another operation which expects
      the specific name to be available (e.g. removal followed by rename
      onto the same name), there's no way to make the latter operation
      The thing is there's no inherent reason for this to be asynchrnous.
      All that's necessary to do this synchronous is a dedicated operation
      which drops its own active ref and deactivates self.  This patch
      implements kernfs_remove_self() and its wrappers in sysfs and driver
      core.  kernfs_remove_self() is to be called from one of the file
      operations, drops the active ref the task is holding, removes the self
      node, and restores active ref to the dead node so that the ref is
      balanced afterwards.  __kernfs_remove() is updated so that it takes an
      early exit if the target node is already fully removed so that the
      active ref restored by kernfs_remove_self() after removal doesn't
      confuse the deactivation path.
      This makes implementing self-deleting nodes very easy.  The normal
      removal path doesn't even need to be changed to use
      kernfs_remove_self() for the self-deleting node.  The method can
      invoke kernfs_remove_self() on itself before proceeding the normal
      removal path.  kernfs_remove() invoked on the node by the normal
      deletion path will simply be ignored.
      This will replace sysfs_schedule_callback().  A subtle feature of
      sysfs_schedule_callback() is that it collapses multiple invocations -
      even if multiple removals are triggered, the removal callback is run
      only once.  An equivalent effect can be achieved by testing the return
      value of kernfs_remove_self() - only the one which gets %true return
      value should proceed with actual deletion.  All other instances of
      kernfs_remove_self() will wait till the enclosing kernfs operation
      which invoked the winning instance of kernfs_remove_self() finishes
      and then return %false.  This trivially makes all users of
      kernfs_remove_self() automatically show correct synchronous behavior
      even when there are multiple concurrent operations - all "echo 1 >
      delete" instances will finish only after the whole operation is
      completed by one of the instances.
      Note that manipulation of active ref is implemented in separate public
      functions - kernfs_[un]break_active_protection().
      kernfs_remove_self() is the only user at the moment but this will be
      used to cater to more complex cases.
      v2: For !CONFIG_SYSFS, dummy version kernfs_remove_self() was missing
          and sysfs_remove_file_self() had incorrect return type.  Fix it.
          Reported by kbuild test bot.
      v3: kernfs_[un]break_active_protection() separated out from
          kernfs_remove_self() and exposed as public API.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tejun Heo's avatar
      kernfs: remove KERNFS_REMOVED · 81c173cb
      Tejun Heo authored
      KERNFS_REMOVED is used to mark half-initialized and dying nodes so
      that they don't show up in lookups and deny adding new nodes under or
      renaming it; however, its role overlaps that of deactivation.
      It's necessary to deny addition of new children while removal is in
      progress; however, this role considerably intersects with deactivation
      - KERNFS_REMOVED prevents new children while deactivation prevents new
      file operations.  There's no reason to have them separate making
      things more complex than necessary.
      This patch removes KERNFS_REMOVED.
      * Instead of KERNFS_REMOVED, each node now starts its life
        deactivated.  This means that we now use both atomic_add() and
        atomic_sub() on KN_DEACTIVATED_BIAS, which is INT_MIN.  The compiler
        generates an overflow warnings when negating INT_MIN as the negation
        can't be represented as a positive number.  Nothing is actually
        broken but let's bump BIAS by one to avoid the warnings for archs
        which negates the subtrahend..
      * A new helper kernfs_active() which tests whether kn->active >= 0 is
        added for convenience and lockdep annotation.  All KERNFS_REMOVED
        tests are replaced with negated kernfs_active() tests.
      * __kernfs_remove() is updated to deactivate, but not drain, all nodes
        in the subtree instead of setting KERNFS_REMOVED.  This removes
        deactivation from kernfs_deactivate(), which is now renamed to
      * Sanity check on KERNFS_REMOVED in kernfs_put() is replaced with
        checks on the active ref.
      * Some comment style updates in the affected area.
      v2: Reordered before removal path restructuring.  kernfs_active()
          dropped and kernfs_get/put_active() used instead.  RB_EMPTY_NODE()
          used in the lookup paths.
      v3: Reverted most of v2 except for creating a new node with
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tejun Heo's avatar
      kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep() · 182fd64b
      Tejun Heo authored
      There currently are two mechanisms gating active ref lockdep
      annotations - KERNFS_LOCKDEP flag and KERNFS_ACTIVE_REF type mask.
      The former disables lockdep annotations in kernfs_get/put_active()
      while the latter disables all of kernfs_deactivate().
      While KERNFS_ACTIVE_REF also behaves as an optimization to skip the
      deactivation step for non-file nodes, the benefit is marginal and it
      needlessly diverges code paths.  Let's drop KERNFS_ACTIVE_REF.
      While at it, add a test helper kernfs_lockdep() to test KERNFS_LOCKDEP
      flag so that it's more convenient and the related code can be compiled
      out when not enabled.
      v2: Refreshed on top of ("kernfs: make kernfs_deactivate() honor
          KERNFS_LOCKDEP flag").  As the earlier patch already added
          KERNFS_LOCKDEP tests to kernfs_deactivate(), those additions are
          dropped from this patch and the existing ones are simply converted
          to kernfs_lockdep().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tejun Heo's avatar
      kernfs: remove kernfs_addrm_cxt · 988cd7af
      Tejun Heo authored
      kernfs_addrm_cxt and the accompanying kernfs_addrm_start/finish() were
      added because there were operations which should be performed outside
      kernfs_mutex after adding and removing kernfs_nodes.  The necessary
      operations were recorded in kernfs_addrm_cxt and performed by
      kernfs_addrm_finish(); however, after the recent changes which
      relocated deactivation and unmapping so that they're performed
      directly during removal, the only operation kernfs_addrm_finish()
      performs is kernfs_put(), which can be moved inside the removal path
      This patch moves the kernfs_put() of the base ref to __kernfs_remove()
      and remove kernfs_addrm_cxt and kernfs_addrm_start/finish().
      * kernfs_add_one() is updated to grab and release kernfs_mutex itself.
        sysfs_addrm_start/finish() invocations around it are removed from
        all users.
      * __kernfs_remove() puts an unlinked node directly instead of chaining
        it to kernfs_addrm_cxt.  Its callers are updated to grab and release
        kernfs_mutex instead of calling kernfs_addrm_start/finish() around
      v2: Rebased on top of "kernfs: associate a new kernfs_node with its
          parent on creation" which dropped @parent from kernfs_add_one().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tejun Heo's avatar
      kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq · abd54f02
      Tejun Heo authored
      kernfs_node->u.completion is used to notify deactivation completion
      from kernfs_put_active() to kernfs_deactivate().  We now allow
      multiple racing removals of the same node and the current removal
      scheme is no longer correct - kernfs_remove() invocation may return
      before the node is properly deactivated if it races against another
      removal.  The removal path will be restructured to address the issue.
      To help such restructure which requires supporting multiple waiters,
      this patch replaces kernfs_node->u.completion with
      kernfs_root->deactivate_waitq.  This makes deactivation event
      notifications share a per-root waitqueue_head; however, the wait path
      is quite cold and this will also allow shaving one pointer off
      v2: Refreshed on top of ("kernfs: make kernfs_deactivate() honor
          KERNFS_LOCKDEP flag").
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  19. 17 Jan, 2014 1 commit
  20. 13 Jan, 2014 7 commits