1. 16 Feb, 2016 1 commit
    • Aditya Kali's avatar
      cgroup: introduce cgroup namespaces · a79a908f
      Aditya Kali authored
      Introduce the ability to create new cgroup namespace. The newly created
      cgroup namespace remembers the cgroup of the process at the point
      of creation of the cgroup namespace (referred as cgroupns-root).
      The main purpose of cgroup namespace is to virtualize the contents
      of /proc/self/cgroup file. Processes inside a cgroup namespace
      are only able to see paths relative to their namespace root
      (unless they are moved outside of their cgroupns-root, at which point
       they will see a relative path from their cgroupns-root).
      For a correctly setup container this enables container-tools
      (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
      containers without leaking system level cgroup hierarchy to the task.
      This patch only implements the 'unshare' part of the cgroupns.
      Signed-off-by: default avatarAditya Kali <adityakali@google.com>
      Signed-off-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      a79a908f
  2. 29 Jul, 2014 1 commit
    • Eric W. Biederman's avatar
      namespaces: Use task_lock and not rcu to protect nsproxy · 728dba3a
      Eric W. Biederman authored
      The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
      a sufficiently expensive system call that people have complained.
      
      Upon inspect nsproxy no longer needs rcu protection for remote reads.
      remote reads are rare.  So optimize for same process reads and write
      by switching using rask_lock instead.
      
      This yields a simpler to understand lock, and a faster setns system call.
      
      In particular this fixes a performance regression observed
      by Rafael David Tinoco <rafael.tinoco@canonical.com>.
      
      This is effectively a revert of Pavel Emelyanov's commit
      cf7b708c Make access to task's nsproxy lighter
      from 2007.  The race this originialy fixed no longer exists as
      do_notify_parent uses task_active_pid_ns(parent) instead of
      parent->nsproxy.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      728dba3a
  3. 27 Aug, 2013 1 commit
  4. 20 Nov, 2012 1 commit
  5. 19 Jul, 2011 1 commit
  6. 26 May, 2011 1 commit
  7. 31 Mar, 2009 1 commit
  8. 24 Nov, 2008 1 commit
    • Serge Hallyn's avatar
      User namespaces: set of cleanups (v2) · 18b6e041
      Serge Hallyn authored
      The user_ns is moved from nsproxy to user_struct, so that a struct
      cred by itself is sufficient to determine access (which it otherwise
      would not be).  Corresponding ecryptfs fixes (by David Howells) are
      here as well.
      
      Fix refcounting.  The following rules now apply:
              1. The task pins the user struct.
              2. The user struct pins its user namespace.
              3. The user namespace pins the struct user which created it.
      
      User namespaces are cloned during copy_creds().  Unsharing a new user_ns
      is no longer possible.  (We could re-add that, but it'll cause code
      duplication and doesn't seem useful if PAM doesn't need to clone user
      namespaces).
      
      When a user namespace is created, its first user (uid 0) gets empty
      keyrings and a clean group_info.
      
      This incorporates a previous patch by David Howells.  Here
      is his original patch description:
      
      >I suggest adding the attached incremental patch.  It makes the following
      >changes:
      >
      > (1) Provides a current_user_ns() macro to wrap accesses to current's user
      >     namespace.
      >
      > (2) Fixes eCryptFS.
      >
      > (3) Renames create_new_userns() to create_user_ns() to be more consistent
      >     with the other associated functions and because the 'new' in the name is
      >     superfluous.
      >
      > (4) Moves the argument and permission checks made for CLONE_NEWUSER to the
      >     beginning of do_fork() so that they're done prior to making any attempts
      >     at allocation.
      >
      > (5) Calls create_user_ns() after prepare_creds(), and gives it the new creds
      >     to fill in rather than have it return the new root user.  I don't imagine
      >     the new root user being used for anything other than filling in a cred
      >     struct.
      >
      >     This also permits me to get rid of a get_uid() and a free_uid(), as the
      >     reference the creds were holding on the old user_struct can just be
      >     transferred to the new namespace's creator pointer.
      >
      > (6) Makes create_user_ns() reset the UIDs and GIDs of the creds under
      >     preparation rather than doing it in copy_creds().
      >
      >David
      
      >Signed-off-by: David Howells <dhowells@redhat.com>
      
      Changelog:
      	Oct 20: integrate dhowells comments
      		1. leave thread_keyring alone
      		2. use current_user_ns() in set_user()
      Signed-off-by: default avatarSerge Hallyn <serue@us.ibm.com>
      18b6e041
  9. 25 Jul, 2008 1 commit
    • Serge E. Hallyn's avatar
      cgroup_clone: use pid of newly created task for new cgroup · e885dcde
      Serge E. Hallyn authored
      cgroup_clone creates a new cgroup with the pid of the task.  This works
      correctly for unshare, but for clone cgroup_clone is called from
      copy_namespaces inside copy_process, which happens before the new pid is
      created.  As a result, the new cgroup was created with current's pid.
      This patch:
      
      	1. Moves the call inside copy_process to after the new pid
      	   is created
      	2. Passes the struct pid into ns_cgroup_clone (as it is not
      	   yet attached to the task)
      	3. Passes a name from ns_cgroup_clone() into cgroup_clone()
      	   so as to keep cgroup_clone() itself simpler
      	4. Uses pid_vnr() to get the process id value, so that the
      	   pid used to name the new cgroup is always the pid as it
      	   would be known to the task which did the cloning or
      	   unsharing.  I think that is the most intuitive thing to
      	   do.  This way, task t1 does clone(CLONE_NEWPID) to get
      	   t2, which does clone(CLONE_NEWPID) to get t3, then the
      	   cgroup for t3 will be named for the pid by which t2 knows
      	   t3.
      
      (Thanks to Dan Smith for finding the main bug)
      
      Changelog:
      	June 11: Incorporate Paul Menage's feedback:  don't pass
      	         NULL to ns_cgroup_clone from unshare, and reduce
      		 patch size by using 'nodename' in cgroup_clone.
      	June 10: Original version
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarSerge Hallyn <serge@us.ibm.com>
      Acked-by: default avatarPaul Menage <menage@google.com>
      Tested-by: default avatarDan Smith <danms@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e885dcde
  10. 19 Oct, 2007 2 commits
    • Pavel Emelyanov's avatar
      Make access to task's nsproxy lighter · cf7b708c
      Pavel Emelyanov authored
      When someone wants to deal with some other taks's namespaces it has to lock
      the task and then to get the desired namespace if the one exists.  This is
      slow on read-only paths and may be impossible in some cases.
      
      E.g.  Oleg recently noticed a race between unshare() and the (sent for
      review in cgroups) pid namespaces - when the task notifies the parent it
      has to know the parent's namespace, but taking the task_lock() is
      impossible there - the code is under write locked tasklist lock.
      
      On the other hand switching the namespace on task (daemonize) and releasing
      the namespace (after the last task exit) is rather rare operation and we
      can sacrifice its speed to solve the issues above.
      
      The access to other task namespaces is proposed to be performed
      like this:
      
           rcu_read_lock();
           nsproxy = task_nsproxy(tsk);
           if (nsproxy != NULL) {
                   / *
                     * work with the namespaces here
                     * e.g. get the reference on one of them
                     * /
           } / *
               * NULL task_nsproxy() means that this task is
               * almost dead (zombie)
               * /
           rcu_read_unlock();
      
      This patch has passed the review by Eric and Oleg :) and,
      of course, tested.
      
      [clg@fr.ibm.com: fix unshare()]
      [ebiederm@xmission.com: Update get_net_ns_by_pid]
      Signed-off-by: default avatarPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Signed-off-by: default avatarCedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf7b708c
    • Serge E. Hallyn's avatar
      cgroups: implement namespace tracking subsystem · 858d72ea
      Serge E. Hallyn authored
      When a task enters a new namespace via a clone() or unshare(), a new cgroup
      is created and the task moves into it.
      
      This version names cgroups which are automatically created using
      cgroup_clone() as "node_<pid>" where pid is the pid of the unsharing or
      cloned process.  (Thanks Pavel for the idea) This is safe because if the
      process unshares again, it will create
      
      	/cgroups/(...)/node_<pid>/node_<pid>
      
      The only possibilities (AFAICT) for a -EEXIST on unshare are
      
      	1. pid wraparound
      	2. a process fails an unshare, then tries again.
      
      Case 1 is unlikely enough that I ignore it (at least for now).  In case 2, the
      node_<pid> will be empty and can be rmdir'ed to make the subsequent unshare()
      succeed.
      
      Changelog:
      	Name cloned cgroups as "node_<pid>".
      
      [clg@fr.ibm.com: fix order of cgroup subsystems in init/Kconfig]
      Signed-off-by: default avatarSerge E. Hallyn <serue@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: default avatarCedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      858d72ea
  11. 17 Oct, 2007 1 commit
  12. 10 Oct, 2007 1 commit
  13. 16 Jul, 2007 2 commits
  14. 08 May, 2007 1 commit
    • Badari Pulavarty's avatar
      Merge sys_clone()/sys_unshare() nsproxy and namespace handling · e3222c4e
      Badari Pulavarty authored
      sys_clone() and sys_unshare() both makes copies of nsproxy and its associated
      namespaces.  But they have different code paths.
      
      This patch merges all the nsproxy and its associated namespace copy/clone
      handling (as much as possible).  Posted on container list earlier for
      feedback.
      
      - Create a new nsproxy and its associated namespaces and pass it back to
        caller to attach it to right process.
      
      - Changed all copy_*_ns() routines to return a new copy of namespace
        instead of attaching it to task->nsproxy.
      
      - Moved the CAP_SYS_ADMIN checks out of copy_*_ns() routines.
      
      - Removed unnessary !ns checks from copy_*_ns() and added BUG_ON()
        just incase.
      
      - Get rid of all individual unshare_*_ns() routines and make use of
        copy_*_ns() instead.
      
      [akpm@osdl.org: cleanups, warning fix]
      [clg@fr.ibm.com: remove dup_namespaces() declaration]
      [serue@us.ibm.com: fix CONFIG_IPC_NS=n, clone(CLONE_NEWIPC) retval]
      [akpm@linux-foundation.org: fix build with CONFIG_SYSVIPC=n]
      Signed-off-by: default avatarBadari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: <containers@lists.osdl.org>
      Signed-off-by: default avatarCedric Le Goater <clg@fr.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e3222c4e
  15. 30 Jan, 2007 2 commits
  16. 13 Dec, 2006 1 commit
  17. 08 Dec, 2006 3 commits
  18. 25 Nov, 2006 1 commit
    • Vasily Tarasov's avatar
      [PATCH] mounstats NULL pointer dereference · 701e054e
      Vasily Tarasov authored
      OpenVZ developers team has encountered the following problem in 2.6.19-rc6
      kernel. After some seconds of running script
      
      while [[ 1 ]]
      do
      	find  /proc -name mountstats | xargs cat
      done
      
      this Oops appears:
      
      BUG: unable to handle kernel NULL pointer dereference at virtual address
      00000010
       printing eip:
      c01a6b70
      *pde = 00000000
      Oops: 0000 [#1]
      SMP
      Modules linked in: xt_length ipt_ttl xt_tcpmss ipt_TCPMSS iptable_mangle
      iptable_filter xt_multiport xt_limit ipt_tos ipt_REJECT ip_tables x_tables
      parport_pc lp parport sunrpc af_packet thermal processor fan button battery
      asus_acpi ac ohci_hcd ehci_hcd usbcore i2c_nforce2 i2c_core tg3 floppy
      pata_amd
      ide_cd cdrom sata_nv libata
      CPU:    1
      EIP:    0060:[<c01a6b70>]    Not tainted VLI
      EFLAGS: 00010246   (2.6.19-rc6 #2)
      EIP is at mountstats_open+0x70/0xf0
      eax: 00000000   ebx: e6247030   ecx: e62470f8   edx: 00000000
      esi: 00000000   edi: c01a6b00   ebp: c33b83c0   esp: f4105eb4
      ds: 007b   es: 007b   ss: 0068
      Process cat (pid: 6044, ti=f4105000 task=f4104a70 task.ti=f4105000)
      Stack: c33b83c0 c04ee940 f46a4a80 c33b83c0 e4df31b4 c01a6b00 f4105000 c0169231
             e4df31b4 c33b83c0 c33b83c0 f4105f20 00000003 f4105000 c0169445 f2503cf0
             f7f8c4c0 00008000 c33b83c0 00000000 00008000 c0169350 f4105f20 00008000
      Call Trace:
       [<c01a6b00>] mountstats_open+0x0/0xf0
       [<c0169231>] __dentry_open+0x181/0x250
       [<c0169445>] nameidata_to_filp+0x35/0x50
       [<c0169350>] do_filp_open+0x50/0x60
       [<c01873d6>] seq_read+0xc6/0x300
       [<c0169511>] get_unused_fd+0x31/0xc0
       [<c01696d3>] do_sys_open+0x63/0x110
       [<c01697a7>] sys_open+0x27/0x30
       [<c01030bd>] sysenter_past_esp+0x56/0x79
       =======================
      Code: 45 74 8b 54 24 20 89 44 24 08 8b 42 f0 31 d2 e8 47 cb f8 ff 85 c0 89 c3
      74 51 8d 80 a0 04 00 00 e8 46 06 2c 00 8b 83 48 04 00 00 <8b> 78 10 85 ff 74
      03
      f0 ff 07 b0 01 86 83 a0 04 00 00 f0 ff 4b
      EIP: [<c01a6b70>] mountstats_open+0x70/0xf0 SS:ESP 0068:f4105eb4
      
      The problem is that task->nsproxy can be equal NULL for some time during
      task exit. This patch fixes the BUG.
      Signed-off-by: default avatarVasily Tarasov <vtaras@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: "Serge E. Hallyn" <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      701e054e
  19. 02 Oct, 2006 4 commits