1. 15 Apr, 2015 6 commits
    • Joel Stanley's avatar
      kernel/reboot.c: add orderly_reboot for graceful reboot · 7a54f46b
      Joel Stanley authored
      The kernel has orderly_poweroff which allows the kernel to initiate a
      graceful shutdown of userspace, by running /sbin/poweroff.  This adds
      orderly_reboot that will cause userspace to shut itself down by calling
      /sbin/reboot.
      
      This will be used for shutdown initiated by a system controller on
      platforms that do not use ACPI.
      
      orderly_reboot() should be used when the system wants to allow userspace
      to gracefully shut itself down.  For cases where the system may imminently
      catch on fire, the existing emergency_restart() provides an immediate
      reboot without involving userspace.
      Signed-off-by: default avatarJoel Stanley <joel@jms.id.au>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jeremy Kerr <jk@ozlabs.org>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a54f46b
    • Aaron Tomlin's avatar
      kernel/hung_task.c: change hung_task.c to use for_each_process_thread() · 972fae69
      Aaron Tomlin authored
      In check_hung_uninterruptible_tasks() avoid the use of deprecated
      while_each_thread().
      
      The "max_count" logic will prevent a livelock - see commit 0c740d0a
      ("introduce for_each_thread() to replace the buggy while_each_thread()").
      Having said this let's use for_each_process_thread().
      Signed-off-by: default avatarAaron Tomlin <atomlin@redhat.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Wysochanski <dwysocha@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      972fae69
    • Jakub Sitnicki's avatar
      kernel/resource.c: remove deprecated __check_region() and friends · 96831c0a
      Jakub Sitnicki authored
      All users of __check_region(), check_region(), and check_mem_region() are
      gone.  We got rid of the last user in v4.0-rc1.  Remove them.
      
      bloat-o-meter on x86_64 shows:
      
      add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-102 (-102)
      function                                     old     new   delta
      __kstrtab___check_region                      15       -     -15
      __ksymtab___check_region                      16       -     -16
      __check_region                                71       -     -71
      Signed-off-by: default avatarJakub Sitnicki <jsitnicki@gmail.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96831c0a
    • Iulia Manda's avatar
      kernel: conditionally support non-root users, groups and capabilities · 2813893f
      Iulia Manda authored
      There are a lot of embedded systems that run most or all of their
      functionality in init, running as root:root.  For these systems,
      supporting multiple users is not necessary.
      
      This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
      non-root users, non-root groups, and capabilities optional.  It is enabled
      under CONFIG_EXPERT menu.
      
      When this symbol is not defined, UID and GID are zero in any possible case
      and processes always have all capabilities.
      
      The following syscalls are compiled out: setuid, setregid, setgid,
      setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
      getgroups, setfsuid, setfsgid, capget, capset.
      
      Also, groups.c is compiled out completely.
      
      In kernel/capability.c, capable function was moved in order to avoid
      adding two ifdef blocks.
      
      This change saves about 25 KB on a defconfig build.  The most minimal
      kernels have total text sizes in the high hundreds of kB rather than
      low MB.  (The 25k goes down a bit with allnoconfig, but not that much.
      
      The kernel was booted in Qemu.  All the common functionalities work.
      Adding users/groups is not possible, failing with -ENOSYS.
      
      Bloat-o-meter output:
      add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarIulia Manda <iulia.manda21@gmail.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Tested-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2813893f
    • Eric B Munson's avatar
      mm: allow compaction of unevictable pages · 5bbe3547
      Eric B Munson authored
      Currently, pages which are marked as unevictable are protected from
      compaction, but not from other types of migration.  The POSIX real time
      extension explicitly states that mlock() will prevent a major page
      fault, but the spirit of this is that mlock() should give a process the
      ability to control sources of latency, including minor page faults.
      However, the mlock manpage only explicitly says that a locked page will
      not be written to swap and this can cause some confusion.  The
      compaction code today does not give a developer who wants to avoid swap
      but wants to have large contiguous areas available any method to achieve
      this state.  This patch introduces a sysctl for controlling compaction
      behavior with respect to the unevictable lru.  Users who demand no page
      faults after a page is present can set compact_unevictable_allowed to 0
      and users who need the large contiguous areas can enable compaction on
      locked memory by leaving the default value of 1.
      
      To illustrate this problem I wrote a quick test program that mmaps a
      large number of 1MB files filled with random data.  These maps are
      created locked and read only.  Then every other mmap is unmapped and I
      attempt to allocate huge pages to the static huge page pool.  When the
      compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
      after fragmenting memory.  When the value is set to 1, allocations
      succeed.
      Signed-off-by: default avatarEric B Munson <emunson@akamai.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5bbe3547
    • Vladimir Davydov's avatar
      memcg: zap mem_cgroup_lookup() · adbe427b
      Vladimir Davydov authored
      mem_cgroup_lookup() is a wrapper around mem_cgroup_from_id(), which
      checks that id != 0 before issuing the function call.  Today, there is
      no point in this additional check apart from optimization, because there
      is no css with id <= 0, so that css_from_id, called by
      mem_cgroup_from_id, will return NULL for any id <= 0.
      
      Since mem_cgroup_from_id is only called from mem_cgroup_lookup, let us
      zap mem_cgroup_lookup, substituting calls to it with mem_cgroup_from_id
      and moving the check if id > 0 to css_from_id.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      adbe427b
  2. 14 Apr, 2015 10 commits
  3. 13 Apr, 2015 1 commit
    • Paul E. McKenney's avatar
      cpu: Defer smpboot kthread unparking until CPU known to scheduler · 00df35f9
      Paul E. McKenney authored
      Currently, smpboot_unpark_threads() is invoked before the incoming CPU
      has been added to the scheduler's runqueue structures.  This might
      potentially cause the unparked kthread to run on the wrong CPU, since the
      correct CPU isn't fully set up yet.
      
      That causes a sporadic, hard to debug boot crash triggering on some
      systems, reported by Borislav Petkov, and bisected down to:
      
        2a442c9c ("x86: Use common outgoing-CPU-notification code")
      
      This patch places smpboot_unpark_threads() in a CPU hotplug
      notifier with priority set so that these kthreads are unparked just after
      the CPU has been added to the runqueues.
      Reported-and-tested-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      00df35f9
  4. 09 Apr, 2015 1 commit
    • Jason Low's avatar
      locking/mutex: Further simplify mutex_spin_on_owner() · 01ac33c1
      Jason Low authored
      Similar to what Linus suggested for rwsem_spin_on_owner(), in
      mutex_spin_on_owner() instead of having while (true) and
      breaking out of the spin loop on lock->owner != owner, we can
      have the loop directly check for while (lock->owner == owner) to
      improve the readability of the code.
      
      It also shrinks the code a bit:
      
         text    data     bss     dec     hex filename
         3721       0       0    3721     e89 mutex.o.before
         3705       0       0    3705     e79 mutex.o.after
      Signed-off-by: default avatarJason Low <jason.low2@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Link: http://lkml.kernel.org/r/1428521960-5268-2-git-send-email-jason.low2@hp.com
      [ Added code generation info. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      01ac33c1
  5. 08 Apr, 2015 6 commits
    • Linus Torvalds's avatar
      Copy the kernel module data from user space in chunks · 3afe9f84
      Linus Torvalds authored
      Unlike most (all?) other copies from user space, kernel module loading
      is almost unlimited in size.  So we do a potentially huge
      "copy_from_user()" when we copy the module data from user space to the
      kernel buffer, which can be a latency concern when preemption is
      disabled (or voluntary).
      
      Also, because 'copy_from_user()' clears the tail of the kernel buffer on
      failures, even a *failed* copy can end up wasting a lot of time.
      
      Normally neither of these are concerns in real life, but they do trigger
      when doing stress-testing with trinity.  Running in a VM seems to add
      its own overheadm causing trinity module load testing to even trigger
      the watchdog.
      
      The simple fix is to just chunk up the module loading, so that it never
      tries to copy insanely big areas in one go.  That bounds the latency,
      and also the amount of (unnecessarily, in this case) cleared memory for
      the failure case.
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3afe9f84
    • Marc Zyngier's avatar
      genirq: Allow the irqchip state of an IRQ to be save/restored · 1b7047ed
      Marc Zyngier authored
      There is a number of cases where a kernel subsystem may want to
      introspect the state of an interrupt at the irqchip level:
      
      - When a peripheral is shared between virtual machines,
        its interrupt state becomes part of the guest's state,
        and must be switched accordingly. KVM on arm/arm64 requires
        this for its guest-visible timer
      - Some GPIO controllers seem to require peeking into the
        interrupt controller they are connected to to report
        their internal state
      
      This seem to be a pattern that is common enough for the core code
      to try and support this without too many horrible hacks. Introduce
      a pair of accessors (irq_get_irqchip_state/irq_set_irqchip_state)
      to retrieve the bits that can be of interest to another subsystem:
      pending, active, and masked.
      
      - irq_get_irqchip_state returns the state of the interrupt according
        to a parameter set to IRQCHIP_STATE_PENDING, IRQCHIP_STATE_ACTIVE,
        IRQCHIP_STATE_MASKED or IRQCHIP_STATE_LINE_LEVEL.
      - irq_set_irqchip_state similarly sets the state of the interrupt.
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: default avatarBjorn Andersson <bjorn.andersson@sonymobile.com>
      Tested-by: default avatarBjorn Andersson <bjorn.andersson@sonymobile.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Abhijeet Dharmapurikar <adharmap@codeaurora.org>
      Cc: Stephen Boyd <sboyd@codeaurora.org>
      Cc: Phong Vo <pvo@apm.com>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: Tin Huynh <tnhuynh@apm.com>
      Cc: Y Vo <yvo@apm.com>
      Cc: Toan Le <toanle@apm.com>
      Cc: Bjorn Andersson <bjorn@kryo.se>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Link: http://lkml.kernel.org/r/1426676484-21812-2-git-send-email-marc.zyngier@arm.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      1b7047ed
    • Marc Zyngier's avatar
      genirq: MSI: Fix freeing of unallocated MSI · fe0c52fc
      Marc Zyngier authored
      While debugging an unrelated issue with the GICv3 ITS driver, the
      following trace triggered:
      
      WARNING: CPU: 1 PID: 1 at kernel/irq/irqdomain.c:1121 irq_domain_free_irqs+0x160/0x17c()
      NULL pointer, cannot free irq
      Modules linked in:
      CPU: 1 PID: 1 Comm: swapper/0 Tainted: G        W      3.19.0-rc6+ #3690
      Hardware name: FVP Base (DT)
      Call trace:
      [<ffffffc000089398>] dump_backtrace+0x0/0x13c
      [<ffffffc0000894e4>] show_stack+0x10/0x1c
      [<ffffffc00066d134>] dump_stack+0x74/0x94
      [<ffffffc0000a92f8>] warn_slowpath_common+0x9c/0xd4
      [<ffffffc0000a938c>] warn_slowpath_fmt+0x5c/0x80
      [<ffffffc0000ee04c>] irq_domain_free_irqs+0x15c/0x17c
      [<ffffffc0000ef918>] msi_domain_free_irqs+0x58/0x74
      [<ffffffc000386f58>] free_msi_irqs+0xb4/0x1c0
      
          // The msi_prepare callback fails here
      
      [<ffffffc0003872c0>] pci_enable_msix+0x25c/0x3d4
      [<ffffffc00038746c>] pci_enable_msix_range+0x34/0x80
      [<ffffffc0003924ac>] vp_try_to_find_vqs+0xec/0x528
      [<ffffffc000392954>] vp_find_vqs+0x6c/0xa8
      [<ffffffc0003ee2a8>] init_vq+0x120/0x248
      [<ffffffc0003eefb0>] virtblk_probe+0xb0/0x6bc
      [<ffffffc00038fc34>] virtio_dev_probe+0x17c/0x214
      [<ffffffc0003d4a04>] driver_probe_device+0x7c/0x23c
      [<ffffffc0003d4cb0>] __driver_attach+0x98/0xa0
      [<ffffffc0003d2c60>] bus_for_each_dev+0x60/0xb4
      [<ffffffc0003d455c>] driver_attach+0x1c/0x28
      [<ffffffc0003d41b0>] bus_add_driver+0x150/0x208
      [<ffffffc0003d54c0>] driver_register+0x64/0x130
      [<ffffffc00038f9e8>] register_virtio_driver+0x24/0x68
      [<ffffffc00091320c>] init+0x70/0xac
      [<ffffffc0000828f0>] do_one_initcall+0x94/0x1d0
      [<ffffffc0008e9b00>] kernel_init_freeable+0x144/0x1e4
      [<ffffffc00066a434>] kernel_init+0xc/0xd8
      ---[ end trace f9ee562a77cc7bae ]---
      
      The ITS msi_prepare callback having failed, we end-up trying to
      free MSIs that have never been allocated. Oddly enough, the kernel
      is pretty upset about it.
      
      It turns out that this behaviour was expected before the MSI domain
      was introduced (and dealt with in arch_teardown_msi_irqs).
      
      The obvious fix is to detect this early enough and bail out.
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: default avatarJiang Liu <jiang.liu@linux.intel.com>
      Link: http://lkml.kernel.org/r/1422299419-6051-1-git-send-email-marc.zyngier@arm.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      fe0c52fc
    • Steven Rostedt (Red Hat)'s avatar
      tracing: Add enum_map file to show enums that have been mapped · 9828413d
      Steven Rostedt (Red Hat) authored
      Add a enum_map file in the tracing directory to see what enums have been
      saved to convert in the print fmt files.
      
      As this requires the enum mapping to be persistent in memory, it is only
      created if the new config option CONFIG_TRACE_ENUM_MAP_FILE is enabled.
      This is for debugging and will increase the persistent memory footprint
      of the kernel.
      
      Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.orgReviewed-by: default avatarMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Tested-by: default avatarMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      9828413d
    • Steven Rostedt (Red Hat)'s avatar
      tracing: Allow for modules to convert their enums to values · 3673b8e4
      Steven Rostedt (Red Hat) authored
      Update the infrastructure such that modules that declare TRACE_DEFINE_ENUM()
      will have those enums converted into their values in the tracepoint
      print fmt strings.
      
      Link: http://lkml.kernel.org/r/87vbhjp74q.fsf@rustcorp.com.auAcked-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Reviewed-by: default avatarMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Tested-by: default avatarMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      3673b8e4
    • Steven Rostedt (Red Hat)'s avatar
      tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values · 0c564a53
      Steven Rostedt (Red Hat) authored
      Several tracepoints use the helper functions __print_symbolic() or
      __print_flags() and pass in enums that do the mapping between the
      binary data stored and the value to print. This works well for reading
      the ASCII trace files, but when the data is read via userspace tools
      such as perf and trace-cmd, the conversion of the binary value to a
      human string format is lost if an enum is used, as userspace does not
      have access to what the ENUM is.
      
      For example, the tracepoint trace_tlb_flush() has:
      
       __print_symbolic(REC->reason,
          { TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" },
          { TLB_REMOTE_SHOOTDOWN, "remote shootdown" },
          { TLB_LOCAL_SHOOTDOWN, "local shootdown" },
          { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" })
      
      Which maps the enum values to the strings they represent. But perf and
      trace-cmd do no know what value TLB_LOCAL_MM_SHOOTDOWN is, and would
      not be able to map it.
      
      With TRACE_DEFINE_ENUM(), developers can place these in the event header
      files and ftrace will convert the enums to their values:
      
      By adding:
      
       TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH);
       TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN);
       TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN);
       TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN);
      
       $ cat /sys/kernel/debug/tracing/events/tlb/tlb_flush/format
      [...]
       __print_symbolic(REC->reason,
          { 0, "flush on task switch" },
          { 1, "remote shootdown" },
          { 2, "local shootdown" },
          { 3, "local mm shootdown" })
      
      The above is what userspace expects to see, and tools do not need to
      be modified to parse them.
      
      Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org
      
      Cc: Guilherme Cox <cox@computer.org>
      Cc: Tony Luck <tony.luck@gmail.com>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Reviewed-by: default avatarMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Tested-by: default avatarMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      0c564a53
  6. 07 Apr, 2015 1 commit
  7. 06 Apr, 2015 2 commits
  8. 03 Apr, 2015 13 commits