1. 25 Oct, 2016 1 commit
    • Charlie Jacobsen's avatar
      Basic lcd module create, run, and destroy. · e0193fa4
      Charlie Jacobsen authored
      This code is ugly, but it's working.
      
      Tested with basic module, and appears to be working
      properly. I will soon incorporate the patched
      modprobe into the kernel tree, and then this code
      will be usable by everyone.
      
      The ipc code is still unimplemented. The only
      hypercall handled is yield. Also note that other
      exit conditions (e.g. external interrupt) have not
      been fully tested.
      
      Overview:
      -- kernel code calls lcd_create_as_module with
         the module's name
      -- lcd_create_as_module loads the module using
         request_lcd_module (request_lcd_module calls
         the patched modprobe to load the module, and
         the patched modprobe calls back into the lcd
         driver via the ioctrl interface to load the
         module)
      -- lcd_create_as_module then finds the loaded
         module, spawns a kernel thread and passes off
         the module to it
      -- the kernel thread initializes the lcd and
         maps the module inside it, then suspends itself
      -- lcd_run_as_module wakes up the kernel thread
         and tells it to run
      -- lcd_delete_as_module stops the kernel thread
         and deletes the module from the host kernel
      
      File-by-file details:
      
      arch/x86/include/asm/lcd-domains-arch.h
      arch/x86/lcd-domains/lcd-domains-arch-tests.c
      arch/x86/lcd-domains/lcd-domains-arch.c
      -- lcd was not running in 64-bit mode, and my
         checks had one subtle bug
      -- fixed %cr3 load to properly load vmcs first
      -- fixed set program counter to use guest virtual
         rather than guest physical address
      
      include/linux/sched.h
      -- added struct lcd to task_struct
      
      include/linux/init_task.h
      -- lcd pointer set to null when task_struct is
         initialized
      
      include/linux/module.h
      kernel/module.c
      -- made init_module and delete_module system calls
         callable from kernel code
      -- available in module.h via do_sys_init_module and
         do_sys_delete_module
      -- simply moved the majority of the guts of the
         system calls into a non-system call, exported
         routine
      -- take an extra flag, for_lcd; when set, the init
         code skips over running (and deallocating) the
         module's init code, and the delete code skips
         over running the module exit
      -- system calls from user code set for_lcd = 0; this
         ensures existing code still works
      
      include/linux/kmod.h
      kernel/kmod.c
      kernel/sysctl.c
      -- changed __request_module to __do_request_module; takes
         one extra argument, for_lcd
      -- __request_module   ==>  __do_request_module with for_lcd = 0
      -- request_lcd_module ==>  __do_request_module with for_lcd = 1
      -- call_modprobe conditionally uses lcd_modprobe_path, the path
         to a patched modprobe accessible via sysfs
      
      include/lcd-domains/lcd-domains.h
      -- added lcd status enum; see source code doc
      -- three routines for creating/running/destroying
         lcd's that use modules; see source code doc
      
      include/uapi/linux/lcd-domains.h
      -- added interface defns for patched modprobe to call into
         lcd driver for module init; lcd driver loads
         module (via slightly refactored module.c code) on behalf
         of modprobe
      
      virt/lcd-domains/lcd-domains.c
      -- implementation of routines for modules inside lcd's
      -- implementation of module init / delete for lcd's
         (uses patched module.c code)
      
      virt/lcd-domains/Kconfig
      virt/lcd-domains/Makefile
      virt/lcd-domains/lcd-module-load-test.c
      virt/lcd-domains/lcd-tests.c
      -- added test module for lcd module code
      -- test runs automatically when lcd module is inserted
      e0193fa4
  2. 26 Aug, 2016 1 commit
    • Subash Abhinov Kasiviswanathan's avatar
      sysctl: handle error writing UINT_MAX to u32 fields · e7d316a0
      Subash Abhinov Kasiviswanathan authored
      We have scripts which write to certain fields on 3.18 kernels but this
      seems to be failing on 4.4 kernels.  An entry which we write to here is
      xfrm_aevent_rseqth which is u32.
      
        echo 4294967295  > /proc/sys/net/core/xfrm_aevent_rseqth
      
      Commit 230633d1 ("kernel/sysctl.c: detect overflows when converting
      to int") prevented writing to sysctl entries when integer overflow
      occurs.  However, this does not apply to unsigned integers.
      
      Heinrich suggested that we introduce a new option to handle 64 bit
      limits and set min as 0 and max as UINT_MAX.  This might not work as it
      leads to issues similar to __do_proc_doulongvec_minmax.  Alternatively,
      we would need to change the datatype of the entry to 64 bit.
      
        static int __do_proc_doulongvec_minmax(void *data, struct ctl_table
        {
            i = (unsigned long *) data;   //This cast is causing to read beyond the size of data (u32)
            vleft = table->maxlen / sizeof(unsigned long); //vleft is 0 because maxlen is sizeof(u32) which is lesser than sizeof(unsigned long) on x86_64.
      
      Introduce a new proc handler proc_douintvec.  Individual proc entries
      will need to be updated to use the new handler.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Fixes: 230633d1 ("kernel/sysctl.c:detect overflows when converting to int")
      Link: http://lkml.kernel.org/r/1471479806-5252-1-git-send-email-subashab@codeaurora.orgSigned-off-by: default avatarSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7d316a0
  3. 02 Aug, 2016 1 commit
    • Borislav Petkov's avatar
      printk: add kernel parameter to control writes to /dev/kmsg · 750afe7b
      Borislav Petkov authored
      Add a "printk.devkmsg" kernel command line parameter which controls how
      userspace writes into /dev/kmsg.  It has three options:
      
       * ratelimit - ratelimit logging from userspace.
       * on  - unlimited logging from userspace
       * off - logging from userspace gets ignored
      
      The default setting is to ratelimit the messages written to it.
      
      This changes the kernel default setting of "on" to "ratelimit" and we do
      that because we want to keep userspace spamming /dev/kmsg to sane
      levels.  This is especially moot when a small kernel log buffer wraps
      around and messages get lost.  So the ratelimiting setting should be a
      sane setting where kernel messages should have a bit higher chance of
      survival from all the spamming.
      
      It additionally does not limit logging to /dev/kmsg while the system is
      booting if we haven't disabled it on the command line.
      
      Furthermore, we can control the logging from a lower priority sysctl
      interface - kernel.printk_devkmsg.
      
      That interface will succeed only if printk.devkmsg *hasn't* been
      supplied on the command line.  If it has, then printk.devkmsg is a
      one-time setting which remains for the duration of the system lifetime.
      This "locking" of the setting is to prevent userspace from changing the
      logging on us through sysctl(2).
      
      This patch is based on previous patches from Linus and Steven.
      
      [bp@suse.de: fixes]
        Link: http://lkml.kernel.org/r/20160719072344.GC25563@nazgul.tnic
      Link: http://lkml.kernel.org/r/20160716061745.15795-3-bp@alien8.deSigned-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Franck Bui <fbui@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      750afe7b
  4. 28 Jul, 2016 1 commit
  5. 15 Jun, 2016 1 commit
    • Daniel Bristot de Oliveira's avatar
      rcu: sysctl: Panic on RCU Stall · 088e9d25
      Daniel Bristot de Oliveira authored
      It is not always easy to determine the cause of an RCU stall just by
      analysing the RCU stall messages, mainly when the problem is caused
      by the indirect starvation of rcu threads. For example, when preempt_rcu
      is not awakened due to the starvation of a timer softirq.
      
      We have been hard coding panic() in the RCU stall functions for
      some time while testing the kernel-rt. But this is not possible in
      some scenarios, like when supporting customers.
      
      This patch implements the sysctl kernel.panic_on_rcu_stall. If
      set to 1, the system will panic() when an RCU stall takes place,
      enabling the capture of a vmcore. The vmcore provides a way to analyze
      all kernel/tasks states, helping out to point to the culprit and the
      solution for the stall.
      
      The kernel.panic_on_rcu_stall sysctl is disabled by default.
      
      Changes from v1:
      - Fixed a typo in the git log
      - The if(sysctl_panic_on_rcu_stall) panic() is in a static function
      - Fixed the CONFIG_TINY_RCU compilation issue
      - The var sysctl_panic_on_rcu_stall is now __read_mostly
      
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Acked-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Reviewed-by: default avatarArnaldo Carvalho de Melo <acme@kernel.org>
      Tested-by: default avatar"Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      088e9d25
  6. 19 May, 2016 1 commit
    • Hugh Dickins's avatar
      mm: /proc/sys/vm/stat_refresh to force vmstat update · 52b6f46b
      Hugh Dickins authored
      Provide /proc/sys/vm/stat_refresh to force an immediate update of
      per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
      before checking counts when testing.  Originally added to work around a
      bug which left counts stranded indefinitely on a cpu going idle (an
      inaccuracy magnified when small below-batch numbers represent "huge"
      amounts of memory), but I believe that bug is now fixed: nonetheless,
      this is still a useful knob.
      
      Its schedule_on_each_cpu() is probably too expensive just to fold into
      reading /proc/meminfo itself: give this mode 0600 to prevent abuse.
      Allow a write or a read to do the same: nothing to read, but "grep -h
      Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient.  Oh, and
      since global_page_state() itself is careful to disguise any underflow as
      0, hack in an "Invalid argument" and pr_warn() if a counter is negative
      after the refresh - this helped to fix a misaccounting of
      NR_ISOLATED_FILE in my migration code.
      
      But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED
      often go negative some of the time.  I have not yet worked out why, but
      have no evidence that it's actually harmful.  Punt for the moment by
      just ignoring the anomaly on those.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52b6f46b
  7. 16 May, 2016 2 commits
    • Arnaldo Carvalho de Melo's avatar
      perf core: Separate accounting of contexts and real addresses in a stack trace · c85b0334
      Arnaldo Carvalho de Melo authored
      The perf_sample->ip_callchain->nr value includes all the entries in the
      ip_callchain->ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc},
      while what the user expects is that what is in the kernel.perf_event_max_stack
      sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be
      honoured in terms of IP addresses in the stack trace.
      
      So allocate a bunch of extra entries for contexts, and do the accounting
      via perf_callchain_entry_ctx struct members.
      
      A new sysctl, kernel.perf_event_max_contexts_per_stack is also
      introduced for investigating possible bugs in the callchain
      implementation by some arch.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Milian Wolff <milian.wolff@kdab.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      c85b0334
    • Arnaldo Carvalho de Melo's avatar
      perf core: Generalize max_stack sysctl handler · a831100a
      Arnaldo Carvalho de Melo authored
      So that it can be used for other stack related knobs, such as the
      upcoming one to tweak the max number of of contexts per stack sample.
      
      In all those cases we can only change the value if there are no perf
      sessions collecting stacks, so they need to grab that mutex, etc.
      
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-8t3fk94wuzp8m2z1n4gc0s17@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      a831100a
  8. 27 Apr, 2016 1 commit
    • Arnaldo Carvalho de Melo's avatar
      perf core: Allow setting up max frame stack depth via sysctl · c5dfd78e
      Arnaldo Carvalho de Melo authored
      The default remains 127, which is good for most cases, and not even hit
      most of the time, but then for some cases, as reported by Brendan, 1024+
      deep frames are appearing on the radar for things like groovy, ruby.
      
      And in some workloads putting a _lower_ cap on this may make sense. One
      that is per event still needs to be put in place tho.
      
      The new file is:
      
        # cat /proc/sys/kernel/perf_event_max_stack
        127
      
      Chaging it:
      
        # echo 256 > /proc/sys/kernel/perf_event_max_stack
        # cat /proc/sys/kernel/perf_event_max_stack
        256
      
      But as soon as there is some event using callchains we get:
      
        # echo 512 > /proc/sys/kernel/perf_event_max_stack
        -bash: echo: write error: Device or resource busy
        #
      
      Because we only allocate the callchain percpu data structures when there
      is a user, which allows for changing the max easily, its just a matter
      of having no callchain users at that point.
      Reported-and-Tested-by: default avatarBrendan Gregg <brendan.d.gregg@gmail.com>
      Reviewed-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: He Kuang <hekuang@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Milian Wolff <milian.wolff@kdab.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Zefan Li <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/20160426002928.GB16708@kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      c5dfd78e
  9. 17 Mar, 2016 1 commit
    • Johannes Weiner's avatar
      mm: scale kswapd watermarks in proportion to memory · 795ae7a0
      Johannes Weiner authored
      In machines with 140G of memory and enterprise flash storage, we have
      seen read and write bursts routinely exceed the kswapd watermarks and
      cause thundering herds in direct reclaim.  Unfortunately, the only way
      to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
      system's emergency reserves - which is entirely unrelated to the
      system's latency requirements.  In order to get kswapd to maintain a
      250M buffer of free memory, the emergency reserves need to be set to 1G.
      That is a lot of memory wasted for no good reason.
      
      On the other hand, it's reasonable to assume that allocation bursts and
      overall allocation concurrency scale with memory capacity, so it makes
      sense to make kswapd aggressiveness a function of that as well.
      
      Change the kswapd watermark scale factor from the currently fixed 25% of
      the tunable emergency reserve to a tunable 0.1% of memory.
      
      Beyond 1G of memory, this will produce bigger watermark steps than the
      current formula in default settings.  Ensure that the new formula never
      chooses steps smaller than that, i.e.  25% of the emergency reserve.
      
      On a 140G machine, this raises the default watermark steps - the
      distance between min and low, and low and high - from 16M to 143M.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      795ae7a0
  10. 09 Feb, 2016 1 commit
    • Mel Gorman's avatar
      sched/debug: Make schedstats a runtime tunable that is disabled by default · cb251765
      Mel Gorman authored
      schedstats is very useful during debugging and performance tuning but it
      incurs overhead to calculate the stats. As such, even though it can be
      disabled at build time, it is often enabled as the information is useful.
      
      This patch adds a kernel command-line and sysctl tunable to enable or
      disable schedstats on demand (when it's built in). It is disabled
      by default as someone who knows they need it can also learn to enable
      it when necessary.
      
      The benefits are dependent on how scheduler-intensive the workload is.
      If it is then the patch reduces the number of cycles spent calculating
      the stats with a small benefit from reducing the cache footprint of the
      scheduler.
      
      These measurements were taken from a 48-core 2-socket
      machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a
      single socket machine 8-core machine with Intel i7-3770 processors.
      
      netperf-tcp
                                 4.5.0-rc1             4.5.0-rc1
                                   vanilla          nostats-v3r1
      Hmean    64         560.45 (  0.00%)      575.98 (  2.77%)
      Hmean    128        766.66 (  0.00%)      795.79 (  3.80%)
      Hmean    256        950.51 (  0.00%)      981.50 (  3.26%)
      Hmean    1024      1433.25 (  0.00%)     1466.51 (  2.32%)
      Hmean    2048      2810.54 (  0.00%)     2879.75 (  2.46%)
      Hmean    3312      4618.18 (  0.00%)     4682.09 (  1.38%)
      Hmean    4096      5306.42 (  0.00%)     5346.39 (  0.75%)
      Hmean    8192     10581.44 (  0.00%)    10698.15 (  1.10%)
      Hmean    16384    18857.70 (  0.00%)    18937.61 (  0.42%)
      
      Small gains here, UDP_STREAM showed nothing intresting and neither did
      the TCP_RR tests. The gains on the 8-core machine were very similar.
      
      tbench4
                                       4.5.0-rc1             4.5.0-rc1
                                         vanilla          nostats-v3r1
      Hmean    mb/sec-1         500.85 (  0.00%)      522.43 (  4.31%)
      Hmean    mb/sec-2         984.66 (  0.00%)     1018.19 (  3.41%)
      Hmean    mb/sec-4        1827.91 (  0.00%)     1847.78 (  1.09%)
      Hmean    mb/sec-8        3561.36 (  0.00%)     3611.28 (  1.40%)
      Hmean    mb/sec-16       5824.52 (  0.00%)     5929.03 (  1.79%)
      Hmean    mb/sec-32      10943.10 (  0.00%)    10802.83 ( -1.28%)
      Hmean    mb/sec-64      15950.81 (  0.00%)    16211.31 (  1.63%)
      Hmean    mb/sec-128     15302.17 (  0.00%)    15445.11 (  0.93%)
      Hmean    mb/sec-256     14866.18 (  0.00%)    15088.73 (  1.50%)
      Hmean    mb/sec-512     15223.31 (  0.00%)    15373.69 (  0.99%)
      Hmean    mb/sec-1024    14574.25 (  0.00%)    14598.02 (  0.16%)
      Hmean    mb/sec-2048    13569.02 (  0.00%)    13733.86 (  1.21%)
      Hmean    mb/sec-3072    12865.98 (  0.00%)    13209.23 (  2.67%)
      
      Small gains of 2-4% at low thread counts and otherwise flat.  The
      gains on the 8-core machine were slightly different
      
      tbench4 on 8-core i7-3770 single socket machine
      Hmean    mb/sec-1        442.59 (  0.00%)      448.73 (  1.39%)
      Hmean    mb/sec-2        796.68 (  0.00%)      794.39 ( -0.29%)
      Hmean    mb/sec-4       1322.52 (  0.00%)     1343.66 (  1.60%)
      Hmean    mb/sec-8       2611.65 (  0.00%)     2694.86 (  3.19%)
      Hmean    mb/sec-16      2537.07 (  0.00%)     2609.34 (  2.85%)
      Hmean    mb/sec-32      2506.02 (  0.00%)     2578.18 (  2.88%)
      Hmean    mb/sec-64      2511.06 (  0.00%)     2569.16 (  2.31%)
      Hmean    mb/sec-128     2313.38 (  0.00%)     2395.50 (  3.55%)
      Hmean    mb/sec-256     2110.04 (  0.00%)     2177.45 (  3.19%)
      Hmean    mb/sec-512     2072.51 (  0.00%)     2053.97 ( -0.89%)
      
      In constract, this shows a relatively steady 2-3% gain at higher thread
      counts. Due to the nature of the patch and the type of workload, it's
      not a surprise that the result will depend on the CPU used.
      
      hackbench-pipes
                               4.5.0-rc1             4.5.0-rc1
                                 vanilla          nostats-v3r1
      Amean    1        0.0637 (  0.00%)      0.0660 ( -3.59%)
      Amean    4        0.1229 (  0.00%)      0.1181 (  3.84%)
      Amean    7        0.1921 (  0.00%)      0.1911 (  0.52%)
      Amean    12       0.3117 (  0.00%)      0.2923 (  6.23%)
      Amean    21       0.4050 (  0.00%)      0.3899 (  3.74%)
      Amean    30       0.4586 (  0.00%)      0.4433 (  3.33%)
      Amean    48       0.5910 (  0.00%)      0.5694 (  3.65%)
      Amean    79       0.8663 (  0.00%)      0.8626 (  0.43%)
      Amean    110      1.1543 (  0.00%)      1.1517 (  0.22%)
      Amean    141      1.4457 (  0.00%)      1.4290 (  1.16%)
      Amean    172      1.7090 (  0.00%)      1.6924 (  0.97%)
      Amean    192      1.9126 (  0.00%)      1.9089 (  0.19%)
      
      Some small gains and losses and while the variance data is not included,
      it's close to the noise. The UMA machine did not show anything particularly
      different
      
      pipetest
                                   4.5.0-rc1             4.5.0-rc1
                                     vanilla          nostats-v2r2
      Min         Time        4.13 (  0.00%)        3.99 (  3.39%)
      1st-qrtle   Time        4.38 (  0.00%)        4.27 (  2.51%)
      2nd-qrtle   Time        4.46 (  0.00%)        4.39 (  1.57%)
      3rd-qrtle   Time        4.56 (  0.00%)        4.51 (  1.10%)
      Max-90%     Time        4.67 (  0.00%)        4.60 (  1.50%)
      Max-93%     Time        4.71 (  0.00%)        4.65 (  1.27%)
      Max-95%     Time        4.74 (  0.00%)        4.71 (  0.63%)
      Max-99%     Time        4.88 (  0.00%)        4.79 (  1.84%)
      Max         Time        4.93 (  0.00%)        4.83 (  2.03%)
      Mean        Time        4.48 (  0.00%)        4.39 (  1.91%)
      Best99%Mean Time        4.47 (  0.00%)        4.39 (  1.91%)
      Best95%Mean Time        4.46 (  0.00%)        4.38 (  1.93%)
      Best90%Mean Time        4.45 (  0.00%)        4.36 (  1.98%)
      Best50%Mean Time        4.36 (  0.00%)        4.25 (  2.49%)
      Best10%Mean Time        4.23 (  0.00%)        4.10 (  3.13%)
      Best5%Mean  Time        4.19 (  0.00%)        4.06 (  3.20%)
      Best1%Mean  Time        4.13 (  0.00%)        4.00 (  3.39%)
      
      Small improvement and similar gains were seen on the UMA machine.
      
      The gain is small but it stands to reason that doing less work in the
      scheduler is a good thing. The downside is that the lack of schedstats and
      tracepoints may be surprising to experts doing performance analysis until
      they find the existence of the schedstats= parameter or schedstats sysctl.
      It will be automatically activated for latencytop and sleep profiling to
      alleviate the problem. For tracepoints, there is a simple warning as it's
      not safe to activate schedstats in the context when it's known the tracepoint
      may be wanted but is unavailable.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Reviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <mgalbraith@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      cb251765
  11. 20 Jan, 2016 1 commit
  12. 19 Jan, 2016 1 commit
    • Willy Tarreau's avatar
      pipe: limit the per-user amount of pages allocated in pipes · 759c0114
      Willy Tarreau authored
      On no-so-small systems, it is possible for a single process to cause an
      OOM condition by filling large pipes with data that are never read. A
      typical process filling 4000 pipes with 1 MB of data will use 4 GB of
      memory. On small systems it may be tricky to set the pipe max size to
      prevent this from happening.
      
      This patch makes it possible to enforce a per-user soft limit above
      which new pipes will be limited to a single page, effectively limiting
      them to 4 kB each, as well as a hard limit above which no new pipes may
      be created for this user. This has the effect of protecting the system
      against memory abuse without hurting other users, and still allowing
      pipes to work correctly though with less data at once.
      
      The limit are controlled by two new sysctls : pipe-user-pages-soft, and
      pipe-user-pages-hard. Both may be disabled by setting them to zero. The
      default soft limit allows the default number of FDs per process (1024)
      to create pipes of the default size (64kB), thus reaching a limit of 64MB
      before starting to create only smaller pipes. With 256 processes limited
      to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
      1084 MB of memory allocated for a user. The hard limit is disabled by
      default to avoid breaking existing applications that make intensive use
      of pipes (eg: for splicing).
      
      Reported-by: socketpair@gmail.com
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Mitigates: CVE-2013-4312 (Linux 2.0+)
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      759c0114
  13. 14 Jan, 2016 1 commit
    • Daniel Cashman's avatar
      mm: mmap: add new /proc tunable for mmap_base ASLR · d07e2259
      Daniel Cashman authored
      Address Space Layout Randomization (ASLR) provides a barrier to
      exploitation of user-space processes in the presence of security
      vulnerabilities by making it more difficult to find desired code/data
      which could help an attack.  This is done by adding a random offset to
      the location of regions in the process address space, with a greater
      range of potential offset values corresponding to better protection/a
      larger search-space for brute force, but also to greater potential for
      fragmentation.
      
      The offset added to the mmap_base address, which provides the basis for
      the majority of the mappings for a process, is set once on process exec
      in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
      which reflect, hopefully, the best compromise for all systems.  The
      trade-off between increased entropy in the offset value generation and
      the corresponding increased variability in address space fragmentation
      is not absolute, however, and some platforms may tolerate higher amounts
      of entropy.  This patch introduces both new Kconfig values and a sysctl
      interface which may be used to change the amount of entropy used for
      offset generation on a system.
      
      The direct motivation for this change was in response to the
      libstagefright vulnerabilities that affected Android, specifically to
      information provided by Google's project zero at:
      
        http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html
      
      The attack presented therein, by Google's project zero, specifically
      targeted the limited randomness used to generate the offset added to the
      mmap_base address in order to craft a brute-force-based attack.
      Concretely, the attack was against the mediaserver process, which was
      limited to respawning every 5 seconds, on an arm device.  The hard-coded
      8 bits used resulted in an average expected success rate of defeating
      the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
      piece).  With this patch, and an accompanying increase in the entropy
      value to 16 bits, the same attack would take an average expected time of
      over 45 hours (32768 tries), which makes it both less feasible and more
      likely to be noticed.
      
      The introduced Kconfig and sysctl options are limited by per-arch
      minimum and maximum values, the minimum of which was chosen to match the
      current hard-coded value and the maximum of which was chosen so as to
      give the greatest flexibility without generating an invalid mmap_base
      address, generally a 3-4 bits less than the number of bits in the
      user-space accessible virtual address space.
      
      When decided whether or not to change the default value, a system
      developer should consider that mmap_base address could be placed
      anywhere up to 2^(value) bits away from the non-randomized location,
      which would introduce variable-sized areas above and below the mmap_base
      address such that the maximum vm_area_struct size may be reduced,
      preventing very large allocations.
      
      This patch (of 4):
      
      ASLR only uses as few as 8 bits to generate the random offset for the
      mmap base address on 32 bit architectures.  This value was chosen to
      prevent a poorly chosen value from dividing the address space in such a
      way as to prevent large allocations.  This may not be an issue on all
      platforms.  Allow the specification of a minimum number of bits so that
      platforms desiring greater ASLR protection may determine where to place
      the trade-off.
      Signed-off-by: default avatarDaniel Cashman <dcashman@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Nick Kralevich <nnk@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d07e2259
  14. 04 Jan, 2016 1 commit
  15. 05 Nov, 2015 2 commits
  16. 12 Oct, 2015 1 commit
    • Alexei Starovoitov's avatar
      bpf: enable non-root eBPF programs · 1be7f75d
      Alexei Starovoitov authored
      In order to let unprivileged users load and execute eBPF programs
      teach verifier to prevent pointer leaks.
      Verifier will prevent
      - any arithmetic on pointers
        (except R10+Imm which is used to compute stack addresses)
      - comparison of pointers
        (except if (map_value_ptr == 0) ... )
      - passing pointers to helper functions
      - indirectly passing pointers in stack to helper functions
      - returning pointer from bpf program
      - storing pointers into ctx or maps
      
      Spill/fill of pointers into stack is allowed, but mangling
      of pointers stored in the stack or reading them byte by byte is not.
      
      Within bpf programs the pointers do exist, since programs need to
      be able to access maps, pass skb pointer to LD_ABS insns, etc
      but programs cannot pass such pointer values to the outside
      or obfuscate them.
      
      Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs,
      so that socket filters (tcpdump), af_packet (quic acceleration)
      and future kcm can use it.
      tracing and tc cls/act program types still require root permissions,
      since tracing actually needs to be able to see all kernel pointers
      and tc is for root only.
      
      For example, the following unprivileged socket filter program is allowed:
      int bpf_prog1(struct __sk_buff *skb)
      {
        u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
        u64 *value = bpf_map_lookup_elem(&my_map, &index);
      
        if (value)
      	*value += skb->len;
        return 0;
      }
      
      but the following program is not:
      int bpf_prog1(struct __sk_buff *skb)
      {
        u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
        u64 *value = bpf_map_lookup_elem(&my_map, &index);
      
        if (value)
      	*value += (u64) skb;
        return 0;
      }
      since it would leak the kernel address into the map.
      
      Unprivileged socket filter bpf programs have access to the
      following helper functions:
      - map lookup/update/delete (but they cannot store kernel pointers into them)
      - get_random (it's already exposed to unprivileged user space)
      - get_smp_processor_id
      - tail_call into another socket filter program
      - ktime_get_ns
      
      The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
      This toggle defaults to off (0), but can be set true (1).  Once true,
      bpf programs and maps cannot be accessed from unprivileged process,
      and the toggle cannot be set back to false.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1be7f75d
  17. 10 Sep, 2015 2 commits
    • Ilya Dryomov's avatar
      sysctl: fix int -> unsigned long assignments in INT_MIN case · 9a5bc726
      Ilya Dryomov authored
      The following
      
          if (val < 0)
              *lvalp = (unsigned long)-val;
      
      is incorrect because the compiler is free to assume -val to be positive
      and use a sign-extend instruction for extending the bit pattern.  This is
      a problem if val == INT_MIN:
      
          # echo -2147483648 >/proc/sys/dev/scsi/logging_level
          # cat /proc/sys/dev/scsi/logging_level
          -18446744071562067968
      
      Cast to unsigned long before negation - that way we first sign-extend and
      then negate an unsigned, which is well defined.  With this:
      
          # cat /proc/sys/dev/scsi/logging_level
          -2147483648
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Cc: Mikulas Patocka <mikulas@twibright.com>
      Cc: Robert Xiao <nneonneo@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a5bc726
    • Dave Young's avatar
      kexec: split kexec_load syscall from kexec core code · 2965faa5
      Dave Young authored
      There are two kexec load syscalls, kexec_load another and kexec_file_load.
       kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
      split kexec_load syscall code to kernel/kexec.c.
      
      And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
      use kexec_file_load only, or vice verse.
      
      The original requirement is from Ted Ts'o, he want kexec kernel signature
      being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
      kexec_load syscall can bypass the checking.
      
      Vivek Goyal proposed to create a common kconfig option so user can compile
      in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
      KEXEC_CORE so that old config files still work.
      
      Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
      architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
      KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
      kexec_load syscall.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2965faa5
  18. 01 Jul, 2015 1 commit
  19. 24 Jun, 2015 1 commit
    • Chris Metcalf's avatar
      watchdog: add watchdog_cpumask sysctl to assist nohz · fe4ba3c3
      Chris Metcalf authored
      Change the default behavior of watchdog so it only runs on the
      housekeeping cores when nohz_full is enabled at build and boot time.
      Allow modifying the set of cores the watchdog is currently running on
      with a new kernel.watchdog_cpumask sysctl.
      
      In the current system, the watchdog subsystem runs a periodic timer that
      schedules the watchdog kthread to run.  However, nohz_full cores are
      designed to allow userspace application code running on those cores to
      have 100% access to the CPU.  So the watchdog system prevents the
      nohz_full application code from being able to run the way it wants to,
      thus the motivation to suppress the watchdog on nohz_full cores, which
      this patchset provides by default.
      
      However, if we disable the watchdog globally, then the housekeeping
      cores can't benefit from the watchdog functionality.  So we allow
      disabling it only on some cores.  See Documentation/lockup-watchdogs.txt
      for more information.
      
      [jhubbard@nvidia.com: fix a watchdog crash in some configurations]
      Signed-off-by: default avatarChris Metcalf <cmetcalf@ezchip.com>
      Acked-by: default avatarDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe4ba3c3
  20. 19 Jun, 2015 1 commit
    • Thomas Gleixner's avatar
      timer: Reduce timer migration overhead if disabled · bc7a34b8
      Thomas Gleixner authored
      Eric reported that the timer_migration sysctl is not really nice
      performance wise as it needs to check at every timer insertion whether
      the feature is enabled or not. Further the check does not live in the
      timer code, so we have an extra function call which checks an extra
      cache line to figure out that it is disabled.
      
      We can do better and store that information in the per cpu (hr)timer
      bases. I pondered to use a static key, but that's a nightmare to
      update from the nohz code and the timer base cache line is hot anyway
      when we select a timer base.
      
      The old logic enabled the timer migration unconditionally if
      CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
      line.
      
      With this modification, we start off with migration disabled. The user
      visible sysctl is still set to enabled. If the kernel switches to NOHZ
      migration is enabled, if the user did not disable it via the sysctl
      prior to the switch. If nohz=off is on the kernel command line,
      migration stays disabled no matter what.
      
      Before:
        47.76%  hog       [.] main
        14.84%  [kernel]  [k] _raw_spin_lock_irqsave
         9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.71%  [kernel]  [k] mod_timer
         6.24%  [kernel]  [k] lock_timer_base.isra.38
         3.76%  [kernel]  [k] detach_if_pending
         3.71%  [kernel]  [k] del_timer
         2.50%  [kernel]  [k] internal_add_timer
         1.51%  [kernel]  [k] get_nohz_timer_target
         1.28%  [kernel]  [k] __internal_add_timer
         0.78%  [kernel]  [k] timerfn
         0.48%  [kernel]  [k] wake_up_nohz_cpu
      
      After:
        48.10%  hog       [.] main
        15.25%  [kernel]  [k] _raw_spin_lock_irqsave
         9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.50%  [kernel]  [k] mod_timer
         6.44%  [kernel]  [k] lock_timer_base.isra.38
         3.87%  [kernel]  [k] detach_if_pending
         3.80%  [kernel]  [k] del_timer
         2.67%  [kernel]  [k] internal_add_timer
         1.33%  [kernel]  [k] __internal_add_timer
         0.73%  [kernel]  [k] timerfn
         0.54%  [kernel]  [k] wake_up_nohz_cpu
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.deSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      bc7a34b8
  21. 17 Apr, 2015 2 commits
  22. 15 Apr, 2015 1 commit
    • Eric B Munson's avatar
      mm: allow compaction of unevictable pages · 5bbe3547
      Eric B Munson authored
      Currently, pages which are marked as unevictable are protected from
      compaction, but not from other types of migration.  The POSIX real time
      extension explicitly states that mlock() will prevent a major page
      fault, but the spirit of this is that mlock() should give a process the
      ability to control sources of latency, including minor page faults.
      However, the mlock manpage only explicitly says that a locked page will
      not be written to swap and this can cause some confusion.  The
      compaction code today does not give a developer who wants to avoid swap
      but wants to have large contiguous areas available any method to achieve
      this state.  This patch introduces a sysctl for controlling compaction
      behavior with respect to the unevictable lru.  Users who demand no page
      faults after a page is present can set compact_unevictable_allowed to 0
      and users who need the large contiguous areas can enable compaction on
      locked memory by leaving the default value of 1.
      
      To illustrate this problem I wrote a quick test program that mmaps a
      large number of 1MB files filled with random data.  These maps are
      created locked and read only.  Then every other mmap is unmapped and I
      attempt to allocate huge pages to the static huge page pool.  When the
      compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
      after fragmenting memory.  When the value is set to 1, allocations
      succeed.
      Signed-off-by: default avatarEric B Munson <emunson@akamai.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5bbe3547
  23. 14 Apr, 2015 1 commit
    • Ulrich Obergfell's avatar
      watchdog: enable the new user interface of the watchdog mechanism · 195daf66
      Ulrich Obergfell authored
      With the current user interface of the watchdog mechanism it is only
      possible to disable or enable both lockup detectors at the same time.
      This series introduces new kernel parameters and changes the semantics of
      some existing kernel parameters, so that the hard lockup detector and the
      soft lockup detector can be disabled or enabled individually.  With this
      series applied, the user interface is as follows.
      
      - parameters in /proc/sys/kernel
      
        . soft_watchdog
          This is a new parameter to control and examine the run state of
          the soft lockup detector.
      
        . nmi_watchdog
          The semantics of this parameter have changed. It can now be used
          to control and examine the run state of the hard lockup detector.
      
        . watchdog
          This parameter is still available to control the run state of both
          lockup detectors at the same time. If this parameter is examined,
          it shows the logical OR of soft_watchdog and nmi_watchdog.
      
        . watchdog_thresh
          The semantics of this parameter are not affected by the patch.
      
      - kernel command line parameters
      
        . nosoftlockup
          The semantics of this parameter have changed. It can now be used
          to disable the soft lockup detector at boot time.
      
        . nmi_watchdog=0 or nmi_watchdog=1
          Disable or enable the hard lockup detector at boot time. The patch
          introduces '=1' as a new option.
      
        . nowatchdog
          The semantics of this parameter are not affected by the patch. It
          is still available to disable both lockup detectors at boot time.
      
      Also, remove the proc_dowatchdog() function which is no longer needed.
      
      [dzickus@redhat.com: wrote changelog]
      [dzickus@redhat.com: update documentation for kernel params and sysctl]
      Signed-off-by: default avatarUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: default avatarDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      195daf66
  24. 25 Mar, 2015 1 commit
  25. 17 Mar, 2015 1 commit
  26. 10 Feb, 2015 1 commit
  27. 15 Dec, 2014 1 commit
  28. 10 Dec, 2014 1 commit
    • Prarit Bhargava's avatar
      kernel: add panic_on_warn · 9e3961a0
      Prarit Bhargava authored
      There have been several times where I have had to rebuild a kernel to
      cause a panic when hitting a WARN() in the code in order to get a crash
      dump from a system.  Sometimes this is easy to do, other times (such as
      in the case of a remote admin) it is not trivial to send new images to
      the user.
      
      A much easier method would be a switch to change the WARN() over to a
      panic.  This makes debugging easier in that I can now test the actual
      image the WARN() was seen on and I do not have to engage in remote
      debugging.
      
      This patch adds a panic_on_warn kernel parameter and
      /proc/sys/kernel/panic_on_warn calls panic() in the
      warn_slowpath_common() path.  The function will still print out the
      location of the warning.
      
      An example of the panic_on_warn output:
      
      The first line below is from the WARN_ON() to output the WARN_ON()'s
      location.  After that the panic() output is displayed.
      
          WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]()
          Kernel panic - not syncing: panic_on_warn set ...
      
          CPU: 30 PID: 11698 Comm: insmod Tainted: G        W  OE  3.17.0+ #57
          Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
           0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190
           0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec
           ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df
          Call Trace:
           [<ffffffff81665190>] dump_stack+0x46/0x58
           [<ffffffff8165e2ec>] panic+0xd0/0x204
           [<ffffffffa038e05f>] ? init_dummy+0x1f/0x30 [dummy_module]
           [<ffffffff81076b90>] warn_slowpath_common+0xd0/0xd0
           [<ffffffffa038e040>] ? dummy_greetings+0x40/0x40 [dummy_module]
           [<ffffffff81076c8a>] warn_slowpath_null+0x1a/0x20
           [<ffffffffa038e05f>] init_dummy+0x1f/0x30 [dummy_module]
           [<ffffffff81002144>] do_one_initcall+0xd4/0x210
           [<ffffffff811b52c2>] ? __vunmap+0xc2/0x110
           [<ffffffff810f8889>] load_module+0x16a9/0x1b30
           [<ffffffff810f3d30>] ? store_uevent+0x70/0x70
           [<ffffffff810f49b9>] ? copy_module_from_fd.isra.44+0x129/0x180
           [<ffffffff810f8ec6>] SyS_finit_module+0xa6/0xd0
           [<ffffffff8166cf29>] system_call_fastpath+0x12/0x17
      
      Successfully tested by me.
      
      hpa said: There is another very valid use for this: many operators would
      rather a machine shuts down than being potentially compromised either
      functionally or security-wise.
      Signed-off-by: default avatarPrarit Bhargava <prarit@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Acked-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9e3961a0
  29. 28 Oct, 2014 1 commit
    • Kirill Tkhai's avatar
      sched/fair: Fix division by zero sysctl_numa_balancing_scan_size · 64192658
      Kirill Tkhai authored
      File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.
      
      This bash command reproduces problem:
      
      $ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \
      	   echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; done
      
      	divide error: 0000 [#1] SMP
      	Modules linked in:
      	CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      	task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000
      	RIP: 0010:[<ffffffff81074191>]  [<ffffffff81074191>] task_scan_min+0x21/0x50
      	RSP: 0000:ffff880037a6bce0  EFLAGS: 00010246
      	RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000
      	RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600
      	RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90
      	R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff
      	R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003
      	FS:  00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      	CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0
      	Stack:
      	 ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
      	 ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
      	 ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
      	Call Trace:
      	 [<ffffffff810741d1>] task_scan_max+0x11/0x40
      	 [<ffffffff81077ef7>] task_numa_fault+0x1f7/0xae0
      	 [<ffffffff8115a896>] ? migrate_misplaced_page+0x276/0x300
      	 [<ffffffff81134a4d>] handle_mm_fault+0x62d/0xba0
      	 [<ffffffff8103e2f1>] __do_page_fault+0x191/0x510
      	 [<ffffffff81030122>] ? native_smp_send_reschedule+0x42/0x60
      	 [<ffffffff8106dc00>] ? check_preempt_curr+0x80/0xa0
      	 [<ffffffff8107092c>] ? wake_up_new_task+0x11c/0x1a0
      	 [<ffffffff8104887d>] ? do_fork+0x14d/0x340
      	 [<ffffffff811799bb>] ? get_unused_fd_flags+0x2b/0x30
      	 [<ffffffff811799df>] ? __fd_install+0x1f/0x60
      	 [<ffffffff8103e67c>] do_page_fault+0xc/0x10
      	 [<ffffffff8150d322>] page_fault+0x22/0x30
      	RIP  [<ffffffff81074191>] task_scan_min+0x21/0x50
      	RSP <ffff880037a6bce0>
      	---[ end trace 9a826d16936c04de ]---
      
      Also fix race in task_scan_min (it depends on compiler behaviour).
      Signed-off-by: default avatarKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dario Faggioli <raistlin@linux.it>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhaiSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      64192658
  30. 09 Oct, 2014 1 commit
  31. 16 Sep, 2014 1 commit
  32. 06 Aug, 2014 1 commit
  33. 23 Jun, 2014 2 commits
    • Aaron Tomlin's avatar
      kernel/watchdog.c: print traces for all cpus on lockup detection · ed235875
      Aaron Tomlin authored
      A 'softlockup' is defined as a bug that causes the kernel to loop in
      kernel mode for more than a predefined period to time, without giving
      other tasks a chance to run.
      
      Currently, upon detection of this condition by the per-cpu watchdog
      task, debug information (including a stack trace) is sent to the system
      log.
      
      On some occasions, we have observed that the "victim" rather than the
      actual "culprit" (i.e.  the owner/holder of the contended resource) is
      reported to the user.  Often this information has proven to be
      insufficient to assist debugging efforts.
      
      To avoid loss of useful debug information, for architectures which
      support NMI, this patch makes it possible to improve soft lockup
      reporting.  This is accomplished by issuing an NMI to each cpu to obtain
      a stack trace.
      
      If NMI is not supported we just revert back to the old method.  A sysctl
      and boot-time parameter is available to toggle this feature.
      
      [dzickus@redhat.com: add CONFIG_SMP in certain areas]
      [akpm@linux-foundation.org: additional CONFIG_SMP=n optimisations]
      [mq@suse.cz: fix warning]
      Signed-off-by: default avatarAaron Tomlin <atomlin@redhat.com>
      Signed-off-by: default avatarDon Zickus <dzickus@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarJan Moskyto Matejka <mq@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed235875
    • David Rientjes's avatar
      mm, pcp: allow restoring percpu_pagelist_fraction default · 7cd2b0a3
      David Rientjes authored
      Oleg reports a division by zero error on zero-length write() to the
      percpu_pagelist_fraction sysctl:
      
          divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
          CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
          Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
          task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
          RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
          RSP: 0018:ffff8800d87a3e78  EFLAGS: 00010246
          RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
          RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
          RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
          R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
          R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
          FS:  00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
          Call Trace:
            proc_sys_call_handler+0xb3/0xc0
            proc_sys_write+0x14/0x20
            vfs_write+0xba/0x1e0
            SyS_write+0x46/0xb0
            tracesys+0xe1/0xe6
      
      However, if the percpu_pagelist_fraction sysctl is set by the user, it
      is also impossible to restore it to the kernel default since the user
      cannot write 0 to the sysctl.
      
      This patch allows the user to write 0 to restore the default behavior.
      It still requires a fraction equal to or larger than 8, however, as
      stated by the documentation for sanity.  If a value in the range [1, 7]
      is written, the sysctl will return EINVAL.
      
      This successfully solves the divide by zero issue at the same time.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarOleg Drokin <green@linuxhacker.ru>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7cd2b0a3
  34. 06 Jun, 2014 2 commits
    • Joe Perches's avatar
      sysctl: convert use of typedef ctl_table to struct ctl_table · 6f8fd1d7
      Joe Perches authored
      This typedef is unnecessary and should just be removed.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f8fd1d7
    • Kees Cook's avatar
      sysctl: allow for strict write position handling · f4aacea2
      Kees Cook authored
      When writing to a sysctl string, each write, regardless of VFS position,
      begins writing the string from the start.  This means the contents of
      the last write to the sysctl controls the string contents instead of the
      first:
      
        open("/proc/sys/kernel/modprobe", O_WRONLY)   = 1
        write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
        write(1, "/bin/true", 9)                = 9
        close(1)                                = 0
      
        $ cat /proc/sys/kernel/modprobe
        /bin/true
      
      Expected behaviour would be to have the sysctl be "AAAA..." capped at
      maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
      contents of the second write.  Similarly, multiple short writes would
      not append to the sysctl.
      
      The old behavior is unlike regular POSIX files enough that doing audits
      of software that interact with sysctls can end up in unexpected or
      dangerous situations.  For example, "as long as the input starts with a
      trusted path" turns out to be an insufficient filter, as what must also
      happen is for the input to be entirely contained in a single write
      syscall -- not a common consideration, especially for high level tools.
      
      This provides kernel.sysctl_writes_strict as a way to make this behavior
      act in a less surprising manner for strings, and disallows non-zero file
      position when writing numeric sysctls (similar to what is already done
      when reading from non-zero file positions).  For now, the default (0) is
      to warn about non-zero file position use, but retain the legacy
      behavior.  Setting this to -1 disables the warning, and setting this to
      1 enables the file position respecting behavior.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: move misplaced hunk, per Randy]
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4aacea2