1. 23 Jun, 2014 1 commit
    • David Rientjes's avatar
      mm, pcp: allow restoring percpu_pagelist_fraction default · 7cd2b0a3
      David Rientjes authored
      Oleg reports a division by zero error on zero-length write() to the
      percpu_pagelist_fraction sysctl:
          divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
          CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
          Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
          task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
          RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
          RSP: 0018:ffff8800d87a3e78  EFLAGS: 00010246
          RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
          RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
          RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
          R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
          R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
          FS:  00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
          Call Trace:
      However, if the percpu_pagelist_fraction sysctl is set by the user, it
      is also impossible to restore it to the kernel default since the user
      cannot write 0 to the sysctl.
      This patch allows the user to write 0 to restore the default behavior.
      It still requires a fraction equal to or larger than 8, however, as
      stated by the documentation for sanity.  If a value in the range [1, 7]
      is written, the sysctl will return EINVAL.
      This successfully solves the divide by zero issue at the same time.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarOleg Drokin <green@linuxhacker.ru>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  2. 06 Jun, 2014 4 commits
    • Joe Perches's avatar
      sysctl: convert use of typedef ctl_table to struct ctl_table · 6f8fd1d7
      Joe Perches authored
      This typedef is unnecessary and should just be removed.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Kees Cook's avatar
      sysctl: allow for strict write position handling · f4aacea2
      Kees Cook authored
      When writing to a sysctl string, each write, regardless of VFS position,
      begins writing the string from the start.  This means the contents of
      the last write to the sysctl controls the string contents instead of the
        open("/proc/sys/kernel/modprobe", O_WRONLY)   = 1
        write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
        write(1, "/bin/true", 9)                = 9
        close(1)                                = 0
        $ cat /proc/sys/kernel/modprobe
      Expected behaviour would be to have the sysctl be "AAAA..." capped at
      maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
      contents of the second write.  Similarly, multiple short writes would
      not append to the sysctl.
      The old behavior is unlike regular POSIX files enough that doing audits
      of software that interact with sysctls can end up in unexpected or
      dangerous situations.  For example, "as long as the input starts with a
      trusted path" turns out to be an insufficient filter, as what must also
      happen is for the input to be entirely contained in a single write
      syscall -- not a common consideration, especially for high level tools.
      This provides kernel.sysctl_writes_strict as a way to make this behavior
      act in a less surprising manner for strings, and disallows non-zero file
      position when writing numeric sysctls (similar to what is already done
      when reading from non-zero file positions).  For now, the default (0) is
      to warn about non-zero file position use, but retain the legacy
      behavior.  Setting this to -1 disables the warning, and setting this to
      1 enables the file position respecting behavior.
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: move misplaced hunk, per Randy]
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Kees Cook's avatar
      sysctl: refactor sysctl string writing logic · 2ca9bb45
      Kees Cook authored
      Consolidate buffer length checking with new-line/end-of-line checking.
      Additionally, instead of reading user memory twice, just do the
      assignment during the loop.
      This change doesn't affect the potential races here.  It was already
      possible to read a sysctl that was in the middle of a write.  In both
      cases, the string will always be NULL terminated.  The pre-existing race
      remains a problem to be solved.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Kees Cook's avatar
      sysctl: clean up char buffer arguments · f8808300
      Kees Cook authored
      When writing to a sysctl string, each write, regardless of VFS position,
      began writing the string from the start.  This meant the contents of the
      last write to the sysctl controlled the string contents instead of the
      This misbehavior was featured in an exploit against Chrome OS.  While
      it's not in itself a vulnerability, it's a weirdness that isn't on the
      mind of most auditors: "This filter looks correct, the first line
      written would not be meaningful to sysctl" doesn't apply here, since the
      size of the write and the contents of the final write are what matter
      when writing to sysctls.
      This adds the sysctl kernel.sysctl_writes_strict to control the write
      behavior.  The default (0) reports when VFS position is non-0 on a
      write, but retains legacy behavior, -1 disables the warning, and 1
      enables the position-respecting behavior.
      The long-term plan here is to wait for userspace to be fixed in response
      to the new warning and to then switch the default kernel behavior to the
      new position-respecting behavior.
      This patch (of 4):
      The char buffer arguments are needlessly cast in weird places.  Clean it
      up so things are easier to read.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  3. 18 May, 2014 1 commit
  4. 14 May, 2014 1 commit
  5. 05 May, 2014 1 commit
  6. 25 Apr, 2014 1 commit
  7. 07 Apr, 2014 1 commit
  8. 03 Apr, 2014 1 commit
  9. 13 Mar, 2014 1 commit
    • Jens Axboe's avatar
      block: remove old blk_iopoll_enabled variable · 89f8b33c
      Jens Axboe authored
      This was a debugging measure to toggle enabled/disabled
      when testing. But for real production setups, it's not
      safe to toggle this setting without either reloading
      drivers of quiescing IO first. Neither of which the toggle
      Additionally, it makes drivers deal with the conditional
      Remove it completely. It's up to the driver whether iopoll
      is enabled or not.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  10. 28 Jan, 2014 1 commit
  11. 25 Jan, 2014 2 commits
  12. 23 Jan, 2014 2 commits
    • Kees Cook's avatar
      kexec: add sysctl to disable kexec_load · 7984754b
      Kees Cook authored
      For general-purpose (i.e.  distro) kernel builds it makes sense to build
      with CONFIG_KEXEC to allow end users to choose what kind of things they
      want to do with kexec.  However, in the face of trying to lock down a
      system with such a kernel, there needs to be a way to disable kexec_load
      (much like module loading can be disabled).  Without this, it is too easy
      for the root user to modify kernel memory even when CONFIG_STRICT_DEVMEM
      and modules_disabled are set.  With this change, it is still possible to
      load an image for use later, then disable kexec_load so the image (or lack
      of image) can't be altered.
      The intention is for using this in environments where "perfect"
      enforcement is hard.  Without a verified boot, along with verified
      modules, and along with verified kexec, this is trying to give a system a
      better chance to defend itself (or at least grow the window of
      discoverability) against attack in the face of a privilege escalation.
      In my mind, I consider several boot scenarios:
      1) Verified boot of read-only verified root fs loading fd-based
         verification of kexec images.
      2) Secure boot of writable root fs loading signed kexec images.
      3) Regular boot loading kexec (e.g. kcrash) image early and locking it.
      4) Regular boot with no control of kexec image at all.
      1 and 2 don't exist yet, but will soon once the verified kexec series has
      landed.  4 is the state of things now.  The gap between 2 and 4 is too
      large, so this change creates scenario 3, a middle-ground above 4 when 2
      and 1 are not possible for a system.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Andi Kleen's avatar
      numa: add a sysctl for numa_balancing · 54a43d54
      Andi Kleen authored
      Add a working sysctl to enable/disable automatic numa memory balancing
      at runtime.
      This allows us to track down performance problems with this feature and
      is generally a good idea.
      This was possible earlier through debugfs, but only with special
      debugging options set.  Also fix the boot message.
      [akpm@linux-foundation.org: s/sched_numa_balancing/sysctl_numa_balancing/]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  13. 21 Jan, 2014 1 commit
    • Jerome Marchand's avatar
      mm: add overcommit_kbytes sysctl variable · 49f0ce5f
      Jerome Marchand authored
      Some applications that run on HPC clusters are designed around the
      availability of RAM and the overcommit ratio is fine tuned to get the
      maximum usage of memory without swapping.  With growing memory, the
      1%-of-all-RAM grain provided by overcommit_ratio has become too coarse
      for these workload (on a 2TB machine it represents no less than 20GB).
      This patch adds the new overcommit_kbytes sysctl variable that allow a
      much finer grain.
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix nommu build]
      Signed-off-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  14. 13 Jan, 2014 2 commits
    • Peter Zijlstra's avatar
      sched/deadline: Remove the sysctl_sched_dl knobs · 1724813d
      Peter Zijlstra authored
      Remove the deadline specific sysctls for now. The problem with them is
      that the interaction with the exisiting rt knobs is nearly impossible
      to get right.
      The current (as per before this patch) situation is that the rt and dl
      bandwidth is completely separate and we enforce rt+dl < 100%. This is
      undesirable because this means that the rt default of 95% leaves us
      hardly any room, even though dl tasks are saver than rt tasks.
      Another proposed solution was (a discarted patch) to have the dl
      bandwidth be a fraction of the rt bandwidth. This is highly
      confusing imo.
      Furthermore neither proposal is consistent with the situation we
      actually want; which is rt tasks ran from a dl server. In which case
      the rt bandwidth is a direct subset of dl.
      So whichever way we go, the introduction of dl controls at this point
      is painful. Therefore remove them and instead share the rt budget.
      This means that for now the rt knobs are used for dl admission control
      and the dl runtime is accounted against the rt runtime. I realise that
      this isn't entirely desirable either; but whatever we do we appear to
      need to change the interface later, so better have a small interface
      for now.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-zpyqbqds1r0vyxtxza1e7rdc@git.kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Dario Faggioli's avatar
      sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks · 332ac17e
      Dario Faggioli authored
      In order of deadline scheduling to be effective and useful, it is
      important that some method of having the allocation of the available
      CPU bandwidth to tasks and task groups under control.
      This is usually called "admission control" and if it is not performed
      at all, no guarantee can be given on the actual scheduling of the
      -deadline tasks.
      Since when RT-throttling has been introduced each task group have a
      bandwidth associated to itself, calculated as a certain amount of
      runtime over a period. Moreover, to make it possible to manipulate
      such bandwidth, readable/writable controls have been added to both
      procfs (for system wide settings) and cgroupfs (for per-group
      Therefore, the same interface is being used for controlling the
      bandwidth distrubution to -deadline tasks and task groups, i.e.,
      new controls but with similar names, equivalent meaning and with
      the same usage paradigm are added.
      However, more discussion is needed in order to figure out how
      we want to manage SCHED_DEADLINE bandwidth at the task group level.
      Therefore, this patch adds a less sophisticated, but actually
      very sensible, mechanism to ensure that a certain utilization
      cap is not overcome per each root_domain (the single rq for !SMP
      Another main difference between deadline bandwidth management and
      RT-throttling is that -deadline tasks have bandwidth on their own
      (while -rt ones doesn't!), and thus we don't need an higher level
      throttling mechanism to enforce the desired bandwidth.
      This patch, therefore:
       - adds system wide deadline bandwidth management by means of:
          * /proc/sys/kernel/sched_dl_runtime_us,
          * /proc/sys/kernel/sched_dl_period_us,
         that determine (i.e., runtime / period) the total bandwidth
         available on each CPU of each root_domain for -deadline tasks;
       - couples the RT and deadline bandwidth management, i.e., enforces
         that the sum of how much bandwidth is being devoted to -rt
         -deadline tasks to stay below 100%.
      This means that, for a root_domain comprising M CPUs, -deadline tasks
      can be created until the sum of their bandwidths stay below:
          M * (sched_dl_runtime_us / sched_dl_period_us)
      It is also possible to disable this bandwidth management logic, and
      be thus free of oversubscribing the system up to any arbitrary level.
      Signed-off-by: default avatarDario Faggioli <raistlin@linux.it>
      Signed-off-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
  15. 17 Dec, 2013 1 commit
  16. 12 Nov, 2013 1 commit
  17. 16 Oct, 2013 1 commit
  18. 09 Oct, 2013 3 commits
  19. 04 Oct, 2013 1 commit
  20. 23 Sep, 2013 1 commit
  21. 11 Sep, 2013 1 commit
  22. 10 Sep, 2013 1 commit
    • Glauber Costa's avatar
      fs: bump inode and dentry counters to long · 3942c07c
      Glauber Costa authored
      This series reworks our current object cache shrinking infrastructure in
      two main ways:
       * Noticing that a lot of users copy and paste their own version of LRU
         lists for objects, we put some effort in providing a generic version.
         It is modeled after the filesystem users: dentries, inodes, and xfs
         (for various tasks), but we expect that other users could benefit in
         the near future with little or no modification.  Let us know if you
         have any issues.
       * The underlying list_lru being proposed automatically and
         transparently keeps the elements in per-node lists, and is able to
         manipulate the node lists individually.  Given this infrastructure, we
         are able to modify the up-to-now hammer called shrink_slab to proceed
         with node-reclaim instead of always searching memory from all over like
         it has been doing.
      Per-node lru lists are also expected to lead to less contention in the lru
      locks on multi-node scans, since we are now no longer fighting for a
      global lock.  The locks usually disappear from the profilers with this
      Although we have no official benchmarks for this version - be our guest to
      independently evaluate this - earlier versions of this series were
      performance tested (details at
      http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
      visible performance regressions while yielding a better qualitative
      behavior in NUMA machines.
      With this infrastructure in place, we can use the list_lru entry point to
      provide memcg isolation and per-memcg targeted reclaim.  Historically,
      those two pieces of work have been posted together.  This version presents
      only the infrastructure work, deferring the memcg work for a later time,
      so we can focus on getting this part tested.  You can see more about the
      history of such work at http://lwn.net/Articles/552769/
      Dave Chinner (18):
        dcache: convert dentry_stat.nr_unused to per-cpu counters
        dentry: move to per-sb LRU locks
        dcache: remove dentries from LRU before putting on dispose list
        mm: new shrinker API
        shrinker: convert superblock shrinkers to new API
        list: add a new LRU list type
        inode: convert inode lru list to generic lru list code.
        dcache: convert to use new lru list infrastructure
        list_lru: per-node list infrastructure
        shrinker: add node awareness
        fs: convert inode and dentry shrinking to be node aware
        xfs: convert buftarg LRU to generic code
        xfs: rework buffer dispose list tracking
        xfs: convert dquot cache lru to list_lru
        fs: convert fs shrinkers to new scan/count API
        drivers: convert shrinkers to new count/scan API
        shrinker: convert remaining shrinkers to count/scan API
        shrinker: Kill old ->shrink API.
      Glauber Costa (7):
        fs: bump inode and dentry counters to long
        super: fix calculation of shrinkable objects for small numbers
        list_lru: per-node API
        vmscan: per-node deferred work
        i915: bail out earlier when shrinker cannot acquire mutex
        hugepage: convert huge zero page shrinker to new shrinker API
        list_lru: dynamically adjust node arrays
      This patch:
      There are situations in very large machines in which we can have a large
      quantity of dirty inodes, unused dentries, etc.  This is particularly true
      when umounting a filesystem, where eventually since every live object will
      eventually be discarded.
      Dave Chinner reported a problem with this while experimenting with the
      shrinker revamp patchset.  So we believe it is time for a change.  This
      patch just moves int to longs.  Machines where it matters should have a
      big long anyway.
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  23. 26 Jul, 2013 1 commit
  24. 23 Jun, 2013 1 commit
    • Dave Hansen's avatar
      perf: Drop sample rate when sampling is too slow · 14c63f17
      Dave Hansen authored
      This patch keeps track of how long perf's NMI handler is taking,
      and also calculates how many samples perf can take a second.  If
      the sample length times the expected max number of samples
      exceeds a configurable threshold, it drops the sample rate.
      This way, we don't have a runaway sampling process eating up the
      This patch can tend to drop the sample rate down to level where
      perf doesn't work very well.  *BUT* the alternative is that my
      system hangs because it spends all of its time handling NMIs.
      I'll take a busted performance tool over an entire system that's
      busted and undebuggable any day.
      BTW, my suspicion is that there's still an underlying bug here.
      Using the HPET instead of the TSC is definitely a contributing
      factor, but I suspect there are some other things going on.
      But, I can't go dig down on a bug like that with my machine
      hanging all the time.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: paulus@samba.org
      Cc: acme@ghostprotocols.net
      Cc: Dave Hansen <dave@sr71.net>
      [ Prettified it a bit. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
  25. 20 Jun, 2013 1 commit
    • Frederic Weisbecker's avatar
      watchdog: Rename confusing state variable · 3c00ea82
      Frederic Weisbecker authored
      We have two very conflicting state variable names in the
      * watchdog_enabled: This one reflects the user interface. It's
      set to 1 by default and can be overriden with boot options
      or sysctl/procfs interface.
      * watchdog_disabled: This is the internal toggle state that
      tells if watchdog threads, timers and NMI events are currently
      running or not. This state mostly depends on the user settings.
      It's a convenient state latch.
      Now we really need to find clearer names because those
      are just too confusing to encourage deep review.
      watchdog_enabled now becomes watchdog_user_enabled to reflect
      its purpose as an interface.
      watchdog_disabled becomes watchdog_running to suggest its
      role as a pure internal state.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Anish Singh <anish198519851985@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Don Zickus <dzickus@redhat.com>
  26. 19 Jun, 2013 1 commit
    • Steven Rostedt (Red Hat)'s avatar
      tracing: Disable tracing on warning · de7edd31
      Steven Rostedt (Red Hat) authored
      Add a traceoff_on_warning option in both the kernel command line as well
      as a sysctl option. When set, any WARN*() function that is hit will cause
      the tracing_on variable to be cleared, which disables writing to the
      ring buffer.
      This is useful especially when tracing a bug with function tracing. When
      a warning is hit, the print caused by the warning can flood the trace with
      the functions that producing the output for the warning. This can make the
      resulting trace useless by either hiding where the bug happened, or worse,
      by overflowing the buffer and losing the trace of the bug totally.
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
  27. 28 May, 2013 1 commit
  28. 29 Apr, 2013 3 commits
    • Andrew Shewmaker's avatar
      mm: replace hardcoded 3% with admin_reserve_pages knob · 4eeab4f5
      Andrew Shewmaker authored
      Add an admin_reserve_kbytes knob to allow admins to change the hardcoded
      memory reserve to something other than 3%, which may be multiple
      gigabytes on large memory systems.  Only about 8MB is necessary to
      enable recovery in the default mode, and only a few hundred MB are
      required even when overcommit is disabled.
      admin_reserve_kbytes is initialized to min(3% free pages, 8MB)
      I arrived at 8MB by summing the RSS of sshd or login, bash, and top.
      Please see first patch in this series for full background, motivation,
      testing, and full changelog.
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: make init_admin_reserve() static]
      Signed-off-by: default avatarAndrew Shewmaker <agshew@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Andrew Shewmaker's avatar
      mm: limit growth of 3% hardcoded other user reserve · c9b1d098
      Andrew Shewmaker authored
      Add user_reserve_kbytes knob.
      Limit the growth of the memory reserved for other user processes to
      min(3% current process size, user_reserve_pages).  Only about 8MB is
      necessary to enable recovery in the default mode, and only a few hundred
      MB are required even when overcommit is disabled.
      user_reserve_pages defaults to min(3% free pages, 128MB)
      I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
      then adding the RSS of each.
      This only affects OVERCOMMIT_NEVER mode.
      1. user reserve
      __vm_enough_memory reserves a hardcoded 3% of the current process size for
      other applications when overcommit is disabled.  This was done so that a
      user could recover if they launched a memory hogging process.  Without the
      reserve, a user would easily run into a message such as:
      bash: fork: Cannot allocate memory
      2. admin reserve
      Additionally, a hardcoded 3% of free memory is reserved for root in both
      overcommit 'guess' and 'never' modes.  This was intended to prevent a
      scenario where root-cant-log-in and perform recovery operations.
      Note that this reserve shrinks, and doesn't guarantee a useful reserve.
      The two hardcoded memory reserves should be updated to account for current
      memory sizes.
      Also, the admin reserve would be more useful if it didn't shrink too much.
      When the current code was originally written, 1GB was considered
      "enterprise".  Now the 3% reserve can grow to multiple GB on large memory
      systems, and it only needs to be a few hundred MB at most to enable a user
      or admin to recover a system with an unwanted memory hogging process.
      I've found that reducing these reserves is especially beneficial for a
      specific type of application load:
       * single application system
       * one or few processes (e.g. one per core)
       * allocating all available memory
       * not initializing every page immediately
       * long running
      I've run scientific clusters with this sort of load.  A long running job
      sometimes failed many hours (weeks of CPU time) into a calculation.  They
      weren't initializing all of their memory immediately, and they weren't
      using calloc, so I put systems into overcommit 'never' mode.  These
      clusters run diskless and have no swap.
      However, with the current reserves, a user wishing to allocate as much
      memory as possible to one process may be prevented from using, for
      example, almost 2GB out of 32GB.
      The effect is less, but still significant when a user starts a job with
      one process per core.  I have repeatedly seen a set of processes
      requesting the same amount of memory fail because one of them could not
      allocate the amount of memory a user would expect to be able to allocate.
      For example, Message Passing Interfce (MPI) processes, one per core.  And
      it is similar for other parallel programming frameworks.
      Changing this reserve code will make the overcommit never mode more useful
      by allowing applications to allocate nearly all of the available memory.
      Also, the new admin_reserve_kbytes will be safer than the current behavior
      since the hardcoded 3% of available memory reserve can shrink to something
      useless in the case where applications have grabbed all available memory.
      * "bash: fork: Cannot allocate memory"
        The downside of the first patch-- which creates a tunable user reserve
        that is only used in overcommit 'never' mode--is that an admin can set
        it so low that a user may not be able to kill their process, even if
        they already have a shell prompt.
        Of course, a user can get in the same predicament with the current 3%
        reserve--they just have to launch processes until 3% becomes negligible.
      * root-cant-log-in problem
        The second patch, adding the tunable rootuser_reserve_pages, allows
        the admin to shoot themselves in the foot by setting it too small.  They
        can easily get the system into a state where root-can't-log-in.
        However, the new admin_reserve_kbytes will be safer than the current
        behavior since the hardcoded 3% of available memory reserve can shrink
        to something useless in the case where applications have grabbed all
        available memory.
       * Memory cgroups provide a more flexible way to limit application memory.
         Not everyone wants to set up cgroups or deal with their overhead.
       * We could create a fourth overcommit mode which provides smaller reserves.
         The size of useful reserves may be drastically different depending
         on the whether the system is embedded or enterprise.
       * Force users to initialize all of their memory or use calloc.
         Some users don't want/expect the system to overcommit when they malloc.
         Overcommit 'never' mode is for this scenario, and it should work well.
      The new user and admin reserve tunables are simple to use, with low
      overhead compared to cgroups.  The patches preserve current behavior where
      3% of memory is less than 128MB, except that the admin reserve doesn't
      shrink to an unusable size under pressure.  The code allows admins to tune
      for embedded and enterprise usage.
       * How is the root-cant-login problem addressed?
         What happens if admin_reserve_pages is set to 0?
         Root is free to shoot themselves in the foot by setting
         admin_reserve_kbytes too low.
         On x86_64, the minimum useful reserve is:
           8MB for overcommit 'guess'
         128MB for overcommit 'never'
         admin_reserve_pages defaults to min(3% free memory, 8MB)
         So, anyone switching to 'never' mode needs to adjust
       * How do you calculate a minimum useful reserve?
         A user or the admin needs enough memory to login and perform
         recovery operations, which includes, at a minimum:
         sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
         For overcommit 'guess', we can sum resident set sizes (RSS)
         because we only need enough memory to handle what the recovery
         programs will typically use. On x86_64 this is about 8MB.
         For overcommit 'never', we can take the max of their virtual sizes (VSZ)
         and add the sum of their RSS. We use VSZ instead of RSS because mode
         forces us to ensure we can fulfill all of the requested memory allocations--
         even if the programs only use a fraction of what they ask for.
         On x86_64 this is about 128MB.
         When swap is enabled, reserves are useful even when they are as
         small as 10MB, regardless of overcommit mode.
         When both swap and overcommit are disabled, then the admin should
         tune the reserves higher to be absolutley safe. Over 230MB each
         was safest in my testing.
       * What happens if user_reserve_pages is set to 0?
         Note, this only affects overcomitt 'never' mode.
         Then a user will be able to allocate all available memory minus
         However, they will easily see a message such as:
         "bash: fork: Cannot allocate memory"
         And they won't be able to recover/kill their application.
         The admin should be able to recover the system if
         admin_reserve_kbytes is set appropriately.
       * What's the difference between overcommit 'guess' and 'never'?
         "Guess" allows an allocation if there are enough free + reclaimable
         pages. It has a hardcoded 3% of free pages reserved for root.
         "Never" allows an allocation if there is enough swap + a configurable
         percentage (default is 50) of physical RAM. It has a hardcoded 3% of
         free pages reserved for root, like "Guess" mode. It also has a
         hardcoded 3% of the current process size reserved for additional
       * Why is overcommit 'guess' not suitable even when an app eventually
         writes to every page? It takes free pages, file pages, available
         swap pages, reclaimable slab pages into consideration. In other words,
         these are all pages available, then why isn't overcommit suitable?
         Because it only looks at the present state of the system. It
         does not take into account the memory that other applications have
         malloced, but haven't initialized yet. It overcommits the system.
      Test Summary
      There was little change in behavior in the default overcommit 'guess'
      mode with swap enabled before and after the patch. This was expected.
      Systems run most predictably (i.e. no oom kills) in overcommit 'never'
      mode with swap enabled. This also allowed the most memory to be allocated
      to a user application.
      Overcommit 'guess' mode without swap is a bad idea. It is easy to
      crash the system. None of the other tested combinations crashed.
      This matches my experience on the Roadrunner supercomputer.
      Without the tunable user reserve, a system in overcommit 'never' mode
      and without swap does not allow the admin to recover, although the
      admin can.
      With the new tunable reserves, a system in overcommit 'never' mode
      and without swap can be configured to:
      1. maximize user-allocatable memory, running close to the edge of
      2. maximize recoverability, sacrificing allocatable memory to
      ensure that a user cannot take down a system
      Test Description
      Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
      System is booted into multiuser console mode, with unnecessary services
      turned off. Caches were dropped before each test.
      Hogs are user memtester processes that attempt to allocate all free memory
      as reported by /proc/meminfo
      In overcommit 'never' mode, memory_ratio=100
      Test Results
      Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
      ----------   ----   ----   -------------   ----   -------------   --------------
      guess        yes    1      5432/5432       no     yes             yes
      guess        yes    4      5444/5444       1      yes             yes
      guess        no     1      5302/5449       no     yes             yes
      guess        no     4      -               crash  no              no
      never        yes    1      5460/5460       1      yes             yes
      never        yes    4      5460/5460       1      yes             yes
      never        no     1      5218/5432       no     no              yes
      never        no     4      5203/5448       no     no              yes
      User and Admin Recovery show their respective reserves, if applicable.
      Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
      ----------   ----   ----   -------------   ----   -------------   --------------
      guess        yes    1      5419/5419       no     - yes           8MB yes
      guess        yes    4      5436/5436       1      - yes           8MB yes
      guess        no     1      5440/5440       *      - yes           8MB yes
      guess        no     4      -               crash  - no            8MB no
      * process would successfully mlock, then the oom killer would pick it
      never        yes    1      5446/5446       no     10MB yes        20MB yes
      never        yes    4      5456/5456       no     10MB yes        20MB yes
      never        no     1      5387/5429       no     128MB no        8MB barely
      never        no     1      5323/5428       no     226MB barely    8MB barely
      never        no     1      5323/5428       no     226MB barely    8MB barely
      never        no     1      5359/5448       no     10MB no         10MB barely
      never        no     1      5323/5428       no     0MB no          10MB barely
      never        no     1      5332/5428       no     0MB no          50MB yes
      never        no     1      5293/5429       no     0MB no          90MB yes
      never        no     1      5001/5427       no     230MB yes       338MB yes
      never        no     4*     4998/5424       no     230MB yes       338MB yes
      * more memtesters were launched, able to allocate approximately another 100MB
      Future Work
       - Test larger memory systems.
       - Test an embedded image.
       - Test other architectures.
       - Time malloc microbenchmarks.
       - Would it be useful to be able to set overcommit policy for
         each memory cgroup?
       - Some lines are slightly above 80 chars.
         Perhaps define a macro to convert between pages and kb?
         Other places in the kernel do this.
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: make init_user_reserve() static]
      Signed-off-by: default avatarAndrew Shewmaker <agshew@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Josh Triplett's avatar
      fs: don't compile in drop_caches.c when CONFIG_SYSCTL=n · 146732ce
      Josh Triplett authored
      drop_caches.c provides code only invokable via sysctl, so don't compile it
      in when CONFIG_SYSCTL=n.
      Signed-off-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  29. 27 Feb, 2013 1 commit
  30. 23 Feb, 2013 1 commit