- Aug 09, 2010
-
-
David Rientjes authored
This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of "allowable" memory. "Allowable," in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of "allowable" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by:
David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Andrew Morton authored
Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
KOSAKI Motohiro authored
If a kernel thread is using use_mm(), badness() returns a positive value. This is not a big issue because caller take care of it correctly. But there is one exception, /proc/<pid>/oom_score calls badness() directly and doesn't care that the task is a regular process. Another example, /proc/1/oom_score return !0 value. But it's unkillable. This incorrectness makes administration a little confusing. This patch fixes it. Signed-off-by:
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Aug 07, 2010
-
-
David Howells authored
Fix the module init error handling. There are a bunch of goto labels for aborting the init procedure at different points and just undoing what needs undoing - they aren't all in the right places, however. This can lead to an oops like the following: BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 IP: [<ffffffff81042a31>] destroy_workqueue+0x17/0xc0 ... Modules linked in: kafs(+) dns_resolver rxkad af_rxrpc fscache Pid: 2171, comm: insmod Not tainted 2.6.35-cachefs+ #319 DG965RY/ ... Process insmod (pid: 2171, threadinfo ffff88003ca6a000, task ffff88003dcc3050) ... Call Trace: [<ffffffffa0055994>] afs_callback_update_kill+0x10/0x12 [kafs] [<ffffffffa007d1c5>] afs_init+0x190/0x1ce [kafs] [<ffffffffa007d035>] ? afs_init+0x0/0x1ce [kafs] [<ffffffff810001ef>] do_one_initcall+0x59/0x14e [<ffffffff8105f7ee>] sys_init_module+0x9c/0x1de [<ffffffff81001eab>] system_call_fastpath+0x16/0x1b Signed-off-by:
David Howells <dhowells@redhat.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
J. Bruce Fields authored
Commit f9d7562f "nfsd4: share file descriptors between stateid's" didn't correctly account for O_RDWR opens. Symptoms include leaked files, resulting in failures to unmount and/or warnings about orphaned inodes on reboot. Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
- Aug 06, 2010
-
-
J. Bruce Fields authored
It's harmless to set this after the server is created, but also ineffective, since the value is only used at the time of svc_create_pooled(). So fail the attempt, in keeping with the pattern set by write_versions, write_{lease,grace}time and write_recoverydir. (This could break userspace that tried to write to nfsd/max_block_size between setting up sockets and starting the server. However, such code wouldn't have worked anyway, and I don't know of any examples--rpc.nfsd in nfs-utils, probably the only user of the interface, doesn't do that.) Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
J. Bruce Fields authored
Commit 59db4a0c "nfsd: move more into nfsd_startup()" inadvertently moved nfsd_versions after nfsd_create_svc(). On older distributions using an rpc.nfsd that does not explicitly set the list of nfsd versions, this results in svc-create_pooled() being called with an empty versions array. The resulting incomplete initialization leads to a NULL dereference in svc_process_common() the first time a client accesses the server. Move nfsd_reset_versions() back before the svc_create_pooled(); this time, put it closer to the svc_create_pooled() call, to make this mistake more difficult in the future. Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Boaz Harrosh authored
If a callback is retried at nfsd4_cb_recall_done() due to some error, the returned rpc reply crashes here: @@ -514,6 +514,7 @@ decode_cb_sequence(struct xdr_stream *xdr, struct nfsd4_cb_sequence *res, u32 dummy; __be32 *p; + BUG_ON(!res); if (res->cbs_minorversion == 0) return 0; [BUG_ON added for demonstration] This is because the nfsd4_cb_done_sequence() has NULLed out the task->tk_msg.rpc_resp pointer. Also eventually the rpc would use the new slot without making sure it is free by calling nfsd41_cb_setup_sequence(). This problem was introduced by a 4.1 protocol addition patch: [0421b5c5] nfsd41: Backchannel: Implement cb_recall over NFSv4.1 Which was overlooking the possibility of an RPC callback retries. For not-4.1 case redoing the _prepare is harmless. Signed-off-by:
Boaz Harrosh <bharrosh@panasas.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
J. Bruce Fields authored
We must create the server before we can call init_socks or check the number of threads. Symptoms were a NULL pointer dereference in nfsd_svc(). Problem identified by Jeff Layton. Also fix a minor cleanup-on-error case in nfsd_startup(). Reported-by:
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Trond Myklebust authored
Mark it as 'experimental' instead, since in practice, NFSv4.1 should now be relatively stable. Signed-off-by:
Trond Myklebust <Trond.Myklebust@netapp.com>
-
Trond Myklebust authored
Signed-off-by:
Trond Myklebust <Trond.Myklebust@netapp.com>
-
Bryan Schumaker authored
Add a flag so we know if we mounted the NFS server using the legacy binary interface. If we used the legacy interface, then we should not show the mountd options. Signed-off-by:
Bryan Schumaker <bjschuma@netapp.com> Signed-off-by:
Trond Myklebust <Trond.Myklebust@netapp.com>
-
Trond Myklebust authored
The delegation is protected by RCU now, so we need to replace the nfsi->rwsem protection with an rcu protected section. Reported-by:
Fred Isaman <iisaman@netapp.com> Signed-off-by:
Trond Myklebust <Trond.Myklebust@netapp.com>
-
David Howells authored
Make /dev/console get initialised before any initialisation routine that invokes modprobe because if modprobe fails, it's going to want to open /dev/console, presumably to write an error message to. The problem with that is that if the /dev/console driver is not yet initialised, the chardev handler will call request_module() to invoke modprobe, which will fail, because we never compile /dev/console as a module. This will lead to a modprobe loop, showing the following in the kernel log: request_module: runaway loop modprobe char-major-5-1 request_module: runaway loop modprobe char-major-5-1 request_module: runaway loop modprobe char-major-5-1 request_module: runaway loop modprobe char-major-5-1 request_module: runaway loop modprobe char-major-5-1 This can happen, for example, when the built in md5 module can't find the built in cryptomgr module (because the latter fails to initialise). The md5 module comes before the call to tty_init(), presumably because 'crypto' comes before 'drivers' alphabetically. Fix this by calling tty_init() from chrdev_init(). Signed-off-by:
David Howells <dhowells@redhat.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Aug 05, 2010
-
-
Jean Delvare authored
sysfs_chmod_file doesn't change the attribute it operates on, so this attribute can be marked const. Signed-off-by:
Jean Delvare <khali@linux-fr.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
Aditya Kali authored
If the bitmap block on disk is bad, ext4_mb_load_buddy() returns an error. This error is returned to the caller, ext4_mb_regular_allocator() and then to ext4_mb_new_blocks(). But ext4_mb_new_blocks() did not check for the return value of ext4_mb_regular_allocator() and would repeatedly try to load the bitmap block. The fix simply catches the return value and exits out of the 'repeat' loop after cleanup. We also take the opportunity to clean up the error handling in ext4_mb_new_blocks(). Google-Bug-Id: 2853530 Signed-off-by:
Aditya Kali <adityakali@google.com> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
Satoru Takeuchi authored
- sys_io_destroy(): acutually return -EINVAL if the context pointed to is invalidIndex: linux-2.6.33-rc4/fs/aio.c - sys_io_getevents(): An argument specifying timeout is not `when', but `timeout'. - sys_io_getevents(): Should describe what is returned if this syscall succeeds. Signed-off-by:
Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by:
Randy Dunlap <randy.dunlap@oracle.com> Reviewed-by:
Jeff Moyer <jmoyer@redhat.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
In data=journal mode, we still use block_write_begin() to prepare page for writing. This function can occasionally mark buffer dirty which violates journalling assumptions - when a buffer is part of a transaction, it should be dirty and a buffer can be already part of a forget list of some transaction when block_write_begin() gets called. This violation of journalling assumptions then results in "JBD: Spotted dirty metadata buffer..." warnings. In fact, temporary dirtying the buffer while the page is still locked does not really cause problems to the journalling because we won't write the buffer until the page gets unlocked. So we just have to make sure to clear dirty bits before unlocking the page. Reviewed-by:
"Theodore Ts'o" <tytso@mit.edu> Signed-off-by:
Jan Kara <jack@suse.cz>
-
Julia Lawall authored
hlist_for_each_entry binds its first argument to a non-null value, and thus any null test on the value of that argument is superfluous. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/ ) // <smpl> @@ iterator I; expression x,E,E1,E2; statement S,S1,S2; @@ I(x,...) { <... - (x != NULL) && E ...> } // </smpl> Signed-off-by:
Julia Lawall <julia@diku.dk> Signed-off-by:
David Teigland <teigland@redhat.com>
-
Changli Gao authored
Signed-off-by:
Changli Gao <xiaosuo@gmail.com> Signed-off-by:
David Teigland <teigland@redhat.com>
-
Jan Kara authored
In data=journal mode, we still use block_write_begin() to prepare page for writing. This function can occasionally mark buffer dirty which violates journalling assumptions - when a buffer is part of a transaction, it should be dirty and a buffer can be already part of a forget list of some transaction when block_write_begin() gets called. This violation of journalling assumptions then results in "JBD: Spotted dirty metadata buffer..." warnings. In fact, temporary dirtying the buffer while the page is still locked does not really cause problems to the journalling because we won't write the buffer until the page gets unlocked. So we just have to make sure to clear dirty bits before unlocking the page. Signed-off-by:
Jan Kara <jack@suse.cz>
-
Wang Lei authored
Add DNS query support for AFS so that it can get the IP addresses of Volume Location servers from the DNS using an AFSDB record. This requires userspace support. /etc/request-key.conf must be configured to invoke a helper for dns_resolver type keys with a subtype of "afsdb:" in the description. Signed-off-by:
Wang Lei <wang840925@gmail.com> Signed-off-by:
David Howells <dhowells@redhat.com> Signed-off-by:
Steve French <sfrench@us.ibm.com>
-
Wang Lei authored
Separate out the DNS resolver key type from the CIFS filesystem into its own module so that it can be made available for general use, including the AFS filesystem module. This facility makes it possible for the kernel to upcall to userspace to have it issue DNS requests, package up the replies and present them to the kernel in a useful form. The kernel is then able to cache the DNS replies as keys can be retained in keyrings. Resolver keys are of type "dns_resolver" and have a case-insensitive description that is of the form "[<type>:]<domain_name>". The optional <type> indicates the particular DNS lookup and packaging that's required. The <domain_name> is the query to be made. If <type> isn't given, a basic hostname to IP address lookup is made, and the result is stored in the key in the form of a printable string consisting of a comma-separated list of IPv4 and IPv6 addresses. This key type is supported by userspace helpers driven from /sbin/request-key and configured through /etc/request-key.conf. The cifs.upcall utility is invoked for UNC path server name to IP address resolution. The CIFS functionality is encapsulated by the dns_resolve_unc_to_ip() function, which is used to resolve a UNC path to an IP address for CIFS filesystem. This part remains in the CIFS module for now. See the added Documentation/networking/dns_resolver.txt for more information. Signed-off-by:
Wang Lei <wang840925@gmail.com> Signed-off-by:
David Howells <dhowells@redhat.com> Acked-by:
Jeff Layton <jlayton@redhat.com> Signed-off-by:
Steve French <sfrench@us.ibm.com>
-
Jeff Layton authored
The commit that added the creduid=0x%x parameter failed to increase the buffer allocation to account for it. Reported-by:
J. Bruce Fields <bfields@fieldses.org> Signed-off-by:
Jeff Layton <jlayton@redhat.com> Signed-off-by:
Steve French <sfrench@us.ibm.com>
-
Jeff Layton authored
It turns out that not all directory inodes with dentries on the i_dentry list are unusable here. We only consider them unusable if they are still hashed or if they have a root dentry attached. Full disclosure -- this check is inherently racy. There's nothing that stops someone from slapping a new dentry onto this inode just after this check, or hashing an existing one that's already attached. So, this is really a "best effort" thing to work around misbehaving servers. Signed-off-by:
Jeff Layton <jlayton@redhat.com> Signed-off-by:
Steve French <sfrench@us.ibm.com>
-
David Howells authored
Make cifs_convert_address() take a const src pointer and a length so that all the strlen() calls in their can be cut out and to make it unnecessary to modify the src string. Also return the data length from dns_resolve_server_name_to_ip() so that a strlen() can be cut out of cifs_compose_mount_options() too. Acked-by:
Jeff Layton <jlayton@redhat.com> Signed-off-by:
David Howells <dhowells@redhat.com> Signed-off-by:
Steve French <sfrench@us.ibm.com>
-
Suresh Jayaraman authored
Fixed the nit pointed out by Jeff. From: Suresh Jayaraman <sjayaraman@suse.de> Subject: [PATCH 1/2] cifs: show features compiled in as part of DebugData This patch adds the features that are compiled in to the CIFS debugging data as shown below: $cat /proc/fs/cifs/DebugData Display Internal CIFS Data Structures for Debugging --------------------------------------------------- CIFS Version 1.64 Features: dfs fscache posix spnego xattr Active VFS Requests: 0 ... This patch provides a definitive way to tell what features are currently enabled in the running kernel. This could also help debugging. Signed-off-by:
Suresh Jayaraman <sjayaraman@suse.de> Cc: Jeff Layton <jlayton@redhat.com> Signed-off-by:
Steve French <sfrench@us.ibm.com>
-
Suresh Jayaraman authored
Update the README file to reflect that now DebugData shows all the features enabled. Signed-off-by:
Suresh Jayaraman <sjayaraman@suse.de> Cc: Jeff Layton <jlayton@redhat.com> -- fs/cifs/README | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) Signed-off-by:
Steve French <sfrench@us.ibm.com>
-
- Aug 04, 2010
-
-
Eric Sandeen authored
commit 3d0518f4, "ext4: New rec_len encoding for very large blocksizes" made several changes to this path, but from a perf perspective, un-inlining ext4_rec_len_from_disk() seems most significant. This function is called from ext4_check_dir_entry(), which on a file-creation workload is called extremely often. I tested this with bonnie: # bonnie++ -u root -s 0 -f -x 200 -d /mnt/test -n 32 (this does 200 iterations) and got this for the file creations: ext4 stock: Average = 21206.8 files/s ext4 inlined: Average = 22346.7 files/s (+5%) Signed-off-by:
Eric Sandeen <sandeen@redhat.com> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu>
-
Tejun Heo authored
bd_prepare_to_claim() incorrectly allowed multiple attempts for exclusive open to progress in parallel if the attempting holders are identical. This triggered BUG_ON() as reported in the following bug. https://bugzilla.kernel.org/show_bug.cgi?id=16393 __bd_abort_claiming() is used to finish claiming blocks and doesn't work if multiple openers are inside a claiming block. Allowing multiple parallel open attempts to continue doesn't gain anything as those are serialized down in the call chain anyway. Fix it by always allowing only single open attempt in a claiming block. This problem can easily be reproduced by adding a delay after bd_prepare_to_claim() and attempting to mount two partitions of a disk. stable: only applicable to v2.6.35 Signed-off-by:
Tejun Heo <tj@kernel.org> Reported-by:
Markus Trippelsdorf <markus@trippelsdorf.de> Cc: stable@kernel.org Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Uwe Kleine-König authored
Signed-off-by:
Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by:
Jiri Kosina <jkosina@suse.cz>
-
Trond Myklebust authored
This will allow us to save the original generic cred in rpc_message, so that if we migrate from one server to another, we can generate a new bound cred without having to punt back to the NFS layer. Signed-off-by:
Trond Myklebust <Trond.Myklebust@netapp.com>
-
Latchesar Ionkov authored
Pass the correct end of the buffer to p9stat_read. Signed-off-by:
Latchesar Ionkov <lucho@ionkov.net> Signed-off-by:
Eric Van Hensbergen <ericvh@gmail.com>
-
- Aug 03, 2010
-
-
Trond Myklebust authored
Fix up those functions that depend on knowing whether or not rpc_restart_call is successful or not. Signed-off-by:
Trond Myklebust <Trond.Myklebust@netapp.com>
-
Trond Myklebust authored
There is no real reason to have RPC_ASSASSINATED() checks in the NFS code. As far as it is concerned, this is just an RPC error... Signed-off-by:
Trond Myklebust <Trond.Myklebust@netapp.com>
-
Trond Myklebust authored
Signed-off-by:
Trond Myklebust <Trond.Myklebust@netapp.com>
-
Trond Myklebust authored
In RFC5661, an NFS4ERR_DELAY error on a SEQUENCE operation has the special meaning that the server is not finished processing the request. In this case we want to just retry the request without touching the slot. Also fix a bug whereby we would fail to update the sequence id if the server returned any error other than NFS_OK/NFS4ERR_DELAY. Signed-off-by:
Trond Myklebust <Trond.Myklebust@netapp.com>
-
Trond Myklebust authored
We don't really support nfs servers that invalidate the file handle after a rename, so precautions such as flushing out dirty data before renaming the file are superfluous. Signed-off-by:
Trond Myklebust <Trond.Myklebust@netapp.com>
-
Trond Myklebust authored
There is no need to flush out writes before calling nfs_wb_all(). Signed-off-by:
Trond Myklebust <Trond.Myklebust@netapp.com>
-
Trond Myklebust authored
Christoph points out that the VFS will always flush out data before calling nfs_fsync(), so we can dispense with a full call to nfs_wb_all(), and replace that with a simpler call to nfs_commit_inode(). Signed-off-by:
Trond Myklebust <Trond.Myklebust@netapp.com>
-