- Jul 27, 2011
-
-
Chris Mason authored
The btrfs transaction code will return any errors that come from reserve_metadata_bytes. We need to make sure we don't return funny things like 1 or EAGAIN. Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
David Howells authored
Since __proc_create() appends the name it is given to the end of the PDE structure that it allocates, there isn't a need to store a name pointer. Instead we can just replace the name pointer with a terminal char array of _unspecified_ length. The compiler will simply append the string to statically defined variables of PDE type overlapping any hole at the end of the structure and, unlike specifying an explicitly _zero_ length array, won't give a warning if you try to statically initialise it with a string of more than zero length. Also, whilst we're at it: (1) Move namelen to end just prior to name and reduce it to a single byte (name shouldn't be longer than NAME_MAX). (2) Move pde_unload_lock two places further on so that if it's four bytes in size on a 64-bit machine, it won't cause an unused hole in the PDE struct. Signed-off-by:
David Howells <dhowells@redhat.com> Signed-off-by:
Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Chris Mason authored
Now that we are using regular file crcs for the free space cache, we can deadlock if we try to read the free_space_inode while we are updating the crc tree. This commit fixes things by using the commit_root to read the crcs. This is safe because we the free space cache file would already be loaded if that block group had been changed in the current transaction. Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Chris Mason authored
For metadata buffers that don't straddle pages (all of them), btrfs can safely use the page uptodate bits and extent_buffer uptodate bit instead of needing to use the extent_state tree. This greatly reduces contention on the state tree lock. Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Chris Mason authored
Before the reader/writer locks, btrfs_next_leaf needed to keep the path blocking to avoid making lockdep upset. Now that btrfs_next_leaf only takes read locks, this isn't required. Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Chris Mason authored
This patch was originally from Tejun Heo. lockdep complains about the btrfs locking because we sometimes take btree locks from two different trees at the same time. The current classes are based only on level in the btree, which isn't enough information for lockdep to figure out if the lock is safe. This patch makes a class for each type of tree, and lumps all the FS trees that actually have files and directories into the same class. Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Chris Mason authored
The btrfs metadata btree is the source of significant lock contention, especially in the root node. This commit changes our locking to use a reader/writer lock. The lock is built on top of rw spinlocks, and it extends the lock tracking to remember if we have a read lock or a write lock when we go to blocking. Atomics count the number of blocking readers or writers at any given time. It removes all of the adaptive spinning from the old code and uses only the spinning/blocking hints inside of btrfs to decide when it should continue spinning. In read heavy workloads this is dramatically faster. In write heavy workloads we're still faster because of less contention on the root node lock. We suffer slightly in dbench because we schedule more often during write locks, but all other benchmarks so far are improved. Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Josef Bacik authored
Hit this nice little deadlock. What happens is this __btrfs_end_transaction with throttle set, --use_count so it equals 0 btrfs_commit_transaction <somebody else actually manages to start the commit> btrfs_end_transaction --use_count so now its -1 <== BAD we just return and wait on the transaction This is bad because we just return after our use_count is -1 and don't let go of our num_writer count on the transaction, so the guy committing the transaction just sits there forever. Fix this by inc'ing our use_count if we're going to call commit_transaction so that if we call btrfs_end_transaction it's valid. Thanks, Signed-off-by:
Josef Bacik <josef@redhat.com> Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Chris Mason authored
The extent_buffers have a very complex interface where we use HIGHMEM for metadata and try to cache a kmap mapping to access the memory. The next commit adds reader/writer locks, and concurrent use of this kmap cache would make it even more complex. This commit drops the ability to use HIGHMEM with extent buffers, and rips out all of the related code. Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Miao Xie authored
When we balanced the chunks across the devices, BUG_ON() in __finish_chunk_alloc() was triggered. ------------[ cut here ]------------ kernel BUG at fs/btrfs/volumes.c:2568! [SNIP] Call Trace: [<ffffffffa049525e>] btrfs_alloc_chunk+0x8e/0xa0 [btrfs] [<ffffffffa04546b0>] do_chunk_alloc+0x330/0x3a0 [btrfs] [<ffffffffa045c654>] btrfs_reserve_extent+0xb4/0x1f0 [btrfs] [<ffffffffa045c86b>] btrfs_alloc_free_block+0xdb/0x350 [btrfs] [<ffffffffa048a8d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs] [<ffffffffa04476fd>] __btrfs_cow_block+0x14d/0x5e0 [btrfs] [<ffffffffa044660d>] ? read_block_for_search+0x14d/0x4d0 [btrfs] [<ffffffffa0447c9b>] btrfs_cow_block+0x10b/0x240 [btrfs] [<ffffffffa044dd5e>] btrfs_search_slot+0x49e/0x7a0 [btrfs] [<ffffffffa044f07d>] btrfs_insert_empty_items+0x8d/0xf0 [btrfs] [<ffffffffa045e973>] insert_with_overflow+0x43/0x110 [btrfs] [<ffffffffa045eb0d>] btrfs_insert_dir_item+0xcd/0x1f0 [btrfs] [<ffffffffa0489bd0>] ? map_extent_buffer+0xb0/0xc0 [btrfs] [<ffffffff812276ad>] ? rb_insert_color+0x9d/0x160 [<ffffffffa046cc40>] ? inode_tree_add+0xf0/0x150 [btrfs] [<ffffffffa0474801>] btrfs_add_link+0xc1/0x1c0 [btrfs] [<ffffffff811dacac>] ? security_inode_init_security+0x1c/0x30 [<ffffffffa04a28aa>] ? btrfs_init_acl+0x4a/0x180 [btrfs] [<ffffffffa047492f>] btrfs_add_nondir+0x2f/0x70 [btrfs] [<ffffffffa046af16>] ? btrfs_init_inode_security+0x46/0x60 [btrfs] [<ffffffffa0474ac0>] btrfs_create+0x150/0x1d0 [btrfs] [<ffffffff81159c63>] ? generic_permission+0x23/0xb0 [<ffffffff8115b415>] vfs_create+0xa5/0xc0 [<ffffffff8115ce6e>] do_last+0x5fe/0x880 [<ffffffff8115dc0d>] path_openat+0xcd/0x3d0 [<ffffffff8115e029>] do_filp_open+0x49/0xa0 [<ffffffff8116a965>] ? alloc_fd+0x95/0x160 [<ffffffff8114f0c7>] do_sys_open+0x107/0x1e0 [<ffffffff810bcc3f>] ? audit_syscall_entry+0x1bf/0x1f0 [<ffffffff8114f1e0>] sys_open+0x20/0x30 [<ffffffff81484ec2>] system_call_fastpath+0x16/0x1b [SNIP] RIP [<ffffffffa049444a>] __finish_chunk_alloc+0x20a/0x220 [btrfs] The reason is: Task1 Space balance task do_chunk_alloc() __finish_chunk_alloc() update device info in the chunk tree alloc system metadata block relocate system metadata block group set system metadata block group readonly, This block group is the only one that can allocate space. So there is no free space that can be allocated now. find no space and don't try to alloc new chunk, and then return ENOSPC BUG_ON() in __finish_chunk_alloc() was triggered. Fix this bug by allocating a new system metadata chunk before relocating the old one if we find there is no free space which can be allocated after setting the old block group to be read-only. Reported-by:
Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by:
Miao Xie <miaox@cn.fujitsu.com> Tested-by:
Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Josef Bacik authored
Everybody else does this, we need to do it too. If we're syncing, we need to tag the pages we're going to write for writeback so we don't end up writing the same stuff over and over again if somebody is constantly redirtying our file. This will keep us from having latencies with heavy sync workloads. Thanks, Signed-off-by:
Josef Bacik <josef@redhat.com> Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Josef Bacik authored
So I had this brilliant idea to use atomic counters for outstanding and reserved extents, but this turned out to be a bad idea. Consider this where we have 1 outstanding extent and 1 reserved extent Reserver Releaser atomic_dec(outstanding) now 0 atomic_read(outstanding)+1 get 1 atomic_read(reserved) get 1 don't actually reserve anything because they are the same atomic_cmpxchg(reserved, 1, 0) atomic_inc(outstanding) atomic_add(0, reserved) free reserved space for 1 extent Then the reserver now has no actual space reserved for it, and when it goes to finish the ordered IO it won't have enough space to do it's allocation and you get those lovely warnings. Signed-off-by:
Josef Bacik <josef@redhat.com> Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Josef Bacik authored
Kill the check to see if we have 512mb of reserved space in delalloc and shrink_delalloc if we do. This causes unexpected latencies and we have other logic to see if we need to throttle. Thanks, Signed-off-by:
Josef Bacik <josef@redhat.com> Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Josef Bacik authored
grab_cache_page will use mapping_gfp_mask(), which for all inodes is set to GFP_HIGHUSER_MOVABLE. So instead use find_or_create_page in all cases where we need GFP_NOFS so we don't deadlock. Thanks, Signed-off-by:
Josef Bacik <josef@redhat.com>
-
Josef Bacik authored
A user reported a deadlock when copying a bunch of files. This is because they were low on memory and kthreadd got hung up trying to migrate pages for an allocation when starting the caching kthread. The page was locked by the person starting the caching kthread. To fix this we just need to use the async thread stuff so that the threads are already created and we don't have to worry about deadlocks. Thanks, Reported-by:
Roman Mamedov <rm@romanrm.ru> Signed-off-by:
Josef Bacik <josef@redhat.com>
-
- Jul 26, 2011
-
-
Christoph Hellwig authored
Since the addition of file capabilities every write needs to read xattrs to check if we have any capabilities to clear. In Linux 3.0 Andi Kleen added a flag to cache the fact that we do not have any attributes on an inode. Make sure to already mark a file as not having any attributes when reading it from disk in case it doesn't even have an attribute fork. Based on an earlier patch from Andi Kleen. Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Alex Elder <aelder@sgi.com>
-
Christoph Hellwig authored
We need to take some locks to prevent new ioends from coming in when we wait for all existing ones to go away. Up to Linux 3.0 that was done using the i_mutex held by the VFS fsync code, but now that we are called without it we need to take care of it ourselves. Use the I/O lock instead of i_mutex just like we do in other places. Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Alex Elder <aelder@sgi.com>
-
Christoph Hellwig authored
Now that REQ_META bios aren't treated specially in the CFQ I/O schedule anymore, we can tag all buffers as metadata to make blktrace traces more meaningful. Note that we use buffers also to zero out partial blocks in the preallocation / hole punching code, and while they operate on data blocks the zeros written certainly aren't data. I think this case is borderline metadata enough to not bother special casing it. Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Alex Elder <aelder@sgi.com>
-
Alex Elder authored
Pull into a helper function some debug-only code that validates a xfs_da_blkinfo structure that's been read from disk. Signed-off-by:
Alex Elder <aelder@sgi.com> Reviewed-by:
Christoph Hellwig <hch@infradead.org>
-
Arun Sharma authored
This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by:
Arun Sharma <asharma@fb.com> Reviewed-by:
Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by:
Mike Frysinger <vapier@gentoo.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Oleg Nesterov authored
acct_arg_size() takes ->page_table_lock around add_mm_counter() if !SPLIT_RSS_COUNTING. This is not needed after commit 172703b0 ("mm: delete non-atomic mm counter implementation"). Signed-off-by:
Oleg Nesterov <oleg@redhat.com> Reviewed-by:
Matt Fleming <matt.fleming@linux.intel.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Tetsuo Handa authored
If CONFIG_MODULES=n, it makes no sense to retry the list of binary formats handler because the list will not be modified by request_module(). Signed-off-by:
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Richard Weinberger <richard@nod.at> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Tetsuo Handa authored
Currently, search_binary_handler() tries to load binary loader module using request_module() if a loader for the requested program is not yet loaded. But second attempt of request_module() does not affect the result of search_binary_handler(). If request_module() triggered recursion, calling request_module() twice causes 2 to the power of MAX_KMOD_CONCURRENT (= 50) repetitions. It is not an infinite loop but is sufficient for users to consider as a hang up. Therefore, this patch changes not to call request_module() twice, making 1 to the power of MAX_KMOD_CONCURRENT repetitions in case of recursion. Signed-off-by:
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by:
Richard Weinberger <richard@nod.at> Tested-by:
Richard Weinberger <richard@nod.at> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Michal Hocko authored
Commit a8bef8ff ("mm: migration: avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks") introduced a BUG_ON() to ensure that VM_STACK_FLAGS and VM_STACK_INCOMPLETE_SETUP do not overlap. The check is a compile time one, so BUILD_BUG_ON is more appropriate. Signed-off-by:
Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Richard Weinberger <richard@nod.at> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Vasiliy Kulikov authored
If an inode's mode permits opening /proc/PID/io and the resulting file descriptor is kept across execve() of a setuid or similar binary, the ptrace_may_access() check tries to prevent using this fd against the task with escalated privileges. Unfortunately, there is a race in the check against execve(). If execve() is processed after the ptrace check, but before the actual io information gathering, io statistics will be gathered from the privileged process. At least in theory this might lead to gathering sensible information (like ssh/ftp password length) that wouldn't be available otherwise. Holding task->signal->cred_guard_mutex while gathering the io information should protect against the race. The order of locking is similar to the one inside of ptrace_attach(): first goes cred_guard_mutex, then lock_task_sighand(). Signed-off-by:
Vasiliy Kulikov <segoon@openwall.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Daisuke Ogino authored
Change the return value to ENOENT. This return value is then returned when opening the proc entry that have been removed. For example, open("/proc/bus/pci/XX/YY") when the corresponding device is being hot-removed. Signed-off-by:
Daisuke Ogino <ogino.daisuke@jp.fujitsu.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Acked-by:
Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Oleg Nesterov authored
do_coredump() assumes that if format_corename() fails it should return -ENOMEM. This is not true, for example cn_print_exe_file() can propagate the error from d_path. Even if it was true, this is too fragile. Change the code to check "ispipe < 0". Signed-off-by:
Oleg Nesterov <oleg@redhat.com> Signed-off-by:
Jiri Slaby <jslaby@suse.cz> Reviewed-by:
Neil Horman <nhorman@tuxdriver.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Jiri Slaby authored
Change every occurence of / in comm and hostname to !. If the process changes its name to contain /, the core is not dumped (if the directory tree doesn't exist like that). The same with hostname being something like myhost/3. Fix this behaviour by using the escape loop used in %E. (We extract it to a separate function.) Now both with comm == myprocess/1 and hostname == myhost/1, the core is dumped like (kernel.core_pattern='core.%p.%e.%h): core.2349.myprocess!1.myhost!1 Signed-off-by:
Jiri Slaby <jslaby@suse.cz> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Jiri Slaby authored
If we don't know the file corresponding to the binary (i.e. exe_file is unknown), use "task->comm (path unknown)" instead of simple "(unknown)" as suggested by ak. The fallback is the same as %e except it will append "(path unknown)". Signed-off-by:
Jiri Slaby <jslaby@suse.cz> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Al Viro authored
The kludge in question is undocumented and doesn't work for 32bit binaries on amd64, sparc64 and s390. Passing (mode_t)-1 as mode had (since 0.99.14v and contrary to behaviour of any other Unix, prescriptions of POSIX, SuS and our own manpages) was kinda-sorta no-op. Note that any software relying on that (and looking for examples shows none) would be visibly broken on sparc64, where practically all userland is built 32bit. No such complaints noticed... Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
mode_t is not a bitmap... Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Sage Weil authored
For the most part we don't care about racing with rename when directing MDS requests; either the old or new parent is fine. Document that, and do some minor cleanup. Reviewed-by:
Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
We carry a pin on the parent directory for the rename source and dest dentries. For the source it's r_locked_dir; we need to explicitly reference the old_dentry parent as well, since the dentry's d_parent may change between when the request was created and pinned and when it is freed. Reviewed-by:
Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
Reviewed-by:
Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
Have caller pass in a safely-obtained reference to the parent directory for calculating a dentry's hash valud. While we're here, simpify the flow through ceph_encode_fh() so that there is a single exit point and cleanup. Also fix a bug with the dentry hash calculation: calculate the hash for the dentry we were given, not its parent. Reviewed-by:
Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
Protect d_parent with d_lock. Carry a reference. Simplify the flow so that there is a single exit point and cleanup. Reviewed-by:
Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
d_parent is protected by d_lock: use it when looking up a dentry's parent directory inode. Also take a reference and drop it in the caller to avoid a use-after-free. Reported-by:
Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by:
Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
The ->lookup() and prepopulate_readdir() callers are working with unhashed dentries, so we don't have to worry. The export.c callers, though, need to initialize something they got back from d_obtain_alias() and are potentially racing with other callers. Make sure we don't return unless the dentry is properly initialized (by us or someone else). Reported-by:
Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by:
Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
Curretly ceph_add_cap clears the complete bit if we are newly issued the FILE_SHARED cap, which is normally the case for a newly issue cap on a new directory. That means we clear the just-set bit. Move the check that sets the flag to after the cap is added/updated. Reviewed-by:
Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by:
Sage Weil <sage@newdream.net>
-