- Apr 07, 2009
-
-
Jens Axboe authored
When CFQ is waiting for a new request from a process, currently it'll immediately restart queuing when it sees such a request. This doesn't work very well with streamed IO, since we then end up splitting IO that would otherwise have been merged nicely. For a simple dd test, this causes 10x as many requests to be issued as we should have. Normally this goes unnoticed due to the low overhead of requests at the device side, but some hardware is very sensitive to request sizes and there it can cause big slow downs. Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
Jens Axboe authored
The request inherits the unplug flag from the bio, but it isn't actually used. The bio flag stops at __make_request(), which tells it to unplug after submission. Passing it on to the request doesn't make any sense. Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
Jens Axboe authored
We only manipulate the must_dispatch and queue_new flags, they are not tested anymore. So get rid of them. Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
Jens Axboe authored
The IO scheduler core calls into the IO scheduler dispatch_request hook to move requests from the IO scheduler and into the driver dispatch list. It only does so when the dispatch list is empty. CFQ moves several requests to the dispatch list, which can cause higher latencies if we suddenly have to switch to some important sync IO. Change the logic to move one request at the time instead. This should almost be functionally equivalent to what we did before, except that we now honor 'quantum' as the maximum queue depth at the device side from any single cfqq. If there's just a single active cfqq, we allow up to 4 times the normal quantum. Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
Jerome Marchand authored
This forces in_flight to be zero when turning off or on the I/O stat accounting and stops updating I/O stats in attempt_merge() when accounting is turned off. Signed-off-by:
Jerome Marchand <jmarchan@redhat.com> Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
Jens Axboe authored
Simple helper functions to quiesce the request queue. These are currently only used for switching IO schedulers on-the-fly, but we can use them to properly switch IO accounting on and off as well. Signed-off-by:
Jerome Marchand <jmarchan@redhat.com> Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
- Apr 06, 2009
-
-
Alan Cox authored
Fix a typo (this was in the original patch but was not merged when the code fixes were for some reason) Signed-off-by:
Alan Cox <alan@redhat.com> Signed-off-by:
Jeff Garzik <jgarzik@redhat.com>
-
Jens Axboe authored
By default, CFQ will anticipate more IO from a given io context if the previously completed IO was sync. This used to be fine, since the only sync IO was reads and O_DIRECT writes. But with more "normal" sync writes being used now, we don't want to anticipate for those. Add a bio/request flag that informs the IO scheduler that this is a sync request that we should not idle for. Introduce WRITE_ODIRECT specifically for O_DIRECT writes, and make sure that the other sync writes set this flag. Signed-off-by:
Jens Axboe <jens.axboe@oracle.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Jens Axboe authored
For the older SSD devices that don't do command queuing, we do want to enable plugging to get better merging. Signed-off-by:
Jens Axboe <jens.axboe@oracle.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Jens Axboe authored
This makes sure that we never wait on async IO for sync requests, instead of doing the split on writes vs reads. Signed-off-by:
Jens Axboe <jens.axboe@oracle.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Apr 03, 2009
-
-
Li Zefan authored
Impact: output all of packet commands - not just the first 4 / 8 bytes Since commit d7e3c324 ("block: add large command support"), struct request->cmd has been changed from unsinged char cmd[BLK_MAX_CDB] to unsigned char *cmd. v1 -> v2: by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> - make sure rq->cmd_len is always intialized, and then we can use rq->cmd_len instead of BLK_MAX_CDB. Signed-off-by:
Li Zefan <lizf@cn.fujitsu.com> Acked-by:
FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jens Axboe <jens.axboe@oracle.com> LKML-Reference: <49D4507E.2060602@cn.fujitsu.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Mar 26, 2009
-
-
Boaz Harrosh authored
bsg submits REQ_TYPE_BLOCK_PC so the right check is max_hw_sectors. But I've removed this check because right after, bsg proceeds with calling blk_rq_map_user() which does all the right checks. Signed-off-by:
Boaz Harrosh <bharrosh@panasas.com> Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
Boaz Harrosh authored
Put a WARN_ON in __blk_put_request if it is about to leak bio(s). This is a serious bug that can happen in error handling code paths. For this to work I have fixed a couple of places in block/ where request->bio != NULL ownership was not honored. And a small cleanup at sg_io() while at it. Signed-off-by:
Boaz Harrosh <bharrosh@panasas.com> Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
- Mar 24, 2009
-
-
Boaz Harrosh authored
Currently inherited from sg.c bsg will submit asynchronous request at the head-of-the-queue, (using "at_head" set in the call to blk_execute_rq_nowait()). This is bad in situation where the queues are full, requests will execute out of order, and can cause starvation of the first submitted requests. The sg_io_v4->flags member is used and a bit is allocated to denote the Q_AT_TAIL. Zero is to queue at_head as before, to be compatible with old code at the write/read path. SG_IO code path behavior was changed so to be the same as write/read behavior. SG_IO was very rarely used and breaking compatibility with it is OK at this stage. sg_io_hdr at sg.h also has a flags member and uses 3 bits from the first nibble and one bit from the last nibble. Even though none of these bits are supported by bsg, The second nibble is allocated for use by bsg. Just in case. Signed-off-by:
Boaz Harrosh <bharrosh@panasas.com> CC: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
Jens Axboe authored
Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
Jens Axboe authored
It calls blk_queue_make_request(), which sets the identical set of limits. Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
- Mar 12, 2009
-
-
Rusty Russell authored
Impact: cleanup This is presumably what those definitions are for, and while all archs define cpu_core_map/cpu_sibling map, that's changing (eg. x86 wants to change it to a pointer). Signed-off-by:
Rusty Russell <rusty@rustcorp.com.au>
-
James Bottomley authored
This allows it to compile and be used on the ps3 platform that wants to use the #define values in scsi.h without actually having CONFIG_SCSI set. Signed-off-by:
James Bottomley <James.Bottomley@HansenPartnership.com>
-
- Mar 06, 2009
-
-
Jens Axboe authored
Commit 1e428079 introduced a bug where we don't get front/back segment sizes in the bio in blk_recount_segments(). Fix this by tracking the back bio as well as the front bio in __blk_recalc_rq_segments(), this also cleans up the interface by getting rid of the segment size pointer passing. Tested-by:
Thomas Gleixner <tglx@linutronix.de> Tested-by:
Ingo Molnar <mingo@elte.hu> Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
- Feb 26, 2009
-
-
Jens Axboe authored
blk_recalc_rq_segments() requires a request structure passed in, which we don't have from blk_recount_segments(). So the latter allocates one on the stack, using > 400 bytes of stack for that. This can cause us to spill over one page of stack from ext4 at least: 0) 4560 400 blk_recount_segments+0x43/0x62 1) 4160 32 bio_phys_segments+0x1c/0x24 2) 4128 32 blk_rq_bio_prep+0x2a/0xf9 3) 4096 32 init_request_from_bio+0xf9/0xfe 4) 4064 112 __make_request+0x33c/0x3f6 5) 3952 144 generic_make_request+0x2d1/0x321 6) 3808 64 submit_bio+0xb9/0xc3 7) 3744 48 submit_bh+0xea/0x10e 8) 3696 368 ext4_mb_init_cache+0x257/0xa6a [ext4] 9) 3328 288 ext4_mb_regular_allocator+0x421/0xcd9 [ext4] 10) 3040 160 ext4_mb_new_blocks+0x211/0x4b4 [ext4] 11) 2880 336 ext4_ext_get_blocks+0xb61/0xd45 [ext4] 12) 2544 96 ext4_get_blocks_wrap+0xf2/0x200 [ext4] 13) 2448 80 ext4_da_get_block_write+0x6e/0x16b [ext4] 14) 2368 352 mpage_da_map_blocks+0x7e/0x4b3 [ext4] 15) 2016 352 ext4_da_writepages+0x2ce/0x43c [ext4] 16) 1664 32 do_writepages+0x2d/0x3c 17) 1632 144 __writeback_single_inode+0x162/0x2cd 18) 1488 96 generic_sync_sb_inodes+0x1e3/0x32b 19) 1392 16 sync_sb_inodes+0xe/0x10 20) 1376 48 writeback_inodes+0x69/0xb3 21) 1328 208 balance_dirty_pages_ratelimited_nr+0x187/0x2f9 22) 1120 224 generic_file_buffered_write+0x1d4/0x2c4 23) 896 176 __generic_file_aio_write_nolock+0x35f/0x393 24) 720 80 generic_file_aio_write+0x6c/0xc8 25) 640 80 ext4_file_write+0xa9/0x137 [ext4] 26) 560 320 do_sync_write+0xf0/0x137 27) 240 48 vfs_write+0xb3/0x13c 28) 192 64 sys_write+0x4c/0x74 29) 128 128 system_call_fastpath+0x16/0x1b Split the segment counting out into a __blk_recalc_rq_segments() helper to avoid allocating an onstack request just for checking the physical segment count. Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
Márton Németh authored
Add documentation for register_blkdev() function and for the parameters. Signed-off-by:
Márton Németh <nm127@freemail.hu> Cc: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
- Feb 25, 2009
-
-
Peter Zijlstra authored
Oleg noticed that we don't strictly need CSD_FLAG_WAIT, rework the code so that we can use CSD_FLAG_LOCK for both purposes. Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Feb 20, 2009
-
-
Rusty Russell authored
This prepares for a real __alloc_percpu, by adding an alignment argument. Only one place uses __alloc_percpu directly, and that's for a string. tj: af_inet also uses __alloc_percpu(), update it. Signed-off-by:
Rusty Russell <rusty@rustcorp.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Jens Axboe <axboe@kernel.dk>
-
- Feb 18, 2009
-
-
Hannes Reinecke authored
blk_abort_queue() iterates the timeout list and aborts each request on the list, but if the driver error handling readds a request to the timeout list during this processing, we could be looping forever. Fix this by splicing current entries to a local list and run over that list instead. Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
Neil Brown authored
Hi Tejun, it looks like your commit: block: don't depend on consecutive minor space f331c029 broke a particular case for booting from partitioned md/raid devices. That is the second time this has been broken recently. The previous time was fixed by block: do_mounts - accept root=<non-existant partition> 30f2f0eb Because the data isn't available when an md device is first created (we add disks and set it up after creation), the initial partition scan finds nothing. It is not until the device is opened that another partition scan happens and finds something. So at the point where the kernel parameter "root=/dev/md_d0p1" is being parsed, md_d0 exists, but md_d0p1 does not. However if we let blk_lookup_devt return the correct device number even though the device doesn't exist, then the attempt to mount it will successfully find the partition. I have tried in the past to find a way to get the partition table to be read as soon as the array is assembled but that proved impossible (at the time). I don't remember the details, and could possibly revisit it. However it would be really nice if blk_lookup_devt could be adjusted to again accept non existant partitions. Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
Jens Axboe authored
We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before 213d9417. Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
Boaz Harrosh authored
When submitting requests via SG_IO, which does a sync io, a bsg_command is not allocated. So an in-Kernel sense_buffer was not set. However when calling blk_execute_rq() with no sense buffer one is provided from the stack. Now bsg at blk_complete_sgv4_hdr_rq() would check if rq->sense_len and a sense was requested by sg_io_v4 the rq->sense was copy_user() back, but by now it is already mangled stack memory. I have fixed that by forcing a sense_buffer when calling bsg_map_hdr(). The bsg_command->sense is provided in the write/read path like before, and on-the-stack buffer is provided when doing SG_IO. I have also fixed a dprintk message to print rq->errors in hex because of the scsi bit-field use of this member. For other block devices it does not matter anyway. Signed-off-by:
Boaz Harrosh <bharrosh@panasas.com> Acked-by:
FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
- Feb 09, 2009
-
-
Frederic Weisbecker authored
Impact: cleanup Move blktrace.c to kernel/trace, also move its config entry. Signed-off-by:
Frederic Weisbecker <fweisbec@gmail.com> Acked-by:
Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by:
Jens Axboe <jens.axboe@oracle.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Feb 05, 2009
-
-
Arnaldo Carvalho de Melo authored
Impact: cleanup To make it easy for ftrace plugin writers, as this was open coded in the existing plugins Signed-off-by:
Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by:
Frédéric Weisbecker <fweisbec@gmail.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Arnaldo Carvalho de Melo authored
Impact: new API These new functions do what previously was being open coded, reducing the number of details ftrace plugin writers have to worry about. It also standardizes the handling of stacktrace, userstacktrace and other trace options we may introduce in the future. With this patch, for instance, the blk tracer (and some others already in the tree) can use the "userstacktrace" /d/tracing/trace_options facility. $ codiff /tmp/vmlinux.before /tmp/vmlinux.after linux-2.6-tip/kernel/trace/trace.c: trace_vprintk | -5 trace_graph_return | -22 trace_graph_entry | -26 trace_function | -45 __ftrace_trace_stack | -27 ftrace_trace_userstack | -29 tracing_sched_switch_trace | -66 tracing_stop | +1 trace_seq_to_user | -1 ftrace_trace_special | -63 ftrace_special | +1 tracing_sched_wakeup_trace | -70 tracing_reset_online_cpus | -1 13 functions changed, 2 bytes added, 355 bytes removed, diff: -353 linux-2.6-tip/block/blktrace.c: __blk_add_trace | -58 1 function changed, 58 bytes removed, diff: -58 linux-2.6-tip/kernel/trace/trace.c: trace_buffer_lock_reserve | +88 trace_buffer_unlock_commit | +86 2 functions changed, 174 bytes added, diff: +174 /tmp/vmlinux.after: 16 functions changed, 176 bytes added, 413 bytes removed, diff: -237 Signed-off-by:
Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by:
Frédéric Weisbecker <fweisbec@gmail.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Arnaldo Carvalho de Melo authored
Impact: API change, cleanup >From ring_buffer_{lock_reserve,unlock_commit}. $ codiff /tmp/vmlinux.before /tmp/vmlinux.after linux-2.6-tip/kernel/trace/trace.c: trace_vprintk | -14 trace_graph_return | -14 trace_graph_entry | -10 trace_function | -8 __ftrace_trace_stack | -8 ftrace_trace_userstack | -8 tracing_sched_switch_trace | -8 ftrace_trace_special | -12 tracing_sched_wakeup_trace | -8 9 functions changed, 90 bytes removed, diff: -90 linux-2.6-tip/block/blktrace.c: __blk_add_trace | -1 1 function changed, 1 bytes removed, diff: -1 /tmp/vmlinux.after: 10 functions changed, 91 bytes removed, diff: -91 Signed-off-by:
Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by:
Frédéric Weisbecker <fweisbec@gmail.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Arnaldo Carvalho de Melo authored
Impact: cleanup Signed-off-by:
Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by:
Steven Rostedt <srostedt@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Arnaldo Carvalho de Melo authored
Impact: simplification of tracers As all tracers are doing this we might as well do it in register_ftrace_event and save one branch each time we call these callbacks. Signed-off-by:
Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by:
Frederic Weisbecker <fweisbec@gmail.com> Acked-by:
Steven Rostedt <srostedt@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Feb 04, 2009
-
-
Arnaldo Carvalho de Melo authored
As they actually all return these enumerators. Reported-by:
Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by:
Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by:
Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Arnaldo Carvalho de Melo authored
Impact: bugfix and cleanup Some callsites were returning either TRACE_ITER_PARTIAL_LINE if the trace_seq routines (trace_seq_printf, etc) returned 0 meaning its buffer was full, or zero otherwise. But... /* Return values for print_line callback */ enum print_line_t { TRACE_TYPE_PARTIAL_LINE = 0, /* Retry after flushing the seq */ TRACE_TYPE_HANDLED = 1, TRACE_TYPE_UNHANDLED = 2 /* Relay to other output functions */ }; In other cases the return value was not being relayed at all. Most of the time it didn't hurt because the page wasn't get filled, but for correctness sake, handle the return values everywhere. Signed-off-by:
Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by:
Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Feb 03, 2009
-
-
Arnaldo Carvalho de Melo authored
Impact: cleanup Reported-by:
Ingo Molnar <mingo@elte.hu> Signed-off-by:
Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Arnaldo Carvalho de Melo authored
Impact: new feature With this and a blkrawverify modified not to verify the sequence numbers we can start using the userspace tools to verify that the data produced with the ftrace plugin works as expected. Example: [root@f10-1 ~]# echo 1 > /sys/block/sda/sda1/trace/enable [root@f10-1 ~]# echo bin > /d/tracing/trace_options [root@f10-1 ~]# echo blk > /d/tracing/current_tracer [root@f10-1 ~]# cat /d/tracing/trace_pipe > sda1.blktrace.0 ^C [root@f10-1 ~]# ./blkrawverify --noseq sda1 Verifying sda1 CPU 0 Wrote output to sda1.verify.out [root@f10-1 ~]# cat sda1.verify.out --------------- Verifying sda1 --------------------- Summary for cpu 0: 1349 valid + 0 invalid (100.0%) processed [root@f10-1 ~]# Signed-off-by:
Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Arnaldo Carvalho de Melo authored
Impact: API change The trace_seq and trace_entry are in trace_iterator, where there are more fields that may be needed by tracers, so just pass the tracer_iterator as is already the case for struct tracer->print_line. Signed-off-by:
Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Feb 02, 2009
-
-
Jens Axboe authored
Some initial probe requests don't have disk->queue mapped yet, so we can't rely on a non-NULL queue in blk_queue_io_stat(). Wrap it in blk_do_io_stat(). Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-
- Jan 30, 2009
-
-
Divyesh Shah authored
This patch adds the ability to pre-empt an ongoing BE timeslice when a RT request is waiting for the current timeslice to complete. This reduces the wait time to disk for RT requests from an upper bound of 4 (current value of cfq_quantum) to 1 disk request. Applied Jens' suggeested changes to avoid the rb lookup and use !cfq_class_rt() and retested. Latency(secs) for the RT task when doing sequential reads from 10G file. | only RT | RT + BE | RT + BE + this patch small (512 byte) reads | 143 | 163 | 145 large (1Mb) reads | 142 | 158 | 146 Signed-off-by:
Divyesh Shah <dpshah@google.com> Signed-off-by:
Jens Axboe <jens.axboe@oracle.com>
-