Discussion:
[PATCH v2] block: BFQ default for single queue devices
Linus Walleij
2018-10-15 14:10:59 UTC
Permalink
This sets BFQ as the default scheduler for single queue
block devices (nr_hw_queues == 1) if it is available. This
affects notably MMC/SD-cards but also UBI and the loopback
device.

I have been running it for a while without any negative
effects on my pet systems and I want some wider testing
so let's throw it out there and see what people say.
Admittedly my use cases are limited. I need to keep this
patch around for my personal needs anyway.

We take special care to avoid using BFQ on zoned devices
(in particular SMR, shingled magnetic recording devices)
as these currently require mq-deadline to group writes
together.

I have opted against introducing any default scheduler
through Kconfig as the mq-deadline enforcement for
zoned devices has to be done at runtime anyways and
too many config options will make things confusing.

My argument for setting a default policy in the kernel
as opposed to user space is the "reasonable defaults"
type, analogous to how we have one default CPU scheduling
policy (CFS) that make most sense for most tasks, and
how automatic process group scheduling happens in most
distributions without userspace involvement. The BFQ
scheduling policy makes most sense for single hardware
queue devices and many embedded systems will not have
the clever userspace tools (such as udev) to make an
educated choice of scheduling policy. Defaults should be
those that make most sense for the hardware.

Cc: Pavel Machek <***@ucw.cz>
Cc: Paolo Valente <***@linaro.org>
Cc: Jens Axboe <***@kernel.dk>
Cc: Ulf Hansson <***@linaro.org>
Cc: Richard Weinberger <***@nod.at>
Cc: Adrian Hunter <***@intel.com>
Cc: Bart Van Assche <***@acm.org>
Cc: Jan Kara <***@suse.cz>
Cc: Artem Bityutskiy <***@gmail.com>
Cc: Christoph Hellwig <***@infradead.org>
Cc: Alan Cox <***@lxorguk.ukuu.org.uk>
Cc: Mark Brown <***@kernel.org>
Cc: Damien Le Moal <***@wdc.com>
Cc: Johannes Thumshirn <***@suse.de>
Cc: Oleksandr Natalenko <***@natalenko.name>
Cc: Jonathan Corbet <***@lwn.net>
Signed-off-by: Linus Walleij <***@linaro.org>
---
ChangeLog v1->v2:
- Add a quirk so that devices with zoned writes are forced
to use the deadline scheduler, this is necessary since only
that scheduler supports zoned writes.
- There is a summary article in LWN for subscribers:
https://lwn.net/Articles/767987/
---
block/elevator.c | 22 ++++++++++++++++++----
1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 8fdcd64ae12e..6e6048ca3471 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -948,13 +948,16 @@ int elevator_switch_mq(struct request_queue *q,
}

/*
- * For blk-mq devices, we default to using mq-deadline, if available, for single
- * queue devices. If deadline isn't available OR we have multiple queues,
- * default to "none".
+ * For blk-mq devices, we default to using:
+ * - "none" for multiqueue devices (nr_hw_queues != 1)
+ * - "bfq", if available, for single queue devices
+ * - "mq-deadline" if "bfq" is not available for single queue devices
+ * - "none" for single queue devices as well as last resort
*/
int elevator_init_mq(struct request_queue *q)
{
struct elevator_type *e;
+ const char *policy;
int err = 0;

if (q->nr_hw_queues != 1)
@@ -968,7 +971,18 @@ int elevator_init_mq(struct request_queue *q)
if (unlikely(q->elevator))
goto out_unlock;

- e = elevator_get(q, "mq-deadline", false);
+ /*
+ * Zoned devices must use a deadline scheduler because currently
+ * that is the only scheduler respecting zoned writes.
+ */
+ if (blk_queue_is_zoned(q))
+ policy = "mq-deadline";
+ else if (IS_ENABLED(CONFIG_IOSCHED_BFQ))
+ policy = "bfq";
+ else
+ policy = "mq-deadline";
+
+ e = elevator_get(q, policy, false);
if (!e)
goto out_unlock;
--
2.17.2
Paolo Valente
2018-10-15 14:22:25 UTC
Permalink
Post by Linus Walleij
This sets BFQ as the default scheduler for single queue
block devices (nr_hw_queues == 1) if it is available. This
affects notably MMC/SD-cards but also UBI and the loopback
device.
I have been running it for a while without any negative
effects on my pet systems and I want some wider testing
so let's throw it out there and see what people say.
Admittedly my use cases are limited. I need to keep this
patch around for my personal needs anyway.
We take special care to avoid using BFQ on zoned devices
(in particular SMR, shingled magnetic recording devices)
as these currently require mq-deadline to group writes
together.
I have opted against introducing any default scheduler
through Kconfig as the mq-deadline enforcement for
zoned devices has to be done at runtime anyways and
too many config options will make things confusing.
My argument for setting a default policy in the kernel
as opposed to user space is the "reasonable defaults"
type, analogous to how we have one default CPU scheduling
policy (CFS) that make most sense for most tasks, and
how automatic process group scheduling happens in most
distributions without userspace involvement. The BFQ
scheduling policy makes most sense for single hardware
queue devices and many embedded systems will not have
the clever userspace tools (such as udev) to make an
educated choice of scheduling policy. Defaults should be
those that make most sense for the hardware.
Unless someone reports (hopefully reproducible) regressions with
common single-queue hardware, then
Acked-by: Paolo Valente <***@linaro.org>

Thanks,
Paolo
Post by Linus Walleij
---
- Add a quirk so that devices with zoned writes are forced
to use the deadline scheduler, this is necessary since only
that scheduler supports zoned writes.
https://lwn.net/Articles/767987/
---
block/elevator.c | 22 ++++++++++++++++++----
1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/block/elevator.c b/block/elevator.c
index 8fdcd64ae12e..6e6048ca3471 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -948,13 +948,16 @@ int elevator_switch_mq(struct request_queue *q,
}
/*
- * For blk-mq devices, we default to using mq-deadline, if available, for single
- * queue devices. If deadline isn't available OR we have multiple queues,
- * default to "none".
+ * - "none" for multiqueue devices (nr_hw_queues != 1)
+ * - "bfq", if available, for single queue devices
+ * - "mq-deadline" if "bfq" is not available for single queue devices
+ * - "none" for single queue devices as well as last resort
*/
int elevator_init_mq(struct request_queue *q)
{
struct elevator_type *e;
+ const char *policy;
int err = 0;
if (q->nr_hw_queues != 1)
@@ -968,7 +971,18 @@ int elevator_init_mq(struct request_queue *q)
if (unlikely(q->elevator))
goto out_unlock;
- e = elevator_get(q, "mq-deadline", false);
+ /*
+ * Zoned devices must use a deadline scheduler because currently
+ * that is the only scheduler respecting zoned writes.
+ */
+ if (blk_queue_is_zoned(q))
+ policy = "mq-deadline";
+ else if (IS_ENABLED(CONFIG_IOSCHED_BFQ))
+ policy = "bfq";
+ else
+ policy = "mq-deadline";
+
+ e = elevator_get(q, policy, false);
if (!e)
goto out_unlock;
--
2.17.2
Oleksandr Natalenko
2018-10-15 14:32:53 UTC
Permalink
Hi.
Post by Linus Walleij
This sets BFQ as the default scheduler for single queue
block devices (nr_hw_queues == 1) if it is available. This
affects notably MMC/SD-cards but also UBI and the loopback
device.
I have been running it for a while without any negative
effects on my pet systems and I want some wider testing
so let's throw it out there and see what people say.
Admittedly my use cases are limited. I need to keep this
patch around for my personal needs anyway.
We take special care to avoid using BFQ on zoned devices
(in particular SMR, shingled magnetic recording devices)
as these currently require mq-deadline to group writes
together.
I have opted against introducing any default scheduler
through Kconfig as the mq-deadline enforcement for
zoned devices has to be done at runtime anyways and
too many config options will make things confusing.
My argument for setting a default policy in the kernel
as opposed to user space is the "reasonable defaults"
type, analogous to how we have one default CPU scheduling
policy (CFS) that make most sense for most tasks, and
how automatic process group scheduling happens in most
distributions without userspace involvement. The BFQ
scheduling policy makes most sense for single hardware
queue devices and many embedded systems will not have
the clever userspace tools (such as udev) to make an
educated choice of scheduling policy. Defaults should be
those that make most sense for the hardware.
---
- Add a quirk so that devices with zoned writes are forced
to use the deadline scheduler, this is necessary since only
that scheduler supports zoned writes.
https://lwn.net/Articles/767987/
---
block/elevator.c | 22 ++++++++++++++++++----
1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/block/elevator.c b/block/elevator.c
index 8fdcd64ae12e..6e6048ca3471 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -948,13 +948,16 @@ int elevator_switch_mq(struct request_queue *q,
}
/*
- * For blk-mq devices, we default to using mq-deadline, if available, for single
- * queue devices. If deadline isn't available OR we have multiple queues,
- * default to "none".
+ * - "none" for multiqueue devices (nr_hw_queues != 1)
+ * - "bfq", if available, for single queue devices
+ * - "mq-deadline" if "bfq" is not available for single queue devices
+ * - "none" for single queue devices as well as last resort
*/
int elevator_init_mq(struct request_queue *q)
{
struct elevator_type *e;
+ const char *policy;
int err = 0;
if (q->nr_hw_queues != 1)
@@ -968,7 +971,18 @@ int elevator_init_mq(struct request_queue *q)
if (unlikely(q->elevator))
goto out_unlock;
- e = elevator_get(q, "mq-deadline", false);
+ /*
+ * Zoned devices must use a deadline scheduler because currently
+ * that is the only scheduler respecting zoned writes.
+ */
+ if (blk_queue_is_zoned(q))
+ policy = "mq-deadline";
+ else if (IS_ENABLED(CONFIG_IOSCHED_BFQ))
+ policy = "bfq";
+ else
+ policy = "mq-deadline";
If more rules will be needed in the future, shall we just add extra ifs,
or it would be better to craft some struct/table now + policy search
helper?
Post by Linus Walleij
+
+ e = elevator_get(q, policy, false);
if (!e)
goto out_unlock;
--
Oleksandr Natalenko (post-factum)
Linus Walleij
2018-10-19 08:33:47 UTC
Permalink
On Mon, Oct 15, 2018 at 4:32 PM Oleksandr Natalenko
Post by Oleksandr Natalenko
+ /*
+ * Zoned devices must use a deadline scheduler because currently
+ * that is the only scheduler respecting zoned writes.
+ */
+ if (blk_queue_is_zoned(q))
+ policy = "mq-deadline";
+ else if (IS_ENABLED(CONFIG_IOSCHED_BFQ))
+ policy = "bfq";
+ else
+ policy = "mq-deadline";
If more rules will be needed in the future, shall we just add extra ifs,
or it would be better to craft some struct/table now + policy search
helper?
Let's do it when it happens. Premature optimization is the root
of all evil ;)

Yours,
Linus Walleij
Oleksandr Natalenko
2018-10-19 09:26:37 UTC
Permalink
Hi.
Post by Linus Walleij
Post by Oleksandr Natalenko
+ /*
+ * Zoned devices must use a deadline scheduler because currently
+ * that is the only scheduler respecting zoned writes.
+ */
+ if (blk_queue_is_zoned(q))
+ policy = "mq-deadline";
+ else if (IS_ENABLED(CONFIG_IOSCHED_BFQ))
+ policy = "bfq";
+ else
+ policy = "mq-deadline";
If more rules will be needed in the future, shall we just add extra ifs,
or it would be better to craft some struct/table now + policy search
helper?
Let's do it when it happens. Premature optimization is the root
of all evil ;)
I'd say, this is a matter of code readability, not optimisations. I do
not strongly object against current approach, though.
--
Oleksandr Natalenko (post-factum)
Bart Van Assche
2018-10-15 15:02:15 UTC
Permalink
Post by Linus Walleij
+ * - "none" for multiqueue devices (nr_hw_queues != 1)
+ * - "bfq", if available, for single queue devices
+ * - "mq-deadline" if "bfq" is not available for single queue devices
+ * - "none" for single queue devices as well as last resort
For SATA SSDs nr_hw_queues == 1 so this patch will also affect these SSDs.
Since this patch is an attempt to improve performance, I'd like to see
measurement data for one or more recent SATA SSDs before a decision is
taken about what to do with this patch.

Thanks,

Bart.
Paolo Valente
2018-10-15 18:34:56 UTC
Permalink
Post by Bart Van Assche
Post by Linus Walleij
+ * - "none" for multiqueue devices (nr_hw_queues != 1)
+ * - "bfq", if available, for single queue devices
+ * - "mq-deadline" if "bfq" is not available for single queue devices
+ * - "none" for single queue devices as well as last resort
For SATA SSDs nr_hw_queues == 1 so this patch will also affect these SSDs.
Since this patch is an attempt to improve performance, I'd like to see
measurement data for one or more recent SATA SSDs before a decision is
taken about what to do with this patch.
Hi Bart,
as I just wrote to Jens I don't think we need this test any longer.
To save you one hope, I'll paste my reply to Jens below.

Anyway, it is very easy to do the tests you ask:
- take a kernel containing the last bfq commits, such as for-next
- do, e.g.,
git clone https://github.com/Algodev-github/S.git
cd S/run_multiple_benchmarks
sudo ./run_main_benchmarks.sh "throughput replayed-startup" "bfq none"
- compare results

Of course, do not do it for multi-queue devices or single-queues
devices, on steroids, that do 400-500 KIOPS.

I'll see if I can convince someone to repeat these tests with a recent
SSD.

And here is again my reply to Jens, which I think holds for your repeated
objection too.

I tested bfq on virtually every device in the range from few hundred
of IOPS to 50-100KIOPS. Then, through the public script I already
mentioned, I found the maximum number of IOPS that bfq can handle:
about 400K with a commodity CPU.

In particular, in all my tests with real hardware, bfq performance
- is not even comparable to that of any of the other scheduler, in
terms of responsiveness, latency for real-time applications, ability
to provide strong bandwidth guarantees, ability to boost throughput
while guaranteeing bandwidths;
- is a little worse than the other schedulers for only one test, on
only some hardware: total throughput with random reads, were it may
lose up to 10-15% of throughput. Of course, the schedulers that reach
a higher throughput leave the machine unusable during the test.

So I really cannot see a reason why bfq could do worse than any of
these other schedulers for some single-queue device (conservatively)
below 300KIOPS.

Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are
probably less than 1% of all the single-queue storage around (USB
drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we
are sacrificing 99% of the hardware, to help 1% of the hardware for
one kind of test cases.

Thanks,
Paolo
Post by Bart Van Assche
Thanks,
Bart.
Paolo Valente
2018-10-17 05:18:44 UTC
Permalink
Post by Paolo Valente
Post by Bart Van Assche
Post by Linus Walleij
+ * - "none" for multiqueue devices (nr_hw_queues != 1)
+ * - "bfq", if available, for single queue devices
+ * - "mq-deadline" if "bfq" is not available for single queue devices
+ * - "none" for single queue devices as well as last resort
For SATA SSDs nr_hw_queues == 1 so this patch will also affect these SSDs.
Since this patch is an attempt to improve performance, I'd like to see
measurement data for one or more recent SATA SSDs before a decision is
taken about what to do with this patch.
Hi Bart,
as I just wrote to Jens I don't think we need this test any longer.
To save you one hope, I'll paste my reply to Jens below.
- take a kernel containing the last bfq commits, such as for-next
- do, e.g.,
git clone https://github.com/Algodev-github/S.git
cd S/run_multiple_benchmarks
sudo ./run_main_benchmarks.sh "throughput replayed-startup" "bfq none"
- compare results
Two things:

1) By mistake, I put 'none' in the last command line above, but it should be mq-deadline:

sudo ./run_main_benchmarks.sh "throughput replayed-startup" "bfq mq-deadline"

2) If you are worried about wearing your device with writes, then just append 'raw' to the last command line. So:

sudo ./run_main_benchmarks.sh "throughput replayed-startup" "bfq mq-deadline" raw

'raw' means: "don't even create files for the background traffic, but just read raw sectors".

Thanks,
Paolo
Post by Paolo Valente
Of course, do not do it for multi-queue devices or single-queues
devices, on steroids, that do 400-500 KIOPS.
I'll see if I can convince someone to repeat these tests with a recent
SSD.
And here is again my reply to Jens, which I think holds for your repeated
objection too.
I tested bfq on virtually every device in the range from few hundred
of IOPS to 50-100KIOPS. Then, through the public script I already
about 400K with a commodity CPU.
In particular, in all my tests with real hardware, bfq performance
- is not even comparable to that of any of the other scheduler, in
terms of responsiveness, latency for real-time applications, ability
to provide strong bandwidth guarantees, ability to boost throughput
while guaranteeing bandwidths;
- is a little worse than the other schedulers for only one test, on
only some hardware: total throughput with random reads, were it may
lose up to 10-15% of throughput. Of course, the schedulers that reach
a higher throughput leave the machine unusable during the test.
So I really cannot see a reason why bfq could do worse than any of
these other schedulers for some single-queue device (conservatively)
below 300KIOPS.
Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are
probably less than 1% of all the single-queue storage around (USB
drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we
are sacrificing 99% of the hardware, to help 1% of the hardware for
one kind of test cases.
Thanks,
Paolo
Post by Bart Van Assche
Thanks,
Bart.
--
You received this message because you are subscribed to the Google Groups "bfq-iosched" group.
For more options, visit https://groups.google.com/d/optout.
Federico Motta
2018-10-16 16:14:26 UTC
Permalink
Post by Bart Van Assche
Post by Linus Walleij
+ * - "none" for multiqueue devices (nr_hw_queues != 1)
+ * - "bfq", if available, for single queue devices
+ * - "mq-deadline" if "bfq" is not available for single queue devices
+ * - "none" for single queue devices as well as last resort
For SATA SSDs nr_hw_queues == 1 so this patch will also affect these SSDs.
Since this patch is an attempt to improve performance, I'd like to see
measurement data for one or more recent SATA SSDs before a decision is
taken about what to do with this patch.
Thanks,
Bart.
Hi,
although these tests should be run for single-queue devices, I tried to
run them on an NVMe high-performance device. Imho if results are good
in such a "difficult to deal with" multi-queue device, they should be
good enough also in a "simpler" single-queue storage device..

Testbed specs:
kernel = 4.18.0 (from bfq dev branch [1], where bfq already contains
also the commits that will be available from 4.20)
fs = ext4
drive = ssd samsung 960 pro NVMe m.2 512gb

Device data sheet specs state that under random IO:
* QD 1 thread 1
* read = 14 kIOPS
* write = 50 kIOPS
* QD 32 thread 4
* read = write = 330 kIOPS

What follows is a results summary; under requests I can give all
results. The workload notation (e.g. 5r5w-seq) means:
- num_readers (5r)
- num_writers (5w)
- sequential_io or random_io (-seq)


# replayed gnome-terminal startup time (lower is better)
workload bfq-mq [s] none [s] % gain
-------- ---------- -------- ------
10r-seq 0.3725 2.79 86.65
5r5w-seq 0.9725 5.53 82.41

# throughput (higher is better)
workload bfq-mq [mb/s] none [mb/s] % gain
--------- ------------- ----------- -------
10r-rand 394.806 429.735 -8.128
10r-seq 1387.63 1431.81 -3.086
1r-seq 838.13 798.872 4.914
5r5w-rand 1118.12 1297.46 -13.822
5r5w-seq 1187 1313.8 -9.651

Thanks,
Federico

[1] https://github.com/Algodev-github/bfq-mq/commits/bfq-mq
Paolo Valente
2018-10-16 16:26:09 UTC
Permalink
Post by Federico Motta
Post by Bart Van Assche
Post by Linus Walleij
+ * - "none" for multiqueue devices (nr_hw_queues != 1)
+ * - "bfq", if available, for single queue devices
+ * - "mq-deadline" if "bfq" is not available for single queue devices
+ * - "none" for single queue devices as well as last resort
For SATA SSDs nr_hw_queues == 1 so this patch will also affect these SSDs.
Since this patch is an attempt to improve performance, I'd like to see
measurement data for one or more recent SATA SSDs before a decision is
taken about what to do with this patch.
Thanks,
Bart.
Hi,
although these tests should be run for single-queue devices, I tried to
run them on an NVMe high-performance device. Imho if results are good
in such a "difficult to deal with" multi-queue device, they should be
good enough also in a "simpler" single-queue storage device..
kernel = 4.18.0 (from bfq dev branch [1], where bfq already contains
also the commits that will be available from 4.20)
fs = ext4
drive = ssd samsung 960 pro NVMe m.2 512gb
* QD 1 thread 1
* read = 14 kIOPS
* write = 50 kIOPS
* QD 32 thread 4
* read = write = 330 kIOPS
What follows is a results summary; under requests I can give all
- num_readers (5r)
- num_writers (5w)
- sequential_io or random_io (-seq)
# replayed gnome-terminal startup time (lower is better)
workload bfq-mq [s] none [s] % gain
-------- ---------- -------- ------
10r-seq 0.3725 2.79 86.65
5r5w-seq 0.9725 5.53 82.41
# throughput (higher is better)
workload bfq-mq [mb/s] none [mb/s] % gain
--------- ------------- ----------- -------
10r-rand 394.806 429.735 -8.128
10r-seq 1387.63 1431.81 -3.086
1r-seq 838.13 798.872 4.914
5r5w-rand 1118.12 1297.46 -13.822
5r5w-seq 1187 1313.8 -9.651
A little unexpectedly for me, throughput loss for random I/O is even
lower than what I have obtained with my nasty SATA SSD (and reported
in my public results).

I didn't expect that little loss with sequential parallel reads.
Probably, when going multiqueue, there are changes I haven't even
thought about (I have never even tested bfq on a multi-queue device).

Thanks,
Paolo
Post by Federico Motta
Thanks,
Federico
[1] https://github.com/Algodev-github/bfq-mq/commits/bfq-mq
Jens Axboe
2018-10-15 15:39:17 UTC
Permalink
Post by Linus Walleij
This sets BFQ as the default scheduler for single queue
block devices (nr_hw_queues == 1) if it is available. This
affects notably MMC/SD-cards but also UBI and the loopback
device.
I have been running it for a while without any negative
effects on my pet systems and I want some wider testing
so let's throw it out there and see what people say.
Admittedly my use cases are limited. I need to keep this
patch around for my personal needs anyway.
We take special care to avoid using BFQ on zoned devices
(in particular SMR, shingled magnetic recording devices)
as these currently require mq-deadline to group writes
together.
I have opted against introducing any default scheduler
through Kconfig as the mq-deadline enforcement for
zoned devices has to be done at runtime anyways and
too many config options will make things confusing.
My argument for setting a default policy in the kernel
as opposed to user space is the "reasonable defaults"
type, analogous to how we have one default CPU scheduling
policy (CFS) that make most sense for most tasks, and
how automatic process group scheduling happens in most
distributions without userspace involvement. The BFQ
scheduling policy makes most sense for single hardware
queue devices and many embedded systems will not have
the clever userspace tools (such as udev) to make an
educated choice of scheduling policy. Defaults should be
those that make most sense for the hardware.
I still don't like this. There are going to be tons of
cases where the single queue device is some hw raid setup
or similar, where performance is going to be much worse with
BFQ than it is with mq-deadline, for instance. That's just
one case.

This kind of policy does not belong in the kernel, at least
not in the current form. If we had some sort of "enable best
options for a desktop" then it could fall under that umbrella.
--
Jens Axboe
Paolo Valente
2018-10-15 18:26:50 UTC
Permalink
Post by Jens Axboe
Post by Linus Walleij
This sets BFQ as the default scheduler for single queue
block devices (nr_hw_queues == 1) if it is available. This
affects notably MMC/SD-cards but also UBI and the loopback
device.
I have been running it for a while without any negative
effects on my pet systems and I want some wider testing
so let's throw it out there and see what people say.
Admittedly my use cases are limited. I need to keep this
patch around for my personal needs anyway.
We take special care to avoid using BFQ on zoned devices
(in particular SMR, shingled magnetic recording devices)
as these currently require mq-deadline to group writes
together.
I have opted against introducing any default scheduler
through Kconfig as the mq-deadline enforcement for
zoned devices has to be done at runtime anyways and
too many config options will make things confusing.
My argument for setting a default policy in the kernel
as opposed to user space is the "reasonable defaults"
type, analogous to how we have one default CPU scheduling
policy (CFS) that make most sense for most tasks, and
how automatic process group scheduling happens in most
distributions without userspace involvement. The BFQ
scheduling policy makes most sense for single hardware
queue devices and many embedded systems will not have
the clever userspace tools (such as udev) to make an
educated choice of scheduling policy. Defaults should be
those that make most sense for the hardware.
I still don't like this. There are going to be tons of
cases where the single queue device is some hw raid setup
or similar, where performance is going to be much worse with
BFQ than it is with mq-deadline, for instance. That's just
one case.
Hi Jens,
in my RAID tests bfq performed as well as in non-RAID tests. Probably
you refer to the fact that, in a RAID configuration, IOPS can become
very high. But, if that is the case, then the response to your
objections already emerged in the previous thread. Let me sum it up
again.

I tested bfq on virtually every device in the range from few hundred
of IOPS to 50-100KIOPS. Then, through the public script I already
mentioned, I found the maximum number of IOPS that bfq can handle:
about 400K with a commodity CPU.

In particular, in all my tests with real hardware, bfq
- is not even comparable to that of any of the other scheduler, in
terms of responsiveness, latency for real-time applications, ability
to provide strong bandwidth guarantees, ability to boost throughput
while guaranteeing bandwidths;
- is a little worse than the other scheduler for only one test, on
only some hardware: total throughput with random reads, were it may
lose up to 10-15% of throughput. Of course, the scheduler that reach
a higher throughput leave the machine unusable during the test.

So I really cannot see a reason why bfq could do worse than any of
these other schedulers for some single-queue device (conservatively)
below 300KIOPS.

Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are
probably less than 1% of all the single-queue storage around (USB
drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we
are sacrificing 99% of the hardware, to help 1% of the hardware, for
one kind of test cases.
Post by Jens Axboe
This kind of policy does not belong in the kernel, at least
not in the current form. If we had some sort of "enable best
options for a desktop" then it could fall under that umbrella.
I don't think bfq can be considered a scheduler for only desktops any
longer.

Thanks,
Paolo
Post by Jens Axboe
--
Jens Axboe
Jens Axboe
2018-10-15 19:26:53 UTC
Permalink
Post by Paolo Valente
Post by Jens Axboe
Post by Linus Walleij
This sets BFQ as the default scheduler for single queue
block devices (nr_hw_queues == 1) if it is available. This
affects notably MMC/SD-cards but also UBI and the loopback
device.
I have been running it for a while without any negative
effects on my pet systems and I want some wider testing
so let's throw it out there and see what people say.
Admittedly my use cases are limited. I need to keep this
patch around for my personal needs anyway.
We take special care to avoid using BFQ on zoned devices
(in particular SMR, shingled magnetic recording devices)
as these currently require mq-deadline to group writes
together.
I have opted against introducing any default scheduler
through Kconfig as the mq-deadline enforcement for
zoned devices has to be done at runtime anyways and
too many config options will make things confusing.
My argument for setting a default policy in the kernel
as opposed to user space is the "reasonable defaults"
type, analogous to how we have one default CPU scheduling
policy (CFS) that make most sense for most tasks, and
how automatic process group scheduling happens in most
distributions without userspace involvement. The BFQ
scheduling policy makes most sense for single hardware
queue devices and many embedded systems will not have
the clever userspace tools (such as udev) to make an
educated choice of scheduling policy. Defaults should be
those that make most sense for the hardware.
I still don't like this. There are going to be tons of
cases where the single queue device is some hw raid setup
or similar, where performance is going to be much worse with
BFQ than it is with mq-deadline, for instance. That's just
one case.
Hi Jens,
in my RAID tests bfq performed as well as in non-RAID tests. Probably
you refer to the fact that, in a RAID configuration, IOPS can become
very high. But, if that is the case, then the response to your
objections already emerged in the previous thread. Let me sum it up
again.
I tested bfq on virtually every device in the range from few hundred
of IOPS to 50-100KIOPS. Then, through the public script I already
about 400K with a commodity CPU.
In particular, in all my tests with real hardware, bfq
- is not even comparable to that of any of the other scheduler, in
terms of responsiveness, latency for real-time applications, ability
to provide strong bandwidth guarantees, ability to boost throughput
while guaranteeing bandwidths;
- is a little worse than the other scheduler for only one test, on
only some hardware: total throughput with random reads, were it may
lose up to 10-15% of throughput. Of course, the scheduler that reach
a higher throughput leave the machine unusable during the test.
So I really cannot see a reason why bfq could do worse than any of
these other schedulers for some single-queue device (conservatively)
below 300KIOPS.
Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are
probably less than 1% of all the single-queue storage around (USB
drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we
are sacrificing 99% of the hardware, to help 1% of the hardware, for
one kind of test cases.
I should have been more clear - I'm not worried about IOPS overhead,
I'm worried about scheduling decisions that lower performance on
(for instance) raid composed of many drives (rotational or otherwise).

If you have actual data (on what hardware, and what kind of tests)
to disprove that worry, then that's great, and I'd love to see that.
--
Jens Axboe
Paolo Valente
2018-10-15 19:44:31 UTC
Permalink
Post by Jens Axboe
Post by Paolo Valente
Post by Jens Axboe
Post by Linus Walleij
This sets BFQ as the default scheduler for single queue
block devices (nr_hw_queues == 1) if it is available. This
affects notably MMC/SD-cards but also UBI and the loopback
device.
I have been running it for a while without any negative
effects on my pet systems and I want some wider testing
so let's throw it out there and see what people say.
Admittedly my use cases are limited. I need to keep this
patch around for my personal needs anyway.
We take special care to avoid using BFQ on zoned devices
(in particular SMR, shingled magnetic recording devices)
as these currently require mq-deadline to group writes
together.
I have opted against introducing any default scheduler
through Kconfig as the mq-deadline enforcement for
zoned devices has to be done at runtime anyways and
too many config options will make things confusing.
My argument for setting a default policy in the kernel
as opposed to user space is the "reasonable defaults"
type, analogous to how we have one default CPU scheduling
policy (CFS) that make most sense for most tasks, and
how automatic process group scheduling happens in most
distributions without userspace involvement. The BFQ
scheduling policy makes most sense for single hardware
queue devices and many embedded systems will not have
the clever userspace tools (such as udev) to make an
educated choice of scheduling policy. Defaults should be
those that make most sense for the hardware.
I still don't like this. There are going to be tons of
cases where the single queue device is some hw raid setup
or similar, where performance is going to be much worse with
BFQ than it is with mq-deadline, for instance. That's just
one case.
Hi Jens,
in my RAID tests bfq performed as well as in non-RAID tests. Probably
you refer to the fact that, in a RAID configuration, IOPS can become
very high. But, if that is the case, then the response to your
objections already emerged in the previous thread. Let me sum it up
again.
I tested bfq on virtually every device in the range from few hundred
of IOPS to 50-100KIOPS. Then, through the public script I already
about 400K with a commodity CPU.
In particular, in all my tests with real hardware, bfq
- is not even comparable to that of any of the other scheduler, in
terms of responsiveness, latency for real-time applications, ability
to provide strong bandwidth guarantees, ability to boost throughput
while guaranteeing bandwidths;
- is a little worse than the other scheduler for only one test, on
only some hardware: total throughput with random reads, were it may
lose up to 10-15% of throughput. Of course, the scheduler that reach
a higher throughput leave the machine unusable during the test.
So I really cannot see a reason why bfq could do worse than any of
these other schedulers for some single-queue device (conservatively)
below 300KIOPS.
Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are
probably less than 1% of all the single-queue storage around (USB
drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we
are sacrificing 99% of the hardware, to help 1% of the hardware, for
one kind of test cases.
I should have been more clear - I'm not worried about IOPS overhead,
I'm worried about scheduling decisions that lower performance on
(for instance) raid composed of many drives (rotational or otherwise).
If you have actual data (on what hardware, and what kind of tests)
to disprove that worry, then that's great, and I'd love to see that.
Here are some old results with a very simple configuration:
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/

Then I stopped repeating tests that always yielded the same good results.

As for more professional systems, a well-known company doing
real-time packet-traffic dumping asked me to modify bfq so as to
guarantee lossless data writing also during queries. The involved box
had a RAID reaching a few Gbps, and everything worked well.

Anyway, if you have specific issues in mind, I can check more deeply.

Thanks,
Paolo
Post by Jens Axboe
--
Jens Axboe
Jens Axboe
2018-10-16 17:35:59 UTC
Permalink
Post by Paolo Valente
Post by Jens Axboe
Post by Paolo Valente
Post by Jens Axboe
Post by Linus Walleij
This sets BFQ as the default scheduler for single queue
block devices (nr_hw_queues == 1) if it is available. This
affects notably MMC/SD-cards but also UBI and the loopback
device.
I have been running it for a while without any negative
effects on my pet systems and I want some wider testing
so let's throw it out there and see what people say.
Admittedly my use cases are limited. I need to keep this
patch around for my personal needs anyway.
We take special care to avoid using BFQ on zoned devices
(in particular SMR, shingled magnetic recording devices)
as these currently require mq-deadline to group writes
together.
I have opted against introducing any default scheduler
through Kconfig as the mq-deadline enforcement for
zoned devices has to be done at runtime anyways and
too many config options will make things confusing.
My argument for setting a default policy in the kernel
as opposed to user space is the "reasonable defaults"
type, analogous to how we have one default CPU scheduling
policy (CFS) that make most sense for most tasks, and
how automatic process group scheduling happens in most
distributions without userspace involvement. The BFQ
scheduling policy makes most sense for single hardware
queue devices and many embedded systems will not have
the clever userspace tools (such as udev) to make an
educated choice of scheduling policy. Defaults should be
those that make most sense for the hardware.
I still don't like this. There are going to be tons of
cases where the single queue device is some hw raid setup
or similar, where performance is going to be much worse with
BFQ than it is with mq-deadline, for instance. That's just
one case.
Hi Jens,
in my RAID tests bfq performed as well as in non-RAID tests. Probably
you refer to the fact that, in a RAID configuration, IOPS can become
very high. But, if that is the case, then the response to your
objections already emerged in the previous thread. Let me sum it up
again.
I tested bfq on virtually every device in the range from few hundred
of IOPS to 50-100KIOPS. Then, through the public script I already
about 400K with a commodity CPU.
In particular, in all my tests with real hardware, bfq
- is not even comparable to that of any of the other scheduler, in
terms of responsiveness, latency for real-time applications, ability
to provide strong bandwidth guarantees, ability to boost throughput
while guaranteeing bandwidths;
- is a little worse than the other scheduler for only one test, on
only some hardware: total throughput with random reads, were it may
lose up to 10-15% of throughput. Of course, the scheduler that reach
a higher throughput leave the machine unusable during the test.
So I really cannot see a reason why bfq could do worse than any of
these other schedulers for some single-queue device (conservatively)
below 300KIOPS.
Finally, since, AFAICT, single-queue devices doing 400+ KIOPS are
probably less than 1% of all the single-queue storage around (USB
drives, HDDs, eMMC, standard SSDs, ...), by sticking to mq-deadline we
are sacrificing 99% of the hardware, to help 1% of the hardware, for
one kind of test cases.
I should have been more clear - I'm not worried about IOPS overhead,
I'm worried about scheduling decisions that lower performance on
(for instance) raid composed of many drives (rotational or otherwise).
If you have actual data (on what hardware, and what kind of tests)
to disprove that worry, then that's great, and I'd love to see that.
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/
Then I stopped repeating tests that always yielded the same good results.
As for more professional systems, a well-known company doing
real-time packet-traffic dumping asked me to modify bfq so as to
guarantee lossless data writing also during queries. The involved box
had a RAID reaching a few Gbps, and everything worked well.
Anyway, if you have specific issues in mind, I can check more deeply.
Do you have anything more recent? All of these predate the current
code (by a lot), and isn't even mq. I'm mostly just interested in
plain fast NVMe device, and a big box hardware raid setup with
a ton of drives.

I do still think that this should be going through the distros, they
need to be the ones driving this, as they will ultimately be the
ones getting customer reports on regressions. The qual/test cycle
they do is useful for this. In mainline, if we make a change like
this, we'll figure out if it worked many releases down the line.
--
Jens Axboe
Jan Kara
2018-10-17 10:05:26 UTC
Permalink
Post by Jens Axboe
Post by Paolo Valente
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/
Then I stopped repeating tests that always yielded the same good results.
As for more professional systems, a well-known company doing
real-time packet-traffic dumping asked me to modify bfq so as to
guarantee lossless data writing also during queries. The involved box
had a RAID reaching a few Gbps, and everything worked well.
Anyway, if you have specific issues in mind, I can check more deeply.
Do you have anything more recent? All of these predate the current
code (by a lot), and isn't even mq. I'm mostly just interested in
plain fast NVMe device, and a big box hardware raid setup with
a ton of drives.
I do still think that this should be going through the distros, they
need to be the ones driving this, as they will ultimately be the
ones getting customer reports on regressions. The qual/test cycle
they do is useful for this. In mainline, if we make a change like
this, we'll figure out if it worked many releases down the line.
Well, the problem with this is that big distro people really don't care
much because they already use udev for tuning the IO scheduler. So whatever
defaults the kernel is going to pick likely won't be seen by distro
customers. Embedded people seem to be driving this effort because they
either don't run udev or they feel not all their teams building new
products have enough expertise to come up with a proper set of rules...

Honza
--
Jan Kara <***@suse.com>
SUSE Labs, CR
Bart Van Assche
2018-10-17 14:48:33 UTC
Permalink
Post by Jan Kara
Well, the problem with this is that big distro people really don't care
much because they already use udev for tuning the IO scheduler. So whatever
defaults the kernel is going to pick likely won't be seen by distro
customers. Embedded people seem to be driving this effort because they
either don't run udev or they feel not all their teams building new
products have enough expertise to come up with a proper set of rules...
What's missing in this discussion is a definition of "embedded system".
Is that a system like a streaming player for TV channels that neither
has a keyboard nor a display or a system that can run multiple apps
simultaneously like a smartphone? I think the difference matters because
some embedded devices hardly do any background I/O nor load any
executable code from storage after boot. So at least for some embedded
devices the problem discussed in this e-mail thread does not exist.

Bart.
Bryan Gurney
2018-10-17 14:59:25 UTC
Permalink
Post by Jan Kara
Well, the problem with this is that big distro people really don't care
much because they already use udev for tuning the IO scheduler. So whatever
defaults the kernel is going to pick likely won't be seen by distro
customers. Embedded people seem to be driving this effort because they
either don't run udev or they feel not all their teams building new
products have enough expertise to come up with a proper set of rules...
What's missing in this discussion is a definition of "embedded system". Is
that a system like a streaming player for TV channels that neither has a
keyboard nor a display or a system that can run multiple apps simultaneously
like a smartphone? I think the difference matters because some embedded
devices hardly do any background I/O nor load any executable code from
storage after boot. So at least for some embedded devices the problem
discussed in this e-mail thread does not exist.
Bart.
There are high-performance embedded systems on the market (NAS, etc.).

I feel strongly about the prevention of users running into errors
because of an incorrect scheduler default, because I encountered that
situation three times in my testing with zoned block devices. The
switch to SCSI_MQ would resolve that, since mq-deadline is the
default, but in my case, I was using Fedora 28, which disables
CONFIG_SCSI_MQ_DEFAULT (which is enabled in the 4.18 kernel), so my
default scheduler was cfq.

Hopefully there aren't any other cases where choosing the "wrong
default scheduler" leads to errors. Ideally the default scheduler
choice should prevent any errors, leaving it up to the distros to
configure a default via other methods, to optimize for performance.


Thanks,

Bryan
Linus Walleij
2018-10-19 08:42:53 UTC
Permalink
Post by Bryan Gurney
I feel strongly about the prevention of users running into errors
because of an incorrect scheduler default, because I encountered that
situation three times in my testing with zoned block devices. The
switch to SCSI_MQ would resolve that, since mq-deadline is the
default, but in my case, I was using Fedora 28, which disables
CONFIG_SCSI_MQ_DEFAULT (which is enabled in the 4.18 kernel), so my
default scheduler was cfq.
I think we should make a patch to the kernel that makes it
impossible (even from sysfs) to choose a non-zone aware
scheduler for these devices.

It's another topic than $SUBJECT patch though. I take this
into account in this version.

Yours,
Linus Walleij
Bryan Gurney
2018-10-19 13:36:49 UTC
Permalink
Post by Linus Walleij
Post by Bryan Gurney
I feel strongly about the prevention of users running into errors
because of an incorrect scheduler default, because I encountered that
situation three times in my testing with zoned block devices. The
switch to SCSI_MQ would resolve that, since mq-deadline is the
default, but in my case, I was using Fedora 28, which disables
CONFIG_SCSI_MQ_DEFAULT (which is enabled in the 4.18 kernel), so my
default scheduler was cfq.
I think we should make a patch to the kernel that makes it
impossible (even from sysfs) to choose a non-zone aware
scheduler for these devices.
It's another topic than $SUBJECT patch though. I take this
into account in this version.
I like this idea. I don't have enough experience to write this patch
myself, but I imagine something like adding "bool is_zoned_aware" to
"struct elevator_type", and setting that true only for the schedulers
that are currently zoned-device aware (which is currently deadline on
single queue, mq-deadline on blk-mq).


Thanks,

Bryan
Johannes Thumshirn
2018-10-19 13:44:38 UTC
Permalink
Post by Bryan Gurney
I like this idea. I don't have enough experience to write this patch
myself, but I imagine something like adding "bool is_zoned_aware" to
"struct elevator_type", and setting that true only for the schedulers
that are currently zoned-device aware (which is currently deadline on
single queue, mq-deadline on blk-mq).
I don't think this is needed currently as a) Jens is working on getting
rid of the legacy path, which leaves us with mq-deadline only and Linus'
patch has:

+ if (blk_queue_is_zoned(q))
+ policy = "mq-deadline";

Which chooses mq-deadline on a zoned device.

So nothing to worry about here now.

All this only given Linus' patch actually gets merged.

Byte,
Johannes
--
Johannes Thumshirn SUSE Labs
***@suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Bryan Gurney
2018-10-19 14:16:05 UTC
Permalink
Post by Johannes Thumshirn
Post by Bryan Gurney
I like this idea. I don't have enough experience to write this patch
myself, but I imagine something like adding "bool is_zoned_aware" to
"struct elevator_type", and setting that true only for the schedulers
that are currently zoned-device aware (which is currently deadline on
single queue, mq-deadline on blk-mq).
I don't think this is needed currently as a) Jens is working on getting
rid of the legacy path,
Once the legacy schedulers are gone, the default (prior to Linus'
proposed patch) will be mq-deadline, which is zoned-device-aware. So
the default scheduler will be "safer" for zoned devices.

However, it will still be possible for users (or distro defaults) to
select a non-zoned-aware scheduler, such as "none", "kyber", or "bfq"
(prior to this patch). So there would still be a window for users to
encounter the same problems I found when aborted commands start
occurring during otherwise normal filesystem or storage activity, by
drivers that are otherwise compliant with the handling characteristics
of zoned block devices.
Post by Johannes Thumshirn
which leaves us with mq-deadline only and Linus'
+ if (blk_queue_is_zoned(q))
+ policy = "mq-deadline";
Which chooses mq-deadline on a zoned device.
So nothing to worry about here now.
All this only given Linus' patch actually gets merged.
I hope it does get merged. I keep forgetting to save my "zoned
devices use deadline" udev rule on my SMR drive test machine in
between reinstalls.


Thanks,

Bryan
Jens Axboe
2018-10-22 08:12:15 UTC
Permalink
Post by Linus Walleij
Post by Bryan Gurney
I feel strongly about the prevention of users running into errors
because of an incorrect scheduler default, because I encountered that
situation three times in my testing with zoned block devices. The
switch to SCSI_MQ would resolve that, since mq-deadline is the
default, but in my case, I was using Fedora 28, which disables
CONFIG_SCSI_MQ_DEFAULT (which is enabled in the 4.18 kernel), so my
default scheduler was cfq.
I think we should make a patch to the kernel that makes it
impossible (even from sysfs) to choose a non-zone aware
scheduler for these devices.
It's another topic than $SUBJECT patch though. I take this
into account in this version.
Yes I agree, and I'd be happy to take such a patch. The only matching we
do now is mq-sched for mq-device, and vice versa. And that will be
going away in 4.21, when there are no more !mq devices that use
scheduling.

If your device is zoned, then you should not be able to switch to a
scheduler that doesn't have support for that. The right approach here
would be to add a capability flag to the IO schedulers.
--
Jens Axboe
Mark Brown
2018-10-17 16:01:57 UTC
Permalink
Post by Jan Kara
Well, the problem with this is that big distro people really don't care
much because they already use udev for tuning the IO scheduler. So whatever
defaults the kernel is going to pick likely won't be seen by distro
customers. Embedded people seem to be driving this effort because they
either don't run udev or they feel not all their teams building new
products have enough expertise to come up with a proper set of rules...
What's missing in this discussion is a definition of "embedded system". Is
that a system like a streaming player for TV channels that neither has a
keyboard nor a display or a system that can run multiple apps simultaneously
like a smartphone? I think the difference matters because some embedded
devices hardly do any background I/O nor load any executable code from
storage after boot. So at least for some embedded devices the problem
discussed in this e-mail thread does not exist.
It's a combination of things - smartphones are definitely part of the
target audience but other things can be affected, I'd guess your
streaming TV player example can have issues if it's got local storage
and downloads things in the background for example. There's definitely
systems that never really use storage once they're booted but there's
also things that move data around and/or have interactive apps. Even
with some of the things that don't really use storage at runtime it can
be important to help cut down boot times.
Jens Axboe
2018-10-17 16:29:22 UTC
Permalink
Post by Jan Kara
Post by Jens Axboe
Post by Paolo Valente
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/
Then I stopped repeating tests that always yielded the same good results.
As for more professional systems, a well-known company doing
real-time packet-traffic dumping asked me to modify bfq so as to
guarantee lossless data writing also during queries. The involved box
had a RAID reaching a few Gbps, and everything worked well.
Anyway, if you have specific issues in mind, I can check more deeply.
Do you have anything more recent? All of these predate the current
code (by a lot), and isn't even mq. I'm mostly just interested in
plain fast NVMe device, and a big box hardware raid setup with
a ton of drives.
I do still think that this should be going through the distros, they
need to be the ones driving this, as they will ultimately be the
ones getting customer reports on regressions. The qual/test cycle
they do is useful for this. In mainline, if we make a change like
this, we'll figure out if it worked many releases down the line.
Well, the problem with this is that big distro people really don't care
much because they already use udev for tuning the IO scheduler. So whatever
defaults the kernel is going to pick likely won't be seen by distro
customers. Embedded people seem to be driving this effort because they
either don't run udev or they feel not all their teams building new
products have enough expertise to come up with a proper set of rules...
Which is also the approach that I've been advocating for here, instead
of a kernel patch...
--
Jens Axboe
Jan Kara
2018-10-18 07:21:10 UTC
Permalink
Post by Jens Axboe
Post by Jan Kara
Post by Jens Axboe
Post by Paolo Valente
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/
Then I stopped repeating tests that always yielded the same good results.
As for more professional systems, a well-known company doing
real-time packet-traffic dumping asked me to modify bfq so as to
guarantee lossless data writing also during queries. The involved box
had a RAID reaching a few Gbps, and everything worked well.
Anyway, if you have specific issues in mind, I can check more deeply.
Do you have anything more recent? All of these predate the current
code (by a lot), and isn't even mq. I'm mostly just interested in
plain fast NVMe device, and a big box hardware raid setup with
a ton of drives.
I do still think that this should be going through the distros, they
need to be the ones driving this, as they will ultimately be the
ones getting customer reports on regressions. The qual/test cycle
they do is useful for this. In mainline, if we make a change like
this, we'll figure out if it worked many releases down the line.
Well, the problem with this is that big distro people really don't care
much because they already use udev for tuning the IO scheduler. So whatever
defaults the kernel is going to pick likely won't be seen by distro
customers. Embedded people seem to be driving this effort because they
either don't run udev or they feel not all their teams building new
products have enough expertise to come up with a proper set of rules...
Which is also the approach that I've been advocating for here, instead
of a kernel patch...
I know you've been advocating the use of udev for IO scheduler selection.
But do you want to force everybody to use udev? And for people who build
their own (usually small) systems, do you want to force them to think about
IO scheduler selection and writing appropriate rules? These are the
problems people were mentioning and I'm not sure what is your opinion on
this.

Honza
--
Jan Kara <***@suse.com>
SUSE Labs, CR
Jens Axboe
2018-10-18 14:35:49 UTC
Permalink
Post by Jan Kara
Post by Jens Axboe
Post by Jan Kara
Post by Jens Axboe
Post by Paolo Valente
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/4.4.0-v7r11/
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.14.0-v7r3/
http://algo.ing.unimo.it/people/paolo/disk_sched/old-results/3.13.0-v7r2/
Then I stopped repeating tests that always yielded the same good results.
As for more professional systems, a well-known company doing
real-time packet-traffic dumping asked me to modify bfq so as to
guarantee lossless data writing also during queries. The involved box
had a RAID reaching a few Gbps, and everything worked well.
Anyway, if you have specific issues in mind, I can check more deeply.
Do you have anything more recent? All of these predate the current
code (by a lot), and isn't even mq. I'm mostly just interested in
plain fast NVMe device, and a big box hardware raid setup with
a ton of drives.
I do still think that this should be going through the distros, they
need to be the ones driving this, as they will ultimately be the
ones getting customer reports on regressions. The qual/test cycle
they do is useful for this. In mainline, if we make a change like
this, we'll figure out if it worked many releases down the line.
Well, the problem with this is that big distro people really don't care
much because they already use udev for tuning the IO scheduler. So whatever
defaults the kernel is going to pick likely won't be seen by distro
customers. Embedded people seem to be driving this effort because they
either don't run udev or they feel not all their teams building new
products have enough expertise to come up with a proper set of rules...
Which is also the approach that I've been advocating for here, instead
of a kernel patch...
I know you've been advocating the use of udev for IO scheduler selection.
But do you want to force everybody to use udev? And for people who build
their own (usually small) systems, do you want to force them to think about
IO scheduler selection and writing appropriate rules? These are the
problems people were mentioning and I'm not sure what is your opinion on
this.
I don't want to force everybody to use udev, use whatever you like on
your platform. For most people that is udev, for embedded it's something
else. As you said, distros already do this via udev. When I've had to
do it on my systems, I've added a udev rule to do it.

My opinion is that the kernel makes various schedulers available.
Deciding which one to use is policy that should go into user space.
The default should be something that's solid and works, fancier
setups and tuning should be left to user space.
--
Jens Axboe
Pavel Machek
2018-10-19 08:22:03 UTC
Permalink
Hi!
Post by Jens Axboe
Post by Jan Kara
Post by Jens Axboe
Which is also the approach that I've been advocating for here, instead
of a kernel patch...
I know you've been advocating the use of udev for IO scheduler selection.
But do you want to force everybody to use udev? And for people who build
their own (usually small) systems, do you want to force them to think about
IO scheduler selection and writing appropriate rules? These are the
problems people were mentioning and I'm not sure what is your opinion on
this.
I don't want to force everybody to use udev, use whatever you like on
your platform. For most people that is udev, for embedded it's something
else. As you said, distros already do this via udev. When I've had to
do it on my systems, I've added a udev rule to do it.
This is not really helpful.

So you want me and everyone else and everyone on embedded to mess with
udev? No, thanks.

There are people booting with init=/bin/bash, too, running fsck. Would
not it be nice to use reasonable schedulers there?
Post by Jens Axboe
My opinion is that the kernel makes various schedulers available.
Deciding which one to use is policy that should go into user space.
The default should be something that's solid and works, fancier
setups and tuning should be left to user space.
Kernel should do reasonable thing by default, and it seems to be easy
in this case.

You keep repeating "but someone's super fast raid might get slowed
down". Those 5 people in the world probably already have their udev
rules.

Now, lets do the right thing by default for the rest of the world,
including you.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Jens Axboe
2018-10-22 08:08:42 UTC
Permalink
Post by Pavel Machek
Hi!
Post by Jens Axboe
Post by Jan Kara
Post by Jens Axboe
Which is also the approach that I've been advocating for here, instead
of a kernel patch...
I know you've been advocating the use of udev for IO scheduler selection.
But do you want to force everybody to use udev? And for people who build
their own (usually small) systems, do you want to force them to think about
IO scheduler selection and writing appropriate rules? These are the
problems people were mentioning and I'm not sure what is your opinion on
this.
I don't want to force everybody to use udev, use whatever you like on
your platform. For most people that is udev, for embedded it's something
else. As you said, distros already do this via udev. When I've had to
do it on my systems, I've added a udev rule to do it.
This is not really helpful.
So you want me and everyone else and everyone on embedded to mess with
udev? No, thanks.
Did you read what I wrote?
Post by Pavel Machek
There are people booting with init=/bin/bash, too, running fsck. Would
not it be nice to use reasonable schedulers there?
I can pretty much guarantee that fsck will run the same speed,
regardless of scheduler. And users generally don't care about
ultimate fairness on the device while running fsck...

If you (or someone else) doesn't want to use udev, use whatever
you want. You're doing something heavily customized at that
point anyway, surely this isn't a show stopper.
Post by Pavel Machek
Post by Jens Axboe
My opinion is that the kernel makes various schedulers available.
Deciding which one to use is policy that should go into user space.
The default should be something that's solid and works, fancier
setups and tuning should be left to user space.
Kernel should do reasonable thing by default, and it seems to be easy
in this case.
I agree, we just differ on what we consider the reasonable choice to
be.
--
Jens Axboe
Oleksandr Natalenko
2018-11-02 10:40:42 UTC
Permalink
Hi.
Post by Jens Axboe
Do you have anything more recent? All of these predate the current
code (by a lot), and isn't even mq. I'm mostly just interested in
plain fast NVMe device, and a big box hardware raid setup with
a ton of drives.
I do still think that this should be going through the distros, they
need to be the ones driving this, as they will ultimately be the
ones getting customer reports on regressions. The qual/test cycle
they do is useful for this. In mainline, if we make a change like
this, we'll figure out if it worked many releases down the line.
Some benchmarks here for a non-RAID setup obtained by S suite. This is
from Lenovo T460s with SAMSUNG MZNTY256HDHP-000L7 SSD. v4.19 kernel is
running with all recent BFQ patches applied.

# replayed gnome terminal startup throughput
# Workload bfq mq-deadline
0r-raw_seq 13.2617 13.4867
10r-raw_seq 512.507 539.95

# replayed gnome terminal startup time
# Workload bfq mq-deadline
0r-raw_seq 0.43 0.4
10r-raw_seq 0.685 4.1625

# replayed lowriter startup throughput
# Workload bfq mq-deadline
0r-raw_seq 9.985 10.375
10r-raw_seq 516.62 539.61

# replayed lowriter startup time
# Workload bfq mq-deadline
0r-raw_seq 0.4 0.3875
10r-raw_seq 0.535 2.3875

# replayed xterm startup throughput
# Workload bfq mq-deadline
0r-raw_seq 5.93833 6.10834
10r-raw_seq 524.447 539.991

# replayed xterm startup time
# Workload bfq mq-deadline
0r-raw_seq 0.23 0.23
10r-raw_seq 0.38 1.56

# throughput
# Workload bfq mq-deadline
10r-raw_rand 362.446 363.817
10r-raw_seq 537.646 540.609
1r-raw_seq 500.733 502.526

Throughput-wise, BFQ is on-par with mq-deadline. Latency-wise, BFQ is
much-much better.
--
Oleksandr Natalenko (post-factum)
Paolo Valente
2018-10-19 10:59:04 UTC
Permalink
Post by Paolo Valente
...
Post by Jens Axboe
This kind of policy does not belong in the kernel, at least
not in the current form. If we had some sort of "enable best
options for a desktop" then it could fall under that umbrella.
I don't think bfq can be considered a scheduler for only desktops any
longer.
Hi Jens,
this reply of mine went on bugging me, until I understood my mistake.

The fact that I consider bfq good also for servers *does not* imply
that having bfq in desktops is to be refused.

As for the option that you are hinting at, I also acknowledge that it
would be trivial for an admin/developer to know whether a given kernel
is meant for a desktop/personal system, while it is more difficult to
choose explicitly among the various I/O schedulers available.

So, I apologize for my shortsighted, initial reply, and ask you if can
elaborate a little more on this. I'm willing to help, if I can.

Thanks,
Paolo
Post by Paolo Valente
Thanks,
Paolo
Post by Jens Axboe
--
Jens Axboe
Jens Axboe
2018-10-22 08:21:47 UTC
Permalink
Post by Paolo Valente
Post by Paolo Valente
...
Post by Jens Axboe
This kind of policy does not belong in the kernel, at least
not in the current form. If we had some sort of "enable best
options for a desktop" then it could fall under that umbrella.
I don't think bfq can be considered a scheduler for only desktops any
longer.
Hi Jens,
this reply of mine went on bugging me, until I understood my mistake.
The fact that I consider bfq good also for servers *does not* imply
that having bfq in desktops is to be refused.
As for the option that you are hinting at, I also acknowledge that it
would be trivial for an admin/developer to know whether a given kernel
is meant for a desktop/personal system, while it is more difficult to
choose explicitly among the various I/O schedulers available.
So, I apologize for my shortsighted, initial reply, and ask you if can
elaborate a little more on this. I'm willing to help, if I can.
I think I've written about this multiple times now, but for me it
really just boils down to sane default, and policy in the kernel.
BFQ is very complicated, about 10K lines of code. I'm not comfortable
making that the default right now - as I've mentioned in other
replies, I think something like that should be driven by the distros
as they will ultimately be the ones that usually get complaints
about behavioral changes that impact performance adversely. This isn't
just about running some benchmarks and calling it a day.

Maybe some day we can make it the default on mq for single queue
devices, but I just don't think we are there yet in terms of
coverage.

While I don't work for a distro anymore, I do have my hands dirty
with a fairly substantial deployment at work. There we run mq-deadline
on single queue devices, and kyber on multiqueue capable devices.
--
Jens Axboe
Ulf Hansson
2018-10-16 13:42:19 UTC
Permalink
Post by Linus Walleij
This sets BFQ as the default scheduler for single queue
block devices (nr_hw_queues == 1) if it is available. This
affects notably MMC/SD-cards but also UBI and the loopback
device.
I have been running it for a while without any negative
effects on my pet systems and I want some wider testing
so let's throw it out there and see what people say.
Admittedly my use cases are limited. I need to keep this
patch around for my personal needs anyway.
We take special care to avoid using BFQ on zoned devices
(in particular SMR, shingled magnetic recording devices)
as these currently require mq-deadline to group writes
together.
I have opted against introducing any default scheduler
through Kconfig as the mq-deadline enforcement for
zoned devices has to be done at runtime anyways and
too many config options will make things confusing.
My argument for setting a default policy in the kernel
as opposed to user space is the "reasonable defaults"
type, analogous to how we have one default CPU scheduling
policy (CFS) that make most sense for most tasks, and
how automatic process group scheduling happens in most
distributions without userspace involvement. The BFQ
scheduling policy makes most sense for single hardware
queue devices and many embedded systems will not have
the clever userspace tools (such as udev) to make an
educated choice of scheduling policy. Defaults should be
those that make most sense for the hardware.
As already stated for v1, this makes perfect sense to me, thanks for posting it!

I do understand there is some pushback from Bart and Jens, around how
to move this forward. However, let's hope they get convinced to try
this out.

When it comes to potential "performance" regressions, I am sure Paolo
is standing-by to help out with BFQ changes, if needed. Moreover, we
can always do a simple revert in worst case scenario, especially since
the change is really limited.
So FWIW:

Reviewed-by: Ulf Hansson <***@linaro.org>

Kind regards
Uffe
Post by Linus Walleij
---
- Add a quirk so that devices with zoned writes are forced
to use the deadline scheduler, this is necessary since only
that scheduler supports zoned writes.
https://lwn.net/Articles/767987/
---
block/elevator.c | 22 ++++++++++++++++++----
1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/block/elevator.c b/block/elevator.c
index 8fdcd64ae12e..6e6048ca3471 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -948,13 +948,16 @@ int elevator_switch_mq(struct request_queue *q,
}
/*
- * For blk-mq devices, we default to using mq-deadline, if available, for single
- * queue devices. If deadline isn't available OR we have multiple queues,
- * default to "none".
+ * - "none" for multiqueue devices (nr_hw_queues != 1)
+ * - "bfq", if available, for single queue devices
+ * - "mq-deadline" if "bfq" is not available for single queue devices
+ * - "none" for single queue devices as well as last resort
*/
int elevator_init_mq(struct request_queue *q)
{
struct elevator_type *e;
+ const char *policy;
int err = 0;
if (q->nr_hw_queues != 1)
@@ -968,7 +971,18 @@ int elevator_init_mq(struct request_queue *q)
if (unlikely(q->elevator))
goto out_unlock;
- e = elevator_get(q, "mq-deadline", false);
+ /*
+ * Zoned devices must use a deadline scheduler because currently
+ * that is the only scheduler respecting zoned writes.
+ */
+ if (blk_queue_is_zoned(q))
+ policy = "mq-deadline";
+ else if (IS_ENABLED(CONFIG_IOSCHED_BFQ))
+ policy = "bfq";
+ else
+ policy = "mq-deadline";
+
+ e = elevator_get(q, policy, false);
if (!e)
goto out_unlock;
--
2.17.2
Loading...