BFS FAQ

BFS FAQ written by Con Kolivas

Thanks to the guys in irc.oftc.net #ck for inspiration to work on this and early testing! Many of them sat idle in that channel for years while nothing happened. The xkcd comic supported_features also helped motivate this work. Yes I know you probably still can't watch full screen videos on youtube, but that's not entirely the scheduler's fault.

Con Kolivas

Why did I write it?
After years of using my old kernel and numerous hardware upgrades, I finally had hardware that needed a newer kernel for drivers and to try out the newer filesystems. Booting the mainline kernel was relatively reassuring in that the scheduler behaviour was much better than what was in earlier kernels. However, it didn't take long before I started being disappointed in that too. Random stalls in mouse movements, keypresses, strange cpu distribution in various workloads and unpredictable behaviour all around were exactly what I was hoping had gone away. So I did what I vowed never to do, looked at the code. After seeing it had grown into a monster of epic proportions I sat down and thought about what was wrong. One of the key features of fairness and interactivity that I always argued for were very simple semantics for how cpu should be distributed, with guaranteed low latencies so that interactivity was assured by design instead of bolted on. CFS in essence does that, but it does something else too. It varies timeslice length to try and preserve some deadline list and it determines cpu distribution based on a run/sleep relationship. It also is designed to scale to monster proportion hardware that the common man will never see. The whole sleep calculation thing is exactly what I found was responsible for making varied behaviour under different loads and relative starvation and unfairness. It's not a profound effect in CFS and that's admirable. It just doesn't behave the way I feel the scheduler should being forward looking only (not calculating sleep) and it doesn't really make the most of a relatively lightly loaded machine without many many cpus. So I threw it all out and wrote exactly the opposite.

What is it?
BFS is the Brain Fuck Scheduler. It was designed to be forward looking only, make the most of lower spec machines, and not scale to massive hardware. ie it is a desktop orientated scheduler, with extremely low latencies for excellent interactivity by design rather than "calculated", with rigid fairness, nice priority distribution and extreme scalability within normal load levels.

Extreme scalability within normal load levels? Isn't that a contradiction?
For years we've been doing our workloads on linux to have more work than we had CPUs because we thought that the "jobservers" were limited in their ability to utilise the CPUs effectively (so we did make -j6 or more on a quad core machine for example). This scheduler proves that the jobservers weren't at fault at all, because make -j4 on a quad core machine with BFS is faster than *any* choice of job numbers on CFS. See reverse scalability graph courtesy of Serge Belyshev showing various job numbers on a kernel build on a quad core machine. The problem has always been that the mainline scheduler can't keep the CPUs busy enough; ie it doesn't make the most of your hardware in the most common situations on a desktop! Note that the reverse scalability graph is old; the scalability has improved since then.

Why "Brain Fuck"?
Because it throws out everything about what we know is good about how to design a modern scheduler in scalability. Because it's so ridiculously simple. Because it performs so ridiculously well on what it's good at despite being that simple. Because it's designed in such a way that mainline would never be interested in adopting it, which is how I like it. Because it will make people sit up and take notice of where the problems are in the current design. Because it throws out the philosophy that one scheduler fits all and shows that you can do a -lot- better with a scheduler designed for a particular purpose. I don't want to use a steamroller to crack nuts. Because it actually means that more CPUs means better latencies. Because I must be fucked in the head to be working on this again. I'll think of some more becauses later.

How scalable is it?
I don't own the sort of hardware that is likely to suffer from using it, so I can't find the upper limit. Based on first principles about the overhead of locking, and the way lookups occur, I'd guess that a machine with 16 CPUS or more would start to have exponentially less performance (thanks Ingo for confirming this). Note that the number of logical CPUs is what affects BFS' scalability, not the physical ones. By that I mean that a hyperthreaded CPU that is a quad core hyperthreaded is 8 EIGHT logical CPUs. So it is NOT the same as a quad core without hyperthreading.

Since version 0.300, scalability improvements have been added that should further improve performance, including NUMA support! No scalability benchmarks on very big machines.have been performed on new versions to compare its performance.

The O(n) lookup of BFS will cause people some concern because of the notation. However, if the actual overhead is very small, then even with large numbers of n, it can be lower overhead than an O(1) design. Testing this scheduler vs CFS with the test app "forks" which forks 1000 tasks that do simple work, shows no difference in time to completion compared to CFS. That's a load of 1000 on a quad core machine. But note that BFS gets much faster when the loads are lower and approximate the number of CPUs, which is much more what you would experience on a desktop.

What about interbench numbers?
Interbench does too many jobs by default on the burn/compile tests. I've put up interbench results from a quad core where the jobs (4) is equal to the number of CPUs so the test is more meaningful, and added comments. It appears I'm about the only person who understands interbench numbers since I wrote the benchmark, so don't place too much weight on them. The 'latt' test app recently written by Jens Axboe is a better place for simpler to understand and useful numbers. The last set of interbench numbers I posted comparing mainline 2.6.35 and 2.6.35-ck1 (containing BFS) showed BFS outperformed mainline in many areas.

What features does BFS have and not have?
On top of the current scheduler design, it has a SCHED_IDLEPRIO which actually does only schedule tasks when idle, and SCHED_ISO for unprivileged realtime performance. BFS does NOT implement CGROUPS. A desktop user should not need know about CGROUPS, nor should they need to use them. BFS also does not have the feature of "lots of tunables I don't understand".

How do you recommend I use this?
It's designed so that you just patch it in and use it. You shouldn't need to do anything at all. But since people still want to know every last thing...

THESE ARE OPTIONAL FOR LOWEST LATENCY. YOU DO NOT NEED THESE! Configure your kernel with 1000Hz, preempt ON and disable dynamic ticks.

What about other Hz, no preempt and dynamic ticks?
The reason I recommend the above settings is these will have the best possible latencies which is by far the most important feature of a scheduler on a desktop. How much each of these options affects your throughput, or power consumption will vary wildly between hardware and workloads. If your major concern is power consumption, enabling dynamic ticks with 1000Hz should bring you to similar power usage of a 100Hz config without the disadvantage. 100Hz does seriously detriment the best possible latencies on BFS and I'd at least recommend 300. This would be, for example, a suitable android phone setting (300Hz, preempt on, dynticks on). The only reason for disabling preempt is if your workload is never latency sensitive. If you were planning on running a distributed computing client (such as folding@home) and nothing else, then 100Hz, no preempt and no dynticks would be best. Also, turning up the rr_interval would be beneficial (300 seems to be optimal and higher values seem to not derive any further benefit). Running it on a real server I'd recommend the default rr_interval, 100Hz, no preempt and dynticks ON (for power saving).

Other tweaks I can try?
You shouldn't need to tune BFS virtually ever. The only tunable for the scheduler itself is the rr_interval value (see documentation). Try 3ms if latency is everything to you. When compiling software, do not use more jobs than you have CPUs! So make -j2 on dual core, -j4 on quad core and so on. Nice levels are strictly obeyed so if you nice your compiles they'll be virtually unnoticeable. (nice -n 19 make -j2). Run your distributed computing clients SCHED_IDLEPRIO (eg folding at home, mprime etc): schedtool -D -e mprime This will make your distributed computing client *never* cause slowdowns in your other userspace applications, at the cost of slightly slower progress of the client. Run your audio and video apps SCHED_ISO: schedtool -I -e amarok This will run amarok as an unprivileged real-time task. Note that if you start an application that tries to get real-time scheduling (eg jackd) and you are not starting it as root, BFS will automatically elevate it to SCHED_ISO for you to give it the next best thing.

NUMA aware?
It is NOT NUMA aware in the sense that it does any fancy shit on NUMA, but it will work on NUMA hardware just fine. Only the really big NUMA hardware is likely to suffer in performance, and this is theoretically only, since no one has that sort of hardware to prove it to me, but it seems almost certain. v0.300 onwards have NUMA enhancements.

Multicore processors?
This is where BFS shines.

Single processors?
Single processors benefit a lot from BFS too.

Low power machines? Phones?
BFS is a very low overheard CPU scheduler. These can benefit a lot from it.

Realtime tasks?
Realtime tasks work just like they do on mainline on uniprocessor machines. However, BFS has the added advantage of dropping unprivileged tasks to SCHED_ISO as mentioned above. Most people would not be aware, though, that BFS has a major advantage when it comes to running realtime tasks running SCHED_FIFO or SCHED_RR on SMP machines. Because BFS uses one global runqueue for all tasks on the system, when a realtime task wants CPU time, BFS will find the most suitable CPU anywhere in the system to run it on, by kicking off the lower priority task running anywhere. Unless the realtime tasks are programmed specifically binding to discrete CPUs, the separate runqueue designed schedulers (like mainline CFS) can get into a situation where two realtime tasks are running on the same CPU, and one will wait for the other to finish when there may well be a non-realtime task running on another CPU. This can cause MASSIVE latencies even though the task is running realtime. Even the realtime-patchset does not prevent this situation from occurring, whereas BFS avoids this by design.

I found a bug!
Great! Help me debug it. A scheduler is subtle and quick to anger. If you can code then delve away and see what you can find! I'll take help from anyone. It's a major ordeal trying to get this thing working on all sorts of hardware. You can't code? Give me whatever details you've got and I'll see what I can do. As per usual this stuff comes with no guarantee, and I do not have infinite time to spend on it. I do NOT get paid to do this and do it just for the fun of it. I'll do whatever I can to help you but I cannot support this like a paid project. I'd *love* to see people hacking on the code themselves.

Are you looking at getting this into mainline?
No. They would be crazy to use this scheduler anyway since it won't scale to their 4096 cpu machines. The only way is to rewrite it to work that way, or to have more than one scheduler in the kernel. I don't want to do the former, and mainline doesn't want to do the latter. Besides, apparently I'm a bad maintainer, which makes sense since for some reason I seem to want to have a career, a life, raise a family with kids and have hobbies, all of which have nothing to do with linux.

Can it be made to scale to 4096 CPUs?
Sure I guess you could run one runqueue per CPU package instead of a global one and so on, but I have no intention whatsoever at doing that because it will compromise the performance where *I* care.

Is this stable?
Yes.

Currently known problems?
Nil.

GIT repository?
Sorry, it's not the right tool for me so it's not worth me investing the time in setting one up.

Wine games?
Earlier suggestions that wine games would perform poorly on BFS were unfounded. All the reports I've received are that they perform better.

CPU Accounting is different?
The mainline kernel simply samples cpu accounting per tick. This can be very inaccurate for very short lived processes. BFS uses accounting of actual cpu time consumed within the limitations of the current kernel and hardware and may report very different cpu usage. Do not use the reported cpu usage between different kernels as some marker of performance. Compare actual work progress instead (eg. when comparing distributed computing clients).

Will I be maintaining this, even though mainline won't have it?
Yes. For the foreseeable future at least. Once the bugs are ironed out, it shouldn't be too much effort to keep in sync with mainline.

What is the relationship of BFS to SD/RSDL?
BFS implements the same forward-looking rigidly fair timeslicer philosophy that SD used, in a foreground - background design, but is otherwise a completely different scheduler. It is an attempt to fuse all the ideas I have about scheduling so far, without worrying about trying to make it infinitely scalable.