329 lines
13 KiB
HTML
329 lines
13 KiB
HTML
<!--startcut ==============================================-->
|
|
<!-- *** BEGIN HTML header *** -->
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
|
|
<HTML><HEAD>
|
|
<title>The Linux Scheduler LG #89</title>
|
|
</HEAD>
|
|
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#0000AF"
|
|
ALINK="#FF0000">
|
|
<!-- *** END HTML header *** -->
|
|
|
|
<!-- *** BEGIN navbar *** -->
|
|
<!-- *** END navbar *** -->
|
|
|
|
<!--endcut ============================================================-->
|
|
|
|
<TABLE BORDER><TR><TD WIDTH="200">
|
|
<A HREF="http://www.linuxgazette.com/">
|
|
<IMG ALT="LINUX GAZETTE" SRC="../gx/2002/lglogo_200x41.png"
|
|
WIDTH="200" HEIGHT="41" border="0"></A>
|
|
<BR CLEAR="all">
|
|
<SMALL>...<I>making Linux just a little more fun!</I></SMALL>
|
|
</TD><TD WIDTH="380">
|
|
|
|
|
|
<CENTER>
|
|
<BIG><BIG><STRONG><FONT COLOR="maroon">The Linux Scheduler</FONT></STRONG></BIG></BIG>
|
|
<BR>
|
|
<STRONG>By <A HREF="../authors/vinayak.html">Vinayak Hegde</A></STRONG>
|
|
</CENTER>
|
|
|
|
</TD></TR>
|
|
</TABLE>
|
|
<P>
|
|
|
|
<!-- END header -->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<h2> Why the Design of Scheduler is critical </h2>
|
|
<p>
|
|
Two of the most critical parts of a kernel are the memory subsystem and
|
|
the scheduler. This is because they influence the design and affect the
|
|
performance of almost every other part of the kernel and the OS. That is
|
|
also why you would want to get them absolutely right and optimize their
|
|
performance. The Linux kernel is used right from small embedded devices,
|
|
scaling up to large mainframes. <!-- Though the kernel is not used as it is but some
|
|
modifications are made to it, the core more or less remains the same.
|
|
Hence it is imperative to get the scheduling policy and the design of the
|
|
scheduler right. --> Designing is scheduler is at best <i> a black art.</i> No matter
|
|
how good the design is, some people will always feel that some categories
|
|
of processes have got a raw deal.
|
|
</p>
|
|
|
|
<p>
|
|
In the article that follows, I have purposely tried to skip
|
|
quoting any reference code because one can easily get it from the net (see the
|
|
references). The article also looks the challenges developers found when they
|
|
redesigned the scheduler, how those challenges were met, and what could be the
|
|
the future direction the
|
|
scheduler is likely
|
|
to follow. Having said that, there's nothing like reading the source code to get
|
|
an understanding of what's going on under the hood. You will find the implementation
|
|
of the scheduler in kernel/sched.c if you have the kernel source installed.
|
|
</p>
|
|
|
|
<h2> Scheduling Objectives </h2>
|
|
<p>
|
|
<b> The Linux scheduler strives to meet several objectives : </b> <br> <br>
|
|
|
|
<ol>
|
|
|
|
<li>
|
|
<b> <u> Fairness </u> </b> :
|
|
<p>
|
|
|
|
The scheduler should give a fair amount of CPU share to every
|
|
process. Quite a fair amount of work has been done in the new kernel to ensure
|
|
fair sharing of CPU time among processes.
|
|
|
|
</p> </li>
|
|
|
|
<li> <b> <u> Throughput and CPU utilization </u> </b> :
|
|
<p>
|
|
|
|
The scheduler should try to maximize both
|
|
throughput and CPU utilization. The usual method of increasing CPU
|
|
utilization is by increasing the amount of multi-programming. But this is only
|
|
beneficial up to a point after which it becomes counterproductive and thrashing
|
|
starts.
|
|
|
|
</p> </li>
|
|
|
|
<li> <b> <u> Minimal Overhead </u> </b> :
|
|
<p>
|
|
|
|
A scheduler itself should run for as small time as
|
|
possible. The scheduler latency should be minimal. But this is the
|
|
tricky part. It is generally seen that scheduling itself is <i> not useful work (??) </i>.
|
|
But if the scheduling is done right even after expending some more time, it may
|
|
be worth the effort. But how do we decide which is the optimal point? Most
|
|
scheduler solve this problem by using some heuristic policies.
|
|
</p> </li>
|
|
|
|
<li> <b> <u> Enforce priority-based scheduling </u> </b> :
|
|
<p>
|
|
|
|
Priority scheduling means that some processes get more preference over others.
|
|
At the very least the scheduler must differentiate between I/O-bound processes
|
|
and CPU-bound processes. Moreover, some kind of aging must be
|
|
be implemented so that starvation of processes does not take place. Linux does
|
|
enforce priorities as well as differentiates between different categories of
|
|
processes. The Linux kernel differentiates between batch scheduled jobs and
|
|
interactive. They get a share of the CPU according to their priorities.
|
|
Probably some people have used the nice command to change the priority of
|
|
a process.
|
|
|
|
</p> </li>
|
|
|
|
<li> <b> <u> Turnaround time, waiting time </u> </b>:
|
|
<p>
|
|
|
|
Turnaround time is the sum of the service time and amount of time wasted waiting
|
|
in the ready queue. The scheduler tries to reduce both.
|
|
|
|
</p> </li>
|
|
|
|
<li> <b> <u> Response time and Variance </u> </b> :
|
|
<p>
|
|
|
|
The response from a program should be as fast
|
|
as possible. Also another important factor which is often ignored is the variance
|
|
between the response times. It is not acceptable if the average response time
|
|
is low but some user occasionally experiences say, a 10-second delay from an interactive
|
|
program.
|
|
|
|
</p> </li>
|
|
|
|
<li> <b> <u> Miscellaneous </u> </b> :
|
|
<p>
|
|
|
|
The scheduler also tries to meet some other goals such as
|
|
predictability. The behavior of the scheduler should be predictable for a given
|
|
set of process with assigned priorities. The scheduler performance must
|
|
degrade gracefully under heavy loads. This is particularly important because
|
|
the Linux is very popular in the server market, and servers tend to get overloaded
|
|
during peak traffic hours.
|
|
|
|
</p> </li>
|
|
</ol>
|
|
|
|
<h2> Improvements in the scheduler </h2>
|
|
<ul>
|
|
|
|
<li> <b> <u> The O(1) complexity Scheduling algorithm </u> </b>
|
|
<p>
|
|
|
|
What's O(1) you may ask? Well, I am going to skim the surface on what the O (known as the Big-Oh) notation means.
|
|
You will find references to the Big-Oh notation in any good algorithms book. What the Big Oh
|
|
notation essentially does is that it tries to estimate the running time of any algorithm
|
|
independent of the machine-related implementation issues. It places a upper bound on the
|
|
running time of the algorithm - that is an upper bound on the algorithm's worst case.
|
|
It is an exceptional technique to compare the efficiency of an algorithm with respect to
|
|
running time.
|
|
|
|
</p>
|
|
<p>
|
|
|
|
Take for example an algorithm which has two nested loops and both of whose
|
|
limits range from 1 to n-1 where (n is number of inputs for the algorithm) then the upper
|
|
bound on it's running time is denoted by <b> <i> O(N<sup>2</sup>) </i></b>. Similarly,
|
|
consider an algorithm to search for an element in an unordered linked list. The worst case
|
|
is that we will have to traverse the list till the last element to get a match or worse still -
|
|
to find out that the element is not in the list. Such an algorithm is said to have
|
|
<b> <i> O(N) </i> </b> complexity as it's running time is directly proportional to the number
|
|
of elements in the list - N.
|
|
|
|
</p>
|
|
<p>
|
|
|
|
The Linux scheduler takes constant time to schedule the processes
|
|
in the ready queue. Hence, it is said to have <b> <i> O(1) </i> </b> complexity. No matter
|
|
what is the number of processes active on the Linux system the scheduler will always take
|
|
constant time to schedule them. All the "parts" - wakeup , selection of the process to be run
|
|
next, context switching and timer interrupt overhead - of the current Linux
|
|
kernel (2.5.49 is the reference here - See resources for details) have O(1) complexity.
|
|
Hence, the new scheduler has O(1) complexity in it's entirety.
|
|
|
|
</p> </li>
|
|
|
|
<li> <b> <u> Better Support for SMP and more scalable </u> </b>
|
|
<p>
|
|
|
|
As mentioned in the introduction, the Linux kernel runs on almost anything from wristwatches
|
|
to supercomputers. With the earlier schedulers there were some problems with the scalability.
|
|
Some of these problems were solved by modifying the kernel for the application and the target architecture. However, the core design of the scheduler was
|
|
not very scalable. The new scheduler is much more scalable
|
|
and SMP (Symmetric Multi-Processing) aware. The performance of the scheduler is much better for
|
|
SMP systems now. One on the goals stated by Ingo Molnar who has written the O(1) scheduler is
|
|
that in SMP, processors should not be idle when there is work to do. Also, care should be taken
|
|
that processes do not get scheduled on different processors form time to time. This is to avoid
|
|
the overhead of filling in the cache with the required data every time.
|
|
|
|
</p> </li>
|
|
|
|
<li> <b> <u> Batch scheduling of jobs </u> </b>
|
|
<p>
|
|
|
|
This is not exactly a new feature, but there are some patches that can be applied to support
|
|
batch scheduling. Earlier kernels also had some support for batch scheduling. As of now, the
|
|
batch scheduling of task is done using priority levels. The Linux kernel uses about 40 nice
|
|
levels (though they are all mapped to about 5 levels of priority). Batch scheduled processes
|
|
generally get the CPU when there are not many interactive and CPU-bound processes around, which have more priority. Batch scheduled processes get bigger time-slices to run than
|
|
normal processes. This also minimizes the effect of swapping data in and out of the cache
|
|
frequently, thus improving the performance of batch jobs.
|
|
|
|
</p> </li>
|
|
|
|
<li> <b> <u> Better performance of Interactive Jobs </u> </b>
|
|
<p>
|
|
|
|
One of the major improvements in the new scheduler is detection and boosting the performance of
|
|
interactive tasks. Tweaking the old scheduler code was a bit cumbersome. In the new scheduler,
|
|
detection of interactive jobs is decoupled from other scheduler tasks such as time-slice management.
|
|
Interactive jobs in the system are detected in the system with the help of usage-related statistics.
|
|
That means interactive tasks have good response times under heavy loads, and CPU-hogging processes
|
|
cannot monopolize CPU time. The new scheduler actively detects interactive tasks and give them more
|
|
precedence over other tasks. Even then, a interactive task is scheduled with other interactive tasks
|
|
by using Round-Robin scheduling. This makes a lot of sense for desktop users as they will not see
|
|
response time increase when they start a CPU intensive job such as encoding songs into the ogg format.
|
|
Plans are on to merge O(1) and pre-emption code to give better response times for interactive tasks.
|
|
|
|
</p> </li>
|
|
|
|
<li> <b> <u> Better Scalability and support for more architectures </u> </b>
|
|
<p>
|
|
|
|
Because of the redesign of the new scheduler, it scales more easily to other architectures such
|
|
as NUMA (Non-Uniform Memory Access) and SMT. NUMA architecture is used on some high-end servers
|
|
and supercomputers. Also there is some ongoing work on the SMT (Symmetric Multi-Threading).
|
|
SMT is also known by a more popular term: HyperThreading. One of the reasons is that
|
|
now every CPU has it's own run queue. Only the load balancing sub-part has a "global" view of the
|
|
system. So any changes for some architecture are to be made mainly in the load balancing sub-part.
|
|
Also recently a patch for NUMA was released. In fact it has been incorporated in Linux 2.5.59.
|
|
SMT processors implement two (or more) virtual CPUs on the same physical processor - one "logical"
|
|
processor can be running while the other waits for memory access. SMT can certainly be seen as a
|
|
sort of NUMA system, since the sibling processors share a cache and thus have faster access to
|
|
memory that either one has accessed recently. Some work is going also going on SMT, but the
|
|
new O(1) scheduler handles SMT pretty well even without any changes. Recently a patch was released
|
|
for SMT. Though the NUMA architecture bears some resemblance to the SMT architecture, the Linux
|
|
scheduler handles them differently.
|
|
|
|
</p> </li>
|
|
|
|
<li> <b> <u> Miscellaneous </u> </b>
|
|
<p>
|
|
|
|
The scheduler gives more priority to fork()ed children than to the parent itself. This could
|
|
potentially be useful for servers which use the fork call to service requests. It could also be
|
|
useful for GUI applications. There are also some real-time scheduling (based on priority)
|
|
available in the kernel.
|
|
|
|
</p> </li>
|
|
</ul>
|
|
|
|
<h2> Resources </h2>
|
|
<ul>
|
|
<li> <a href="http://people.redhat.com/~mingo/O(1)-scheduler/"> Ingo Molnar's Homepage O(1) patches </a> </li>
|
|
<li> <a href="http://www.tldp.org/LDP/lki/lki-2.html#ss2.3"> Linux Kernel Internals - Tigran Aivazian</a> </li>
|
|
<li> Browse the Linux Kernel Source Code @ lxr.linux.no
|
|
<ol>
|
|
<li> <a href="http://lxr.linux.no/source/Documentation/sched-design.txt?v=2.5.49"> Scheduler Design </a> </li>
|
|
<li> <a href="http://lxr.linux.no/source/Documentation/sched-coding.txt?v=2.5.49"> Some notes about scheduler functions </a> </li>
|
|
</ol>
|
|
</li>
|
|
</ul>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<!-- *** BEGIN author bio *** -->
|
|
<P>
|
|
<P>
|
|
<!-- *** BEGIN bio *** -->
|
|
<P>
|
|
<img ALIGN="LEFT" ALT="[BIO]" SRC="../gx/2002/note.png">
|
|
<em>
|
|
Vinayak is currently pursuing the APGDST course
|
|
at NCST. His areas of interest are networking, parallel
|
|
computing systems and programming languages. He
|
|
believes that Linux will do to the software industry
|
|
what the invention of printing press did to the world
|
|
of science and literature. In his non-existent free
|
|
time he likes listening to music and reading books. He
|
|
is currently working on Project LIberatioN-UX where he
|
|
makes affordable computing on Linux accessible for
|
|
academia/corporates by configuring remote boot stations
|
|
(Thin Clients).
|
|
</em>
|
|
<br CLEAR="all">
|
|
<!-- *** END bio *** -->
|
|
|
|
<!-- *** END author bio *** -->
|
|
|
|
|
|
<!-- *** BEGIN copyright *** -->
|
|
<hr>
|
|
<CENTER><SMALL><STRONG>
|
|
Copyright © 2003, Vinayak Hegde.
|
|
Copying license <A HREF="../copying.html">http://www.linuxgazette.com/copying.html</A><BR>
|
|
Published in Issue 89 of <i>Linux Gazette</i>, April 2003
|
|
</STRONG></SMALL></CENTER>
|
|
<!-- *** END copyright *** -->
|
|
<HR>
|
|
|
|
<!--startcut ==========================================================-->
|
|
<CENTER>
|
|
<!-- *** BEGIN navbar *** -->
|
|
<!-- *** END navbar *** -->
|
|
</CENTER>
|
|
</BODY></HTML>
|
|
<!--endcut ============================================================-->
|