1241 lines
55 KiB
HTML
1241 lines
55 KiB
HTML
<HTML>
|
||
<center>
|
||
<A HREF="../tlk-toc.html"> Table of Contents</A>,
|
||
<A href="../tlk.html" target="_top"> Show Frames</A>,
|
||
<A href="../mm/memory.html" target="_top"> No Frames</A>
|
||
</center>
|
||
<hr>
|
||
<META NAME="TtH" CONTENT="1.03">
|
||
|
||
<p>
|
||
<H1><A NAME="tth_chAp3">Chapter 3 <br>Memory Management</H1>
|
||
<A NAME="mm-chapter"></A>
|
||
<p>
|
||
<img src="../logos/sit3-bw-tran.1.gif"><br> <tt><b></tt></b>
|
||
The memory management subsystem is one of the most important parts of
|
||
the operating system.
|
||
Since the early days of computing, there has been a need
|
||
for more memory than exists physically in a system.
|
||
Strategies have been developed to overcome this limitation
|
||
and the most successful of these is virtual memory.
|
||
Virtual memory makes the system appear to have more memory than it actually
|
||
has by sharing it between competing processes as they need it.
|
||
|
||
<p>
|
||
Virtual memory does more than just make your computer's memory go further.
|
||
The memory management subsystem provides:
|
||
|
||
<DL compact>
|
||
<p>
|
||
<dt><b>Large Address Spaces</b></dt><dd> The operating system makes the system appear
|
||
as if it has a larger amount of memory than it actually has.
|
||
The virtual memory can be many times larger than the physical memory
|
||
in the system,
|
||
|
||
<p>
|
||
<dt><b>Protection</b></dt><dd> Each process in the system has its own virtual address
|
||
space. These virtual address spaces are completely separate
|
||
from each other and so a process running one application cannot
|
||
affect another. Also, the hardware virtual memory mechanisms allow
|
||
areas of memory to be protected against writing. This protects
|
||
code and data from being overwritten by rogue applications.
|
||
|
||
<p>
|
||
<dt><b>Memory Mapping</b></dt><dd> Memory mapping is used to map image and data files into a
|
||
processes address space.
|
||
In memory mapping, the contents of a file are linked
|
||
directly into the virtual address space of a process.
|
||
|
||
<p>
|
||
<dt><b>Fair Physical Memory Allocation</b></dt><dd> The memory management subsystem
|
||
allows each running process in the system a fair share
|
||
of the physical memory of the system,
|
||
|
||
<p>
|
||
<dt><b>Shared Virtual Memory</b></dt><dd> Although virtual memory allows processes
|
||
to have separate (virtual) address spaces, there are times
|
||
when you need processes to share memory.
|
||
For example there could be several processes in the system
|
||
running the <font face="helvetica">bash</font> command shell.
|
||
Rather than have several copies of <font face="helvetica">bash</font>, one in each
|
||
processes virtual address space, it is better to have only one copy in
|
||
physical memory and all of the processes running <font face="helvetica">bash</font>
|
||
share it.
|
||
Dynamic libraries are another common example of executing code
|
||
shared between several processes.
|
||
|
||
<p>
|
||
Shared memory can also be used as an Inter Process Communication
|
||
(IPC) mechanism, with two or more processes exchanging information
|
||
via memory common to all of them.
|
||
Linux supports the Unix <sup><font size=-4><tt>T</tt>M</font></sup> System V shared memory IPC.
|
||
</DL>
|
||
<p>
|
||
|
||
<H2><A NAME="tth_sEc3.1">3.1 </A> An Abstract Model of Virtual Memory</H2>
|
||
|
||
<p>
|
||
|
||
<p><A NAME="tth_fIg3.1"></A>
|
||
<center><center> <img src="vm.gif"><br>
|
||
<p>
|
||
</center></center><center> Figure 3.1: Abstract model of Virtual to Physical address mapping</center>
|
||
<A NAME="abstract-mm-model"></A>
|
||
<p>
|
||
<p>Before considering the methods that Linux uses to support virtual
|
||
memory it is useful to consider an abstract model that is not cluttered
|
||
by too much detail.
|
||
|
||
<p>
|
||
As the processor executes a program it reads an instruction from
|
||
memory and decodes it.
|
||
In decoding the instruction it may need to fetch or store the
|
||
contents of a location in memory.
|
||
The processor then executes the instruction and moves onto the
|
||
next instruction in the program.
|
||
In this way the processor is always accessing memory either to fetch
|
||
instructions or to fetch and store data.
|
||
|
||
<p>
|
||
In a virtual memory system all of these addresses are virtual addresses and not
|
||
physical addresses.
|
||
These virtual addresses are converted into physical addresses by the
|
||
processor based on information held in a set of tables maintained
|
||
by the operating system.
|
||
|
||
<p>
|
||
To make this translation easier, virtual and physical memory are divided into handy sized chunks called
|
||
<em>pages</em>.
|
||
These pages are all the same size, they need not be but if they were not, the system would be very hard
|
||
to administer.
|
||
Linux on Alpha AXP systems uses 8 Kbyte pages and on Intel x86 systems it uses 4 Kbyte pages.
|
||
Each of these pages is given a unique number; the page frame number (PFN).
|
||
|
||
|
||
<p>
|
||
In this paged model, a virtual address is composed of two parts; an offset and a virtual
|
||
page frame number.
|
||
If the page size is 4 Kbytes, bits 11:0 of the virtual address contain the offset and bits 12 and above
|
||
are the virtual page frame number.
|
||
Each time the processor encounters a virtual address it must extract the offset and the virtual
|
||
page frame number.
|
||
The processor must translate the virtual page frame number into a physical one and then access the
|
||
location at the correct offset into that physical page.
|
||
To do this the processor uses <em>page tables</em>.
|
||
|
||
<p>
|
||
Figure <A href="#abstract-mm-model"
|
||
> 3.1</A> shows the virtual address
|
||
spaces of two processes, process <em>X</em> and process <em>Y</em>,
|
||
each with their own page tables.
|
||
These page tables map each processes virtual pages into physical
|
||
pages in memory.
|
||
This shows that process <em>X's</em> virtual page frame number 0 is mapped into memory
|
||
in physical page frame number 1 and that process <em>Y's</em> virtual page frame number 1 is mapped
|
||
into physical page frame number 4.
|
||
Each entry in the theoretical page table contains the following information:
|
||
|
||
<UL>
|
||
<p>
|
||
|
||
<li> Valid flag. This indicates if this page table entry is valid,
|
||
|
||
<li> The physical page frame number that this entry is describing,
|
||
|
||
<li> Access control information. This describes how the page may be used.
|
||
Can it be written to? Does it contain executable code?
|
||
</UL>
|
||
<p>
|
||
The page table is accessed using the virtual page frame number as an offset.
|
||
Virtual page frame 5 would be the 6th element of the table (0 is the first element).
|
||
|
||
<p>
|
||
To translate a virtual address into a physical one, the processor must first
|
||
work out the virtual addresses page frame number and the offset within that virtual page.
|
||
By making the page size a power of
|
||
2 this can be easily done by masking and shifting.
|
||
Looking again at Figures <A href="#abstract-mm-model"
|
||
> 3.1</A> and assuming a page
|
||
size of <em>0x2000</em> bytes (which is decimal 8192) and an address
|
||
of <em>0x2194</em> in process <em>Y's</em> virtual address space then the processor
|
||
would translate that address into offset <em>0x194</em> into virtual page frame number 1.
|
||
|
||
<p>
|
||
The processor uses the virtual page frame number as an index into the processes
|
||
page table to retrieve its page table entry.
|
||
If the page table entry at that offset is valid, the processor takes the physical
|
||
page frame number from this entry.
|
||
If the entry is invalid, the process has accessed a non-existent area of its virtual memory.
|
||
In this case, the processor cannot resolve the address and must pass control to the
|
||
operating system so that it can fix things up.
|
||
|
||
<p>
|
||
Just how the processor notifies the operating system that the correct process has
|
||
attempted to access a virtual address for which there is no valid translation is
|
||
specific to the processor.
|
||
However the processor delivers it, this is known as a <em>page fault</em> and the operating
|
||
system is notified of the faulting virtual address and the reason for the page fault.
|
||
|
||
<p>
|
||
Assuming that this is a valid page table entry,
|
||
the processor takes that physical page frame number and multiplies it by the page size
|
||
to get the address of the base of the page in physical memory.
|
||
Finally, the processor adds in the offset to the instruction or data that it needs.
|
||
|
||
<p>
|
||
Using the above example again, process <em>Y's</em> virtual page frame number 1 is
|
||
mapped to physical page frame number 4 which starts at <em>0x8000</em> (4 x <em>0x2000</em>).
|
||
Adding in the <em>0x194</em> byte offset gives us a final physical address
|
||
of <em>0x8194</em>.
|
||
|
||
<p>
|
||
By mapping virtual to physical addresses this way,
|
||
the virtual memory can be mapped into the system's physical pages
|
||
in any order.
|
||
For example, in Figure <A href="#abstract-mm-model"
|
||
> 3.1</A> process <em>X's</em>
|
||
virtual page frame number 0 is mapped to
|
||
physical page frame number 1 whereas virtual page frame number 7 is mapped to physical page frame number 0
|
||
even though it is higher in virtual memory than virtual page frame number 0.
|
||
This demonstrates an interesting byproduct of virtual memory; the pages of
|
||
virtual memory do not have to be present in physical memory in any
|
||
particular order.
|
||
|
||
<p>
|
||
|
||
<H3><A NAME="tth_sEc3.1.1">3.1.1 </A> Demand Paging</H3>
|
||
|
||
<p>
|
||
As there is much less physical memory than virtual memory the operating
|
||
system must be careful that it does not use the physical memory
|
||
inefficiently.
|
||
One way to save physical memory is to only load virtual pages that are
|
||
currently being used by the executing program.
|
||
For example, a database program may be run to query a database.
|
||
In this case not all of the database needs to be loaded into memory, just
|
||
those data records that are being examined.
|
||
If the database query is a search query then it does not make
|
||
sense to load the code from the database program that deals with
|
||
adding new records.
|
||
This technique of only loading virtual pages into memory as they are
|
||
accessed is known as demand paging.
|
||
|
||
<p>
|
||
When a process attempts to access a virtual address that is not currently
|
||
in memory the processor cannot find a page table entry for the virtual
|
||
page referenced.
|
||
For example, in Figure <A href="#abstract-mm-model"
|
||
> 3.1</A> there is
|
||
no entry in process <em>X's</em> page table for virtual page frame number 2 and
|
||
so if process <em>X</em> attempts to read from an address within
|
||
virtual page frame number 2 the processor cannot translate the address into a physical one.
|
||
At this point the processor notifies the operating system that a page fault has occurred.
|
||
|
||
<p>
|
||
If the faulting virtual address is invalid this means that the process has attempted to access
|
||
a virtual address that it should not have.
|
||
Maybe the application has gone wrong in some way, for example writing to random addresses in memory.
|
||
In this case the operating system will terminate it, protecting the other processes in the
|
||
system from this rogue process.
|
||
|
||
<p>
|
||
If the faulting virtual address was valid but the page that it refers to is not currently in
|
||
memory, the operating system must bring the appropriate page into memory from the image on disk.
|
||
Disk access takes a long time, relatively speaking, and so the process must
|
||
wait quite a while until the page has been fetched.
|
||
If there are other processes that could run then the operating
|
||
system will select one of them to run.
|
||
The fetched page is written into a free physical page frame and an entry
|
||
for the virtual page frame number is added to the processes page table.
|
||
The process is then restarted at the machine instruction where the memory fault
|
||
occurred.
|
||
This time the virtual memory access is made, the processor can make the
|
||
virtual to physical address translation and so the process continues to run.
|
||
|
||
<p>
|
||
Linux uses demand paging to load executable images into a processes virtual memory.
|
||
Whenever a command is executed, the file containing it is opened
|
||
and its contents are mapped into the processes virtual memory.
|
||
This is done by modifying the data structures describing this
|
||
processes memory map and is known as <em>memory mapping</em>.
|
||
However, only the first part of the image is actually brought into
|
||
physical memory.
|
||
The rest of the image is left on disk.
|
||
As the image executes, it generates page faults and Linux uses the
|
||
processes memory map in order to determine which parts of the image to
|
||
bring into memory for execution.
|
||
|
||
<p>
|
||
|
||
<H3><A NAME="tth_sEc3.1.2">3.1.2 </A> Swapping</H3>
|
||
|
||
<p>
|
||
If a process needs to bring a virtual page into physical memory
|
||
and there are no free physical pages available, the operating
|
||
system must make room for this page by discarding another page
|
||
from physical memory.
|
||
|
||
<p>
|
||
If the page to be discarded from physical memory came from an image or data file and has
|
||
not been written to then the page does not need to be saved.
|
||
Instead it can be discarded and if the process needs that page
|
||
again it can be brought back into memory from the image or data file.
|
||
|
||
<p>
|
||
However, if the page has been modified, the operating system must
|
||
preserve the contents of that page so that it can be accessed at a later
|
||
time.
|
||
This type of page is known as a <i>dirty</i> page and when it is removed from
|
||
memory it is saved in a special sort of file called the swap file.
|
||
Accesses to the swap file are very long relative to the speed of the processor and physical
|
||
memory and the operating system must juggle the need to write pages to disk with the
|
||
need to retain them in memory to be used again.
|
||
|
||
<p>
|
||
If the algorithm used to decide which pages to discard or swap (the <em>swap algorithm</em>
|
||
is not efficient then a condition known as <i>thrashing</i> occurs.
|
||
In this case, pages are constantly being written to disk and then
|
||
being read back and the operating system is too busy to allow much
|
||
real work to be performed.
|
||
If, for example, physical page frame number 1 in Figure <A href="#abstract-mm-model"
|
||
> 3.1</A> is
|
||
being regularly
|
||
accessed then it is not a good candidate for swapping to hard disk.
|
||
The set of pages that a process is currently using is called the <em>working set</em>.
|
||
An efficient swap scheme would make sure that all processes have their working set in
|
||
physical memory.
|
||
|
||
<p>
|
||
Linux uses a Least Recently Used (LRU) page aging technique to fairly choose pages which might
|
||
be removed from the system.
|
||
This scheme involves every page in the system having an age which
|
||
changes as the page is accessed.
|
||
The more that a page is accessed, the younger it is; the less that
|
||
it is accessed the older and more stale it becomes.
|
||
Old pages are good candidates for swapping.
|
||
|
||
<p>
|
||
|
||
<H3><A NAME="tth_sEc3.1.3">3.1.3 </A> Shared Virtual Memory</H3>
|
||
|
||
<p>
|
||
Virtual memory makes it easy for several processes to share memory.
|
||
All memory access are made via page tables and each process has its own separate page table.
|
||
For two processes sharing a physical page of memory, its physical page frame number must appear
|
||
in a page table entry in both of their page tables.
|
||
|
||
<p>
|
||
Figure <A href="#abstract-mm-model"
|
||
> 3.1</A> shows two processes that each share physical page frame number 4.
|
||
For process <em>X</em> this is virtual page frame number 4 whereas for process <em>Y</em> this is virtual page frame number
|
||
6.
|
||
This illustrates an interesting point about sharing pages: the shared physical page does not have
|
||
to exist at the same place in virtual memory for any or all of the processes sharing
|
||
it.
|
||
|
||
<p>
|
||
|
||
<H3><A NAME="tth_sEc3.1.4">3.1.4 </A> Physical and Virtual Addressing Modes</H3>
|
||
It does not make much sense for the operating system itself to run in virtual
|
||
memory.
|
||
This would be a nightmare situation where the operating system must
|
||
maintain page tables for itself.
|
||
Most multi-purpose processors support the notion of a physical address mode as well
|
||
as a virtual address mode.
|
||
Physical addressing mode requires no page tables and the processor does not attempt
|
||
to perform any address translations in this mode.
|
||
The Linux kernel is linked to run in physical address space.
|
||
|
||
<p>
|
||
The Alpha AXP processor does not have a special physical addressing mode.
|
||
Instead, it divides up the memory space into several areas and designates
|
||
two of them as physically mapped addresses.
|
||
This kernel address space is known as KSEG address
|
||
space and it encompasses all addresses upwards from <em>0xfffffc0000000000</em>.
|
||
In order to execute from code linked in KSEG (by definition, kernel code)
|
||
or access data there, the code must be executing in kernel mode.
|
||
The Linux kernel on Alpha is linked to execute from address
|
||
<em>0xfffffc0000310000</em>.
|
||
|
||
<p>
|
||
|
||
<H3><A NAME="tth_sEc3.1.5">3.1.5 </A> Access Control</H3>
|
||
|
||
<p>
|
||
The page table entries also contain access control information.
|
||
As the processor is already using the page table entry to map a processes virtual address
|
||
to a physical one, it can easily use the access control information to check that the
|
||
process is not accessing memory in a way that it should not.
|
||
|
||
<p>
|
||
There are many reasons why you would want to restrict access to areas of memory.
|
||
Some memory, such as that containing executable code, is naturally read only memory; the
|
||
operating system should not allow a process to write data over its executable code.
|
||
By contrast, pages containing data can be written to but attempts to execute that memory
|
||
as instructions should fail.
|
||
Most processors have at least two modes of execution: <em>kernel</em> and <em>user</em>.
|
||
You would not want kernel code executing by a user or kernel data structures to be accessible
|
||
except when the processor is running in kernel mode.
|
||
|
||
<p>
|
||
|
||
<p><A NAME="tth_fIg3.2"></A>
|
||
<center><center> <img src="pte.gif"><br>
|
||
<p>
|
||
</center></center><center> Figure 3.2: Alpha AXP Page Table Entry</center>
|
||
<A NAME="axp-pte-figure"></A>
|
||
<p>
|
||
<p>The access control information is held in the PTE and
|
||
is processor specific; figure <A href="#axp-pte-figure"
|
||
> 3.2</A> shows the PTE for Alpha AXP.
|
||
The bit fields have the following meanings:
|
||
|
||
<DL compact>
|
||
<p>
|
||
<dt><b>V</b></dt><dd> Valid, if set this PTE is valid,
|
||
<dt><b>FOE</b></dt><dd> ``Fault on Execute'', Whenever an attempt to execute instructions in this
|
||
page occurs, the processor reports a page fault and passes control
|
||
to the operating system,
|
||
<dt><b>FOW</b></dt><dd> ``Fault on Write'', as above but page fault on an attempt
|
||
to write to this page,
|
||
<dt><b>FOR</b></dt><dd> ``Fault on Read'', as above but page fault on an attempt
|
||
to read from this page,
|
||
<dt><b>ASM</b></dt><dd> Address Space Match. This is used when the operating
|
||
system wishes to clear only some of the entries from the
|
||
Translation Buffer,
|
||
<dt><b>KRE</b></dt><dd> Code running in kernel mode can read this page,
|
||
<dt><b>URE</b></dt><dd> Code running in user mode can read this page,
|
||
<dt><b>GH</b></dt><dd> Granularity hint used when mapping an entire block with
|
||
a single Translation Buffer entry rather than many,
|
||
<dt><b>KWE</b></dt><dd> Code running in kernel mode can write to this page,
|
||
<dt><b>UWE</b></dt><dd> Code running in user mode can write to this page,
|
||
<dt><b>page frame number</b></dt><dd> For PTEs with the <b>V</b> bit set, this field contains
|
||
the physical Page Frame Number (page frame number) for this PTE. For invalid
|
||
PTEs, if this field is not zero, it contains information about
|
||
where the page is in the swap file.
|
||
|
||
</DL>
|
||
<p>
|
||
The following two bits are defined and used by Linux:
|
||
|
||
<DL compact>
|
||
<p>
|
||
<dt><b>_PAGE_DIRTY</b></dt><dd> if set, the page needs to
|
||
be written out to the swap file,
|
||
|
||
<dt><b>_PAGE_ACCESSED</b></dt><dd> Used by Linux to mark a page as having been
|
||
accessed.
|
||
|
||
</DL>
|
||
<p>
|
||
|
||
<H2><A NAME="tth_sEc3.2">3.2 </A> Caches</H2>
|
||
If you were to implement a system using the above theoretical model then
|
||
it would work, but not particularly efficiently.
|
||
Both operating system and processor designers try hard to extract more
|
||
performance from the system.
|
||
Apart from making the processors, memory and so on faster the best
|
||
approach is to maintain caches of useful information and data that
|
||
make some operations faster.
|
||
Linux uses a number of memory management related caches:
|
||
|
||
<DL compact>
|
||
<p>
|
||
<dt><b>Buffer Cache</b></dt><dd>
|
||
|
||
<p>
|
||
The buffer cache contains data buffers that are used by the
|
||
block device drivers.
|
||
|
||
|
||
<p>
|
||
These buffers are of fixed sizes (for
|
||
example 512 bytes) and contain blocks of information that have
|
||
either been read from a block device or are being written to
|
||
it.
|
||
A block device is one that can only be accessed by reading and
|
||
writing fixed sized blocks of data.
|
||
All hard disks are block devices.
|
||
|
||
<p>
|
||
The buffer cache is indexed via the device identifier and
|
||
the desired block number and is used to quickly find a block
|
||
of data. Block devices are only ever accessed via the buffer
|
||
cache. If data can be found in the buffer cache then it does not
|
||
need to be read from the physical block device, for example a
|
||
hard disk, and access to it is much faster.
|
||
|
||
<p>
|
||
<dt><b>Page Cache</b></dt><dd>
|
||
|
||
<p>
|
||
This is used to speed up access to images and
|
||
data on disk.
|
||
|
||
<p>
|
||
It is used to cache the logical contents of a file a page at a time and
|
||
is accessed via the file and offset within the file.
|
||
As pages are read into memory from disk, they are cached in the
|
||
page cache.
|
||
|
||
<p>
|
||
<dt><b>Swap Cache</b></dt><dd>
|
||
|
||
<p>
|
||
Only modified (or <i>dirty</i>) pages are saved in the swap
|
||
file.
|
||
|
||
|
||
<p>
|
||
So long as these pages are not modified after they
|
||
have been written to the swap file then the next time the
|
||
page is swapped out there is no
|
||
need to write it to the swap file as the page is already in
|
||
the swap file.
|
||
Instead the page can simply be discarded.
|
||
In a heavily swapping system this saves many unnecessary and
|
||
costly disk operations.
|
||
|
||
<p>
|
||
<dt><b>Hardware Caches</b></dt><dd>
|
||
|
||
<p>
|
||
One commonly implemented hardware cache is in the processor; a cache of Page
|
||
Table Entries.
|
||
In this case, the processor does not always read the page table directly
|
||
but instead caches translations for pages as it needs them.
|
||
These are the Translation Look-aside Buffers and contain cached copies of the
|
||
page table entries from one or more processes in the system.
|
||
|
||
<p>
|
||
When the reference to the virtual address is made, the processor will attempt
|
||
to find a matching TLB entry.
|
||
If it finds one, it can directly translate the virtual address into a
|
||
physical one and perform the correct operation on the data.
|
||
If the processor cannot find a matching TLB entry then it must get the
|
||
operating system to help.
|
||
It does this by signalling the operating system that
|
||
a TLB miss has occurred.
|
||
A system specific mechanism is used to deliver that exception to the
|
||
operating system code that can fix things up.
|
||
The operating system generates a new TLB entry for the address
|
||
mapping.
|
||
When the exception has been cleared, the processor will make another attempt
|
||
to translate the virtual address. This time it will work because there
|
||
is now a valid entry in the TLB for that address.
|
||
</DL>
|
||
<p>
|
||
The drawback of using caches, hardware or otherwise, is that in order
|
||
to save effort Linux must use more time and space maintaining these
|
||
caches and, if the caches become corrupted, the system will crash.
|
||
|
||
<p>
|
||
|
||
<H2><A NAME="tth_sEc3.3">3.3 </A> Linux Page Tables</H2>
|
||
|
||
<p>
|
||
|
||
<p><A NAME="tth_fIg3.3"></A>
|
||
<center><center> <img src="page-tables.gif"><br>
|
||
<p>
|
||
</center></center><center> Figure 3.3: Three Level Page Tables</center>
|
||
<A NAME="linux-page-tables-figure"></A>
|
||
<p>
|
||
<p>Linux assumes that there are three levels of page tables.
|
||
Each Page Table accessed contains the page frame number of the next level of Page Table.
|
||
Figure <A href="#linux-page-tables-figure"
|
||
> 3.3</A> shows how a virtual address can
|
||
be broken into a number of fields; each field providing an offset into a
|
||
particular Page Table.
|
||
To translate a virtual address into a physical one, the processor must take the
|
||
contents of each level field, convert it into an offset into the physical
|
||
page containing the Page Table and read the page frame number of the next level of Page
|
||
Table.
|
||
This is repeated three times until the page frame number of the physical page containing
|
||
the virtual address is found.
|
||
Now the final field in the virtual address, the byte offset, is used to
|
||
find the data inside the page.
|
||
|
||
<p>
|
||
Each platform that Linux runs on must provide translation macros that allow
|
||
the kernel to traverse the page tables for a particular process.
|
||
This way, the kernel does not need to know the format of the page table entries or
|
||
how they are arranged.
|
||
|
||
<p>
|
||
This is so successful that Linux uses the same page table manipulation code for
|
||
the Alpha processor, which has three levels of page tables, and for Intel x86 processors,
|
||
which have two levels of page tables.
|
||
|
||
<p>
|
||
|
||
<H2><A NAME="tth_sEc3.4">3.4 </A> Page Allocation and Deallocation</H2>
|
||
There are many demands on the physical pages in the system.
|
||
For example, when an image is loaded into memory the operating system
|
||
needs to allocate pages.
|
||
These will be freed when the image has finished executing and is
|
||
unloaded.
|
||
Another use for physical pages is to hold kernel specific data structures
|
||
such as the page tables themselves.
|
||
The mechanisms and data structures used for page allocation and
|
||
deallocation are perhaps the most critical in maintaining the efficiency
|
||
of the virtual memory subsystem.
|
||
|
||
<p>
|
||
All of the physical pages in the system are described by the <tt>mem_map</tt>
|
||
data structure which is a list of <tt>mem_map_t</tt>
|
||
|
||
<p>
|
||
<a href="#tthFtNtAAB" name=tthFrefAAB><sup>1</sup></a>
|
||
structures which is initialized at boot time.
|
||
Each <tt>mem_map_t</tt> describes a single physical page in the system.
|
||
Important fields (so far as memory management is concerned) are:
|
||
|
||
<DL compact>
|
||
<p>
|
||
<dt><b>count</b></dt><dd> This is a count of the number of users of this page.
|
||
The count is greater than one when the page is shared between many
|
||
processes,
|
||
<dt><b>age</b></dt><dd> This field describes the age of the page and is used
|
||
to decide if the page is a good candidate for discarding or
|
||
swapping,
|
||
<dt><b>map_nr</b></dt><dd> This is the physical page frame number that this <tt>mem_map_t</tt> describes.
|
||
</DL>
|
||
<p>
|
||
The <tt>free_area</tt> vector is used by the page allocation code
|
||
to find and free pages.
|
||
The whole buffer management scheme is supported by this mechanism and so
|
||
far as the code is concerned, the size of the page and physical paging
|
||
mechanisms used by the processor are irrelevant.
|
||
|
||
<p>
|
||
Each element of <tt>free_area</tt> contains information about blocks of
|
||
pages. The first element in the array describes single pages, the next blocks of 2 pages, the next
|
||
blocks of 4 pages and so on upwards in powers of two.
|
||
The <tt>list</tt> element is used as a queue head and has pointers to the
|
||
<tt>page</tt> data structures in the <tt>mem_map</tt> array. Free blocks of pages are queued here.
|
||
<tt>map</tt> is a pointer to a bitmap which keeps track of allocated groups of
|
||
pages of this size.
|
||
Bit N of the bitmap is set if the Nth block of pages is free.
|
||
|
||
<p>
|
||
Figure <A href="#free-area-figure"
|
||
> free-area-figure</A> shows the <tt>free_area</tt> structure.
|
||
Element 0 has one free page (page frame number 0) and element 2 has 2 free blocks of 4 pages, the
|
||
first starting at page frame number 4 and the second at page frame number 56.
|
||
|
||
<p>
|
||
|
||
<H3><A NAME="tth_sEc3.4.1">3.4.1 </A> Page Allocation</H3>
|
||
|
||
<p>
|
||
Linux uses the Buddy algorithm
|
||
<a href="#tthFtNtAAC" name=tthFrefAAC><sup>2</sup></a>
|
||
to effectively allocate and deallocate blocks of pages.
|
||
The page allocation code
|
||
|
||
<p>
|
||
attempts to allocate a block of one or more physical pages.
|
||
Pages are allocated in blocks which are powers of 2 in size.
|
||
That means that it can allocate a block 1 page, 2 pages, 4 pages and so on.
|
||
So long as there are enough free pages in the system to grant this request
|
||
(nr_free_pages <font face="symbol"> > </font
|
||
> min_free_pages) the allocation code will search the
|
||
<tt>free_area</tt> for a block of pages of the size requested.
|
||
Each element of the <tt>free_area</tt> has a map of the allocated and free blocks
|
||
of pages for that sized block.
|
||
For example, element 2 of the array has a memory map that describes free and allocated
|
||
blocks each of 4 pages long.
|
||
|
||
<p>
|
||
The allocation algorithm first searches for blocks of pages of the size
|
||
requested.
|
||
It follows the chain of free pages that is queued on the <i>list</i>
|
||
element of the <tt>free_area</tt> data structure.
|
||
If no blocks of pages of the requested size are free, blocks of the next
|
||
size (which is twice that of the size requested) are looked for.
|
||
This process continues until all of the <tt>free_area</tt> has been searched or
|
||
until a block of pages has been found.
|
||
If the block of pages found is larger than that requested it must be
|
||
broken down until there is a block of the right size.
|
||
Because the blocks are each a power of 2 pages big then this breaking
|
||
down process is easy as you simply break the blocks in half.
|
||
The free blocks are queued on the appropriate queue and the allocated
|
||
block of pages is returned to the caller.
|
||
|
||
<p>
|
||
|
||
<p><A NAME="tth_fIg3.4"></A>
|
||
<center><center> <img src="free-area.gif"><br>
|
||
<p>
|
||
</center></center><center> Figure 3.4: The <tt>free_area</tt> data structure</center>
|
||
<A NAME="free-area-figure"></A>
|
||
<p>
|
||
<p>For example, in Figure <A href="#free-area-figure"
|
||
> 3.4</A> if a block of 2 pages
|
||
was requested, the first block of 4 pages (starting at page frame number 4) would
|
||
be broken into two 2 page blocks.
|
||
The first, starting at page frame number 4 would be returned to the caller as the
|
||
allocated pages and the second block, starting at page frame number 6 would be queued
|
||
as a free block of 2 pages onto element 1 of the <tt>free_area</tt> array.
|
||
|
||
<p>
|
||
|
||
<H3><A NAME="tth_sEc3.4.2">3.4.2 </A> Page Deallocation</H3>
|
||
|
||
<p>
|
||
Allocating blocks of pages tends to fragment memory with larger
|
||
blocks of free pages being broken down into smaller ones.
|
||
The page deallocation code
|
||
|
||
<p>
|
||
recombines pages into larger blocks of free pages whenever it can.
|
||
In fact the page block size is important as it allows for easy
|
||
combination of blocks into larger blocks.
|
||
|
||
<p>
|
||
Whenever a block of pages is freed, the adjacent or buddy block of the
|
||
same size is checked to see if it is free.
|
||
If it is, then it is combined with the newly freed block of pages to form
|
||
a new free block of pages for the next size block of pages.
|
||
Each time two blocks of pages are recombined into a bigger block of free
|
||
pages the page deallocation code attempts to recombine that block into a yet
|
||
larger one.
|
||
In this way the blocks of free pages are as large as memory usage will allow.
|
||
|
||
<p>
|
||
For example, in Figure <A href="#free-area-figure"
|
||
> 3.4</A>, if page frame number 1 were to be
|
||
freed, then that would be combined with the already free page frame number 0 and
|
||
queued onto element 1 of the <tt>free_area</tt> as a free block of size 2 pages.
|
||
|
||
<p>
|
||
|
||
<H2><A NAME="tth_sEc3.5">3.5 </A> Memory Mapping</H2>
|
||
|
||
<p>
|
||
When an image is executed, the contents of the executable image must be brought into
|
||
the processes virtual address space.
|
||
The same is also true of any shared libraries that the executable image has been linked to use.
|
||
The executable file is not actually brought into physical memory, instead it is merely
|
||
linked into the processes virtual memory.
|
||
Then, as the parts of the program are referenced by the running application, the
|
||
image is brought into memory from the executable image.
|
||
This linking of an image into a processes virtual address space is known as memory mapping.
|
||
|
||
<p>
|
||
|
||
<p><A NAME="tth_fIg3.5"></A>
|
||
<center><center> <img src="vm_area.gif"><br>
|
||
<p>
|
||
</center></center><center> Figure 3.5: Areas of Virtual Memory</center>
|
||
<A NAME="vm-area-struct"></A>
|
||
<p>
|
||
<p>Every processes virtual memory is represented by an <tt>mm_struct</tt> data structure.
|
||
This contains information about the image that it is currently executing (for example <font face="helvetica">bash</font>)
|
||
and also has pointers to a number of <tt>vm_area_struct</tt> data structures.
|
||
Each <tt>vm_area_struct</tt> data structure describes the start and end of the area of virtual memory,
|
||
the processes access rights to that memory and a set of operations for that memory.
|
||
These operations are a set of routines that Linux must use when manipulating this area of
|
||
virtual memory.
|
||
For example, one of the virtual memory operations performs the correct actions when the
|
||
process has attempted to access this virtual memory but finds (via a page fault) that the
|
||
memory is not actually in physical memory.
|
||
This operation is the <em>nopage</em> operation.
|
||
The <em>nopage</em> operation is used when Linux demand pages the pages of an executable
|
||
image into memory.
|
||
|
||
<p>
|
||
When an executable image is mapped into a processes virtual address a set of <tt>vm_area_struct</tt>
|
||
data structures is generated.
|
||
Each <tt>vm_area_struct</tt> data structure represents a part of the executable image; the executable
|
||
code, initialized data (variables), unitialized data and so on.
|
||
Linux supports a number of standard virtual memory operations and as
|
||
the <tt>vm_area_struct</tt> data structures are created, the correct set of virtual memory operations
|
||
are associated with them.
|
||
|
||
<p>
|
||
|
||
<H2><A NAME="tth_sEc3.6">3.6 </A> Demand Paging</H2>
|
||
|
||
<p>
|
||
Once an executable image has been memory mapped into a processes virtual memory it can start
|
||
to execute.
|
||
As only the very start of the image is physically pulled into memory it will soon access an area
|
||
of virtual memory that is not yet in physical memory.
|
||
When a process accesses a virtual address that does not have a valid page table
|
||
entry, the processor will report a page fault to Linux.
|
||
|
||
<p>
|
||
The page fault describes the virtual address where the page fault occurred and the type
|
||
of memory access that caused.
|
||
|
||
<p>
|
||
Linux must find the <tt>vm_area_struct</tt> that represents the area of memory that the
|
||
page fault occurred in.
|
||
As searching through the <tt>vm_area_struct</tt> data structures is critical to the efficient
|
||
handling of page faults, these are linked together in an AVL (Adelson-Velskii and Landis)
|
||
tree structure.
|
||
If there is no <tt>vm_area_struct</tt> data structure for this faulting virtual address,
|
||
this process has accessed an illegal virtual address.
|
||
Linux will signal the process, sending a <tt>SIGSEGV</tt> signal, and if the process does not
|
||
have a handler for that signal it will be terminated.
|
||
|
||
<p>
|
||
Linux next checks the type of page fault that occurred against the types of accesses
|
||
allowed for this area of virtual memory.
|
||
If the process is accessing the memory in an illegal way, say writing to an area that it
|
||
is only allowed to read from, it is also signalled with a memory error.
|
||
|
||
<p>
|
||
Now that Linux has determined that the page fault is legal, it must deal with it.
|
||
|
||
<p>
|
||
Linux must differentiate between pages that are in the swap file and those that are part
|
||
of an executable image on a disk somewhere.
|
||
It does this by using the page table entry for this faulting virtual address.
|
||
|
||
<p>
|
||
If the page's page table entry is invalid but not empty, the page fault is for a page
|
||
currently being held in the swap file.
|
||
For Alpha AXP page table entries, these are entries which do not have their valid bit set
|
||
but which have a non-zero value in their PFN field.
|
||
In this case the PFN field holds information about where in the swap (and which swap
|
||
file) the page is being held.
|
||
How pages in the swap file are handled is described later in this chapter.
|
||
|
||
<p>
|
||
Not all <tt>vm_area_struct</tt> data structures have a set of virtual memory operations and even
|
||
those that do may not have a <em>nopage</em> operation.
|
||
This is because by default Linux will fix up the access by allocating a new physical page and
|
||
creating a valid page table entry for it.
|
||
If there is a <em>nopage</em> operation for this area of virtual memory, Linux will use it.
|
||
|
||
<p>
|
||
The generic Linux <em>nopage</em> operation
|
||
is used for
|
||
memory mapped executable images and it uses the page cache to bring the required image
|
||
page into physical memory.
|
||
|
||
<p>
|
||
However the required page is brought into physical memory, the processes page tables are updated.
|
||
It may be necessary for hardware specific actions to update those entries, particularly if
|
||
the processor uses translation look aside buffers.
|
||
Now that the page fault has been handled it can be dismissed and the process is restarted
|
||
at the instruction that made the faulting virtual memory access.
|
||
|
||
<p>
|
||
|
||
<H2><A NAME="tth_sEc3.7">3.7 </A> The Linux Page Cache</H2>
|
||
|
||
<p>
|
||
|
||
<p><A NAME="tth_fIg3.6"></A>
|
||
<center><center> <img src="page-cache.gif"><br>
|
||
<p>
|
||
</center></center><center> Figure 3.6: The Linux Page Cache</center>
|
||
<A NAME="page-cache-figure"></A>
|
||
<p>
|
||
<p>The role of the Linux page cache is to speed up access to files on disk.
|
||
Memory mapped files are read a page at a time and these pages are stored in the page cache.
|
||
Figure <A href="#page-cache-figure"
|
||
> 3.6</A> shows that
|
||
the page cache consists of the <tt>page_hash_table</tt>, a vector
|
||
of pointers to <tt>mem_map_t</tt> data structures.
|
||
|
||
<p>
|
||
Each file in Linux is identified by a VFS <tt>inode</tt> data structure (described in
|
||
Chapter <A href="../fs/filesystem.html"
|
||
> filesystem-chapter</A>) and each VFS <tt>inode</tt> is unique and fully describes
|
||
one and only one file.
|
||
The index into the page table is derived from the file's VFS <tt>inode</tt> and the offset into
|
||
the file.
|
||
|
||
<p>
|
||
Whenever a page is read from a memory mapped file, for example when it needs to be brought
|
||
back into memory during demand paging, the page is read through the page cache.
|
||
If the page is present in the cache, a pointer to the <tt>mem_map_t</tt> data structure
|
||
representing it is returned to the page fault handling code.
|
||
Otherwise the page must be brought into memory from the file system that holds the image.
|
||
Linux allocates a physical page and reads the page from the file on disk.
|
||
|
||
<p>
|
||
If it is possible, Linux will initiate a read of the next page in the file.
|
||
This single page read ahead means that if the process is accessing the pages in the file
|
||
serially, the next page will be waiting in memory for the process.
|
||
|
||
<p>
|
||
Over time the page cache grows as images are read and executed.
|
||
Pages will be removed from the cache as they are no longer needed, say as an image
|
||
is no longer being used by any process.
|
||
As Linux uses memory it can start to run low on physical pages.
|
||
In this case Linux will reduce the size of the page cache.
|
||
|
||
<p>
|
||
|
||
<H2><A NAME="tth_sEc3.8">3.8 </A> Swapping Out and Discarding Pages</H2>
|
||
|
||
<p>
|
||
When physical memory becomes scarce the Linux memory management subsystem
|
||
must attempt to free physical pages.
|
||
This task falls to the kernel swap daemon (<i>kswapd</i>).
|
||
|
||
<p>
|
||
The kernel swap daemon is a special type of process, a kernel thread.
|
||
Kernel threads are processes have no virtual memory, instead they run in kernel mode in the physical address space.
|
||
The kernel swap daemon is slightly misnamed in that it does more than merely swap pages out to the system's
|
||
swap files.
|
||
Its role is make sure that there are enough free pages in the system to keep the
|
||
memory management system operating efficiently.
|
||
|
||
<p>
|
||
The Kernel swap daemon (<i>kswapd</i>) is started by the
|
||
kernel init process at startup time and sits waiting for the kernel swap timer to periodically expire.
|
||
|
||
<p>
|
||
Every time the timer expires, the swap daemon looks to see if the number of
|
||
free pages in the system is getting too low.
|
||
It uses two variables, <i>free_pages_high</i> and <i>free_pages_low</i> to decide if it should free some pages.
|
||
So long as the number of free pages in the system remains above <i>free_pages_high</i>, the kernel
|
||
swap daemon does nothing; it sleeps again until its timer next expires.
|
||
For the purposes of this check the kernel swap daemon takes into account the number of pages currently
|
||
being written out to the swap file.
|
||
It keeps a count of these in <i>nr_async_pages</i>; this is incremented each time a page is queued waiting to
|
||
be written out to the swap file and decremented when the write to the swap device has completed.
|
||
<i>free_pages_low</i> and <i>free_pages_high</i> are set at
|
||
system startup time and are related to the number of physical pages in the
|
||
system.
|
||
If the number of free pages in the system has fallen below <i>free_pages_high</i> or worse still
|
||
<i>free_pages_low</i>, the kernel swap daemon will try three ways to reduce the number of physical pages being
|
||
used by the system:
|
||
|
||
<DL compact>
|
||
<p>
|
||
<dt></dt><dd> Reducing the size of the buffer and page caches,
|
||
<dt></dt><dd> Swapping out System V shared memory pages,
|
||
<dt></dt><dd> Swapping out and discarding pages.
|
||
</DL>
|
||
<p>
|
||
If the number of free pages in the system has fallen below <i>free_pages_low</i>, the kernel swap daemon
|
||
will try to free 6 pages before it next runs.
|
||
Otherwise it will try to free 3 pages.
|
||
Each of the above methods are tried in turn until enough pages have been
|
||
freed.
|
||
The kernel swap daemon remembers which method it was using the last time that it attempted to free
|
||
physical pages.
|
||
Each time it runs it will start trying to free pages using this last successful method.
|
||
|
||
<p>
|
||
After it has free sufficient pages, the swap daemon sleeps again until its timer expires.
|
||
If the reason that the kernel swap daemon freed pages was that the number of free pages in the
|
||
system had fallen below <i>free_pages_low</i>, it only sleeps for half its usual time.
|
||
Once the number of free pages is more than <i>free_pages_low</i> the kernel swap daemon goes
|
||
back to sleeping longer between checks.
|
||
|
||
<p>
|
||
|
||
<H3><A NAME="tth_sEc3.8.1">3.8.1 </A> Reducing the Size of the Page and Buffer Caches</H3>
|
||
|
||
<p>
|
||
The pages held in the page and buffer caches are good candidates for being freed into the <tt>free_area</tt>
|
||
vector.
|
||
The Page Cache, which contains pages of memory mapped files, may contain unneccessary pages that are filling
|
||
up the system's memory.
|
||
Likewise the Buffer Cache, which contains buffers read from or being written to physical devices, may also contain
|
||
unneeded buffers.
|
||
When the physical pages in the system start to run out, discarding pages from these caches is relatively easy
|
||
as it requires no writing to physical devices (unlike swapping pages out of memory).
|
||
Discarding these pages does not have too many harmful side effects other than making access to physical devices and
|
||
memory mapped files slower.
|
||
However, if the discarding of pages from these caches is done fairly, all processes will suffer equally.
|
||
|
||
<p>
|
||
Every time the Kernel swap daemon tries to shrink these caches
|
||
|
||
<p>
|
||
it examines a block of pages in the <tt>mem_map</tt> page vector to see if any
|
||
can be discarded from physical memory.
|
||
The size of the block of pages examined is higher if the kernel swap daemon is intensively swapping; that is if
|
||
the number of free pages in the system has fallen dangerously low.
|
||
The blocks of pages are examined in a cyclical manner; a different block of pages
|
||
is examined each time an attempt is made to shrink the memory map.
|
||
This is known as the <em>clock</em> algorithm as, rather like the minute hand of a
|
||
clock, the whole <tt>mem_map</tt> page vector is examined
|
||
a few pages at a time.
|
||
|
||
<p>
|
||
Each page being examined is checked to see if it is cached in either
|
||
the page cache or the buffer cache.
|
||
You should note that shared pages are not considered for discarding
|
||
at this time and that a page cannot be in both caches at the
|
||
same time.
|
||
If the page is not in either cache then the next page in the
|
||
<tt>mem_map</tt> page vector is examined.
|
||
|
||
<p>
|
||
Pages are cached in the buffer cache (or rather the buffers within the pages
|
||
are cached) to make buffer allocation and deallocation more efficient.
|
||
The memory map shrinking code tries to free the buffers that are
|
||
contained within the page being examined.
|
||
|
||
<p>
|
||
If all the buffers are freed, then the pages that contain them are
|
||
also be freed.
|
||
If the examined page is in the Linux page cache, it is removed from
|
||
the page cache and freed.
|
||
|
||
<p>
|
||
When enough pages have been freed on this attempt then the kernel
|
||
swap daemon will wait until the next time it is periodically woken.
|
||
As none of the freed pages were part of any process's virtual memory (they
|
||
were cached pages), then no page tables need updating.
|
||
If there were not enough cached pages discarded then the swap daemon
|
||
will try to swap out some shared pages.
|
||
|
||
<p>
|
||
|
||
<H3><A NAME="tth_sEc3.8.2">3.8.2 </A> Swapping Out System V Shared Memory Pages</H3>
|
||
System V shared memory is an inter-process communication mechanism which allows two or more
|
||
processes to share virtual memory in order to pass information amongst themselves.
|
||
How processes share memory in this way is described in more detail in Chapter <A href="../ipc/ipc.html"
|
||
> IPC-chapter</A>.
|
||
For now it is enough to say that each area of System V shared memory is described by
|
||
a <tt>shmid_ds</tt> data structure.
|
||
This contains a pointer to a list of <tt>vm_area_struct</tt> data structures, one for each process
|
||
sharing this area of virtual memory.
|
||
The <tt>vm_area_struct</tt> data structures describe where in each processes virtual memory this area
|
||
of System V shared memory goes.
|
||
Each <tt>vm_area_struct</tt> data structure for this System V shared memory is linked together using
|
||
the <tt>vm_next_shared</tt> and <tt>vm_prev_shared</tt> pointers.
|
||
Each <tt>shmid_ds</tt> data structure also contains a list of page table entries each of which describes the
|
||
physical page that a shared virtual page maps to.
|
||
|
||
<p>
|
||
The kernel swap daemon also uses a <em>clock</em> algorithm when swapping out System V shared memory pages.
|
||
|
||
<p>
|
||
.
|
||
Each time it runs it remembers which page of which shared virtual memory area it last swapped out.
|
||
It does this by keeping two indices, the first is an index into the set of <tt>shmid_ds</tt> data structures,
|
||
the second into the list of page table entries for this area of System V shared memory. This makes
|
||
sure that it fairly victimizes the areas of System V shared memory.
|
||
|
||
<p>
|
||
As the physical page frame number for a given virtual page of System V shared memory is contained in the
|
||
page tables of all of the processes sharing this area of virtual memory,
|
||
the kernel swap daemon must modify all of these page tables to show that the
|
||
page is no longer in memory but is now held in the swap file.
|
||
For each shared page it is swapping out, the kernel swap daemon finds the page table entry in each of the
|
||
sharing processes page tables (by following a pointer from each <tt>vm_area_struct</tt> data structure).
|
||
If this processes page table entry for this page of System V shared memory is valid, it converts it into
|
||
an invalid but swapped out page table entry and reduces this (shared) page's count of users by one.
|
||
The format of a swapped out System V shared page table entry contains an index into the set of <tt>shmid_ds</tt>
|
||
data structures and an index into the page table entries for this area of System V shared memory.
|
||
|
||
<p>
|
||
If the page's count is zero after the page tables of the sharing processes have all been modified, the
|
||
shared page can be written out to the swap file.
|
||
The page table entry in the list pointed at by the
|
||
<tt>shmid_ds</tt> data structure for this area of System V shared memory is replaced by a swapped out page
|
||
table entry.
|
||
A swapped out page table entry is invalid but contains an index into the set of open swap files and the offset
|
||
in that file where the swapped out page can be found.
|
||
This information will be used when the page has to be brought back into physical memory.
|
||
|
||
<p>
|
||
|
||
<H3><A NAME="tth_sEc3.8.3">3.8.3 </A> Swapping Out and Discarding Pages</H3>
|
||
|
||
<p>
|
||
The swap daemon looks at each process in the system
|
||
in turn to see if it is a good candidate for swapping.
|
||
|
||
<p>
|
||
Good candidates are processes that can be swapped (some cannot) and that have
|
||
one or more pages which can be swapped or discarded from memory.
|
||
Pages are swapped out of physical memory into the system's swap files only if
|
||
the data in them cannot be retrieved another way.
|
||
|
||
<p>
|
||
A lot of the contents of an executable image come from the image's file and
|
||
can easily be re-read from that file.
|
||
For example, the executable instructions of an image will never be modified
|
||
by the image and so will never be written to the swap file.
|
||
These pages can simply be discarded; when they are again referenced
|
||
by the process, they will be brought back into memory from the
|
||
executable image.
|
||
|
||
<p>
|
||
Once the process to swap has been located, the swap daemon looks through
|
||
all of its virtual memory regions looking for areas which
|
||
are not shared or locked.
|
||
|
||
<p>
|
||
Linux does not swap out all of the swappable pages of the process that
|
||
it has selected; instead it removes only a small number of pages.
|
||
|
||
<p>
|
||
Pages cannot be swapped or discarded if they are locked in memory.
|
||
|
||
<p>
|
||
The Linux swap algorithm uses page aging.
|
||
Each page has a counter (held in the <tt>mem_map_t</tt> data structure) that
|
||
gives the Kernel swap daemon some idea whether or not a page is worth swapping.
|
||
Pages age when they are unused and rejuvinate on access;
|
||
the swap daemon only swaps out old pages.
|
||
The default action when a page is first allocated, is to give it an initial age of 3.
|
||
Each time it is touched, it's age is increased by 3 to a maximum of 20.
|
||
Every time the Kernel swap daemon runs it ages pages, decrementing their
|
||
age by 1.
|
||
These default actions can be changed and for this reason they (and other
|
||
swap related information) are stored in the <tt>swap_control</tt> data structure.
|
||
|
||
<p>
|
||
If the page is old (age = 0), the swap daemon will process it further.
|
||
<i>Dirty</i> pages are pages which can be swapped out.
|
||
Linux uses an architecture specific bit in the PTE to describe pages this way
|
||
(see Figure <A href="#axp-pte-figure"
|
||
> 3.2</A>).
|
||
However, not all <i>dirty</i> pages are necessarily written to the swap file.
|
||
Every virtual memory region of a process may have its own swap operation
|
||
(pointed at by the <tt>vm_ops</tt> pointer in the <tt>vm_area_struct</tt>) and that method is used.
|
||
Otherwise, the swap daemon will allocate a page in the swap file and write the
|
||
page out to that device.
|
||
|
||
<p>
|
||
The page's page table entry is replaced by one which
|
||
is marked as invalid but which contains information about where the page is in
|
||
the swap file.
|
||
This is an offset into the swap file
|
||
where the page is held and an indication of which swap file is being used.
|
||
Whatever the swap method used, the original physical page
|
||
is made free by putting it back into the <tt>free_area</tt>.
|
||
Clean (or rather not <i>dirty</i>) pages can be discarded and put back into the <tt>free_area</tt>
|
||
for re-use.
|
||
|
||
<p>
|
||
If enough of the swappable processes pages have been swapped out or discarded, the swap daemon
|
||
will again sleep.
|
||
The next time it wakes it will consider the next process in the system.
|
||
In this way, the swap daemon nibbles away at each processes physical pages until the system
|
||
is again in balance. This is much fairer than swapping out whole processes.
|
||
|
||
<p>
|
||
|
||
<H2><A NAME="tth_sEc3.9">3.9 </A> The Swap Cache</H2>
|
||
|
||
<p>
|
||
When swapping pages out to the swap files, Linux avoids writing pages if it does not have to.
|
||
There are times when a page is both in a swap file and in physical memory.
|
||
This happens when a page that was swapped out of memory was then brought back into memory when it was
|
||
again accessed by a process.
|
||
So long as the page in memory is not written to, the copy in the swap file remains valid.
|
||
|
||
<p>
|
||
Linux uses the swap cache to track these pages.
|
||
The swap cache is a list of page table entries, one per physical page in the system.
|
||
This is a page table entry for a swapped out page and describes which swap file the page is being
|
||
held in together with its location in the swap file.
|
||
If a swap cache entry is non-zero, it represents a page which is being held in a swap
|
||
file that has not been modified.
|
||
If the page is subsequently modified (by being written to), its entry is removed from the swap cache.
|
||
|
||
<p>
|
||
When Linux needs to swap a physical page out to a swap file it consults the swap cache and,
|
||
if there is a valid entry for this page, it does not need to write the page out to the
|
||
swap file.
|
||
This is because the page in memory has not been modified since it was last read from the
|
||
swap file.
|
||
|
||
<p>
|
||
The entries in the swap cache are page table entries for swapped out pages.
|
||
They are marked as invalid but contain information which allow Linux to find the right
|
||
swap file and the right page within that swap file.
|
||
|
||
<p>
|
||
|
||
<H2><A NAME="tth_sEc3.10">3.10 </A> Swapping Pages In</H2>
|
||
The dirty pages saved in the swap files may be needed again, for example when an application
|
||
writes to an area of virtual memory whose contents are held in a swapped out physical page.
|
||
Accessing a page of virtual memory that is not held in physical memory causes a page fault to occur.
|
||
The page fault is the processor signalling the operating system that it cannot translate
|
||
a virtual address into a physical one.
|
||
In this case this is because the page table entry describing this page of virtual memory was
|
||
marked as invalid when the page was swapped out.
|
||
The processor cannot handle the virtual to physical address translation and so hands control
|
||
back to the operating system describing as it does so the virtual address that
|
||
faulted and the reason for the fault.
|
||
The format of this information and how the processor passes control to the operating system
|
||
is processor specific.
|
||
|
||
<p>
|
||
The processor specific page fault handling code must locate the <tt>vm_area_struct</tt> data structure
|
||
that describes the area of virtual memory that contains the faulting virtual address.
|
||
It does this by searching the <tt>vm_area_struct</tt> data structures for this process until it finds the
|
||
one containing the faulting virtual address.
|
||
This is very time critical code and a processes <tt>vm_area_struct</tt> data structures are so arranged
|
||
as to make this search take as little time as possible.
|
||
|
||
<p>
|
||
Having carried out the appropriate processor specific actions and found that the faulting virtual
|
||
address is for a valid area of virtual memory, the page fault processing becomes generic and applicable
|
||
to all processors that Linux runs on.
|
||
|
||
<p>
|
||
The generic page fault handling code looks for the page table entry for the faulting virtual address.
|
||
If the page table entry it finds is for a swapped out page, Linux must swap the page back into
|
||
physical memory.
|
||
The format of the page table entry for a swapped out page is processor specific but all processors
|
||
mark these pages as invalid and put the information neccessary to locate the page within the swap file
|
||
into the page table entry.
|
||
Linux needs this information in order to bring the page back into physical memory.
|
||
|
||
<p>
|
||
At this point,
|
||
Linux knows the faulting virtual address and has a page table entry containing
|
||
information about where this page has been swapped to.
|
||
The <tt>vm_area_struct</tt> data structure may contain a pointer to a routine which will swap any page
|
||
of the area of virtual memory that it describes back into physical memory.
|
||
This is its <em>swapin</em> operation.
|
||
If there is a <em>swapin</em> operation for this area of virtual memory then Linux will use it.
|
||
This
|
||
is, in fact, how swapped out System V shared memory pages are
|
||
handled as it requires special handling because the format of a swapped out System V shared page is a little different
|
||
from that of an ordinairy swapped out page.
|
||
There may not be a <em>swapin</em> operation, in which case Linux will assume that this is an
|
||
ordinairy page that does not need to be specially handled.
|
||
|
||
<p>
|
||
It allocates a free physical page and reads the swapped out page back from the swap file.
|
||
Information telling it where in the swap file (and which swap file) is taken from the
|
||
the invalid page table entry.
|
||
|
||
<p>
|
||
If the access that caused the page fault was not a write access then the page is left in the swap
|
||
cache and its page table entry is not marked as writable.
|
||
If the page is subsequently written to, another page fault will occur and, at that point, the page is
|
||
marked as dirty and its entry is removed from the swap cache.
|
||
If the page is not written to and it needs to be swapped out again, Linux can avoid the write
|
||
of the page to its swap file because the page is already in the swap file.
|
||
|
||
<p>
|
||
If the access that caused the page to be brought in from
|
||
the swap file was a write operation, this page is removed from the swap cache and its page
|
||
table entry is marked as both dirty and writable.
|
||
|
||
<p>
|
||
<hr><H3>Footnotes:</H3>
|
||
|
||
<p><a name=tthFtNtAAB></a><a href="#tthFrefAAB"><sup>1</sup></a> Confusingly the structure is also known as the <i>page</i>
|
||
structure.
|
||
<p><a name=tthFtNtAAC></a><a href="#tthFrefAAC"><sup>2</sup></a> Bibliography reference here
|
||
<p><hr><small>File translated from T<sub><font size=-1>E</font></sub>X by <a href="http://hutchinson.belmont.ma.us/tth/tth.html">T<sub><font size=-1>T</font></sub>H</a>, version 1.0.</small>
|
||
<hr>
|
||
<center>
|
||
<A HREF="../mm/memory.html"> Top of Chapter</A>,
|
||
<A HREF="../tlk-toc.html"> Table of Contents</A>,
|
||
<A href="../tlk.html" target="_top"> Show Frames</A>,
|
||
<A href="../mm/memory.html" target="_top"> No Frames</A><br>
|
||
<EFBFBD> 1996-1999 David A Rusling <A HREF="../misc/copyright.html">copyright notice</a>.
|
||
</center>
|
||
</HTML> |