diff --git a/LDP/howto/docbook/HOWTO-INDEX/adminSect.sgml b/LDP/howto/docbook/HOWTO-INDEX/adminSect.sgml
index d6e980a8..47944d0e 100644
--- a/LDP/howto/docbook/HOWTO-INDEX/adminSect.sgml
+++ b/LDP/howto/docbook/HOWTO-INDEX/adminSect.sgml
@@ -160,7 +160,7 @@ troubleshooting for ix86-based systems.
KernelAnalysis-HOWTO,
KernelAnalysis-HOWTO
-Updated: July 2002.
+Updated: March 2003.
Explains some things about the Linux Kernel,
such as the most important components, how they work, and so on.
diff --git a/LDP/howto/docbook/HOWTO-INDEX/howtoChap.sgml b/LDP/howto/docbook/HOWTO-INDEX/howtoChap.sgml
index 543337d2..c60493ac 100644
--- a/LDP/howto/docbook/HOWTO-INDEX/howtoChap.sgml
+++ b/LDP/howto/docbook/HOWTO-INDEX/howtoChap.sgml
@@ -1394,7 +1394,7 @@ troubleshooting for ix86-based systems.
KernelAnalysis-HOWTO,
KernelAnalysis-HOWTO
-Updated: July 2002.
+Updated: March 2003.
Explains some things about the Linux Kernel,
such as the most important components, how they work, and so on.
diff --git a/LDP/howto/docbook/HOWTO-INDEX/osSect.sgml b/LDP/howto/docbook/HOWTO-INDEX/osSect.sgml
index 6f15d3bd..0af6e9f4 100644
--- a/LDP/howto/docbook/HOWTO-INDEX/osSect.sgml
+++ b/LDP/howto/docbook/HOWTO-INDEX/osSect.sgml
@@ -499,7 +499,7 @@ troubleshooting for ix86-based systems.
KernelAnalysis-HOWTO,
KernelAnalysis-HOWTO
-Updated: July 2002.
+Updated: March 2003.
Explains some things about the Linux Kernel,
such as the most important components, how they work, and so on.
diff --git a/LDP/howto/linuxdoc/KernelAnalysis-HOWTO.sgml b/LDP/howto/linuxdoc/KernelAnalysis-HOWTO.sgml
index 3dc19a97..bc5184a1 100644
--- a/LDP/howto/linuxdoc/KernelAnalysis-HOWTO.sgml
+++ b/LDP/howto/linuxdoc/KernelAnalysis-HOWTO.sgml
@@ -1,80 +1,101 @@
-
+
KernelAnalysis-HOWTO
+
-Roberto Arcomano
+Roberto Arcomano berto@bertolinux.com
+
-v0.63 - July 31, 2002
+v0.7, March 26, 2003
+
-This document tries to explain some things about the Linux Kernel, such
- as the most important components, how they work, and so on. This HOWTO should
- help prevent the reader from needing to browse all the kernel source files
- searching for the"right function," declaration, and definition, and then linking
- each to the other. You can find the latest version of this document at If
- you have suggestions to help make this document better, please submit your
- ideas to me at the following address:
+This document tries to explain some things about the Linux Kernel,
+ such as the most important components, how they work, and so on.
+ This HOWTO should help prevent the reader from needing to browse
+ all the kernel source files searching for the"right function," declaration,
+ and definition, and then linking each to the other. You can find
+ the latest version of this document at If you have suggestions to
+ help make this document better, please submit your ideas to me at
+ the following address:
+
Introduction
Introduction
-This HOWTO tries to define how parts of the Linux Kernel work, what are
- the main functions and data structures used, and how the "wheel spins". You can
- find the latest version of this document at If you have suggestions to help
- make this document better, please submit your ideas to me at the following
- address: Code used within this document refers to the Linux Kernel version
- 2.4.x, which is the last stable kernel version at time of writing this HOWTO.
+This HOWTO tries to define how parts of the Linux Kernel work,
+ what are the main functions and data structures used, and how the
+ "wheel spins". You can find the latest version of this document at
+ If you have suggestions to help make this document better, please
+ submit your ideas to me at the following address: Code used within
+ this document refers to the Linux Kernel version 2.4.x, which is
+ the last stable kernel version at time of writing this HOWTO.
+
Copyright
-Copyright (C) 2000,2001,2002 Roberto Arcomano. This document is free; you
- can redistribute it and/or modify it under the terms of the GNU General Public
- License as published by the Free Software Foundation; either version 2 of the
- License, or (at your option) any later version. This document is distributed
- in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
- the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
- See the GNU General Public License for more details. You can get a copy of
- the GNU GPL
+Copyright (C) 2000,2001,2002 Roberto Arcomano. This document
+ is free; you can redistribute it and/or modify it under the terms
+ of the GNU General Public License as published by the Free Software
+ Foundation; either version 2 of the License, or (at your option)
+ any later version. This document is distributed in the hope that
+ it will be useful, but WITHOUT ANY WARRANTY; without even the implied
+ warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ See the GNU General Public License for more details. You can get
+ a copy of the GNU GPL
+
Translations
-If you want to translate this document you are free to do so. However,
- you will need to do the following:
+If you want to translate this document you are free to do so.
+ However, you will need to do the following:
+
+
-
-Check that another version of the document doesn't already exist at your
- local LDP
+Check that another version of the document doesn't already exist
+ at your local LDP
-
-Maintain all 'Introduction' sections (including 'Introduction', 'Copyright',
- 'Translations' , 'Credits').
+Maintain all 'Introduction' sections (including 'Introduction',
+ 'Copyright', 'Translations' , 'Credits').
+
-Warning! You don't have to translate TXT or HTML file, you have to modify
- LYX file, so that it is possible to convert it all other formats (TXT, HTML,
- RIFF, etc.): to do that you can use "LyX" application you download from .
+Warning! You don't have to translate TXT or HTML file, you have
+ to modify LYX file, so that it is possible to convert it all other
+ formats (TXT, HTML, RIFF, etc.): to do that you can use "LyX" application
+ you download from .
+
-No need to ask me to translate! You just have to let me know (if you want)
- about your translation.
+No need to ask me to translate! You just have to let me know
+ (if you want) about your translation.
+
Thank you for your translation!
+
Credits
-Thanks to for publishing and uploading my document quickly.
+Thanks to for publishing and uploading my document quickly.
+
+
+
+Thanks to Klaas de Waal for his suggestions.
+
Syntax used
@@ -82,49 +103,64 @@ Syntax used
Function Syntax
When speaking about a function, we write:
+
+
"function_name [ file location . extension ]"
+
For example:
+
+
"schedule [kernel/sched.c]"
+
tells us that we talk about
+
"schedule"
+
function retrievable from file
+
[ kernel/sched.c ]
+
Note: We also assume /usr/src/linux as the starting directory.
+
Indentation
Indentation in source code is 3 blank characters.
+
InterCallings Analysis
Overview
-We use the"InterCallings Analysis "(ICA) to see (in an indented fashion)
- how kernel functions call each other.
+We use the"InterCallings Analysis "(ICA) to see (in an indented
+ fashion) how kernel functions call each other.
+
For example, the sleep_on command is described in ICA below:
+
+
|sleep_on
@@ -139,10 +175,13 @@ For example, the sleep_on command is described in ICA below:
sleep_on ICA
+
The indented ICA is followed by functions' locations:
+
+
-
@@ -163,116 +202,148 @@ __remove_wait_queue [include/linux/wait.h]
list_del [include/linux/list.h]
-
__list_del
+
-Note: We don't specify anymore file location, if specified just before.
+Note: We don't specify anymore file location, if specified just
+ before.
+
Details
In an ICA a line like looks like the following
+
+
function1 -> function2
+
-means that < function1 > is a generic pointer to another function.
- In this case < function1 > points to < function2 >.
+means that < function1 > is a generic pointer to another
+ function. In this case < function1 > points to < function2
+ >.
+
When we write:
+
+
function:
+
-it means that < function > is not a real function. It is a label
- (typically assembler label).
+it means that < function > is not a real function. It is
+ a label (typically assembler label).
+
-In many sections we may report a ''C'' code or a ''pseudo-code''. In real
- source files, you could use ''assembler'' or ''not structured'' code. This
- difference is for learning purposes.
+In many sections we may report a ''C'' code or a ''pseudo-code''.
+ In real source files, you could use ''assembler'' or ''not structured''
+ code. This difference is for learning purposes.
+
PROs of using ICA
The advantages of using ICA (InterCallings Analysis) are many:
+
+
-
-You get an overview of what happens when you call a kernel function
+You get an overview of what happens when you call a kernel function
+
-
-Function locations are indicated after the function, so ICA could also
- be considered as a little ''function reference''
+Function locations are indicated after the function, so ICA could
+ also be considered as a little ''function reference''
-
-InterCallings Analysis (ICA) is useful in sleep/awake mechanisms, where
- we can view what we do before sleeping, the proper sleeping action, and what
- we'll do after waking up (after schedule).
+InterCallings Analysis (ICA) is useful in sleep/awake mechanisms,
+ where we can view what we do before sleeping, the proper sleeping
+ action, and what we'll do after waking up (after schedule).
+
CONTROs of using ICA
+
-
Some of the disadvantages of using ICA are listed below:
+
-As all theoretical models, we simplify reality avoiding many details, such
- as real source code and special conditions.
+As all theoretical models, we simplify reality avoiding many
+ details, such as real source code and special conditions.
+
+
-
-Additional diagrams should be added to better represent stack conditions,
- data values, and so on.
+Additional diagrams should be added to better represent stack
+ conditions, data values, and so on.
+
Fundamentals
What is the kernel?
-The kernel is the "core" of any computer system: it is the "software" which
- allows users to share computer resources.
+The kernel is the "core" of any computer system: it is the "software"
+ which allows users to share computer resources.
+
-The kernel can be thought ofas the main software of the OS (Operating System),
- which may also include graphics management.
+The kernel can be thought as the main software of the OS (Operating
+ System), which may also include graphics management.
+
-For example, under Linux (like other Unix-like OSs), the XWindow environment
- doesn't belong to the Linux Kernel, because it manages only graphical operations
- (it uses user mode I/O to access video card devices).
+For example, under Linux (like other Unix-like OSs), the XWindow
+ environment doesn't belong to the Linux Kernel, because it manages
+ only graphical operations (it uses user mode I/O to access video
+ card devices).
+
-By contrast, Windows environments (Win9x, WinME, WinNT, Win2K, WinXP, and
- so on) are a mix between a graphical environment and kernel.
+By contrast, Windows environments (Win9x, WinME, WinNT, Win2K,
+ WinXP, and so on) are a mix between a graphical environment and kernel.
+
What is the difference between User Mode and Kernel Mode?
Overview
-Many years ago, when computers were as big as a room, users ran their applications
- with much difficulty and, sometimes, their applications crashed the computer.
-
+Many years ago, when computers were as big as a room, users ran
+ their applications with much difficulty and, sometimes, their applications
+ crashed the computer.
+
Operative modes
-To avoid having applications that constantly crashed, newer OSs were designed
- with 2 different operative modes:
+To avoid having applications that constantly crashed, newer OSs
+ were designed with 2 different operative modes:
+
+
-
-Kernel Mode: the machine operates with critical data structure, direct
- hardware (IN/OUT or memory mapped), direct memory, IRQ, DMA, and so on.
+Kernel Mode: the machine operates with critical data structure,
+ direct hardware (IN/OUT or memory mapped), direct memory, IRQ, DMA,
+ and so on.
-
User Mode: users can run applications.
+
@@ -289,63 +360,76 @@ Implementation | _______ _______ | Abstraction
| | |
| | |
\|/ Hardware |
+
-Kernel Mode "prevents" User Mode applications from damaging the system or
- its features.
+Kernel Mode "prevents" User Mode applications from damaging the
+ system or its features.
+
-Modern microprocessors implement in hardware at least 2 different states.
- For example under Intel, 4 states determine the PL (Privilege Level). It is
- possible to use 0,1,2,3 states, with 0 used in Kernel Mode.
+Modern microprocessors implement in hardware at least 2 different
+ states. For example under Intel, 4 states determine the PL (Privilege
+ Level). It is possible to use 0,1,2,3 states, with 0 used in Kernel
+ Mode.
+
-Unix OS requires only 2 privilege levels, and we will use such a paradigm
- as point of reference.
+Unix OS requires only 2 privilege levels, and we will use such
+ a paradigm as point of reference.
+
Switching from User Mode to Kernel Mode
When do we switch?
-Once we understand that there are 2 different modes, we have to know when
- we switch from one to the other.
+Once we understand that there are 2 different modes, we have
+ to know when we switch from one to the other.
+
Typically, there are 2 points of switching:
+
+
-
-When calling a System Call: after calling a System Call, the task voluntary
- calls pieces of code living in Kernel Mode
+When calling a System Call: after calling a System Call, the
+ task voluntary calls pieces of code living in Kernel Mode
-
-When an IRQ (or exception) comes: after the IRQ an IRQ handler (or exception
- handler) is called, then control returns back to the task that was interrupted
- like nothing was happened.
+When an IRQ (or exception) comes: after the IRQ an IRQ handler
+ (or exception handler) is called, then control returns back to the
+ task that was interrupted like nothing was happened.
+
System Calls
-System calls are like special functions that manage OS routines which live
- in Kernel Mode.
+System calls are like special functions that manage OS routines
+ which live in Kernel Mode.
+
A system call can be called when we:
+
+
-
access an I/O device or a file (like read or write)
-
-need to access privileged information (like pid, changing scheduling policy
- or other information)
+need to access privileged information (like pid, changing scheduling
+ policy or other information)
-
-need to change execution context (like forking or executing some other
- application)
+need to change execution context (like forking or executing some
+ other application)
-
-need to execute a particular command (like ''chdir'', ''kill", ''brk'',
- or ''signal'')
+need to execute a particular command (like ''chdir'', ''kill",
+ ''brk'', or ''signal'')
+
@@ -366,20 +450,27 @@ need to execute a particular command (like ''chdir'', ''kill", ''brk'',
Unix System Calls Working
+
-System calls are almost the only interface used by User Mode to talk with
- low level resources (hardware). The only exception to this statement is when
- a process uses ''ioperm'' system call. In this case a device can be accessed
- directly by User Mode process (IRQs cannot be used).
+System calls are almost the only interface used by User Mode
+ to talk with low level resources (hardware). The only exception to
+ this statement is when a process uses ''ioperm'' system call. In
+ this case a device can be accessed directly by User Mode process
+ (IRQs cannot be used).
+
-NOTE: Not every ''C'' function is a system call, only some of them.
+NOTE: Not every ''C'' function is a system call, only some of
+ them.
+
-Below is a list of System Calls under Linux Kernel 2.4.17, from [
- arch/i386/kernel/entry.S ]
+Below is a list of System Calls under Linux Kernel 2.4.17, from
+ [ arch/i386/kernel/entry.S ]
+
+
.long SYMBOL_NAME(sys_ni_syscall) /* 0 - old "setup()" system call*/
@@ -610,17 +701,21 @@ Below is a list of System Calls under Linux Kernel 2.4.17, from [
.long SYMBOL_NAME(sys_readahead) /* 225 */
+
IRQ Event
-When an IRQ comes, the task that is running is interrupted in order to
- service the IRQ Handler.
+When an IRQ comes, the task that is running is interrupted in
+ order to service the IRQ Handler.
+
-After the IRQ is handled, control returns backs exactly to point of interrupt,
- like nothing happened.
+After the IRQ is handled, control returns backs exactly to point
+ of interrupt, like nothing happened.
+
+
@@ -640,11 +735,14 @@ EXECUTION |___________| [return to code]
User->Kernel Mode Transition caused by IRQ event
+
-The numbered steps below refer to the sequence of events in the diagram
- above:
+The numbered steps below refer to the sequence of events in the
+ diagram above:
+
+
-
@@ -659,73 +757,92 @@ The "Interrupt handler" code is executed.
Control returns back to task user mode (as if nothing happened)
-
Process returns back to normal execution
+
-Special interest has the Timer IRQ, coming every TIMER ms to manage:
+Special interest has the Timer IRQ, coming every TIMER ms to
+ manage:
+
+
-
Alarms
-
-System and task counters (used by schedule to decide when stop a process
- or for accounting)
+System and task counters (used by schedule to decide when stop
+ a process or for accounting)
-
Multitasking based on wake up mechanism after TIMESLICE time.
+
Multitasking
Mechanism
-The key point of modern OSs is the "Task". The Task is an application running
- in memory sharing all resources (included CPU and Memory) with other Tasks.
+The key point of modern OSs is the "Task". The Task is an application
+ running in memory sharing all resources (included CPU and Memory)
+ with other Tasks.
+
-This "resource sharing" is managed by the "Multitasking Mechanism". The Multitasking
- Mechanism switches from one task to another after a "timeslice" time. Users have
- the "illusion" that they own all resources. We can also imagine a single user
- scenario, where a user can have the "illusion" of running many tasks at the same
- time.
+This "resource sharing" is managed by the "Multitasking Mechanism".
+ The Multitasking Mechanism switches from one task to another after
+ a "timeslice" time. Users have the "illusion" that they own all resources.
+ We can also imagine a single user scenario, where a user can have
+ the "illusion" of running many tasks at the same time.
+
-To implement this multitasking, the task uses "the state" variable, which
- can be:
+To implement this multitasking, the task uses "the state" variable,
+ which can be:
+
+
-
READY, ready for execution
-
BLOCKED, waiting for a resource
+
-The task state is managed by its presence in a relative list: READY list
- and BLOCKED list.
+The task state is managed by its presence in a relative list:
+ READY list and BLOCKED list.
+
Task Switching
-The movement from one task to another is called ''Task Switching''. many
- computers have a hardware instruction which automatically performs this operation.
- Task Switching occurs in the following cases:
+The movement from one task to another is called ''Task Switching''.
+ many computers have a hardware instruction which automatically performs
+ this operation. Task Switching occurs in the following cases:
+
+
-
-After Timeslice ends: we need to schedule a "Ready for execution" task and
- give it access.
+After Timeslice ends: we need to schedule a "Ready for execution"
+ task and give it access.
-
-When a Task has to wait for a device: we need to schedule a new task and
- switch to it *
+When a Task has to wait for a device: we need to schedule a new
+ task and switch to it *
+
-* We schedule another task to prevent "Busy Form Waiting", which occurs
- when we are waiting for a device instead performing other work.
+* We schedule another task to prevent "Busy Form Waiting", which
+ occurs when we are waiting for a device instead performing other
+ work.
+
Task Switching is managed by the "Schedule" entity.
+
+
@@ -754,10 +871,13 @@ Timer | |
Task Switching based on TimeSlice
+
A typical Timeslice for Linux is about 10 ms.
+
+
@@ -781,56 +901,70 @@ A typical Timeslice for Linux is about 10 ms.
Task Switching based on Waiting for a Resource
+
Microkernel vs Monolithic OS
Overview
-Until now we viewed so called Monolithic OS, but there is also another
- kind of OS: ''Microkernel''.
+Until now we viewed so called Monolithic OS, but there is also
+ another kind of OS: ''Microkernel''.
+
-A Microkernel OS uses Tasks, not only for user mode processes, but also
- as a real kernel manager, like Floppy-Task, HDD-Task, Net-Task and so on. Some
- examples are Amoeba, and Mach.
+A Microkernel OS uses Tasks, not only for user mode processes,
+ but also as a real kernel manager, like Floppy-Task, HDD-Task, Net-Task
+ and so on. Some examples are Amoeba, and Mach.
+
PROs and CONTROs of Microkernel OS
PROS:
+
+
-
-OS is simpler to maintain because each Task manages a single kind of operation.
- So if you want to modify networking, you modify Net-Task (ideally, if it is
- not needed a structural update).
+OS is simpler to maintain because each Task manages a single
+ kind of operation. So if you want to modify networking, you modify
+ Net-Task (ideally, if it is not needed a structural update).
+
CONS:
+
+
-
-Performances are worse than Monolithic OS, because you have to add 2*TASK_SWITCH
- times (the first to enter the specific Task, the second to go out from it).
+Performances are worse than Monolithic OS, because you have to
+ add 2*TASK_SWITCH times (the first to enter the specific Task, the
+ second to go out from it).
+
-My personal opinion is that, Microkernels are a good didactic example (like
- Minix) but they are not ''optimal'', so not really suitable. Linux uses a few
- Tasks, called "Kernel Threads" to implement a little microkernel structure (like
- kswapd, which is used to retrieve memory pages from mass storage). In this
- case there are no problems with perfomance because swapping is a very slow
- job.
+My personal opinion is that, Microkernels are a good didactic
+ example (like Minix) but they are not ''optimal'', so not really
+ suitable. Linux uses a few Tasks, called "Kernel Threads" to implement
+ a little microkernel structure (like kswapd, which is used to retrieve
+ memory pages from mass storage). In this case there are no problems
+ with perfomance because swapping is a very slow job.
+
Networking
ISO OSI levels
-Standard ISO-OSI describes a network architecture with the following levels:
+Standard ISO-OSI describes a network architecture with the following
+ levels:
+
+
-
@@ -847,36 +981,44 @@ Session level (SSL)
Presentation level (FTP binary-ascii coding)
-
Application level (applications like Netscape)
+
-The first 2 levels listed above are often implemented in hardware. Next
- levels are in software (or firmware for routers).
+The first 2 levels listed above are often implemented in hardware.
+ Next levels are in software (or firmware for routers).
+
-Many protocols are used by an OS: one of these is TCP/IP (the most important
- living on 3-4 levels).
+Many protocols are used by an OS: one of these is TCP/IP (the
+ most important living on 3-4 levels).
+
What does the kernel?
-The kernel doesn't know anything (only addresses) about first 2 levels
- of ISO-OSI.
+The kernel doesn't know anything (only addresses) about first
+ 2 levels of ISO-OSI.
+
In RX it:
+
+
-
-Manages handshake with low levels devices (like ethernet card or modem)
- receiving "frames" from them.
+Manages handshake with low levels devices (like ethernet card
+ or modem) receiving "frames" from them.
-
-Builds TCP/IP "packets" from "frames" (like Ethernet or PPP ones),
+Builds TCP/IP "packets" from "frames" (like Ethernet or PPP ones),
+
-
-Convers ''packets'' in ''sockets'' passing them to the right application
- (using port number) or
+Convers ''packets'' in ''sockets'' passing them to the right
+ application (using port number) or
-
Forwards packets to the right queue
+
@@ -885,10 +1027,13 @@ NIC ---------> Kernel ----------> Application
| packets
--------------> Forward
- RX -
+
In TX stage it:
+
+
-
@@ -899,6 +1044,7 @@ Queues datas into TCP/IP ''packets''
Splits ''packets" into "frames" (like Ethernet or PPP ones)
-
Sends ''frames'' using HW drivers
+
@@ -909,17 +1055,21 @@ Forward -------------------
- TX -
+
Virtual Memory
Segmentation
-Segmentation is the first method to solve memory allocation problems: it
- allows you to compile source code without caring where the application will
- be placed in memory. As a matter of fact, this feature helps applications developers
- to develop in a independent fashion from the OS e also from the hardware.
+Segmentation is the first method to solve memory allocation problems:
+ it allows you to compile source code without caring where the application
+ will be placed in memory. As a matter of fact, this feature helps
+ applications developers to develop in a independent fashion from
+ the OS e also from the hardware.
+
+
@@ -937,21 +1087,27 @@ Segmentation is the first method to solve memory allocation problems: it
Segment
+
-We can say that a segment is the logical entity of an application, or the
- image of the application in memory.
+We can say that a segment is the logical entity of an application,
+ or the image of the application in memory.
+
-When programming, we don't care where our data is put in memory, we only
- care about the offset inside our segment (our application).
+When programming, we don't care where our data is put in memory,
+ we only care about the offset inside our segment (our application).
+
-We use to assign a Segment to each Process and vice versa. In Linux this
- is not true. Linux uses only 4 segments for either Kernel and all Processes.
+We use to assign a Segment to each Process and vice versa. In
+ Linux this is not true. Linux uses only 4 segments for either Kernel
+ and all Processes.
+
Problems of Segmentation
+
@@ -971,18 +1127,22 @@ Problems of Segmentation
Segmentation problem
+
-In the diagram above, we want to get exit processes A, and D and enter
- process B. As we can see there is enough space for B, but we cannot split it
- in 2 pieces, so we CANNOT load it (memory out).
+In the diagram above, we want to get exit processes A, and D
+ and enter process B. As we can see there is enough space for B, but
+ we cannot split it in 2 pieces, so we CANNOT load it (memory out).
+
The reason this problem occurs is because pure segments are continuous
areas (because they are logical areas) and cannot be split.
+
Pagination
+
@@ -1002,23 +1162,29 @@ Pagination
Segment
+
-Pagination splits memory in "n" pieces, each one with a fixed
- length.
+Pagination splits memory in "n" pieces, each one with
+ a fixed length.
+
-A process may be loaded in one or more Pages. When memory is freed, all
- pages are freed (see Segmentation Problem, before).
+A process may be loaded in one or more Pages. When memory is
+ freed, all pages are freed (see Segmentation Problem, before).
+
-Pagination is also used for another important purpose, "Swapping". If a page
- is not present in physical memory then it generates an EXCEPTION, that will
- make the Kernel search for a new page in storage memory. This mechanism allow
- OS to load more applications than the ones allowed by physical memory only.
+Pagination is also used for another important purpose, "Swapping".
+ If a page is not present in physical memory then it generates an
+ EXCEPTION, that will make the Kernel search for a new page in storage
+ memory. This mechanism allow OS to load more applications than the
+ ones allowed by physical memory only.
+
Pagination Problem
+
____________________
@@ -1031,17 +1197,22 @@ Pagination Problem
Pagination Problem
+
-In the diagram above, we can see what is wrong with the pagination policy:
- when a Process Y loads into Page X, ALL memory space of the Page is allocated,
- so the remaining space at the end of Page is wasted.
+In the diagram above, we can see what is wrong with the pagination
+ policy: when a Process Y loads into Page X, ALL memory space of the
+ Page is allocated, so the remaining space at the end of Page is wasted.
+
Segmentation and Pagination
-How can we solve segmentation and pagination problems? Using either 2 policies.
+How can we solve segmentation and pagination problems? Using
+ either 2 policies.
+
+
@@ -1061,23 +1232,29 @@ How can we solve segmentation and pagination problems? Using either 2 policies.
|____________________|
| .. |
+
-Process X, identified by Segment X, is split in 3 pieces and each of one
- is loaded in a page.
+Process X, identified by Segment X, is split in 3 pieces and
+ each of one is loaded in a page.
+
We do not have:
+
+
-
-Segmentation problem: we allocate per Pages, so we also free Pages and
- we manage free space in an optimized way.
+Segmentation problem: we allocate per Pages, so we also free
+ Pages and we manage free space in an optimized way.
-
-Pagination problem: only last page wastes space, but we can decide to use
- very small pages, for example 4096 bytes length (losing at maximum 4096*N_Tasks
- bytes) and manage hierarchical paging (using 2 or 3 levels of paging)
+Pagination problem: only last page wastes space, but we can decide
+ to use very small pages, for example 4096 bytes length (losing at
+ maximum 4096*N_Tasks bytes) and manage hierarchical paging (using
+ 2 or 3 levels of paging)
+
@@ -1097,13 +1274,16 @@ Pagination problem: only last page wastes space, but we can decide to use
| | | |
Hierarchical Paging
+
Linux Startup
We start the Linux kernel first from C code executed from ''startup_32:''
asm label:
+
+
|startup_32:
@@ -1142,6 +1322,7 @@ We start the Linux kernel first from C code executed from ''startup_32:''
|kernel_thread
|unlock_kernel
|cpu_idle
+
@@ -1205,10 +1386,13 @@ kernel_thread [arch/i386/kernel/process.c]
unlock_kernel [include/asm/smplock.h]
-
cpu_idle [arch/i386/kernel/process.c]
+
The last function ''rest_init'' does the following:
+
+
-
@@ -1216,16 +1400,20 @@ launches the kernel thread ''init''
-
calls unlock_kernel
-
-makes the kernel run cpu_idle routine, that will be the idle loop executing
- when nothing is scheduled
+makes the kernel run cpu_idle routine, that will be the idle
+ loop executing when nothing is scheduled
+
-In fact the start_kernel procedure never ends. It will execute cpu_idle
- routine endlessly.
+In fact the start_kernel procedure never ends. It will execute
+ cpu_idle routine endlessly.
+
Follows ''init'' description, which is the first Kernel Thread:
+
+
|init
@@ -1242,15 +1430,18 @@ Follows ''init'' description, which is the first Kernel Thread:
|free_initmem
|unlock_kernel
|execve
+
Linux Peculiarities
Overview
-Linux has some peculiarities that distinguish it from other OSs. These
- peculiarities include:
+Linux has some peculiarities that distinguish it from other OSs.
+ These peculiarities include:
+
+
-
@@ -1263,33 +1454,43 @@ Kernel threads
Kernel modules
-
''Proc'' directory
+
Flexibility Elements
-Points 4 and 5 give system administrators an enormous flexibility on system
- configuration from user mode allowing them to solve also critical kernel bugs
- or specific problems without have to reboot the machine. For example, if you
- needed to change something on a big server and you didn't want to make a reboot,
- you could prepare the kernel to talk with a module, that you'll write.
+Points 4 and 5 give system administrators an enormous flexibility
+ on system configuration from user mode allowing them to solve also
+ critical kernel bugs or specific problems without have to reboot
+ the machine. For example, if you needed to change something on a
+ big server and you didn't want to make a reboot, you could prepare
+ the kernel to talk with a module, that you'll write.
+
Pagination only
-Linux doesn't use segmentation to distinguish Tasks from each other; it
- uses pagination. (Only 2 segments are used for all Tasks, CODE and DATA/STACK)
-
+Linux doesn't use segmentation to distinguish Tasks from each
+ other; it uses pagination. (Only 2 segments are used for all Tasks,
+ CODE and DATA/STACK)
+
-We can also say that an interTask page fault never occurs, because each
- Task uses a set of Page Tables that are different for each Task. These tables
- cannot point to the same physical addresses.
+We can also say that an interTask page fault never occurs, because
+ each Task uses a set of Page Tables that are different for each Task.
+ There are some cases where different Tasks point to same Page Tables,
+ like shared libraries: this is needed to reduce memory usage; remember
+ that shared libraries are CODE only cause all datas are stored into
+ actual Task stack.
+
Linux segments
Under the Linux kernel only 4 segments exist:
+
+
-
@@ -1300,13 +1501,17 @@ Kernel Data / Stack [0x18]
User Code [0x23]
-
User Data / Stack [0x2b]
+
[syntax is ''Purpose [Segment]'']
+
Under Intel architecture, the segment registers used are:
+
+
-
@@ -1316,85 +1521,106 @@ DS for Data Segment
-
SS for Stack Segment
-
-ES for Alternative Segment (for example used to make a memory copy between
- 2 different segments)
+ES for Alternative Segment (for example used to make a memory
+ copy between 2 different segments)
+
So, every Task uses 0x23 for code and 0x2b for data/stack.
+
Linux pagination
Under Linux 3 levels of pages are used, depending on the architecture.
- Under Intel only 2 levels are supported. Linux also supports Copy on Write
- mechanisms (please see Cap.10 for more information).
+ Under Intel only 2 levels are supported. Linux also supports Copy
+ on Write mechanisms (please see Cap.10 for more information).
+
Why don't interTasks address conflicts exist?
-The answer is very very simple: interTask address conflicts cannot exist
- because they are impossible. Linear -> physical mapping is done by "Pagination",
- so it just needs to assign physical pages in an univocal fashion.
+The answer is very very simple: interTask address conflicts
+ cannot exist because they are impossible. Linear -> physical
+ mapping is done by "Pagination", so it just needs to assign physical
+ pages in an univocal fashion.
+
Do we need to defragment memory?
-No. Page assigning is a dynamic process. We need a page only when a Task
- asks for it, so we choose it from free memory paging in an ordered fashion.
- When we want to release the page, we only have to add it to the free pages
- list.
+No. Page assigning is a dynamic process. We need a page only
+ when a Task asks for it, so we choose it from free memory paging
+ in an ordered fashion. When we want to release the page, we only
+ have to add it to the free pages list.
+
What about Kernel Pages?
-Kernel pages have a problem: they can be allocated in a dynamic fashion
- but we cannot have a guarantee that they are in contiguous area allocation,
- because linear kernel space is equivalent to physical kernel space.
+Kernel pages have a problem: they can be allocated in a dynamic
+ fashion but we cannot have a guarantee that they are in contiguous
+ area allocation, because linear kernel space is equivalent to physical
+ kernel space.
+
-For Code Segment there is no problem. Boot code is allocated at boot time
- (so we have a fixed amount of memory to allocate), and on modules we only have
- to allocate a memory area which could contain module code.
+For Code Segment there is no problem. Boot code is allocated
+ at boot time (so we have a fixed amount of memory to allocate), and
+ on modules we only have to allocate a memory area which could contain
+ module code.
+
-The real problem is the stack segment because each Task uses some kernel
- stack pages. Stack segments must be contiguous (according to stack definition),
- so we have to establish a maximum limit for each Task's stack dimension. If
- we exceed this limit bad things happen. We overwrite kernel mode process data
- structures.
+The real problem is the stack segment because each Task uses
+ some kernel stack pages. Stack segments must be contiguous (according
+ to stack definition), so we have to establish a maximum limit for
+ each Task's stack dimension. If we exceed this limit bad things happen.
+ We overwrite kernel mode process data structures.
+
-The structure of the Kernel helps us, because kernel functions are never:
+The structure of the Kernel helps us, because kernel functions
+ are never:
+
+
-
recursive
-
intercalling more than N times.
+
-Once we know N, and we know the average of static variables for all kernel
- functions, we can estimate a stack limit.
+Once we know N, and we know the average of static variables for
+ all kernel functions, we can estimate a stack limit.
+
-If you want to try the problem out, you can create a module with a function
- inside calling itself many times. After a fixed number of times, the kernel
- module will hang because of a page fault exception handler (typically write
- to a read-only page).
+If you want to try the problem out, you can create a module with
+ a function inside calling itself many times. After a fixed number
+ of times, the kernel module will hang because of a page fault exception
+ handler (typically write to a read-only page).
+
Softirq
-When an IRQ comes, task switching is deferred until later to get better
- performance. Some Task jobs (that could have to be done just after the IRQ
- and that could take much CPU in interrupt time, like building up a TCP/IP packet)
- are queued and will be done at scheduling time (once a time-slice will end).
+When an IRQ comes, task switching is deferred until later to
+ get better performance. Some Task jobs (that could have to be done
+ just after the IRQ and that could take much CPU in interrupt time,
+ like building up a TCP/IP packet) are queued and will be done at
+ scheduling time (once a time-slice will end).
+
-In recent kernels (2.4.x) the softirq mechanisms are given to a kernel_thread:
- ''ksoftirqd_CPUn''. n stands for the number of CPU executing kernel_thread
- (in a monoprocessor system ''ksoftirqd_CPU0'' uses PID 3).
+In recent kernels (2.4.x) the softirq mechanisms are given to
+ a kernel_thread: ''ksoftirqd_CPUn''. n stands for the number of CPU
+ executing kernel_thread (in a monoprocessor system ''ksoftirqd_CPU0''
+ uses PID 3).
+
Preparing Softirq
@@ -1403,13 +1629,16 @@ Enabling Softirq
''cpu_raise_softirq'' is a routine that will wake_up ''ksoftirqd_CPU0''
kernel thread, to let it manage the enqueued job.
+
+
|cpu_raise_softirq
|__cpu_raise_softirq
|wakeup_softirqd
|wake_up_process
+
@@ -1421,27 +1650,34 @@ __cpu_raise_softirq [include/linux/interrupt.h]
wakeup_softirq [kernel/softirq.c]
-
wake_up_process [kernel/sched.c]
+
-''__cpu_raise_softirq'' routine will set right bit in the vector describing
- softirq pending.
+''__cpu_raise_softirq'' routine will set right bit in the vector
+ describing softirq pending.
+
''wakeup_softirq'' uses ''wakeup_process'' to wake up ''ksoftirqd_CPU0''
kernel thread.
+
Executing Softirq
TODO: describing data structures involved in softirq mechanism.
+
-When kernel thread ''ksoftirqd_CPU0'' has been woken up, it will execute
- queued jobs
+When kernel thread ''ksoftirqd_CPU0'' has been woken up, it will
+ execute queued jobs
+
The code of ''ksoftirqd_CPU0'' is (main endless loop):
+
+
for (;;) {
@@ -1455,31 +1691,34 @@ for (;;) {
}
__set_current_state(TASK_INTERRUPTIBLE)
}
+
-
ksoftirqd [kernel/softirq.c]
-
-
-
-
+
+
Kernel Threads
-Even though Linux is a monolithic OS, a few ''kernel threads'' exist to
- do housekeeping work.
+Even though Linux is a monolithic OS, a few ''kernel threads''
+ exist to do housekeeping work.
+
-These Tasks don't utilize USER memory; they share KERNEL memory. They also
- operate at the highest privilege (RING 0 on a i386 architecture) like any other
- kernel mode piece of code.
+These Tasks don't utilize USER memory; they share KERNEL memory.
+ They also operate at the highest privilege (RING 0 on a i386 architecture)
+ like any other kernel mode piece of code.
+
Kernel threads are created by ''kernel_thread [arch/i386/kernel/process]''
- function, which calls ''clone'' [arch/i386/kernel/process.c] system
- call from assembler (which is a ''fork'' like system call):
+ function, which calls ''clone'' [arch/i386/kernel/process.c]
+ system call from assembler (which is a ''fork'' like system call):
+
+
int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)
@@ -1507,15 +1746,21 @@ int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)
: "memory");
return retval;
}
+
-Once called, we have a new Task (usually with very low PID number, like
- 2,3, etc.) waiting for a very slow resource, like swap or usb event. A very
- slow resource is used because we would have a task switching overhead otherwise.
+Once called, we have a new Task (usually with very low PID number,
+ like 2,3, etc.) waiting for a very slow resource, like swap or usb
+ event. A very slow resource is used because we would have a task
+ switching overhead otherwise.
+
-Below is a list of most common kernel threads (from ''ps x'' command):
+Below is a list of most common kernel threads (from ''ps x''
+ command):
+
+
PID COMMAND
@@ -1528,104 +1773,133 @@ PID COMMAND
7 kacpid
67 khubd
+
-'init' kernel thread is the first process created, at boot time. It will
- call all other User Mode Tasks (from file /etc/inittab) like console daemons,
- tty daemons and network daemons (''rc'' scripts).
+'init' kernel thread is the first process created, at boot time.
+ It will call all other User Mode Tasks (from file /etc/inittab) like
+ console daemons, tty daemons and network daemons (''rc'' scripts).
+
Example of Kernel Threads: kswapd [mm/vmscan.c].
''kswapd'' is created by ''clone() [arch/i386/kernel/process.c]''
+
Initialisation routines:
+
+
|do_initcalls
|kswapd_init
|kernel_thread
|syscall fork (in assembler)
+
do_initcalls [init/main.c]
+
kswapd_init [mm/vmscan.c]
+
kernel_thread [arch/i386/kernel/process.c]
+
Kernel Modules
Overview
-Linux Kernel modules are pieces of code (examples: fs, net, and hw driver)
- running in kernel mode that you can add at runtime.
+Linux Kernel modules are pieces of code (examples: fs, net, and
+ hw driver) running in kernel mode that you can add at runtime.
+
-The Linux core cannot be modularized: scheduling and interrupt management
- or core network, and so on.
+The Linux core cannot be modularized: scheduling and interrupt
+ management or core network, and so on.
+
-Under "/lib/modules/KERNEL_VERSION/" you can find all the modules installed
- on your system.
+Under "/lib/modules/KERNEL_VERSION/" you can find all the modules
+ installed on your system.
+
Module loading and unloading
To load a module, type the following:
+
+
insmod MODULE_NAME parameters
example: insmod ne io=0x300 irq=9
+
-NOTE: You can use modprobe in place of insmod if you want the kernel automatically
- search some parameter (for example when using PCI driver, or if you have specified
- parameter under /etc/conf.modules file).
+NOTE: You can use modprobe in place of insmod if you want the
+ kernel automatically search some parameter (for example when using
+ PCI driver, or if you have specified parameter under /etc/conf.modules
+ file).
+
To unload a module, type the following:
+
+
rmmod MODULE_NAME
+
Module definition
A module always contains:
+
+
-
-"init_module" function, executed at insmod (or modprobe) command
+"init_module" function, executed at insmod (or modprobe) command
+
-
"cleanup_module" function, executed at rmmod command
+
-If these functions are not in the module, you need to add 2 macros to specify
- what functions will act as init and exit module:
+If these functions are not in the module, you need to add 2 macros
+ to specify what functions will act as init and exit module:
+
+
-
module_init(FUNCTION_NAME)
-
module_exit(FUNCTION_NAME)
+
-NOTE: a module can "see" a kernel variable only if it has been exported (with
- macro EXPORT_SYMBOL).
+NOTE: a module can "see" a kernel variable only if it has been
+ exported (with macro EXPORT_SYMBOL).
+
A useful trick for adding flexibility to your kernel
+
// kernel sources side
@@ -1651,36 +1925,44 @@ int init_module() {
int cleanup_module() {
foo_function_pointer = NULL;
}
+
-This simple trick allows you to have very high flexibility in your Kernel,
- because only when you load the module you'll make "my_function" routine execute.
- This routine will do everything you want to do: for example ''rshaper'' module,
- which controls bandwidth input traffic from the network, works in this kind
- of matter.
+This simple trick allows you to have very high flexibility in
+ your Kernel, because only when you load the module you'll make "my_function"
+ routine execute. This routine will do everything you want to do:
+ for example ''rshaper'' module, which controls bandwidth input traffic
+ from the network, works in this kind of matter.
+
-Notice that the whole module mechanism is possible thanks to some global
- variables exported to modules, such as head list (allowing you to extend the
- list as much as you want). Typical examples are fs, generic devices (char,
- block, net, telephony). You have to prepare the kernel to accept your new module;
- in some cases you have to create an infrastructure (like telephony one, that
- was recently created) to be as standard as possible.
+Notice that the whole module mechanism is possible thanks to
+ some global variables exported to modules, such as head list (allowing
+ you to extend the list as much as you want). Typical examples are
+ fs, generic devices (char, block, net, telephony). You have to prepare
+ the kernel to accept your new module; in some cases you have to create
+ an infrastructure (like telephony one, that was recently created)
+ to be as standard as possible.
+
Proc directory
-Proc fs is located in the /proc directory, which is a special directory
- allowing you to talk directly with kernel.
+Proc fs is located in the /proc directory, which is a special
+ directory allowing you to talk directly with kernel.
+
Linux uses ''proc'' directory to support direct kernel communications:
- this is necessary in many cases, for example when you want see main processes
- data structures or enable ''proxy-arp'' feature on one interface and not in
- others, you want to change max number of threads, or if you want to debug some
- bus state, like ISA or PCI, to know what cards are installed and what I/O addresses
- and IRQs are assigned to them.
+ this is necessary in many cases, for example when you want see main
+ processes data structures or enable ''proxy-arp'' feature on one
+ interface and not in others, you want to change max number of threads,
+ or if you want to debug some bus state, like ISA or PCI, to know
+ what cards are installed and what I/O addresses and IRQs are assigned
+ to them.
+
+
|-- bus
@@ -2139,18 +2421,22 @@ Linux uses ''proc'' directory to support direct kernel communications:
|-- uptime
`-- version
+
-In the directory there are also all the tasks using PID as file names (you
- have access to all Task information, like path of binary file, memory used,
- and so on).
+In the directory there are also all the tasks using PID as file
+ names (you have access to all Task information, like path of binary
+ file, memory used, and so on).
+
-The interesting point is that you cannot only see kernel values (for example,
- see info about any task or about network options enabled of your TCP/IP stack)
- but you are also able to modify some of it, typically that ones under /proc/sys
- directory:
+The interesting point is that you cannot only see kernel values
+ (for example, see info about any task or about network options enabled
+ of your TCP/IP stack) but you are also able to modify some of it,
+ typically that ones under /proc/sys directory:
+
+
/proc/sys/
@@ -2162,12 +2448,16 @@ The interesting point is that you cannot only see kernel values (for example,
net
vm
kernel
+
/proc/sys/kernel
-Below are very important and well-know kernel values, ready to be modified:
+Below are very important and well-know kernel values, ready to
+ be modified:
+
+
overflowgid
@@ -2194,13 +2484,17 @@ hostname // host name of your Linux box
version // date info about kernel compilation
osrelease // kernel version (i.e. 2.4.5)
ostype // Linux!
+
/proc/sys/net
-This can be considered the most useful proc subdirectory. It allows you
- to change very important settings for your network kernel configuration.
+This can be considered the most useful proc subdirectory. It
+ allows you to change very important settings for your network kernel
+ configuration.
+
+
core
@@ -2209,15 +2503,19 @@ ipv6
unix
ethernet
802
+
/proc/sys/net/core
-Listed below are general net settings, like "netdev_max_backlog" (typically
- 300), the length of all your network packets. This value can limit your network
- bandwidth when receiving packets, Linux has to wait up to scheduling time to
- flush buffers (due to bottom half mechanism), about 1000/HZ ms
+Listed below are general net settings, like "netdev_max_backlog"
+ (typically 300), the length of all your network packets. This value
+ can limit your network bandwidth when receiving packets, Linux has
+ to wait up to scheduling time to flush buffers (due to bottom half
+ mechanism), about 1000/HZ ms
+
+
300 * 100 = 30 000
@@ -2225,58 +2523,75 @@ packets HZ(Timeslice freq) packets/s
30 000 * 1000 = 30 M
packets average (Bytes/packet) throughput Bytes/s
+
If you want to get higher throughput, you need to increase netdev_max_backlog,
by typing:
+
+
echo 4000 > /proc/sys/net/core/netdev_max_backlog
+
-Note: Warning for some HZ values: under some architecture (like alpha or
- arm-tbox) it is 1000, so you can have 300 MBytes/s of average throughput.
+Note: Warning for some HZ values: under some architecture (like
+ alpha or arm-tbox) it is 1000, so you can have 300 MBytes/s of average
+ throughput.
+
/proc/sys/net/ipv4
-"ip_forward", enables or disables ip forwarding in your Linux box. This is
- a generic setting for all devices, you can specify each device you choose.
+"ip_forward", enables or disables ip forwarding in your Linux box.
+ This is a generic setting for all devices, you can specify each
+ device you choose.
+
/proc/sys/net/ipv4/conf/interface
-I think this is the most useful /proc entry, because it allows you to change
- some net settings to support wireless networks (see for more information).
+I think this is the most useful /proc entry, because it allows
+ you to change some net settings to support wireless networks (see
+ for more information).
+
Here are some examples of when you could use this setting:
+
+
-
"forwarding", to enable ip forwarding for your interface
-
-"proxy_arp", to enable proxy arp feature. For more see Proxy arp HOWTO under
- and for proxy arp use in Wireless networks.
+"proxy_arp", to enable proxy arp feature. For more see Proxy arp
+ HOWTO under and for proxy arp use in Wireless networks.
-
-"send_redirects" to avoid interface to send ICMP_REDIRECT (as before, see
- for more).
+"send_redirects" to avoid interface to send ICMP_REDIRECT (as before,
+ see for more).
+
Linux Multitasking
Overview
-This section will analyze data structures--the mechanism used to manage
- multitasking environment under Linux.
+This section will analyze data structures--the mechanism used
+ to manage multitasking environment under Linux.
+
Task States
-A Linux Task can be one of the following states (according to [include/linux.h]):
+A Linux Task can be one of the following states (according to
+ [include/linux.h]):
+
+
-
@@ -2284,15 +2599,17 @@ TASK_RUNNING, it means that it is in the "Ready List"
-
TASK_INTERRUPTIBLE, task waiting for a signal or a resource (sleeping)
-
-TASK_UNINTERRUPTIBLE, task waiting for a resource (sleeping), it is in
- same "Wait Queue"
+TASK_UNINTERRUPTIBLE, task waiting for a resource (sleeping),
+ it is in same "Wait Queue"
-
TASK_ZOMBIE, task child without father
-
TASK_STOPPED, task being debugged
+
Graphical Interaction
+
______________ CPU Available ______________
@@ -2311,16 +2628,20 @@ Waiting for | | Resource
|______________________|
Main Multitasking Flow
+
Timeslice
PIT 8253 Programming
-Each 10 ms (depending on HZ value) an IRQ0 comes, which helps us in a multitasking
- environment. This signal comes from PIC 8259 (in arch 386+) which is connected
- to PIT 8253 with a clock of 1.19318 MHz.
+Each 10 ms (depending on HZ value) an IRQ0 comes, which helps
+ us in a multitasking environment. This signal comes from PIC 8259
+ (in arch 386+) which is connected to PIT 8253 with a clock of 1.19318
+ MHz.
+
+
_____ ______ ______
@@ -2344,25 +2665,32 @@ outb_p(0x34,0x43); /* binary, mode 2, LSB/MSB, ch 0 */
outb_p(LATCH & 0xff , 0x40); /* LSB */
outb(LATCH >> 8 , 0x40); /* MSB */
+
-So we program 8253 (PIT, Programmable Interval Timer) with LATCH = (1193180/HZ)
- = 11931.8 when HZ=100 (default). LATCH indicates the frequency divisor factor.
+So we program 8253 (PIT, Programmable Interval Timer) with LATCH
+ = (1193180/HZ) = 11931.8 when HZ=100 (default). LATCH indicates the
+ frequency divisor factor.
+
-LATCH = 11931.8 gives to 8253 (in output) a frequency of 1193180 / 11931.8
- = 100 Hz, so period = 10ms
+LATCH = 11931.8 gives to 8253 (in output) a frequency of 1193180
+ / 11931.8 = 100 Hz, so period = 10ms
+
So Timeslice = 1/HZ.
+
-With each Timeslice we temporarily interrupt current process execution
- (without task switching), and we do some housekeeping work, after which we'll
- return back to our previous process.
+With each Timeslice we temporarily interrupt current process
+ execution (without task switching), and we do some housekeeping work,
+ after which we'll return back to our previous process.
+
Linux Timer IRQ ICA
+
Linux Timer IRQ
@@ -2389,10 +2717,13 @@ IRQ 0 [Timer]
|}
|RESTORE_ALL
+
Functions can be found under:
+
+
-
@@ -2407,51 +2738,61 @@ do_timer, update_process_times [kernel/timer.c]
do_softirq [kernel/soft_irq.c]
-
RESTORE_ALL, while loop [arch/i386/kernel/entry.S]
+
Notes:
+
+
-
-Function "IRQ0x00_interrupt" (like others IRQ0xXY_interrupt) is directly
- pointed by IDT (Interrupt Descriptor Table, similar to Real Mode Interrupt
- Vector Table, see Cap 11 for more), so EVERY interrupt coming to the processor
- is managed by "IRQ0x#NR_interrupt" routine, where #NR is the interrupt
- number. We refer to it as "wrapper irq handler".
+Function "IRQ0x00_interrupt" (like others IRQ0xXY_interrupt) is
+ directly pointed by IDT (Interrupt Descriptor Table, similar to Real
+ Mode Interrupt Vector Table, see Cap 11 for more), so EVERY interrupt
+ coming to the processor is managed by "IRQ0x#NR_interrupt" routine,
+ where #NR is the interrupt number. We refer to it as "wrapper
+ irq handler".
-
wrapper routines are executed, like "do_IRQ","handle_IRQ_event" [arch/i386/kernel/irq.c].
-
-After this, control is passed to official IRQ routine (pointed by "handler()"),
- previously registered with "request_irq" [arch/i386/kernel/irq.c],
+After this, control is passed to official IRQ routine (pointed
+ by "handler()"), previously registered with "request_irq" [arch/i386/kernel/irq.c],
in this case "timer_interrupt" [arch/i386/kernel/time.c].
-
-"timer_interrupt" [arch/i386/kernel/time.c] routine is executed
- and, when it ends,
+"timer_interrupt" [arch/i386/kernel/time.c] routine is
+ executed and, when it ends,
-
control backs to some assembler routines [arch/i386/kernel/entry.S].
+
Description:
+
-To manage Multitasking, Linux (like every other Unix) uses a ''counter''
- variable to keep track of how much CPU was used by the task. So, on each IRQ
- 0, the counter is decremented (point 4) and, when it reaches 0, we need to
- switch task to manage timesharing (point 4 "need_resched" variable is set to
- 1, then, in point 5 assembler routines control "need_resched" and call, if needed,
- "schedule" [kernel/sched.c]).
+To manage Multitasking, Linux (like every other Unix) uses a
+ ''counter'' variable to keep track of how much CPU was used by the
+ task. So, on each IRQ 0, the counter is decremented (point 4) and,
+ when it reaches 0, we need to switch task to manage timesharing (point
+ 4 "need_resched" variable is set to 1, then, in point 5 assembler routines
+ control "need_resched" and call, if needed, "schedule" [kernel/sched.c]).
+
Scheduler
-The scheduler is the piece of code that chooses what Task has to be executed
- at a given time.
+The scheduler is the piece of code that chooses what Task has
+ to be executed at a given time.
+
-Any time you need to change running task, select a candidate. Below is
- the ''schedule [kernel/sched.c]'' function.
+Any time you need to change running task, select a candidate.
+ Below is the ''schedule [kernel/sched.c]'' function.
+
+
|schedule
@@ -2471,30 +2812,38 @@ Any time you need to change running task, select a candidate. Below is
|ret *** ret from call using future_EIP in place of call address
new_task
+
Bottom Half, Task Queues. and Tasklets
Overview
-In classic Unix, when an IRQ comes (from a device), Unix makes "task switching"
- to interrogate the task that requested the device.
+In classic Unix, when an IRQ comes (from a device), Unix makes
+ "task switching" to interrogate the task that requested the device.
+
-To improve performance, Linux can postpone the non-urgent work until later,
- to better manage high speed event.
+To improve performance, Linux can postpone the non-urgent work
+ until later, to better manage high speed event.
+
-This feature is managed since kernel 1.x by the "bottom half" (BH). The irq
- handler "marks" a bottom half, to be executed later, in scheduling time.
+This feature is managed since kernel 1.x by the "bottom half" (BH).
+ The irq handler "marks" a bottom half, to be executed later, in scheduling
+ time.
+
-In the latest kernels there is a "task queue"that is more dynamic than BH
- and there is also a "tasklet" to manage multiprocessor environments.
+In the latest kernels there is a "task queue"that is more dynamic
+ than BH and there is also a "tasklet" to manage multiprocessor environments.
+
BH schema is:
+
+
-
@@ -2503,9 +2852,11 @@ Declaration
Mark
-
Execution
+
Declaration
+
#define DECLARE_TASK_QUEUE(q) LIST_HEAD(q)
@@ -2517,17 +2868,21 @@ struct list_head {
#define LIST_HEAD_INIT(name) { &(name), &(name) }
''DECLARE_TASK_QUEUE'' [include/linux/tqueue.h, include/linux/list.h]
+
-"DECLARE_TASK_QUEUE(q)" macro is used to declare a structure named "q" managing
- task queue.
+"DECLARE_TASK_QUEUE(q)" macro is used to declare a structure named
+ "q" managing task queue.
+
Mark
Here is the ICA schema for "mark_bh" [include/linux/interrupt.h]
function:
+
+
|mark_bh(NUMBER)
@@ -2537,17 +2892,22 @@ Here is the ICA schema for "mark_bh" [include/linux/interrupt.h]
|soft_active |= (1 << HI_SOFTIRQ)
''mark_bh''[include/linux/interrupt.h]
+
-For example, when an IRQ handler wants to "postpone" some work, it would
- "mark_bh(NUMBER)", where NUMBER is a BH declarated (see section before).
+For example, when an IRQ handler wants to "postpone" some work,
+ it would "mark_bh(NUMBER)", where NUMBER is a BH declarated (see section
+ before).
+
Execution
We can see this calling from "do_IRQ" [arch/i386/kernel/irq.c]
function:
+
+
|do_softirq
@@ -2555,27 +2915,34 @@ We can see this calling from "do_IRQ" [arch/i386/kernel/irq.c]
|tasklet_vec[0].list->func
+
"h->action(h);" is the function has been previously queued.
+
Very low level routines
set_intr_gate
+
set_trap_gate
+
set_task_gate (not used).
+
-(*interrupt)[NR_IRQS](void) = { IRQ0x00_interrupt, IRQ0x01_interrupt,
- ..}
+(*interrupt)[NR_IRQS](void) = { IRQ0x00_interrupt,
+ IRQ0x01_interrupt, ..}
+
NR_IRQS = 224 [kernel 2.4.2]
+
Task Switching
@@ -2583,23 +2950,28 @@ Task Switching
When does Task switching occur?
Now we'll see how the Linux Kernel switchs from one task to another.
+
Task Switching is needed in many cases, such as the following:
+
+
-
when TimeSlice ends, we need to give access to some other task
-
-when a task decide to access a resource, it sleeps for it, so we have to
- choose another task
+when a task decide to access a resource, it sleeps for it, so
+ we have to choose another task
-
-when a task waits for a pipe, we have to give access to other task, which
- would write to pipe
+when a task waits for a pipe, we have to give access to other
+ task, which would write to pipe
+
Task Switching
+
TASK SWITCHING TRICK
@@ -2622,17 +2994,22 @@ Task Switching
"a" (prev), "d" (next), \
"b" (prev)); \
} while (0)
+
Trick is here:
+
+
-
''pushl %4'' which puts future_EIP into the stack
-
-''jmp __switch_to'' which execute ''__switch_to'' function, but in opposite
- of ''call'' we will return to valued pushed in point 1 (so new Task!)
+''jmp __switch_to'' which execute ''__switch_to'' function, but
+ in opposite of ''call'' we will return to valued pushed in point
+ 1 (so new Task!)
+
@@ -2660,15 +3037,18 @@ Task1 Data/Stack Task1 Code | | |w | |
|__________| |__________| |__________| |__________|
Task2 Data/Stack Task2 Code Kernel Code Kernel Data/Stack
+
Fork
Overview
-Fork is used to create another task. We start from a Task Parent, and we
- copy many data structures to Task Child.
+Fork is used to create another task. We start from a Task Parent,
+ and we copy many data structures to Task Child.
+
+
@@ -2690,28 +3070,35 @@ Fork is used to create another task. We start from a Task Parent, and we
|_________|
Fork SysCall
+
What is not copied
-New Task just created (''Task Child'') is almost equal to Parent (''Task
- Parent''), there are only few differences:
+New Task just created (''Task Child'') is almost equal to Parent
+ (''Task Parent''), there are only few differences:
+
+
-
obviously PID
-
-child ''fork()'' will return 0, while parent ''fork()'' will return PID
- of Task Child, to distinguish them each other in User Mode
+child ''fork()'' will return 0, while parent ''fork()'' will
+ return PID of Task Child, to distinguish them each other in User
+ Mode
-
-All child data pages are marked ''READ + EXECUTE'', no "WRITE'' (while parent
- has WRITE right for its own pages) so, when a write request comes, a ''Page
- Fault'' exception is generated which will create a new independent page: this
- mechanism is called ''Copy on Write'' (see Cap.10 for more).
+All child data pages are marked ''READ + EXECUTE'', no "WRITE''
+ (while parent has WRITE right for its own pages) so, when a write
+ request comes, a ''Page Fault'' exception is generated which will
+ create a new independent page: this mechanism is called ''Copy on
+ Write'' (see Cap.10 for more).
+
Fork ICA
+
|sys_fork
@@ -2746,6 +3133,7 @@ Fork ICA
fork ICA
+
@@ -2791,19 +3179,23 @@ copy_thread
SET_LINKS [include/linux/sched.h]
-
wake_up_process [kernel/sched.c]
+
Copy on Write
To implement Copy on Write for Linux:
+
+
-
-Mark all copied pages as read-only, causing a Page Fault when a child tries
- to write to them.
+Mark all copied pages as read-only, causing a Page Fault when
+ a Task tries to write to them.
-
-Page Fault handler creates a new page for the Task caused exception.
+Page Fault handler creates a new page.
+
@@ -2825,6 +3217,7 @@ Page Fault handler creates a new page for the Task caused exception.
Page Fault ICA
+
@@ -2846,27 +3239,33 @@ copy_cow_page
establish_pte
-
set_pte [include/asm/pgtable-3level.h]
+
Linux Memory Management
Overview
-Linux uses segmentation + pagination, which simplifies notation.
+Linux uses segmentation + pagination, which simplifies notation.
+
+
Segments
Linux uses only 4 segments:
+
+
-
-2 segments (code and data/stack) for KERNEL SPACE from [0xC000 0000]
- (3 GB) to [0xFFFF FFFF] (4 GB)
+2 segments (code and data/stack) for KERNEL SPACE from [0xC000
+ 0000] (3 GB) to [0xFFFF FFFF] (4 GB)
-
-2 segments (code and data/stack) for USER SPACE from [0] (0 GB)
- to [0xBFFF FFFF] (3 GB)
+2 segments (code and data/stack) for USER SPACE from [0]
+ (0 GB) to [0xBFFF FFFF] (3 GB)
+
@@ -2886,13 +3285,16 @@ Linux uses only 4 segments:
0x00000000
Kernel/User Linear addresses
+
Specific i386 implementation
-Again, Linux implements Pagination using 3 Levels of Paging, but in i386
- architecture only 2 of them are really used:
+Again, Linux implements Pagination using 3 Levels of Paging,
+ but in i386 architecture only 2 of them are really used:
+
+
@@ -2930,27 +3332,36 @@ Again, Linux implements Pagination using 3 Levels of Paging, but in i386
+
Memory Mapping
-Linux manages Access Control with Pagination only, so different Tasks will
- have the same segment addresses, but different CR3 (register used to store
- Directory Page Address), pointing to different Page Entries.
+Linux manages Access Control with Pagination only, so different
+ Tasks will have the same segment addresses, but different CR3 (register
+ used to store Directory Page Address), pointing to different Page
+ Entries.
+
-In User mode a task cannot overcome 3 GB limit (0 x C0 00 00 00), so only
- the first 768 page directory entries are meaningful (768*4MB = 3GB).
+In User mode a task cannot overcome 3 GB limit (0 x C0 00 00
+ 00), so only the first 768 page directory entries are meaningful
+ (768*4MB = 3GB).
+
-When a Task goes in Kernel Mode (by System call or by IRQ) the other 256
- pages directory entries become important, and they point to the same page files
- as all other Tasks (which are the same as the Kernel).
+When a Task goes in Kernel Mode (by System call or by IRQ) the
+ other 256 pages directory entries become important, and they point
+ to the same page files as all other Tasks (which are the same as
+ the Kernel).
+
-Note that Kernel (and only kernel) Linear Space is equal to Kernel Physical
- Space, so:
+Note that Kernel (and only kernel) Linear Space is equal to Kernel
+ Physical Space, so:
+
+
@@ -2970,16 +3381,20 @@ Note that Kernel (and only kernel) Linear Space is equal to Kernel Physical
Logical Addresses Physical Addresses
+
-Linear Kernel Space corresponds to Physical Kernel Space translated 3
- GB down (in fact page tables are something like { "00000000", "00000001" },
- so they operate no virtualization, they only report physical addresses they
- take from linear ones).
+Linear Kernel Space corresponds to Physical Kernel Space translated
+ 3 GB down (in fact page tables are something like { "00000000",
+ "00000001" }, so they operate no virtualization, they only report
+ physical addresses they take from linear ones).
+
-Notice that you'll not have an "addresses conflict" between Kernel and User
- spaces because we can manage physical addresses with Page Tables.
+Notice that you'll not have an "addresses conflict" between Kernel
+ and User spaces because we can manage physical addresses with Page
+ Tables.
+
Low level memory allocation
@@ -2988,43 +3403,56 @@ Boot Initialization
We start from kmem_cache_init (launched by start_kernel [init/main.c]
at boot up).
+
+
|kmem_cache_init
|kmem_cache_estimate
+
kmem_cache_init [mm/slab.c]
+
kmem_cache_estimate
+
Now we continue with mem_init (also launched by start_kernel[init/main.c])
+
+
|mem_init
|free_all_bootmem
|free_all_bootmem_core
+
mem_init [arch/i386/mm/init.c]
+
free_all_bootmem [mm/bootmem.c]
+
free_all_bootmem_core
+
Run-time allocation
-Under Linux, when we want to allocate memory, for example during "copy_on_write"
- mechanism (see Cap.10), we call:
+Under Linux, when we want to allocate memory, for example during
+ "copy_on_write" mechanism (see Cap.10), we call:
+
+
|copy_mm
@@ -3041,10 +3469,13 @@ Under Linux, when we want to allocate memory, for example during "copy_on_write"
|rmqueue
|reclaim_pages
+
Functions can be found under:
+
+
-
@@ -3075,9 +3506,11 @@ __alloc_pages [mm/page_alloc.c]
rm_queue
-
reclaim_pages [mm/vmscan.c]
+
TODO: Understand Zones
+
Swap
@@ -3085,12 +3518,16 @@ Swap
Overview
Swap is managed by the kswapd daemon (kernel thread).
+
kswapd
-As other kernel threads, kswapd has a main loop that wait to wake up.
+As other kernel threads, kswapd has a main loop that wait to
+ wake up.
+
+
|kswapd
@@ -3102,6 +3539,7 @@ As other kernel threads, kswapd has a main loop that wait to wake up.
|run_task_queue
|interruptible_sleep_on_timeout // we sleep for a new swap request
|}
+
@@ -3117,17 +3555,21 @@ refill_inactive_scan [mm/vmswap.c]
run_task_queue [kernel/softirq.c]
-
interruptible_sleep_on_timeout [kernel/sched.c]
+
When do we need swapping?
-Swapping is needed when we have to access a page that is not in physical
- memory.
+Swapping is needed when we have to access a page that is not
+ in physical memory.
+
-Linux uses ''kswapd'' kernel thread to carry out this purpose. When the
- Task receives a page fault exception we do the following:
+Linux uses ''kswapd'' kernel thread to carry out this purpose.
+ When the Task receives a page fault exception we do the following:
+
+
@@ -3150,6 +3592,7 @@ Linux uses ''kswapd'' kernel thread to carry out this purpose. When the
Page Fault ICA
+
@@ -3173,26 +3616,30 @@ alloc_pages_pgdat
__alloc_pages
-
wakeup_kswapd [mm/vmscan.c]
+
Linux Networking
How Linux networking is managed?
-There exists a device driver for each kind of NIC. Inside it, Linux will
- ALWAYS call a standard high level routing: "netif_rx [net/core/dev.c]",
- which will controls what 3 level protocol the frame belong to, and it will
- call the right 3 level function (so we'll use a pointer to the function to
- determine which is right).
+There exists a device driver for each kind of NIC. Inside it,
+ Linux will ALWAYS call a standard high level routing: "netif_rx [net/core/dev.c]",
+ which will controls what 3 level protocol the frame belong to, and
+ it will call the right 3 level function (so we'll use a pointer to
+ the function to determine which is right).
+
TCP example
-We'll see now an example of what happens when we send a TCP packet to Linux,
- starting from ''netif_rx [net/core/dev.c]'' call.
+We'll see now an example of what happens when we send a TCP packet
+ to Linux, starting from ''netif_rx [net/core/dev.c]'' call.
+
Interrupt management: "netif_rx"
+
|netif_rx
@@ -3202,27 +3649,34 @@ Interrupt management: "netif_rx"
|cpu_raise_softirq
|softirq_active(cpu) |= (1 << NET_RX_SOFTIRQ) // set bit NET_RX_SOFTIRQ in the BH vector
+
Functions:
+
+
-
__skb_queue_tail [include/linux/skbuff.h]
-
cpu_raise_softirq [kernel/softirq.c]
+
Post Interrupt management: "net_rx_action"
-Once IRQ interaction is ended, we need to follow the next part of the frame
- life and examine what NET_RX_SOFTIRQ does.
+Once IRQ interaction is ended, we need to follow the next part
+ of the frame life and examine what NET_RX_SOFTIRQ does.
+
-We will next call ''net_rx_action [net/core/dev.c]'' according
- to "net_dev_init [net/core/dev.c]".
+We will next call ''net_rx_action [net/core/dev.c]''
+ according to "net_dev_init [net/core/dev.c]".
+
+
|net_rx_action
@@ -3281,10 +3735,13 @@ We will next call ''net_rx_action [net/core/dev.c]'' according
|if (ACK)
|tcp_set_state(TCP_ESTABLISHED)
+
Functions can be found under:
+
+
-
@@ -3343,19 +3800,23 @@ tcp_rcv_synsent_state_process [net/ipv4/tcp_input.c]
tcp_set_state [include/net/tcp.h]
-
tcp_send_ack [net/ipv4/tcp_output.c]
+
Description:
+
+
-
First we determine protocol type (IP, then TCP)
-
-NF_HOOK (function) is a wrapper routine that first manages the network
- filter (for example firewall), then it calls ''function''.
+NF_HOOK (function) is a wrapper routine that first manages the
+ network filter (for example firewall), then it calls ''function''.
-
After we manage 3-way TCP Handshake which consists of:
+
@@ -3373,17 +3834,20 @@ SERVER (LISTENING) CLIENT (CONNECTING)
3-Way TCP handshake
+
-
In the end we only have to launch "tcp_rcv_established [net/ipv4/tcp_input.c]"
which gives the packet to the user socket and wakes it up.
+
Linux File System
TODO
+
Useful Tips
@@ -3393,9 +3857,11 @@ Stack and Heap
Overview
Here we view how "stack" and "heap" are allocated in memory
+
Memory allocation
+
@@ -3412,17 +3878,22 @@ XX.. | | <-- top of the stack [Stack Pointer&rsqb
Stack
+
-Memory address values start from 00.. (which is also where Stack Segment
- begins) and they grow going toward FF.. value.
+Memory address values start from 00.. (which is also where Stack
+ Segment begins) and they grow going toward FF.. value.
+
XX.. is the actual value of the Stack Pointer.
+
Stack is used by functions for:
+
+
-
@@ -3431,10 +3902,13 @@ global variables
local variables
-
return address
+
For example, for a classical function:
+
+
@@ -3480,6 +3954,7 @@ we have
Note: variables order can be different depending on hardware architecture.
+
Application vs Process
@@ -3487,17 +3962,21 @@ Application vs Process
Base definition
We have to distinguish 2 concepts:
+
+
-
Application: that is the useful code we want to execute
-
-Process: that is the IMAGE on memory of the application (it depends on
- memory strategy used, segmentation and/or Pagination).
+Process: that is the IMAGE on memory of the application (it depends
+ on memory strategy used, segmentation and/or Pagination).
+
Often Process is also called Task or Thread.
+
Locks
@@ -3505,56 +3984,62 @@ Locks
Overview
2 kind of locks:
+
+
-
intraCPU
-
interCPU
+
Copy_on_write
-Copy_on_write is a mechanism used to reduce memory usage. It postpones
- memory allocation until the memory is really needed.
+Copy_on_write is a mechanism used to reduce memory usage. It
+ postpones memory allocation until the memory is really needed.
+
-For example, when a task executes the "fork()" system call (to create another
- task), we still use the same memory pages as the parent, in read only mode.
- When the new task WRITES into the old page, it causes an exception and the
- page is copied and marked "rw" (read, write).
+For example, when a task executes the "fork()" system call (to
+ create another task), we still use the same memory pages as the
+ parent, in read only mode. When a task WRITES into the page, it causes
+ an exception and the page is copied and marked "rw" (read, write).
+
+
1-) Page X is shared between Task Parent and Task Child
Task Parent
- | | RW Access ______
+ | | RO Access ______
| |---------->|Page X|
|_________| |______|
/|\
|
Task Child |
- | | R Access |
+ | | RO Access |
| |----------------
|_________|
-2-) Write request from Task Child
+2-) Write request
Task Parent
- | | RW Access ______
- | |---------->|Page X|
+ | | RO Access ______
+ | |---------->|Page X| Trying to write
|_________| |______|
/|\
|
Task Child |
- | | W Access |
+ | | RO Access |
| |----------------
|_________|
-3-) Final Configuration: Task Parent and Task Child have an independent copy of the Page, X and Y
+3-) Final Configuration: Either Task Parent and Task Child have an independent copy of the Page, X and Y
Task Parent
| | RW Access ______
| |---------->|Page X|
@@ -3565,28 +4050,33 @@ For example, when a task executes the "fork()" system call (to create another
| | RW Access ______
| |---------->|Page Y|
|_________| |______|
+
80386 specific details
Boot procedure
+
bbootsect.s [arch/i386/boot]
setup.S (+video.S)
head.S (+misc.c) [arch/i386/boot/compressed]
start_kernel [init/main.c]
+
80386 (and more) Descriptors
Overview
-Descriptors are data structure used by Intel microprocessor i386+ to virtualize
- memory.
+Descriptors are data structure used by Intel microprocessor i386+
+ to virtualize memory.
+
Kind of descriptors
+
-
@@ -3595,17 +4085,20 @@ GDT (Global Descriptor Table)
LDT (Local Descriptor Table)
-
IDT (Interrupt Descriptor Table)
+
IRQ
Overview
-IRQ is an asyncronous signal sent to microprocessor to advertise a requested
- work is completed
+IRQ is an asyncronous signal sent to microprocessor to advertise
+ a requested work is completed
+
Interaction schema
+
|<--> IRQ(0) [Timer]
@@ -3623,13 +4116,16 @@ Interaction schema
IRQ - Tasks Interaction Schema
+
What happens?
-A typical O.S. uses many IRQ signals to interrupt normal process execution
- and does some housekeeping work. So:
+A typical O.S. uses many IRQ signals to interrupt normal process
+ execution and does some housekeeping work. So:
+
+
-
@@ -3638,11 +4134,13 @@ IRQ (i) occurs and Task(j) is interrupted
IRQ(i)_handler is executed
-
control backs to Task(j) interrupted
+
-Under Linux, when an IRQ comes, first the IRQ wrapper routine (named "interrupt0x??")
- is called, then the "official" IRQ(i)_handler will be executed. This allows some
- duties like timeslice preemption.
+Under Linux, when an IRQ comes, first the IRQ wrapper routine
+ (named "interrupt0x??") is called, then the "official" IRQ(i)_handler
+ will be executed. This allows some duties like timeslice preemption.
+
Utility functions
@@ -3650,22 +4148,29 @@ Utility functions
list_entry [include/linux/list.h]
Definition:
+
+
#define list_entry(ptr, type, member) \
((type *)((char *)(ptr)-(unsigned long)(&((type *)0)->member)))
+
Meaning:
+
-"list_entry" macro is used to retrieve a parent struct pointer, by using
- only one of internal struct pointer.
+"list_entry" macro is used to retrieve a parent struct pointer,
+ by using only one of internal struct pointer.
+
Example:
+
+
struct __wait_queue {
@@ -3684,11 +4189,14 @@ typedef struct __wait_queue wait_queue_t;
wait_queue_t *out list_entry(tmp, wait_queue_t, task_list);
// where tmp point to list_head
+
-So, in this case, by means of *tmp pointer [list_head] we retrieve
- an *out pointer [wait_queue_t].
+So, in this case, by means of *tmp pointer [list_head]
+ we retrieve an *out pointer [wait_queue_t].
+
+
@@ -3700,6 +4208,7 @@ So, in this case, by means of *tmp pointer [list_head] we retrieve
| next * -->| | |
|____________| ----- *tmp [we have this]
+
Sleep
@@ -3707,7 +4216,9 @@ Sleep
Sleep code
Files:
+
+
-
@@ -3718,10 +4229,13 @@ include/linux/sched.h
include/linux/wait.h
-
include/linux/list.h
+
Functions:
+
+
-
@@ -3732,10 +4246,13 @@ interruptible_sleep_on_timeout
sleep_on
-
sleep_on_timeout
+
Called functions:
+
+
-
@@ -3748,10 +4265,13 @@ list_add
__list_add
-
__remove_wait_queue
+
InterCallings Analysis:
+
+
|sleep_on
@@ -3765,18 +4285,24 @@ InterCallings Analysis:
|__list_del --
+
Description:
+
-Under Linux each resource (ideally an object shared between many users
- and many processes), , has a queue to manage ALL tasks requesting it.
+Under Linux each resource (ideally an object shared between many
+ users and many processes), , has a queue to manage ALL tasks requesting
+ it.
+
-This queue is called "wait queue" and it consists of many items we'll call
- the"wait queue element":
+This queue is called "wait queue" and it consists of many items
+ we'll call the"wait queue element":
+
+
*** wait queue structure [include/linux/wait.h] ***
@@ -3790,10 +4316,13 @@ struct __wait_queue {
struct list_head {
struct list_head *next, *prev;
};
+
Graphic working:
+
+
*** wait queue element ***
@@ -3820,16 +4349,20 @@ Graphic working:
task1 <--[prev *, lock, next *]--> taskN
+
-"wait queue head" point to first (with next *) and last (with prev *) elements
- of the "wait queue list".
+"wait queue head" point to first (with next *) and last (with prev
+ *) elements of the "wait queue list".
+
When a new element has to be added, "__add_wait_queue" [include/linux/wait.h]
is called, after which the generic routine "list_add" [include/linux/wait.h],
will be executed:
+
+
*** function list_add [include/linux/list.h] ***
@@ -3843,12 +4376,15 @@ static __inline__ void __list_add (struct list_head * new, \
new->prev = prev;
prev->next = new;
}
+
To complete the description, we see also "__list_del" [include/linux/list.h]
- function called by "list_del" [include/linux/list.h] inside "remove_wait_queue"
- [include/linux/wait.h]:
+ function called by "list_del" [include/linux/list.h] inside
+ "remove_wait_queue" [include/linux/wait.h]:
+
+
*** function list_del [include/linux/list.h] ***
@@ -3859,16 +4395,20 @@ static __inline__ void __list_del (struct list_head * prev, struct list_head * n
next->prev = prev;
prev->next = next;
}
+
Stack consideration
-A typical list (or queue) is usually managed allocating it into the Heap
- (see Cap.10 for Heap and Stack definition and about where variables are allocated).
- Otherwise here, we statically allocate Wait Queue data in a local variable
- (Stack), then function is interrupted by scheduling, in the end, (returning
- from scheduling) we'll erase local variable.
+A typical list (or queue) is usually managed allocating it into
+ the Heap (see Cap.10 for Heap and Stack definition and about where
+ variables are allocated). Otherwise here, we statically allocate
+ Wait Queue data in a local variable (Stack), then function is interrupted
+ by scheduling, in the end, (returning from scheduling) we'll erase
+ local variable.
+
+
new task <----| task1 <------| task2 <------|
@@ -3885,32 +4425,42 @@ A typical list (or queue) is usually managed allocating it into the Heap
|__________| |__________| |__________|
Stack Stack Stack
+
Static variables
Overview
-Linux is written in ''C'' language, and as every application has:
+Linux is written in ''C'' language, and as every application
+ has:
+
+
-
Local variables
-
-Module variables (inside the source file and relative only to that module)
+Module variables (inside the source file and relative only to
+ that module)
-
-Global/Static variables present in only 1 copy (the same for all modules)
+Global/Static variables present in only 1 copy (the same for
+ all modules)
+
-When a Static variable is modified by a module, all other modules will
- see the new value.
+When a Static variable is modified by a module, all other modules
+ will see the new value.
+
-Static variables under Linux are very important, cause they are the only
- kind to add new support to kernel: they typically are pointers to the head
- of a list of registered elements, which can be:
+Static variables under Linux are very important, cause they are
+ the only kind to add new support to kernel: they typically are pointers
+ to the head of a list of registered elements, which can be:
+
+
-
@@ -3919,37 +4469,47 @@ added
deleted
-
maybe modified
+
_______ _______ _______
Global variable -------> |Item(1)| -> |Item(2)| -> |Item(3)| ..
|_______| |_______| |_______|
+
Main variables
Current
+
________________
Current ----------------> | Actual process |
|________________|
+
-Current points to ''task_struct'' structure, which contains all data about
- a process like:
+Current points to ''task_struct'' structure, which contains all
+ data about a process like:
+
+
-
pid, name, state, counter, policy of scheduling
-
-pointers to many data structures like: files, vfs, other processes, signals...
+pointers to many data structures like: files, vfs, other processes,
+ signals...
+
Current is not a real variable, it is
+
+
static inline struct task_struct * get_current(void) {
@@ -3958,103 +4518,129 @@ static inline struct task_struct * get_current(void) {
return current;
}
#define current get_current()
+
-Above lines just takes value of ''esp'' register (stack pointer) and get
- it available like a variable, from which we can point to our task_struct structure.
-
-
-From ''current'' element we can access directly to any other process (ready,
- stopped or in any other state) kernel data structure, for example changing
- STATE (like a I/O driver does), PID, presence in ready list or blocked list,
- etc.
+Above lines just takes value of ''esp'' register (stack pointer)
+ and get it available like a variable, from which we can point to
+ our task_struct structure.
+
+From ''current'' element we can access directly to any other
+ process (ready, stopped or in any other state) kernel data structure,
+ for example changing STATE (like a I/O driver does), PID, presence
+ in ready list or blocked list, etc.
Registered filesystems
+
______ _______ ______
file_systems ------> | ext2 | -> | msdos | -> | ntfs |
[fs/super.c] |______| |_______| |______|
+
-When you use command like ''modprobe some_fs'' you will add a new entry
- to file systems list, while removing it (by using ''rmmod'') will delete it.
+When you use command like ''modprobe some_fs'' you will add a
+ new entry to file systems list, while removing it (by using ''rmmod'')
+ will delete it.
+
Mounted filesystems
+
______ _______ ______
mount_hash_table ---->| / | -> | /usr | -> | /var |
[fs/namespace.c] |______| |_______| |______|
+
-When you use ''mount'' command to add a fs, the new entry will be inserted
- in the list, while an ''umount'' command will delete the entry.
+When you use ''mount'' command to add a fs, the new entry will
+ be inserted in the list, while an ''umount'' command will delete
+ the entry.
+
Registered Network Packet Type
+
______ _______ ______
ptype_all ------>| ip | -> | x25 | -> | ipv6 |
[net/core/dev.c] |______| |_______| |______|
+
-For example, if you add support for IPv6 (loading relative module) a new
- entry will be added in the list.
+For example, if you add support for IPv6 (loading relative module)
+ a new entry will be added in the list.
+
Registered Network Internet Protocol
+
______ _______ _______
inet_protocol_base ----->| icmp | -> | tcp | -> | udp |
[net/ipv4/protocol.c] |______| |_______| |_______|
+
-Also others packet type have many internal protocols in each list (like
- IPv6).
+Also others packet type have many internal protocols in each
+ list (like IPv6).
+
+
______ _______ _______
inet6_protos ----------->|icmpv6| -> | tcpv6 | -> | udpv6 |
[net/ipv6/protocol.c] |______| |_______| |_______|
+
Registered Network Device
+
______ _______ _______
dev_base --------------->| lo | -> | eth0 | -> | ppp0 |
[drivers/core/Space.c] |______| |_______| |_______|
+
Registered Char Device
+
______ _______ ________
chrdevs ---------------->| lp | -> | keyb | -> | serial |
[fs/devices.c] |______| |_______| |________|
+
-''chrdevs'' is not a pointer to a real list, but it is a standard vector.
+''chrdevs'' is not a pointer to a real list, but it is a standard
+ vector.
+
Registered Block Device
+
______ ______ ________
bdev_hashtable --------->| fd | -> | hd | -> | scsi |
[fs/block_dev.c] |______| |______| |________|
+
''bdev_hashtable'' is an hash vector.
+
Glossary
@@ -4062,16 +4648,22 @@ Glossary
Links
+
+
+
-
+
+
+
+