old-www/HOWTO/text/KernelAnalysis-HOWTO

  KernelAnalysis-HOWTO
  Roberto Arcomano berto@bertolinux.com
  v0.7, March 26, 2003

  This document tries to explain some things about the Linux Kernel,
  such as the most important components, how they work, and so on.  This
  HOWTO should help prevent the reader from needing to browse all the
  kernel source files searching for the"right function," declaration,
  and definition, and then linking each to the other. You can find the
  latest version of this document at http://www.bertolinux.com
  <http://www.bertolinux.com> If you have suggestions to help make this
  document better, please submit your ideas to me at the following
  address: berto@bertolinux.com <mailto:berto@bertolinux.com>

  1.  Introduction

  1.1.  Introduction

  This HOWTO tries to define how parts of the Linux Kernel work, what
  are the main functions and data structures used, and how the "wheel
  spins". You can find the latest version of this document at
  http://www.bertolinux.com <http://www.bertolinux.com> If you have
  suggestions to help make this document better, please submit your
  ideas to me at the following address: berto@bertolinux.com
  <mailto:berto@bertolinux.com>Code used within this document refers to
  the Linux Kernel version 2.4.x, which is the last stable kernel
  version at time of writing this HOWTO.


  1.2.  Copyright

  Copyright (C) 2000,2001,2002 Roberto Arcomano. This document is free;
  you can redistribute it and/or modify it under the terms of the GNU
  General Public License as published by the Free Software Foundation;
  either version 2 of the License, or (at your option) any later
  version. This document is distributed in the hope that it will be
  useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  General Public License for more details. You can get a copy of the GNU
  GPL here <http://www.gnu.org/copyleft/gpl.html>


  1.3.  Translations

  If you want to translate this document you are free to do so.
  However, you will need to do the following:


  1. Check that another version of the document doesn't already exist at
     your local LDP

  2. Maintain all 'Introduction' sections (including 'Introduction',

  Warning! You don't have to translate TXT or HTML file, you have to
  modify LYX file, so that it is possible to convert it all other
  formats (TXT, HTML, RIFF, etc.): to do that you can use "LyX"
  application you download from http://www.lyx.org <http://www.lyx.org>.


  No need to ask me to translate! You just have to let me know (if you
  want) about your translation.


  Thank you for your translation!

  1.4.  Credits

  Thanks to Linux Documentation Project <http://www.tldp.org> for
  publishing and uploading my document quickly.


  Thanks to Klaas de Waal for his suggestions.


  2.  Syntax used

  2.1.  Function Syntax

  When speaking about a function, we write:


  "function_name  [ file location . extension ]"


  For example:


  "schedule [kernel/sched.c]"


  tells us that we talk about


  "schedule"


  function retrievable from file


  [ kernel/sched.c ]


  Note: We also assume /usr/src/linux as the starting directory.


  2.2.  Indentation

  Indentation in source code is 3 blank characters.


  2.3.  InterCallings Analysis

  2.3.1.  Overview

  We use the"InterCallings Analysis "(ICA) to see (in an indented
  fashion) how kernel functions call each other.


  For example, the sleep_on command is described in ICA below:


  |sleep_on
  |init_waitqueue_entry      --
  |__add_wait_queue            |   enqueuing request
     |list_add                 |
        |__list_add          --
     |schedule              ---     waiting for request to be executed
        |__remove_wait_queue --
        |list_del              |   dequeuing request
           |__list_del       --

                            sleep_on ICA


  The indented ICA is followed by functions' locations:


  <20>  sleep_on [kernel/sched.c]

  <20>  init_waitqueue_entry [include/linux/wait.h]

  <20>  __add_wait_queue

  <20>  list_add [include/linux/list.h]

  <20>  __list_add

  <20>  schedule [kernel/sched.c]

  <20>  __remove_wait_queue [include/linux/wait.h]

  <20>  list_del [include/linux/list.h]

  <20>  __list_del

  Note: We don't specify anymore file location, if specified just
  before.


  2.3.2.  Details

  In an ICA a line like looks like the following


   function1 -> function2


  means that < function1 > is a generic pointer to another function. In
  this case < function1 >  points to < function2 >.


  When we write:


    function:


  it means that < function > is not a real function. It is a label
  (typically assembler label).

  In many sections we may report a ''C'' code or a ''pseudo-code''.  In
  real source files, you could use ''assembler'' or ''not structured''
  code.  This difference is for learning purposes.


  2.3.3.  PROs of using ICA

  The advantages of using ICA (InterCallings Analysis) are many:


  <20>  You get an overview of what happens when you call a kernel function


  <20>  Function locations are indicated after the function, so ICA could
     also be considered as a little ''function reference''

  <20>  InterCallings Analysis (ICA) is useful in sleep/awake mechanisms,
     where we can view what we do before sleeping, the proper sleeping
     action, and what we'll do after waking up (after schedule).

  2.3.4.  CONTROs of using ICA


  <20>  Some of the disadvantages of using ICA are listed below:

  As all theoretical models, we simplify reality avoiding many details,
  such as real source code and special conditions.


  <20>  Additional diagrams should be added to better represent stack
     conditions, data values, and so on.

  3.  Fundamentals

  3.1.  What is the kernel?

  The kernel is the "core" of any computer system: it is the  "software"
  which allows users to share computer resources.


  The kernel can be thought as the main software of the OS (Operating
  System), which may also include graphics management.


  For example, under Linux (like other Unix-like OSs), the XWindow
  environment doesn't belong to the Linux Kernel, because it manages
  only graphical operations (it uses user mode I/O to access video card
  devices).


  By contrast, Windows environments (Win9x, WinME, WinNT, Win2K, WinXP,
  and so on) are a mix between a graphical environment and kernel.


  3.2.  What is the difference between User Mode and Kernel Mode?

  3.2.1.  Overview

  Many years ago, when computers were as big as a room, users ran their
  applications with much difficulty and, sometimes, their applications
  crashed the computer.


  3.2.2.  Operative modes

  To avoid having applications that constantly crashed, newer OSs were
  designed with 2 different operative modes:


  1. Kernel Mode: the machine operates with critical data structure,
     direct hardware (IN/OUT or memory mapped), direct memory, IRQ, DMA,
     and so on.

  2. User Mode: users can run applications.


                 |          Applications           /|\
                 |         ______________           |
                 |         | User Mode  |           |
                 |         ______________           |
                 |               |                  |
  Implementation |        _______ _______           |   Abstraction
      Detail     |        | Kernel Mode |           |
                 |        _______________           |
                 |               |                  |
                 |               |                  |
                 |               |                  |
                \|/          Hardware               |


  Kernel Mode "prevents" User Mode applications from damaging the system
  or its features.


  Modern microprocessors implement in hardware at least 2 different
  states. For example under Intel, 4 states determine the PL (Privilege
  Level). It is possible to use 0,1,2,3 states, with 0 used in Kernel
  Mode.


  Unix OS requires only 2 privilege levels, and we will use such a
  paradigm as point of reference.


  3.3.  Switching from User Mode to Kernel Mode

  3.3.1.  When do we switch?

  Once we understand that there are 2 different modes, we have to know
  when we switch from one to the other.


  Typically, there are 2 points of switching:


  1. When calling a System Call: after calling a System Call, the task
     voluntary calls pieces of code living in Kernel Mode

  2. When an IRQ (or exception) comes: after the IRQ an IRQ handler (or
     exception handler) is called, then control returns back to the task
     that was interrupted like nothing was happened.


  3.3.2.  System Calls

  System calls are like special functions that manage OS routines which
  live in Kernel Mode.


  A system call can be called when we:


  <20>  access an I/O device or a file (like read or write)

  <20>  need to access privileged information (like pid, changing
     scheduling policy or other information)

  <20>  need to change execution context (like forking or executing some
     other application)

  <20>  need to execute a particular command (like ''chdir'', ''kill",


                                   |                |
                           ------->| System Call i  | (Accessing Devices)
  |                |       |       |  [sys_read()]  |
  | ...            |       |       |                |
  | system_call(i) |--------       |                |
  |   [read()]     |               |                |
  | ...            |               |                |
  | system_call(j) |--------       |                |
  |   [get_pid()]  |       |       |                |
  | ...            |       ------->| System Call j  | (Accessing kernel data structures)
  |                |               |  [sys_getpid()]|
                                   |                |

      USER MODE                        KERNEL MODE


                          Unix System Calls Working


  System calls are almost the only interface used by User Mode to talk
  with low level resources (hardware). The only exception to this
  statement is when a process uses ''ioperm'' system call. In this case
  a device can be accessed directly by User Mode process (IRQs cannot be
  used).


  NOTE: Not every ''C'' function is a system call, only some of them.


  Below is a list of System Calls under Linux Kernel 2.4.17, from [
  arch/i386/kernel/entry.S ]


          .long SYMBOL_NAME(sys_ni_syscall)       /* 0  -  old "setup()" system call*/
          .long SYMBOL_NAME(sys_exit)
          .long SYMBOL_NAME(sys_fork)
          .long SYMBOL_NAME(sys_read)
          .long SYMBOL_NAME(sys_write)
          .long SYMBOL_NAME(sys_open)             /* 5 */
          .long SYMBOL_NAME(sys_close)
          .long SYMBOL_NAME(sys_waitpid)
          .long SYMBOL_NAME(sys_creat)
          .long SYMBOL_NAME(sys_link)
          .long SYMBOL_NAME(sys_unlink)           /* 10 */
          .long SYMBOL_NAME(sys_execve)
          .long SYMBOL_NAME(sys_chdir)
          .long SYMBOL_NAME(sys_time)
          .long SYMBOL_NAME(sys_mknod)
          .long SYMBOL_NAME(sys_chmod)            /* 15 */
          .long SYMBOL_NAME(sys_lchown16)
          .long SYMBOL_NAME(sys_ni_syscall)                               /* old break syscall holder */
          .long SYMBOL_NAME(sys_stat)
          .long SYMBOL_NAME(sys_lseek)
          .long SYMBOL_NAME(sys_getpid)           /* 20 */
          .long SYMBOL_NAME(sys_mount)
          .long SYMBOL_NAME(sys_oldumount)
          .long SYMBOL_NAME(sys_setuid16)
          .long SYMBOL_NAME(sys_getuid16)
          .long SYMBOL_NAME(sys_stime)            /* 25 */
          .long SYMBOL_NAME(sys_ptrace)
          .long SYMBOL_NAME(sys_alarm)
          .long SYMBOL_NAME(sys_fstat)
          .long SYMBOL_NAME(sys_pause)
          .long SYMBOL_NAME(sys_utime)            /* 30 */
          .long SYMBOL_NAME(sys_ni_syscall)                               /* old stty syscall holder */
          .long SYMBOL_NAME(sys_ni_syscall)                               /* old gtty syscall holder */
          .long SYMBOL_NAME(sys_access)
          .long SYMBOL_NAME(sys_nice)
          .long SYMBOL_NAME(sys_ni_syscall)       /* 35 */                /* old ftime syscall holder */
          .long SYMBOL_NAME(sys_sync)
          .long SYMBOL_NAME(sys_kill)
          .long SYMBOL_NAME(sys_rename)
          .long SYMBOL_NAME(sys_mkdir)
          .long SYMBOL_NAME(sys_rmdir)            /* 40 */
          .long SYMBOL_NAME(sys_dup)
          .long SYMBOL_NAME(sys_pipe)
          .long SYMBOL_NAME(sys_times)
          .long SYMBOL_NAME(sys_ni_syscall)                               /* old prof syscall holder */
          .long SYMBOL_NAME(sys_brk)              /* 45 */
          .long SYMBOL_NAME(sys_setgid16)
          .long SYMBOL_NAME(sys_getgid16)
          .long SYMBOL_NAME(sys_signal)
          .long SYMBOL_NAME(sys_geteuid16)
          .long SYMBOL_NAME(sys_getegid16)        /* 50 */
          .long SYMBOL_NAME(sys_acct)
          .long SYMBOL_NAME(sys_umount)                                   /* recycled never used phys() */
          .long SYMBOL_NAME(sys_ni_syscall)                               /* old lock syscall holder */
          .long SYMBOL_NAME(sys_ioctl)
          .long SYMBOL_NAME(sys_fcntl)            /* 55 */
          .long SYMBOL_NAME(sys_ni_syscall)                               /* old mpx syscall holder */
          .long SYMBOL_NAME(sys_setpgid)
          .long SYMBOL_NAME(sys_ni_syscall)                               /* old ulimit syscall holder */
          .long SYMBOL_NAME(sys_olduname)
          .long SYMBOL_NAME(sys_umask)            /* 60 */
          .long SYMBOL_NAME(sys_chroot)
          .long SYMBOL_NAME(sys_ustat)
          .long SYMBOL_NAME(sys_dup2)
          .long SYMBOL_NAME(sys_getppid)
          .long SYMBOL_NAME(sys_getpgrp)          /* 65 */
          .long SYMBOL_NAME(sys_setsid)
          .long SYMBOL_NAME(sys_sigaction)
          .long SYMBOL_NAME(sys_sgetmask)
          .long SYMBOL_NAME(sys_ssetmask)
          .long SYMBOL_NAME(sys_setreuid16)       /* 70 */
          .long SYMBOL_NAME(sys_setregid16)
          .long SYMBOL_NAME(sys_sigsuspend)
          .long SYMBOL_NAME(sys_sigpending)
          .long SYMBOL_NAME(sys_sethostname)
          .long SYMBOL_NAME(sys_setrlimit)        /* 75 */
          .long SYMBOL_NAME(sys_old_getrlimit)
          .long SYMBOL_NAME(sys_getrusage)
          .long SYMBOL_NAME(sys_gettimeofday)
          .long SYMBOL_NAME(sys_settimeofday)
          .long SYMBOL_NAME(sys_getgroups16)      /* 80 */
          .long SYMBOL_NAME(sys_setgroups16)
          .long SYMBOL_NAME(old_select)
          .long SYMBOL_NAME(sys_symlink)
          .long SYMBOL_NAME(sys_lstat)
          .long SYMBOL_NAME(sys_readlink)         /* 85 */
          .long SYMBOL_NAME(sys_uselib)
          .long SYMBOL_NAME(sys_swapon)
          .long SYMBOL_NAME(sys_reboot)
          .long SYMBOL_NAME(old_readdir)
          .long SYMBOL_NAME(old_mmap)             /* 90 */
          .long SYMBOL_NAME(sys_munmap)
          .long SYMBOL_NAME(sys_truncate)
          .long SYMBOL_NAME(sys_ftruncate)
          .long SYMBOL_NAME(sys_fchmod)
          .long SYMBOL_NAME(sys_fchown16)         /* 95 */
          .long SYMBOL_NAME(sys_getpriority)
          .long SYMBOL_NAME(sys_setpriority)
          .long SYMBOL_NAME(sys_ni_syscall)                               /* old profil syscall holder */
          .long SYMBOL_NAME(sys_statfs)
          .long SYMBOL_NAME(sys_fstatfs)          /* 100 */
          .long SYMBOL_NAME(sys_ioperm)
          .long SYMBOL_NAME(sys_socketcall)
          .long SYMBOL_NAME(sys_syslog)
          .long SYMBOL_NAME(sys_setitimer)
          .long SYMBOL_NAME(sys_getitimer)        /* 105 */
          .long SYMBOL_NAME(sys_newstat)
          .long SYMBOL_NAME(sys_newlstat)
          .long SYMBOL_NAME(sys_newfstat)
          .long SYMBOL_NAME(sys_uname)
          .long SYMBOL_NAME(sys_iopl)             /* 110 */
          .long SYMBOL_NAME(sys_vhangup)
          .long SYMBOL_NAME(sys_ni_syscall)       /* old "idle" system call */
          .long SYMBOL_NAME(sys_vm86old)
          .long SYMBOL_NAME(sys_wait4)
          .long SYMBOL_NAME(sys_swapoff)          /* 115 */
          .long SYMBOL_NAME(sys_sysinfo)
          .long SYMBOL_NAME(sys_ipc)
          .long SYMBOL_NAME(sys_fsync)
          .long SYMBOL_NAME(sys_sigreturn)
          .long SYMBOL_NAME(sys_clone)            /* 120 */
          .long SYMBOL_NAME(sys_setdomainname)
          .long SYMBOL_NAME(sys_newuname)
          .long SYMBOL_NAME(sys_modify_ldt)
          .long SYMBOL_NAME(sys_adjtimex)
          .long SYMBOL_NAME(sys_mprotect)         /* 125 */
          .long SYMBOL_NAME(sys_sigprocmask)
          .long SYMBOL_NAME(sys_create_module)
          .long SYMBOL_NAME(sys_init_module)
          .long SYMBOL_NAME(sys_delete_module)
          .long SYMBOL_NAME(sys_get_kernel_syms)  /* 130 */
          .long SYMBOL_NAME(sys_quotactl)
          .long SYMBOL_NAME(sys_getpgid)
          .long SYMBOL_NAME(sys_fchdir)
          .long SYMBOL_NAME(sys_bdflush)
          .long SYMBOL_NAME(sys_sysfs)            /* 135 */
          .long SYMBOL_NAME(sys_personality)
          .long SYMBOL_NAME(sys_ni_syscall)       /* for afs_syscall */
          .long SYMBOL_NAME(sys_setfsuid16)
          .long SYMBOL_NAME(sys_setfsgid16)
          .long SYMBOL_NAME(sys_llseek)           /* 140 */
          .long SYMBOL_NAME(sys_getdents)
          .long SYMBOL_NAME(sys_select)
          .long SYMBOL_NAME(sys_flock)
          .long SYMBOL_NAME(sys_msync)
          .long SYMBOL_NAME(sys_readv)            /* 145 */
          .long SYMBOL_NAME(sys_writev)
          .long SYMBOL_NAME(sys_getsid)
          .long SYMBOL_NAME(sys_fdatasync)
          .long SYMBOL_NAME(sys_sysctl)
          .long SYMBOL_NAME(sys_mlock)            /* 150 */
          .long SYMBOL_NAME(sys_munlock)
          .long SYMBOL_NAME(sys_mlockall)
          .long SYMBOL_NAME(sys_munlockall)
          .long SYMBOL_NAME(sys_sched_setparam)
          .long SYMBOL_NAME(sys_sched_getparam)   /* 155 */
          .long SYMBOL_NAME(sys_sched_setscheduler)
          .long SYMBOL_NAME(sys_sched_getscheduler)
          .long SYMBOL_NAME(sys_sched_yield)
          .long SYMBOL_NAME(sys_sched_get_priority_max)
          .long SYMBOL_NAME(sys_sched_get_priority_min)  /* 160 */
          .long SYMBOL_NAME(sys_sched_rr_get_interval)
          .long SYMBOL_NAME(sys_nanosleep)
          .long SYMBOL_NAME(sys_mremap)
          .long SYMBOL_NAME(sys_setresuid16)
          .long SYMBOL_NAME(sys_getresuid16)      /* 165 */
          .long SYMBOL_NAME(sys_vm86)
          .long SYMBOL_NAME(sys_query_module)
          .long SYMBOL_NAME(sys_poll)
          .long SYMBOL_NAME(sys_nfsservctl)
          .long SYMBOL_NAME(sys_setresgid16)      /* 170 */
          .long SYMBOL_NAME(sys_getresgid16)
          .long SYMBOL_NAME(sys_prctl)
          .long SYMBOL_NAME(sys_rt_sigreturn)
          .long SYMBOL_NAME(sys_rt_sigaction)
          .long SYMBOL_NAME(sys_rt_sigprocmask)   /* 175 */
          .long SYMBOL_NAME(sys_rt_sigpending)
          .long SYMBOL_NAME(sys_rt_sigtimedwait)
          .long SYMBOL_NAME(sys_rt_sigqueueinfo)
          .long SYMBOL_NAME(sys_rt_sigsuspend)
          .long SYMBOL_NAME(sys_pread)            /* 180 */
          .long SYMBOL_NAME(sys_pwrite)
          .long SYMBOL_NAME(sys_chown16)
          .long SYMBOL_NAME(sys_getcwd)
          .long SYMBOL_NAME(sys_capget)
          .long SYMBOL_NAME(sys_capset)           /* 185 */
          .long SYMBOL_NAME(sys_sigaltstack)
          .long SYMBOL_NAME(sys_sendfile)
          .long SYMBOL_NAME(sys_ni_syscall)               /* streams1 */
          .long SYMBOL_NAME(sys_ni_syscall)               /* streams2 */
          .long SYMBOL_NAME(sys_vfork)            /* 190 */
          .long SYMBOL_NAME(sys_getrlimit)
          .long SYMBOL_NAME(sys_mmap2)
          .long SYMBOL_NAME(sys_truncate64)
          .long SYMBOL_NAME(sys_ftruncate64)
          .long SYMBOL_NAME(sys_stat64)           /* 195 */
          .long SYMBOL_NAME(sys_lstat64)
          .long SYMBOL_NAME(sys_fstat64)
          .long SYMBOL_NAME(sys_lchown)
          .long SYMBOL_NAME(sys_getuid)
          .long SYMBOL_NAME(sys_getgid)           /* 200 */
          .long SYMBOL_NAME(sys_geteuid)
          .long SYMBOL_NAME(sys_getegid)
          .long SYMBOL_NAME(sys_setreuid)
          .long SYMBOL_NAME(sys_setregid)
          .long SYMBOL_NAME(sys_getgroups)        /* 205 */
          .long SYMBOL_NAME(sys_setgroups)
          .long SYMBOL_NAME(sys_fchown)
          .long SYMBOL_NAME(sys_setresuid)
          .long SYMBOL_NAME(sys_getresuid)
          .long SYMBOL_NAME(sys_setresgid)        /* 210 */
          .long SYMBOL_NAME(sys_getresgid)
          .long SYMBOL_NAME(sys_chown)
          .long SYMBOL_NAME(sys_setuid)
          .long SYMBOL_NAME(sys_setgid)
          .long SYMBOL_NAME(sys_setfsuid)         /* 215 */
          .long SYMBOL_NAME(sys_setfsgid)
          .long SYMBOL_NAME(sys_pivot_root)
          .long SYMBOL_NAME(sys_mincore)
          .long SYMBOL_NAME(sys_madvise)
          .long SYMBOL_NAME(sys_getdents64)       /* 220 */
          .long SYMBOL_NAME(sys_fcntl64)
          .long SYMBOL_NAME(sys_ni_syscall)       /* reserved for TUX */
          .long SYMBOL_NAME(sys_ni_syscall)       /* Reserved for Security */
          .long SYMBOL_NAME(sys_gettid)
          .long SYMBOL_NAME(sys_readahead)        /* 225 */


  3.3.3.  IRQ Event

  When an IRQ comes, the task that is running is interrupted in order to
  service the IRQ Handler.


  After the IRQ is handled, control returns backs exactly to point of
  interrupt, like nothing  happened.


                Running Task
               |-----------|          (3)
  NORMAL       |   |       | [break execution] IRQ Handler
  EXECUTION (1)|   |       |     ------------->|---------|
               |  \|/      |     |             |  does   |
   IRQ (2)---->| ..        |----->             |  some   |
               |   |       |<-----             |  work   |
  BACK TO      |   |       |     |             |  ..(4). |
  NORMAL    (6)|  \|/      |     <-------------|_________|
  EXECUTION    |___________|  [return to code]
                                      (5)
                 USER MODE                     KERNEL MODE

           User->Kernel Mode Transition caused by IRQ event


  The numbered steps below refer to the sequence of events in the
  diagram above:
  1. Process is executing

  2. IRQ comes while the task is running.

  3. Task is interrupted to call an "Interrupt handler".

  4. The "Interrupt handler" code is executed.

  5. Control returns back to task user mode (as if nothing happened)

  6. Process returns back to normal execution

  Special interest has the Timer IRQ, coming every TIMER ms to manage:


  1. Alarms

  2. System and task counters (used by schedule to decide when stop a
     process or for accounting)

  3. Multitasking based on wake up mechanism after TIMESLICE time.

  3.4.  Multitasking

  3.4.1.  Mechanism

  The key point of modern OSs is the "Task". The Task is an application
  running in memory sharing all resources (included CPU and Memory) with
  other Tasks.


  This "resource sharing" is managed by the  "Multitasking Mechanism".
  The Multitasking Mechanism switches from one task to another after a
  "timeslice" time. Users have the "illusion" that they own all
  resources.  We can also imagine a single user scenario, where a user
  can have the "illusion" of running many tasks at the same time.


  To implement this multitasking, the task uses "the state" variable,
  which can be:


  1. READY, ready for execution

  2. BLOCKED, waiting for a resource

  The task state is managed by its presence in a relative list: READY
  list and BLOCKED list.


  3.4.2.  Task Switching

  The movement from one task to another is called ''Task Switching''.
  many computers have a hardware instruction which automatically
  performs this operation. Task Switching occurs in the following cases:


  1. After Timeslice ends: we need to schedule a "Ready for execution"
     task and give it access.

  2. When a Task has to wait for a device: we need to schedule a new
     task and switch to it *

  * We schedule another task to prevent   "Busy Form Waiting", which
  occurs when we are waiting for a device instead performing other work.


  Task Switching is managed by the "Schedule" entity.


  Timer    |           |
   IRQ     |           |                            Schedule
    |      |           |                     ________________________
    |----->|   Task 1  |<------------------>|(1)Chooses a Ready Task |
    |      |           |                    |(2)Task Switching       |
    |      |___________|                    |________________________|
    |      |           |                               /|\
    |      |           |                                |
    |      |           |                                |
    |      |           |                                |
    |      |           |                                |
    |----->|   Task 2  |<-------------------------------|
    |      |           |                                |
    |      |___________|                                |
    .      .     .     .                                .
    .      .     .     .                                .
    .      .     .     .                                .
    |      |           |                                |
    |      |           |                                |
    ------>|   Task N  |<--------------------------------
           |           |
           |___________|

              Task Switching based on TimeSlice


  A typical Timeslice for Linux is about 10 ms.


   |           |
   |           | Resource    _____________________________
   |   Task 1  |----------->|(1) Enqueue Resource request |
   |           |  Access    |(2)  Mark Task as blocked    |
   |           |            |(3)  Choose a Ready Task     |
   |___________|            |(4)    Task Switching        |
                            |_____________________________|
                                         |
                                         |
   |           |                         |
   |           |                         |
   |   Task 2  |<-------------------------
   |           |
   |           |
   |___________|

       Task Switching based on Waiting for a Resource


  3.5.  Microkernel vs Monolithic OS

  3.5.1.  Overview

  Until now we viewed so called Monolithic OS, but there is also another
  kind of OS: ''Microkernel''.


  A Microkernel OS uses Tasks, not only for user mode processes, but
  also as a real kernel manager, like Floppy-Task, HDD-Task, Net-Task
  and so on. Some examples are Amoeba, and Mach.


  3.5.2.  PROs and CONTROs of Microkernel OS

  PROS:


  <20>  OS is simpler to maintain because each Task manages a single kind
     of operation. So if you want to modify networking, you modify Net-
     Task (ideally, if it is not needed a structural update).

  CONS:


  <20>  Performances are worse than Monolithic OS, because you have to add
     2*TASK_SWITCH times (the first to enter the specific Task, the
     second to go out from it).

  My personal opinion is that, Microkernels are a good didactic example
  (like Minix) but they are not ''optimal'', so not really suitable.
  Linux uses a few Tasks, called "Kernel Threads" to implement a little
  microkernel structure (like kswapd, which is used to retrieve memory
  pages from mass storage). In this case there are no problems with
  perfomance because swapping is a very slow job.


  3.6.  Networking

  3.6.1.  ISO OSI levels

  Standard ISO-OSI describes a network architecture with the following
  levels:


  1. Physical level (examples: PPP and Ethernet)

  2. Data-link level (examples: PPP and Ethernet)

  3. Network level (examples: IP, and X.25)

  4. Transport level (examples: TCP, UDP)

  5. Session level (SSL)

  6. Presentation level (FTP binary-ascii coding)

  7. Application level (applications like Netscape)

  The first 2 levels listed above are often implemented in hardware.
  Next levels are in software (or firmware for routers).


  Many protocols are used by an OS: one of these is TCP/IP (the most
  important living on 3-4 levels).


  3.6.2.  What does the kernel?

  The kernel doesn't know anything (only addresses) about first 2 levels
  of ISO-OSI.


  In RX it:


  1. Manages handshake with low levels devices (like ethernet card or
     modem) receiving "frames" from them.

  2. Builds TCP/IP "packets" from "frames" (like Ethernet or PPP ones),


  3. Convers ''packets'' in ''sockets'' passing them to the right
     application (using port number) or

  4. Forwards packets to the right queue


  frames         packets              sockets
  NIC ---------> Kernel ----------> Application
                    |    packets
                    --------------> Forward
                          - RX -


  In TX  stage it:


  1. Converts sockets or

  2. Queues datas into TCP/IP ''packets''

  3. Splits ''packets" into "frames" (like Ethernet or PPP ones)

  4. Sends ''frames'' using HW drivers


  sockets       packets                     frames
  Application ---------> Kernel ----------> NIC
                packets     /|\
  Forward  -------------------
                          - TX -


  3.7.  Virtual Memory

  3.7.1.  Segmentation

  Segmentation is the first method to solve memory allocation problems:
  it allows you to compile source code without caring where the
  application will be placed in memory. As a matter of fact, this
  feature helps applications developers to develop in a independent
  fashion from the OS e also from the hardware.
              |       Stack        |
              |          |         |
              |         \|/        |
              |        Free        |
              |         /|\        |     Segment <---> Process
              |          |         |
              |        Heap        |
              | Data uninitialized |
              |  Data initialized  |
              |       Code         |
              |____________________|

                     Segment


  We can say that a segment is the logical entity of an application, or
  the image of the application in memory.


  When programming, we don't care where our data is put in memory, we
  only care about the offset inside our segment (our application).


  We use to assign a Segment to each Process and vice versa. In Linux
  this is not true. Linux uses only 4 segments for either Kernel and all
  Processes.


  3.7.1.1.  Problems of Segmentation


                                   ____________________
                            ----->|                    |----->
                            | IN  |     Segment A      | OUT
   ____________________     |     |____________________|
  |                    |____|     |                    |
  |     Segment B      |          |     Segment B      |
  |                    |____      |                    |
  |____________________|    |     |____________________|
                            |     |     Segment C      |
                            |     |____________________|
                            ----->|     Segment D      |----->
                              IN  |____________________| OUT

                       Segmentation problem


  In the diagram above, we want to get exit processes A, and D and enter
  process B. As we can see there is enough space for B, but we cannot
  split it in 2 pieces, so we CANNOT load it (memory out).


  The reason this problem occurs is because pure segments are continuous
  areas (because they are logical areas) and cannot be split.


  3.7.2.  Pagination


               ____________________
              |     Page 1         |
              |____________________|
              |     Page 2         |
              |____________________|
              |      ..            |     Segment <---> Process
              |____________________|
              |     Page n         |
              |____________________|
              |                    |
              |____________________|
              |                    |
              |____________________|

                     Segment


  Pagination splits memory in "n" pieces, each one with a fixed length.


  A process may be loaded in one or more Pages. When memory is freed,
  all pages are freed (see Segmentation Problem, before).


  Pagination is also used for another important purpose, "Swapping".  If
  a page is not present in physical memory then it generates an
  EXCEPTION, that will make the Kernel search for a new page in storage
  memory. This mechanism allow OS to load more applications than the
  ones allowed by physical memory only.


  3.7.2.1.  Pagination Problem


               ____________________
     Page   X |     Process Y      |
              |____________________|
              |                    |
              |       WASTE        |
              |       SPACE        |
              |____________________|

                Pagination Problem


  In the diagram above, we can see what is wrong with the pagination
  policy: when a Process Y loads into Page X, ALL memory space of the
  Page is allocated, so the remaining space at the end of Page is
  wasted.


  3.7.3.  Segmentation and Pagination

  How can we solve segmentation and pagination problems? Using either 2
  policies.


                                    |      ..            |
                                    |____________________|
                              ----->|      Page 1        |
                              |     |____________________|
                              |     |      ..            |
   ____________________       |     |____________________|
  |                    |      |---->|      Page 2        |
  |      Segment X     |  ----|     |____________________|
  |                    |      |     |       ..           |
  |____________________|      |     |____________________|
                              |     |       ..           |
                              |     |____________________|
                              |---->|      Page 3        |
                                    |____________________|
                                    |       ..           |


  Process X, identified by Segment X, is split in 3 pieces and each of
  one is loaded in a page.


  We do not have:


  1. Segmentation problem: we allocate per Pages, so we also free Pages
     and we manage free space in an optimized way.

  2. Pagination problem: only last page wastes space, but we can decide
     to use very small pages, for example 4096 bytes length (losing at
     maximum 4096*N_Tasks bytes) and manage hierarchical paging (using 2
     or 3 levels of paging)


                            |         |           |         |
                            |         |   Offset2 |  Value  |
                            |         |        /|\|         |
                    Offset1 |         |-----    | |         |
                        /|\ |         |    |    | |         |
                         |  |         |    |   \|/|         |
                         |  |         |    ------>|         |
                        \|/ |         |           |         |
   Base Paging Address ---->|         |           |         |
                            | ....... |           | ....... |
                            |         |           |         |

                       Hierarchical Paging


  4.  Linux Startup

  We start the Linux kernel first from C code executed from
  ''startup_32:'' asm label:


  |startup_32:
     |start_kernel
        |lock_kernel
        |trap_init
        |init_IRQ
        |sched_init
        |softirq_init
        |time_init
        |console_init
        |#ifdef CONFIG_MODULES
           |init_modules
        |#endif
        |kmem_cache_init
        |sti
        |calibrate_delay
        |mem_init
        |kmem_cache_sizes_init
        |pgtable_cache_init
        |fork_init
        |proc_caches_init
        |vfs_caches_init
        |buffer_init
        |page_cache_init
        |signals_init
        |#ifdef CONFIG_PROC_FS
          |proc_root_init
        |#endif
        |#if defined(CONFIG_SYSVIPC)
           |ipc_init
        |#endif
        |check_bugs
        |smp_init
        |rest_init
           |kernel_thread
           |unlock_kernel
           |cpu_idle


  <20>  startup_32 [arch/i386/kernel/head.S]

  <20>  start_kernel [init/main.c]

  <20>  lock_kernel [include/asm/smplock.h]

  <20>  trap_init [arch/i386/kernel/traps.c]

  <20>  init_IRQ [arch/i386/kernel/i8259.c]

  <20>  sched_init [kernel/sched.c]

  <20>  softirq_init [kernel/softirq.c]

  <20>  time_init [arch/i386/kernel/time.c]

  <20>  console_init [drivers/char/tty_io.c]

  <20>  init_modules [kernel/module.c]

  <20>  kmem_cache_init [mm/slab.c]

  <20>  sti [include/asm/system.h]

  <20>  calibrate_delay [init/main.c]

  <20>  mem_init [arch/i386/mm/init.c]

  <20>  kmem_cache_sizes_init [mm/slab.c]

  <20>  pgtable_cache_init [arch/i386/mm/init.c]

  <20>  fork_init [kernel/fork.c]

  <20>  proc_caches_init

  <20>  vfs_caches_init [fs/dcache.c]

  <20>  buffer_init [fs/buffer.c]

  <20>  page_cache_init [mm/filemap.c]

  <20>  signals_init [kernel/signal.c]

  <20>  proc_root_init [fs/proc/root.c]

  <20>  ipc_init [ipc/util.c]

  <20>  check_bugs [include/asm/bugs.h]

  <20>  smp_init [init/main.c]

  <20>  rest_init

  <20>  kernel_thread [arch/i386/kernel/process.c]

  <20>  unlock_kernel [include/asm/smplock.h]

  <20>  cpu_idle [arch/i386/kernel/process.c]

  The last function ''rest_init'' does the following:


  1. launches the kernel thread ''init''

  2. calls unlock_kernel

  3. makes the kernel run cpu_idle routine, that will be the idle loop
     executing when nothing is scheduled

  In fact the start_kernel procedure never ends. It will execute
  cpu_idle routine endlessly.


  Follows ''init'' description, which is the first Kernel Thread:


  |init
     |lock_kernel
     |do_basic_setup
        |mtrr_init
        |sysctl_init
        |pci_init
        |sock_init
        |start_context_thread
        |do_init_calls
           |(*call())-> kswapd_init
     |prepare_namespace
     |free_initmem
     |unlock_kernel
     |execve


  5.  Linux Peculiarities

  5.1.  Overview

  Linux has some peculiarities that distinguish it from other OSs.
  These peculiarities include:


  1. Pagination only

  2. Softirq

  3. Kernel threads

  4. Kernel modules

  5.

  5.1.1.  Flexibility Elements

  Points 4 and 5 give system administrators an enormous flexibility on
  system configuration from user mode allowing them to solve also
  critical kernel bugs or specific problems without have to reboot the
  machine. For example, if you needed to change something on a big
  server and you didn't want to make a reboot, you could prepare the
  kernel to talk with a module, that you'll write.


  5.2.  Pagination only

  Linux doesn't use segmentation to distinguish Tasks from each other;
  it uses pagination. (Only 2 segments are used for all Tasks, CODE and
  DATA/STACK)


  We can also say that an interTask page fault never occurs, because
  each Task uses a set of Page Tables that are different for each Task.
  There are some cases where different Tasks point to same Page Tables,
  like shared libraries: this is needed to reduce memory usage; remember
  that shared libraries are CODE only cause all datas are stored into
  actual Task stack.


  5.2.1.  Linux segments

  Under the Linux kernel only 4 segments exist:


  1. Kernel Code [0x10]

  2. Kernel Data / Stack [0x18]

  3. User Code [0x23]

  4. User Data / Stack [0x2b]

  [syntax is ''Purpose [Segment]'']


  Under Intel architecture, the segment registers used are:


  <20>  CS for Code Segment

  <20>  DS for Data Segment

  <20>  SS for Stack Segment

  <20>  ES for Alternative Segment (for example used to make a memory copy
     between 2 different segments)

  So, every Task uses 0x23 for code and 0x2b for data/stack.


  5.2.2.  Linux pagination

  Under Linux 3 levels of pages are used, depending on the architecture.
  Under Intel only 2 levels are supported. Linux also supports Copy on
  Write mechanisms (please see Cap.10 for more information).


  5.2.3.  Why don't interTasks address conflicts exist?

  The answer is very very simple: interTask  address conflicts cannot
  exist because  they are impossible. Linear -> physical mapping is done
  by "Pagination", so it just needs to assign physical pages in an
  univocal fashion.


  5.2.4.  Do we need to defragment memory?

  No. Page assigning is a dynamic process. We need a page only when a
  Task asks for it, so we choose it from free memory paging in an
  ordered fashion. When we want to release the page, we only have to add
  it to the free pages list.


  5.2.5.  What about Kernel Pages?

  Kernel pages have a problem: they can be allocated in a dynamic
  fashion but we cannot have a guarantee that they are in contiguous
  area allocation, because linear kernel space is equivalent to physical
  kernel space.


  For Code Segment there is no problem. Boot code is allocated at boot
  time (so we have a fixed amount of memory to allocate), and on modules
  we only have to allocate a memory area which could contain module
  code.


  The real problem is the stack segment because each Task uses some
  kernel stack pages. Stack segments must be contiguous (according to
  stack definition), so we have to establish a maximum limit for each
  Task's stack dimension. If we exceed this limit bad things happen.  We
  overwrite kernel mode process data structures.


  The structure of the Kernel helps us, because kernel functions are
  never:


  <20>  recursive

  <20>  intercalling more than N times.

  Once we know N, and we know the average of static variables for all
  kernel functions, we can estimate a stack limit.


  If you want to try the problem out, you can create a module with a
  function inside calling itself many times. After a fixed number of
  times, the kernel module will hang because of a page fault exception
  handler (typically write to a read-only page).


  5.3.  Softirq

  When an IRQ comes, task switching is deferred until later to get
  better performance. Some Task jobs (that could have to be done just
  after the IRQ and that could take much CPU in interrupt time, like
  building up a TCP/IP packet) are queued and will be done at scheduling
  time (once a time-slice will end).


  In recent kernels (2.4.x) the softirq mechanisms are given to a
  kernel_thread: ''ksoftirqd_CPUn''. n stands for the number of CPU
  executing kernel_thread (in a monoprocessor system ''ksoftirqd_CPU0''
  uses PID 3).


  5.3.1.  Preparing Softirq

  5.3.2.  Enabling Softirq

  kernel thread, to let it manage the enqueued job.


  |cpu_raise_softirq
     |__cpu_raise_softirq
     |wakeup_softirqd
        |wake_up_process


  <20>  cpu_raise_softirq [kernel/softirq.c]

  <20>  __cpu_raise_softirq [include/linux/interrupt.h]

  <20>  wakeup_softirq [kernel/softirq.c]

  <20>  wake_up_process [kernel/sched.c]

  describing softirq pending.


  kernel thread.


  5.3.3.  Executing Softirq

  TODO: describing data structures involved in softirq mechanism.


  When kernel thread ''ksoftirqd_CPU0'' has been woken up, it will
  execute queued jobs


  The code of ''ksoftirqd_CPU0'' is (main endless loop):


  for (;;) {
     if (!softirq_pending(cpu))
        schedule();
        __set_current_state(TASK_RUNNING);
     while (softirq_pending(cpu)) {
        do_softirq();
        if (current->need_resched)
           schedule
     }
     __set_current_state(TASK_INTERRUPTIBLE)
  }


  <20>  ksoftirqd [kernel/softirq.c]

  5.4.  Kernel Threads

  Even though Linux is a monolithic OS, a few ''kernel threads'' exist
  to do  housekeeping work.


  These Tasks don't utilize USER memory; they share KERNEL memory.  They
  also operate at the highest privilege (RING 0 on a i386 architecture)
  like any other kernel mode piece of code.


  Kernel threads are created by ''kernel_thread
  [arch/i386/kernel/process]'' function, which calls ''clone''
  [arch/i386/kernel/process.c] system call from assembler (which is a
  ''fork'' like system call):


  int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)
  {
          long retval, d0;

          __asm__ __volatile__(
                  "movl %%esp,%%esi\n\t"
                  "int $0x80\n\t"         /* Linux/i386 system call */
                  "cmpl %%esp,%%esi\n\t"  /* child or parent? */
                  "je 1f\n\t"             /* parent - jump */
                  /* Load the argument into eax, and push it.  That way, it does
                   * not matter whether the called function is compiled with
                   * -mregparm or not.  */
                  "movl %4,%%eax\n\t"
                  "pushl %%eax\n\t"
                  "call *%5\n\t"          /* call fn */
                  "movl %3,%0\n\t"        /* exit */
                  "int $0x80\n"
                  "1:\t"
                  :"=&a" (retval), "=&S" (d0)
                  :"0" (__NR_clone), "i" (__NR_exit),
                   "r" (arg), "r" (fn),
                   "b" (flags | CLONE_VM)
                  : "memory");
          return retval;
  }


  Once called, we have a new Task (usually with very low PID number,
  like 2,3, etc.) waiting for a very slow resource, like swap or usb
  event. A very slow resource is used because we would have a task
  switching overhead otherwise.


  Below is a list of most common kernel threads (from ''ps x'' command):


  PID      COMMAND
   1        init
   2        keventd
   3        kswapd
   4        kreclaimd
   5        bdflush
   6        kupdated
   7        kacpid
  67        khubd


  It will call all other User Mode Tasks (from file /etc/inittab) like
  console daemons, tty daemons and network daemons (''rc'' scripts).


  5.4.1.  Example of Kernel Threads: kswapd [mm/vmscan.c].


  Initialisation routines:


  |do_initcalls
     |kswapd_init
        |kernel_thread
           |syscall fork (in assembler)


  do_initcalls [init/main.c]


  kswapd_init [mm/vmscan.c]


  kernel_thread [arch/i386/kernel/process.c]


  5.5.  Kernel Modules

  5.5.1.  Overview

  Linux Kernel modules are pieces of code (examples: fs, net, and hw
  driver) running in kernel mode that you can add at runtime.


  The Linux core cannot be modularized: scheduling and interrupt
  management or core network, and so on.


  Under "/lib/modules/KERNEL_VERSION/" you can find all the modules
  installed on your system.


  5.5.2.  Module loading and unloading

  To load a module, type the following:


  insmod MODULE_NAME parameters

  example: insmod ne io=0x300 irq=9


  NOTE: You can use modprobe in place of insmod if you want the kernel
  automatically search some parameter (for example when using PCI
  driver, or if you have specified parameter under /etc/conf.modules
  file).


  To unload a module, type the following:


   rmmod MODULE_NAME


  5.5.3.  Module definition

  A module always contains:


  1. "init_module" function, executed at insmod (or modprobe) command

  2. "cleanup_module" function, executed at rmmod command

  If these functions are not in the module, you need to add 2 macros to
  specify what functions  will act as init and exit module:


  1. module_init(FUNCTION_NAME)

  2. module_exit(FUNCTION_NAME)

  NOTE: a module can "see" a kernel variable only if it has been
  exported (with macro EXPORT_SYMBOL).


  5.5.4.  A useful trick for adding flexibility to your kernel


  // kernel sources side
  void (*foo_function_pointer)(void *);

  if (foo_function_pointer)
    (foo_function_pointer)(parameter);


  // module side
  extern void (*foo_function_pointer)(void *);

  void my_function(void *parameter) {
    //My code
  }

  int init_module() {
    foo_function_pointer = &my_function;
  }

  int cleanup_module() {
    foo_function_pointer = NULL;
  }


  This simple trick allows you to have very high flexibility in your
  Kernel, because only when you load the module you'll make
  "my_function" routine execute. This routine will do everything you
  want to do: for example ''rshaper'' module, which controls bandwidth
  input traffic from the network, works in this kind of matter.


  Notice that the whole module mechanism is possible thanks to some
  global variables exported to modules, such as head list (allowing you
  to extend the list as much as you want). Typical examples are fs,
  generic devices (char, block, net, telephony). You have to prepare the
  kernel to accept your new module; in some cases you have to create an
  infrastructure (like telephony one, that was recently created) to be
  as standard as possible.


  5.6.  Proc directory

  Proc fs is located in the  /proc directory, which is a special
  directory allowing you to talk directly with kernel.


  Linux uses ''proc'' directory to support direct kernel communications:
  this is necessary in many cases, for example when you want see main
  processes data structures or enable ''proxy-arp'' feature on one
  interface and not in others, you want to change max number of threads,
  or if you want to debug some bus state, like ISA or PCI, to know what
  cards are installed and what I/O addresses and IRQs are assigned to
  them.


  |-- bus
  |   |-- pci
  |   |   |-- 00
  |   |   |   |-- 00.0
  |   |   |   |-- 01.0
  |   |   |   |-- 07.0
  |   |   |   |-- 07.1
  |   |   |   |-- 07.2
  |   |   |   |-- 07.3
  |   |   |   |-- 07.4
  |   |   |   |-- 07.5
  |   |   |   |-- 09.0
  |   |   |   |-- 0a.0
  |   |   |   `-- 0f.0
  |   |   |-- 01
  |   |   |   `-- 00.0
  |   |   `-- devices
  |   `-- usb
  |-- cmdline
  |-- cpuinfo
  |-- devices
  |-- dma
  |-- dri
  |   `-- 0
  |       |-- bufs
  |       |-- clients
  |       |-- mem
  |       |-- name
  |       |-- queues
  |       |-- vm
  |       `-- vma
  |-- driver
  |-- execdomains
  |-- filesystems
  |-- fs
  |-- ide
  |   |-- drivers
  |   |-- hda -> ide0/hda
  |   |-- hdc -> ide1/hdc
  |   |-- ide0
  |   |   |-- channel
  |   |   |-- config
  |   |   |-- hda
  |   |   |   |-- cache
  |   |   |   |-- capacity
  |   |   |   |-- driver
  |   |   |   |-- geometry
  |   |   |   |-- identify
  |   |   |   |-- media
  |   |   |   |-- model
  |   |   |   |-- settings
  |   |   |   |-- smart_thresholds
  |   |   |   `-- smart_values
  |   |   |-- mate
  |   |   `-- model
  |   |-- ide1
  |   |   |-- channel
  |   |   |-- config
  |   |   |-- hdc
  |   |   |   |-- capacity
  |   |   |   |-- driver
  |   |   |   |-- identify
  |   |   |   |-- media
  |   |   |   |-- model
  |   |   |   `-- settings
  |   |   |-- mate
  |   |   `-- model
  |   `-- via
  |-- interrupts
  |-- iomem
  |-- ioports
  |-- irq
  |   |-- 0
  |   |-- 1
  |   |-- 10
  |   |-- 11
  |   |-- 12
  |   |-- 13
  |   |-- 14
  |   |-- 15
  |   |-- 2
  |   |-- 3
  |   |-- 4
  |   |-- 5
  |   |-- 6
  |   |-- 7
  |   |-- 8
  |   |-- 9
  |   `-- prof_cpu_mask
  |-- kcore
  |-- kmsg
  |-- ksyms
  |-- loadavg
  |-- locks
  |-- meminfo
  |-- misc
  |-- modules
  |-- mounts
  |-- mtrr
  |-- net
  |   |-- arp
  |   |-- dev
  |   |-- dev_mcast
  |   |-- ip_fwchains
  |   |-- ip_fwnames
  |   |-- ip_masquerade
  |   |-- netlink
  |   |-- netstat
  |   |-- packet
  |   |-- psched
  |   |-- raw
  |   |-- route
  |   |-- rt_acct
  |   |-- rt_cache
  |   |-- rt_cache_stat
  |   |-- snmp
  |   |-- sockstat
  |   |-- softnet_stat
  |   |-- tcp
  |   |-- udp
  |   |-- unix
  |   `-- wireless
  |-- partitions
  |-- pci
  |-- scsi
  |   |-- ide-scsi
  |   |   `-- 0
  |   `-- scsi
  |-- self -> 2069
  |-- slabinfo
  |-- stat
  |-- swaps
  |-- sys
  |   |-- abi
  |   |   |-- defhandler_coff
  |   |   |-- defhandler_elf
  |   |   |-- defhandler_lcall7
  |   |   |-- defhandler_libcso
  |   |   |-- fake_utsname
  |   |   `-- trace
  |   |-- debug
  |   |-- dev
  |   |   |-- cdrom
  |   |   |   |-- autoclose
  |   |   |   |-- autoeject
  |   |   |   |-- check_media
  |   |   |   |-- debug
  |   |   |   |-- info
  |   |   |   `-- lock
  |   |   `-- parport
  |   |       |-- default
  |   |       |   |-- spintime
  |   |       |   `-- timeslice
  |   |       `-- parport0
  |   |           |-- autoprobe
  |   |           |-- autoprobe0
  |   |           |-- autoprobe1
  |   |           |-- autoprobe2
  |   |           |-- autoprobe3
  |   |           |-- base-addr
  |   |           |-- devices
  |   |           |   |-- active
  |   |           |   `-- lp
  |   |           |       `-- timeslice
  |   |           |-- dma
  |   |           |-- irq
  |   |           |-- modes
  |   |           `-- spintime
  |   |-- fs
  |   |   |-- binfmt_misc
  |   |   |-- dentry-state
  |   |   |-- dir-notify-enable
  |   |   |-- dquot-nr
  |   |   |-- file-max
  |   |   |-- file-nr
  |   |   |-- inode-nr
  |   |   |-- inode-state
  |   |   |-- jbd-debug
  |   |   |-- lease-break-time
  |   |   |-- leases-enable
  |   |   |-- overflowgid
  |   |   `-- overflowuid
  |   |-- kernel
  |   |   |-- acct
  |   |   |-- cad_pid
  |   |   |-- cap-bound
  |   |   |-- core_uses_pid
  |   |   |-- ctrl-alt-del
  |   |   |-- domainname
  |   |   |-- hostname
  |   |   |-- modprobe
  |   |   |-- msgmax
  |   |   |-- msgmnb
  |   |   |-- msgmni
  |   |   |-- osrelease
  |   |   |-- ostype
  |   |   |-- overflowgid
  |   |   |-- overflowuid
  |   |   |-- panic
  |   |   |-- printk
  |   |   |-- random
  |   |   |   |-- boot_id
  |   |   |   |-- entropy_avail
  |   |   |   |-- poolsize
  |   |   |   |-- read_wakeup_threshold
  |   |   |   |-- uuid
  |   |   |   `-- write_wakeup_threshold
  |   |   |-- rtsig-max
  |   |   |-- rtsig-nr
  |   |   |-- sem
  |   |   |-- shmall
  |   |   |-- shmmax
  |   |   |-- shmmni
  |   |   |-- sysrq
  |   |   |-- tainted
  |   |   |-- threads-max
  |   |   `-- version
  |   |-- net
  |   |   |-- 802
  |   |   |-- core
  |   |   |   |-- hot_list_length
  |   |   |   |-- lo_cong
  |   |   |   |-- message_burst
  |   |   |   |-- message_cost
  |   |   |   |-- mod_cong
  |   |   |   |-- netdev_max_backlog
  |   |   |   |-- no_cong
  |   |   |   |-- no_cong_thresh
  |   |   |   |-- optmem_max
  |   |   |   |-- rmem_default
  |   |   |   |-- rmem_max
  |   |   |   |-- wmem_default
  |   |   |   `-- wmem_max
  |   |   |-- ethernet
  |   |   |-- ipv4
  |   |   |   |-- conf
  |   |   |   |   |-- all
  |   |   |   |   |   |-- accept_redirects
  |   |   |   |   |   |-- accept_source_route
  |   |   |   |   |   |-- arp_filter
  |   |   |   |   |   |-- bootp_relay
  |   |   |   |   |   |-- forwarding
  |   |   |   |   |   |-- log_martians
  |   |   |   |   |   |-- mc_forwarding
  |   |   |   |   |   |-- proxy_arp
  |   |   |   |   |   |-- rp_filter
  |   |   |   |   |   |-- secure_redirects
  |   |   |   |   |   |-- send_redirects
  |   |   |   |   |   |-- shared_media
  |   |   |   |   |   `-- tag
  |   |   |   |   |-- default
  |   |   |   |   |   |-- accept_redirects
  |   |   |   |   |   |-- accept_source_route
  |   |   |   |   |   |-- arp_filter
  |   |   |   |   |   |-- bootp_relay
  |   |   |   |   |   |-- forwarding
  |   |   |   |   |   |-- log_martians
  |   |   |   |   |   |-- mc_forwarding
  |   |   |   |   |   |-- proxy_arp
  |   |   |   |   |   |-- rp_filter
  |   |   |   |   |   |-- secure_redirects
  |   |   |   |   |   |-- send_redirects
  |   |   |   |   |   |-- shared_media
  |   |   |   |   |   `-- tag
  |   |   |   |   |-- eth0
  |   |   |   |   |   |-- accept_redirects
  |   |   |   |   |   |-- accept_source_route
  |   |   |   |   |   |-- arp_filter
  |   |   |   |   |   |-- bootp_relay
  |   |   |   |   |   |-- forwarding
  |   |   |   |   |   |-- log_martians
  |   |   |   |   |   |-- mc_forwarding
  |   |   |   |   |   |-- proxy_arp
  |   |   |   |   |   |-- rp_filter
  |   |   |   |   |   |-- secure_redirects
  |   |   |   |   |   |-- send_redirects
  |   |   |   |   |   |-- shared_media
  |   |   |   |   |   `-- tag
  |   |   |   |   |-- eth1
  |   |   |   |   |   |-- accept_redirects
  |   |   |   |   |   |-- accept_source_route
  |   |   |   |   |   |-- arp_filter
  |   |   |   |   |   |-- bootp_relay
  |   |   |   |   |   |-- forwarding
  |   |   |   |   |   |-- log_martians
  |   |   |   |   |   |-- mc_forwarding
  |   |   |   |   |   |-- proxy_arp
  |   |   |   |   |   |-- rp_filter
  |   |   |   |   |   |-- secure_redirects
  |   |   |   |   |   |-- send_redirects
  |   |   |   |   |   |-- shared_media
  |   |   |   |   |   `-- tag
  |   |   |   |   `-- lo
  |   |   |   |       |-- accept_redirects
  |   |   |   |       |-- accept_source_route
  |   |   |   |       |-- arp_filter
  |   |   |   |       |-- bootp_relay
  |   |   |   |       |-- forwarding
  |   |   |   |       |-- log_martians
  |   |   |   |       |-- mc_forwarding
  |   |   |   |       |-- proxy_arp
  |   |   |   |       |-- rp_filter
  |   |   |   |       |-- secure_redirects
  |   |   |   |       |-- send_redirects
  |   |   |   |       |-- shared_media
  |   |   |   |       `-- tag
  |   |   |   |-- icmp_echo_ignore_all
  |   |   |   |-- icmp_echo_ignore_broadcasts
  |   |   |   |-- icmp_ignore_bogus_error_responses
  |   |   |   |-- icmp_ratelimit
  |   |   |   |-- icmp_ratemask
  |   |   |   |-- inet_peer_gc_maxtime
  |   |   |   |-- inet_peer_gc_mintime
  |   |   |   |-- inet_peer_maxttl
  |   |   |   |-- inet_peer_minttl
  |   |   |   |-- inet_peer_threshold
  |   |   |   |-- ip_autoconfig
  |   |   |   |-- ip_conntrack_max
  |   |   |   |-- ip_default_ttl
  |   |   |   |-- ip_dynaddr
  |   |   |   |-- ip_forward
  |   |   |   |-- ip_local_port_range
  |   |   |   |-- ip_no_pmtu_disc
  |   |   |   |-- ip_nonlocal_bind
  |   |   |   |-- ipfrag_high_thresh
  |   |   |   |-- ipfrag_low_thresh
  |   |   |   |-- ipfrag_time
  |   |   |   |-- neigh
  |   |   |   |   |-- default
  |   |   |   |   |   |-- anycast_delay
  |   |   |   |   |   |-- app_solicit
  |   |   |   |   |   |-- base_reachable_time
  |   |   |   |   |   |-- delay_first_probe_time
  |   |   |   |   |   |-- gc_interval
  |   |   |   |   |   |-- gc_stale_time
  |   |   |   |   |   |-- gc_thresh1
  |   |   |   |   |   |-- gc_thresh2
  |   |   |   |   |   |-- gc_thresh3
  |   |   |   |   |   |-- locktime
  |   |   |   |   |   |-- mcast_solicit
  |   |   |   |   |   |-- proxy_delay
  |   |   |   |   |   |-- proxy_qlen
  |   |   |   |   |   |-- retrans_time
  |   |   |   |   |   |-- ucast_solicit
  |   |   |   |   |   `-- unres_qlen
  |   |   |   |   |-- eth0
  |   |   |   |   |   |-- anycast_delay
  |   |   |   |   |   |-- app_solicit
  |   |   |   |   |   |-- base_reachable_time
  |   |   |   |   |   |-- delay_first_probe_time
  |   |   |   |   |   |-- gc_stale_time
  |   |   |   |   |   |-- locktime
  |   |   |   |   |   |-- mcast_solicit
  |   |   |   |   |   |-- proxy_delay
  |   |   |   |   |   |-- proxy_qlen
  |   |   |   |   |   |-- retrans_time
  |   |   |   |   |   |-- ucast_solicit
  |   |   |   |   |   `-- unres_qlen
  |   |   |   |   |-- eth1
  |   |   |   |   |   |-- anycast_delay
  |   |   |   |   |   |-- app_solicit
  |   |   |   |   |   |-- base_reachable_time
  |   |   |   |   |   |-- delay_first_probe_time
  |   |   |   |   |   |-- gc_stale_time
  |   |   |   |   |   |-- locktime
  |   |   |   |   |   |-- mcast_solicit
  |   |   |   |   |   |-- proxy_delay
  |   |   |   |   |   |-- proxy_qlen
  |   |   |   |   |   |-- retrans_time
  |   |   |   |   |   |-- ucast_solicit
  |   |   |   |   |   `-- unres_qlen
  |   |   |   |   `-- lo
  |   |   |   |       |-- anycast_delay
  |   |   |   |       |-- app_solicit
  |   |   |   |       |-- base_reachable_time
  |   |   |   |       |-- delay_first_probe_time
  |   |   |   |       |-- gc_stale_time
  |   |   |   |       |-- locktime
  |   |   |   |       |-- mcast_solicit
  |   |   |   |       |-- proxy_delay
  |   |   |   |       |-- proxy_qlen
  |   |   |   |       |-- retrans_time
  |   |   |   |       |-- ucast_solicit
  |   |   |   |       `-- unres_qlen
  |   |   |   |-- route
  |   |   |   |   |-- error_burst
  |   |   |   |   |-- error_cost
  |   |   |   |   |-- flush
  |   |   |   |   |-- gc_elasticity
  |   |   |   |   |-- gc_interval
  |   |   |   |   |-- gc_min_interval
  |   |   |   |   |-- gc_thresh
  |   |   |   |   |-- gc_timeout
  |   |   |   |   |-- max_delay
  |   |   |   |   |-- max_size
  |   |   |   |   |-- min_adv_mss
  |   |   |   |   |-- min_delay
  |   |   |   |   |-- min_pmtu
  |   |   |   |   |-- mtu_expires
  |   |   |   |   |-- redirect_load
  |   |   |   |   |-- redirect_number
  |   |   |   |   `-- redirect_silence
  |   |   |   |-- tcp_abort_on_overflow
  |   |   |   |-- tcp_adv_win_scale
  |   |   |   |-- tcp_app_win
  |   |   |   |-- tcp_dsack
  |   |   |   |-- tcp_ecn
  |   |   |   |-- tcp_fack
  |   |   |   |-- tcp_fin_timeout
  |   |   |   |-- tcp_keepalive_intvl
  |   |   |   |-- tcp_keepalive_probes
  |   |   |   |-- tcp_keepalive_time
  |   |   |   |-- tcp_max_orphans
  |   |   |   |-- tcp_max_syn_backlog
  |   |   |   |-- tcp_max_tw_buckets
  |   |   |   |-- tcp_mem
  |   |   |   |-- tcp_orphan_retries
  |   |   |   |-- tcp_reordering
  |   |   |   |-- tcp_retrans_collapse
  |   |   |   |-- tcp_retries1
  |   |   |   |-- tcp_retries2
  |   |   |   |-- tcp_rfc1337
  |   |   |   |-- tcp_rmem
  |   |   |   |-- tcp_sack
  |   |   |   |-- tcp_stdurg
  |   |   |   |-- tcp_syn_retries
  |   |   |   |-- tcp_synack_retries
  |   |   |   |-- tcp_syncookies
  |   |   |   |-- tcp_timestamps
  |   |   |   |-- tcp_tw_recycle
  |   |   |   |-- tcp_window_scaling
  |   |   |   `-- tcp_wmem
  |   |   `-- unix
  |   |       `-- max_dgram_qlen
  |   |-- proc
  |   `-- vm
  |       |-- bdflush
  |       |-- kswapd
  |       |-- max-readahead
  |       |-- min-readahead
  |       |-- overcommit_memory
  |       |-- page-cluster
  |       `-- pagetable_cache
  |-- sysvipc
  |   |-- msg
  |   |-- sem
  |   `-- shm
  |-- tty
  |   |-- driver
  |   |   `-- serial
  |   |-- drivers
  |   |-- ldisc
  |   `-- ldiscs
  |-- uptime
  `-- version


  In the directory there are also all the tasks using PID as file names
  (you have access to all Task information, like path of binary file,
  memory used, and so on).
  The interesting point is that you cannot only see kernel values (for
  example, see info about any task or about network options enabled of
  your TCP/IP stack) but you are also able to modify some of it,
  typically that ones under /proc/sys directory:


  /proc/sys/
            acpi
            dev
            debug
            fs
            proc
            net
            vm
            kernel


  5.6.1.  /proc/sys/kernel

  Below are very important and well-know kernel values, ready to be
  modified:


  overflowgid
  overflowuid
  random
  threads-max // Max number of threads, typically 16384
  sysrq // kernel hack: you can view istant register values and more
  sem
  msgmnb
  msgmni
  msgmax
  shmmni
  shmall
  shmmax
  rtsig-max
  rtsig-nr
  modprobe // modprobe file location
  printk
  ctrl-alt-del
  cap-bound
  panic
  domainname // domain name of your Linux box
  hostname // host name of your Linux box
  version // date info about kernel compilation
  osrelease // kernel version (i.e. 2.4.5)
  ostype // Linux!


  5.6.2.  /proc/sys/net

  This can be considered the most useful proc subdirectory. It allows
  you to change very important settings for your network kernel
  configuration.


  core
  ipv4
  ipv6
  unix
  ethernet
  802


  5.6.2.1.  /proc/sys/net/core

  Listed below are general net settings, like "netdev_max_backlog"
  (typically 300), the length of all your network packets. This value
  can limit your network bandwidth when receiving packets, Linux has to
  wait up to scheduling time to flush buffers (due to bottom half
  mechanism), about 1000/HZ ms


    300    *        100             =     30 000
  packets     HZ(Timeslice freq)         packets/s

  30 000   *       1000             =      30 M
  packets     average (Bytes/packet)   throughput Bytes/s


  If you want to get higher throughput, you need to increase
  netdev_max_backlog, by typing:


  echo 4000 > /proc/sys/net/core/netdev_max_backlog


  Note: Warning for some HZ values: under some architecture (like alpha
  or arm-tbox) it is 1000, so you can have 300 MBytes/s of average
  throughput.


  5.6.2.2.  /proc/sys/net/ipv4

  "ip_forward", enables or disables ip forwarding in your Linux box.
  This is a generic setting for all devices, you can  specify  each
  device you choose.


  5.6.2.2.1.  /proc/sys/net/ipv4/conf/interface

  I think this is the most useful /proc entry, because it allows you to
  change some net settings to support wireless networks (see Wireless-
  HOWTO <http://www.bertolinux.com> for more information).


  Here are some examples of when you could use this setting:


  <20>  "forwarding", to enable ip forwarding for your interface

  <20>  "proxy_arp", to enable proxy arp feature. For more see Proxy arp
     HOWTO under Linux Documentation Project <http://www.tldp.org> and
     Wireless-HOWTO <http://www.bertolinux.com> for proxy arp use in
     Wireless networks.

  <20>  "send_redirects" to avoid interface to send ICMP_REDIRECT (as
     before, see Wireless-HOWTO <http://www.bertolinux.com> for more).

  6.  Linux Multitasking

  6.1.  Overview

  This section will analyze data structures--the mechanism used to
  manage multitasking environment under Linux.


  6.1.1.  Task States

  A Linux Task can be one of the following states (according to
  [include/linux.h]):


  1. TASK_RUNNING, it means that it is in the "Ready List"

  2. TASK_INTERRUPTIBLE, task waiting for a signal or a resource
     (sleeping)

  3. TASK_UNINTERRUPTIBLE, task waiting for a resource (sleeping), it is
     in same "Wait Queue"

  4. TASK_ZOMBIE, task child without father

  5. TASK_STOPPED, task being debugged

  6.1.2.  Graphical Interaction


         ______________     CPU Available     ______________
        |              |  ---------------->  |              |
        | TASK_RUNNING |                     | Real Running |
        |______________|  <----------------  |______________|
                             CPU Busy
              |   /|\
  Waiting for |    | Resource
   Resource   |    | Available
             \|/   |
      ______________________
     |                      |
     | TASK_INTERRUPTIBLE / |
     | TASK-UNINTERRUPTIBLE |
     |______________________|

                       Main Multitasking Flow


  6.2.  Timeslice

  6.2.1.  PIT 8253 Programming

  Each 10 ms (depending on HZ value) an IRQ0 comes, which helps us in a
  multitasking environment. This signal comes from PIC 8259 (in arch
  386+) which is connected to PIT 8253 with a clock of 1.19318 MHz.


      _____         ______        ______
     | CPU |<------| 8259 |------| 8253 |
     |_____| IRQ0  |______|      |___/|\|
                                      |_____ CLK 1.193.180 MHz

  // From include/asm/param.h
  #ifndef HZ
  #define HZ 100
  #endif

  // From include/asm/timex.h
  #define CLOCK_TICK_RATE 1193180 /* Underlying HZ */

  // From include/linux/timex.h
  #define LATCH ((CLOCK_TICK_RATE + HZ/2) / HZ) /* For divider */

  // From arch/i386/kernel/i8259.c
  outb_p(0x34,0x43); /* binary, mode 2, LSB/MSB, ch 0 */
  outb_p(LATCH & 0xff , 0x40); /* LSB */
  outb(LATCH >> 8 , 0x40); /* MSB */


  So we program 8253 (PIT, Programmable Interval Timer) with LATCH =
  (1193180/HZ) = 11931.8 when HZ=100 (default). LATCH indicates the
  frequency divisor factor.


  LATCH = 11931.8 gives to 8253 (in output) a frequency of 1193180 /
  11931.8 = 100 Hz, so period = 10ms


  So Timeslice = 1/HZ.


  With each Timeslice we temporarily interrupt current process execution
  (without task switching), and we do some housekeeping work, after
  which we'll return back to our previous process.


  6.2.2.  Linux Timer IRQ ICA


  Linux Timer IRQ
  IRQ 0 [Timer]
   |
  \|/
  |IRQ0x00_interrupt        //   wrapper IRQ handler
     |SAVE_ALL              ---
        |do_IRQ                |   wrapper routines
           |handle_IRQ_event  ---
              |handler() -> timer_interrupt  // registered IRQ 0 handler
                 |do_timer_interrupt
                    |do_timer
                       |jiffies++;
                       |update_process_times
                       |if (--counter <= 0) { // if time slice ended then
                          |counter = 0;        //   reset counter
                          |need_resched = 1;   //   prepare to reschedule
                       |}
           |do_softirq
           |while (need_resched) { // if necessary
              |schedule             //   reschedule
              |handle_softirq
           |}
     |RESTORE_ALL


  Functions can be found under:


  <20>  IRQ0x00_interrupt, SAVE_ALL [include/asm/hw_irq.h]

  <20>  do_IRQ, handle_IRQ_event [arch/i386/kernel/irq.c]

  <20>  timer_interrupt, do_timer_interrupt [arch/i386/kernel/time.c]

  <20>  do_timer, update_process_times [kernel/timer.c]

  <20>  do_softirq [kernel/soft_irq.c]

  <20>  RESTORE_ALL, while loop [arch/i386/kernel/entry.S]

  Notes:


  1. Function "IRQ0x00_interrupt" (like others IRQ0xXY_interrupt) is
     directly pointed by IDT (Interrupt Descriptor Table, similar to
     Real Mode Interrupt Vector Table, see Cap 11 for more), so EVERY
     interrupt coming to the processor is managed by
     "IRQ0x#NR_interrupt" routine, where #NR is the interrupt number. We
     refer to it as "wrapper irq handler".

  2. wrapper routines are executed, like "do_IRQ","handle_IRQ_event"
     [arch/i386/kernel/irq.c].

  3. After this, control is passed to official IRQ routine (pointed by
     "handler()"), previously registered with "request_irq"
     [arch/i386/kernel/irq.c], in this case "timer_interrupt"
     [arch/i386/kernel/time.c].

  4. "timer_interrupt" [arch/i386/kernel/time.c] routine is executed
     and, when it ends,


  5. control backs to some assembler routines
     [arch/i386/kernel/entry.S].

  Description:


  To manage Multitasking, Linux (like every other Unix) uses a task. So,
  on each IRQ 0, the counter is decremented (point 4) and, when it
  reaches 0, we need to switch task to manage timesharing (point 4
  "need_resched" variable is set to 1, then, in point 5 assembler
  routines control "need_resched" and call, if needed, "schedule"
  [kernel/sched.c]).


  6.3.  Scheduler

  The scheduler is the piece of code that chooses what Task has to be
  executed at a given time.


  Any time you need to change running task, select a candidate.  Below
  is the ''schedule [kernel/sched.c]'' function.


  |schedule
     |do_softirq // manages post-IRQ work
     |for each task
        |calculate counter
     |prepare_to__switch // does anything
     |switch_mm // change Memory context (change CR3 value)
     |switch_to (assembler)
        |SAVE ESP
        |RESTORE future_ESP
        |SAVE EIP
        |push future_EIP *** push parameter as we did a call
           |jmp __switch_to (it does some TSS work)
           |__switch_to()
            ..
           |ret *** ret from call using future_EIP in place of call address
        new_task


  6.4.  Bottom Half, Task Queues. and Tasklets

  6.4.1.  Overview

  In classic Unix, when an IRQ comes (from a device), Unix makes "task
  switching" to interrogate the task that requested the device.


  To improve performance, Linux can postpone the non-urgent work until
  later, to better manage high speed event.


  This feature is managed since kernel 1.x by the "bottom half" (BH).
  The irq handler "marks" a bottom half, to be executed later, in
  scheduling time.


  In the latest kernels there is a "task queue"that is more dynamic than
  BH and there is also a "tasklet" to manage multiprocessor
  environments.

  BH schema is:


  1. Declaration

  2. Mark

  3. Execution

  6.4.2.  Declaration


  #define DECLARE_TASK_QUEUE(q) LIST_HEAD(q)
  #define LIST_HEAD(name) \
     struct list_head name = LIST_HEAD_INIT(name)
  struct list_head {
     struct list_head *next, *prev;
  };
  #define LIST_HEAD_INIT(name) { &(name), &(name) }

        ''DECLARE_TASK_QUEUE'' [include/linux/tqueue.h, include/linux/list.h]


  "DECLARE_TASK_QUEUE(q)" macro is used to declare a structure named "q"
  managing task queue.


  6.4.3.  Mark

  Here is the ICA schema for "mark_bh" [include/linux/interrupt.h]
  function:


  |mark_bh(NUMBER)
     |tasklet_hi_schedule(bh_task_vec + NUMBER)
        |insert into tasklet_hi_vec
           |__cpu_raise_softirq(HI_SOFTIRQ)
              |soft_active |= (1 << HI_SOFTIRQ)

                     ''mark_bh''[include/linux/interrupt.h]


  For example, when an IRQ handler wants to "postpone" some work, it
  would  "mark_bh(NUMBER)", where NUMBER is a BH declarated (see section
  before).


  6.4.4.  Execution

  We can see this calling from "do_IRQ" [arch/i386/kernel/irq.c]
  function:


  |do_softirq
     |h->action(h)-> softirq_vec[TASKLET_SOFTIRQ]->action -> tasklet_action
        |tasklet_vec[0].list->func


  "h->action(h);" is the function has been previously queued.


  6.5.  Very low level routines

  set_intr_gate


  set_trap_gate


  set_task_gate (not used).


  (*interrupt)[NR_IRQS](void) = { IRQ0x00_interrupt, IRQ0x01_interrupt,
  ..}


  NR_IRQS = 224 [kernel 2.4.2]


  6.6.  Task Switching

  6.6.1.  When does Task switching occur?

  Now we'll see how the Linux Kernel switchs from one task to another.


  Task Switching is needed in many cases, such as the following:


  <20>  when TimeSlice ends, we need to give access to some other task

  <20>  when a task decide to access a resource, it sleeps for it, so we
     have to choose another task

  <20>  when a task waits for a pipe, we have to give access to other task,
     which would write to pipe

  6.6.2.  Task Switching


                             TASK SWITCHING TRICK
  #define switch_to(prev,next,last) do {                                  \
          asm volatile("pushl %%esi\n\t"                                  \
                       "pushl %%edi\n\t"                                  \
                       "pushl %%ebp\n\t"                                  \
                       "movl %%esp,%0\n\t"        /* save ESP */          \
                       "movl %3,%%esp\n\t"        /* restore ESP */       \
                       "movl $1f,%1\n\t"          /* save EIP */          \
                       "pushl %4\n\t"             /* restore EIP */       \
                       "jmp __switch_to\n"                                \
                       "1:\t"                                             \
                       "popl %%ebp\n\t"                                   \
                       "popl %%edi\n\t"                                   \
                       "popl %%esi\n\t"                                   \
                       :"=m" (prev->thread.esp),"=m" (prev->thread.eip),  \
                        "=b" (last)                                       \
                       :"m" (next->thread.esp),"m" (next->thread.eip),    \
                        "a" (prev), "d" (next),                           \
                        "b" (prev));                                      \
  } while (0)


  Trick is here:


  1.

  2. in opposite of ''call'' we will return to valued pushed in point 1
     (so new Task!)


        U S E R   M O D E                 K E R N E L     M O D E

   |          |     |          |       |          |     |          |
   |          |     |          | Timer |          |     |          |
   |          |     |  Normal  |  IRQ  |          |     |          |
   |          |     |   Exec   |------>|Timer_Int.|     |          |
   |          |     |     |    |       | ..       |     |          |
   |          |     |    \|/   |       |schedule()|     | Task1 Ret|
   |          |     |          |       |_switch_to|<--  |  Address |
   |__________|     |__________|       |          |  |  |          |
                                       |          |  |S |          |
  Task1 Data/Stack   Task1 Code        |          |  |w |          |
                                       |          | T|i |          |
                                       |          | a|t |          |
   |          |     |          |       |          | s|c |          |
   |          |     |          | Timer |          | k|h |          |
   |          |     |  Normal  |  IRQ  |          |  |i |          |
   |          |     |   Exec   |------>|Timer_Int.|  |n |          |
   |          |     |     |    |       | ..       |  |g |          |
   |          |     |    \|/   |       |schedule()|  |  | Task2 Ret|
   |          |     |          |       |_switch_to|<--  |  Address |
   |__________|     |__________|       |__________|     |__________|

  Task2 Data/Stack   Task2 Code        Kernel Code  Kernel Data/Stack


  6.7.  Fork

  6.7.1.  Overview

  Fork is used to create another task. We start from a Task Parent, and
  we copy many data structures to Task Child.


                                 |         |
                                 | ..      |
           Task Parent           |         |
           |         |           |         |
           |  fork   |---------->|  CREATE |
           |         |          /|   NEW   |
           |_________|         / |   TASK  |
                              /  |         |
               ---           /   |         |
               ---          /    | ..      |
                           /     |         |
           Task Child     /
           |         |   /
           |  fork   |<-/
           |         |
           |_________|

                         Fork SysCall

  6.7.2.  What is not copied

  New Task just created (''Task Child'') is almost equal to Parent
  (''Task Parent''), there are only few differences:


  1. obviously PID

  2. child ''fork()'' will return 0, while parent ''fork()'' will return
     PID of Task Child, to distinguish them each other in User Mode

  3. All child data pages are marked ''READ + EXECUTE'', no "WRITE''
     (while parent has WRITE right for its own pages) so, when a write
     request comes, a ''Page Fault'' exception is generated which will
     create a new independent page: this mechanism is called ''Copy on
     Write'' (see Cap.10 for more).

  6.7.3.  Fork ICA


  |sys_fork
     |do_fork
        |alloc_task_struct
           |__get_free_pages
         |p->state = TASK_UNINTERRUPTIBLE
         |copy_flags
         |p->pid = get_pid
         |copy_files
         |copy_fs
         |copy_sighand
         |copy_mm // should manage CopyOnWrite (I part)
            |allocate_mm
            |mm_init
               |pgd_alloc -> get_pgd_fast
                  |get_pgd_slow
            |dup_mmap
               |copy_page_range
                  |ptep_set_wrprotect
                     |clear_bit // set page to read-only
            |copy_segments // For LDT
         |copy_thread
            |childregs->eax = 0
            |p->thread.esp = childregs // child fork returns 0
            |p->thread.eip = ret_from_fork // child starts from fork exit
         |retval = p->pid // parent fork returns child pid
         |SET_LINKS // insertion of task into the list pointers
         |nr_threads++ // Global variable
         |wake_up_process(p) // Now we can wake up just created child
         |return retval

                 fork ICA


  <20>  sys_fork [arch/i386/kernel/process.c]

  <20>  do_fork [kernel/fork.c]

  <20>  alloc_task_struct [include/asm/processor.c]

  <20>  __get_free_pages [mm/page_alloc.c]


  <20>  get_pid [kernel/fork.c]

  <20>  copy_files

  <20>  copy_fs

  <20>  copy_sighand

  <20>  copy_mm

  <20>  allocate_mm

  <20>  mm_init

  <20>  pgd_alloc -> get_pgd_fast [include/asm/pgalloc.h]

  <20>  get_pgd_slow

  <20>  dup_mmap [kernel/fork.c]

  <20>  copy_page_range [mm/memory.c]

  <20>  ptep_set_wrprotect [include/asm/pgtable.h]

  <20>  clear_bit [include/asm/bitops.h]

  <20>  copy_segments [arch/i386/kernel/process.c]

  <20>  copy_thread

  <20>  SET_LINKS [include/linux/sched.h]

  <20>  wake_up_process [kernel/sched.c]

  6.7.4.  Copy on Write

  To implement Copy on Write for Linux:


  1. Mark all copied pages as read-only, causing a Page Fault when a
     Task tries to write to them.

  2. Page Fault handler creates a new page.


   | Page
   | Fault
   | Exception
   |
   |
   -----------> |do_page_fault
                   |handle_mm_fault
                      |handle_pte_fault
                         |do_wp_page
                            |alloc_page      // Allocate a new page
                            |break_cow
                               |copy_cow_page // Copy old page to new one
                               |establish_pte // reconfig Page Table pointers
                                  |set_pte

                Page Fault ICA


  <20>  do_page_fault [arch/i386/mm/fault.c]

  <20>  handle_mm_fault [mm/memory.c]

  <20>  handle_pte_fault

  <20>  do_wp_page

  <20>  alloc_page [include/linux/mm.h]

  <20>  break_cow [mm/memory.c]

  <20>  copy_cow_page

  <20>  establish_pte

  <20>  set_pte [include/asm/pgtable-3level.h]

  7.  Linux Memory Management

  7.1.  Overview

  Linux uses segmentation + pagination, which simplifies notation.


  7.1.1.  Segments

  Linux uses only 4 segments:


  <20>  2 segments (code and data/stack) for KERNEL SPACE from [0xC000
     0000] (3 GB) to [0xFFFF FFFF] (4 GB)

  <20>  2 segments (code and data/stack) for USER SPACE from [0] (0 GB) to
     [0xBFFF FFFF] (3 GB)


                                 __
     4 GB--->|                |    |
             |     Kernel     |    |  Kernel Space (Code + Data/Stack)
             |                |  __|
     3 GB--->|----------------|  __
             |                |    |
             |                |    |
     2 GB--->|                |    |
             |     Tasks      |    |  User Space (Code + Data/Stack)
             |                |    |
     1 GB--->|                |    |
             |                |    |
             |________________|  __|
   0x00000000
            Kernel/User Linear addresses


  7.2.  Specific i386 implementation

  Again, Linux implements Pagination using 3 Levels of Paging, but in
  i386 architecture only 2 of them are really used:


     ------------------------------------------------------------------
     L    I    N    E    A    R         A    D    D    R    E    S    S
     ------------------------------------------------------------------
          \___/                 \___/                     \_____/

       PD offset              PF offset                 Frame offset
       [10 bits]              [10 bits]                 [12 bits]
            |                     |                          |
            |                     |     -----------          |
            |                     |     |  Value  |----------|---------
            |     |         |     |     |---------|   /|\    |        |
            |     |         |     |     |         |    |     |        |
            |     |         |     |     |         |    | Frame offset |
            |     |         |     |     |         |   \|/             |
            |     |         |     |     |---------|<------            |
            |     |         |     |     |         |      |            |
            |     |         |     |     |         |      | x 4096     |
            |     |         |  PF offset|_________|-------            |
            |     |         |       /|\ |         |                   |
        PD offset |_________|-----   |  |         |          _________|
              /|\ |         |    |   |  |         |          |
               |  |         |    |  \|/ |         |         \|/
   _____       |  |         |    ------>|_________|   PHYSICAL ADDRESS
  |     |     \|/ |         |    x 4096 |         |
  | CR3 |-------->|         |           |         |
  |_____|         | ....... |           | ....... |
                  |         |           |         |

                 Page Directory          Page File

                         Linux i386 Paging


  7.3.  Memory Mapping

  Linux manages Access Control with Pagination only, so different Tasks
  will have the same segment addresses, but different CR3 (register used
  to store Directory Page Address), pointing to different Page Entries.


  In User mode a task cannot overcome 3 GB limit (0 x C0 00 00 00), so
  only the  first 768 page directory entries are meaningful (768*4MB =
  3GB).


  When a Task goes in Kernel Mode (by System call or by IRQ) the other
  256 pages directory entries become important, and they point to the
  same page files as all other Tasks (which are the same as the Kernel).


  Note that Kernel (and only kernel) Linear Space is equal to Kernel
  Physical Space, so:


              ________________ _____
             |Other KernelData|___  |  |                |
             |----------------|   | |__|                |
             |     Kernel     |\  |____|   Real Other   |
    3 GB --->|----------------| \      |   Kernel Data  |
             |                |\ \     |                |
             |              __|_\_\____|__   Real       |
             |      Tasks     |  \ \   |     Tasks      |
             |              __|___\_\__|__   Space      |
             |                |    \ \ |                |
             |                |     \ \|----------------|
             |                |      \ |Real KernelSpace|
             |________________|       \|________________|

             Logical Addresses          Physical Addresses


  Linear Kernel Space corresponds to Physical Kernel Space  translated 3
  GB down (in fact page tables are something like { "00000000",
  "00000001" }, so they operate no virtualization, they only report
  physical addresses  they take from linear ones).


  Notice that you'll not have an "addresses conflict" between Kernel and
  User spaces because we can manage physical addresses with Page Tables.


  7.4.  Low level memory allocation

  7.4.1.  Boot Initialization

  We start from kmem_cache_init (launched by start_kernel [init/main.c]
  at boot up).


  |kmem_cache_init
     |kmem_cache_estimate


  kmem_cache_init [mm/slab.c]


  kmem_cache_estimate


  Now we continue with mem_init (also launched by
  start_kernel[init/main.c])


  |mem_init
     |free_all_bootmem
        |free_all_bootmem_core


  mem_init [arch/i386/mm/init.c]


  free_all_bootmem [mm/bootmem.c]

  free_all_bootmem_core


  7.4.2.  Run-time allocation

  Under Linux, when we want to allocate memory, for example during
  "copy_on_write" mechanism (see Cap.10), we call:


  |copy_mm
     |allocate_mm = kmem_cache_alloc
        |__kmem_cache_alloc
           |kmem_cache_alloc_one
              |alloc_new_slab
                 |kmem_cache_grow
                    |kmem_getpages
                       |__get_free_pages
                          |alloc_pages
                             |alloc_pages_pgdat
                                |__alloc_pages
                                   |rmqueue
                                   |reclaim_pages


  Functions can be found under:


  <20>  copy_mm [kernel/fork.c]

  <20>  allocate_mm [kernel/fork.c]

  <20>  kmem_cache_alloc [mm/slab.c]

  <20>  __kmem_cache_alloc

  <20>  kmem_cache_alloc_one

  <20>  alloc_new_slab

  <20>  kmem_cache_grow

  <20>  kmem_getpages

  <20>  __get_free_pages [mm/page_alloc.c]

  <20>  alloc_pages [mm/numa.c]

  <20>  alloc_pages_pgdat

  <20>  __alloc_pages [mm/page_alloc.c]

  <20>  rm_queue

  <20>  reclaim_pages [mm/vmscan.c]

  TODO: Understand Zones


  7.5.  Swap


  7.5.1.  Overview

  Swap is managed by the kswapd daemon (kernel thread).


  7.5.2.  kswapd

  As other kernel threads, kswapd has a main loop that wait to wake up.


  |kswapd
     |// initialization routines
     |for (;;) { // Main loop
        |do_try_to_free_pages
        |recalculate_vm_stats
        |refill_inactive_scan
        |run_task_queue
        |interruptible_sleep_on_timeout // we sleep for a new swap request
     |}


  <20>  kswapd [mm/vmscan.c]

  <20>  do_try_to_free_pages

  <20>  recalculate_vm_stats [mm/swap.c]

  <20>  refill_inactive_scan [mm/vmswap.c]

  <20>  run_task_queue [kernel/softirq.c]

  <20>  interruptible_sleep_on_timeout [kernel/sched.c]

  7.5.3.  When do we need swapping?

  Swapping is needed when we have to access a page that is not in
  physical memory.


  Linux uses ''kswapd'' kernel thread to carry out this purpose.  When
  the Task receives a page fault exception we do the following:


   | Page Fault Exception
   | cause by all these conditions:
   |   a-) User page
   |   b-) Read or write access
   |   c-) Page not present
   |
   |
   -----------> |do_page_fault
                   |handle_mm_fault
                      |pte_alloc
                         |pte_alloc_one
                            |__get_free_page = __get_free_pages
                               |alloc_pages
                                  |alloc_pages_pgdat
                                     |__alloc_pages
                                        |wakeup_kswapd // We wake up kernel thread kswapd

                     Page Fault ICA


  <20>  do_page_fault [arch/i386/mm/fault.c]

  <20>  handle_mm_fault [mm/memory.c]

  <20>  pte_alloc

  <20>  pte_alloc_one [include/asm/pgalloc.h]

  <20>  __get_free_page [include/linux/mm.h]

  <20>  __get_free_pages [mm/page_alloc.c]

  <20>  alloc_pages [mm/numa.c]

  <20>  alloc_pages_pgdat

  <20>  __alloc_pages

  <20>  wakeup_kswapd [mm/vmscan.c]

  8.  Linux Networking

  8.1.  How Linux networking is managed?

  There exists a device driver for each kind of NIC. Inside it, Linux
  will ALWAYS call a standard high level routing: "netif_rx
  [net/core/dev.c]", which will controls what 3 level protocol the frame
  belong to, and it will call the right 3 level function (so we'll use a
  pointer to the function to determine which is right).


  8.2.  TCP example

  We'll see now an example of what happens when we send a TCP packet to
  Linux, starting from ''netif_rx [net/core/dev.c]'' call.


  8.2.1.  Interrupt management: "netif_rx"


  |netif_rx
     |__skb_queue_tail
        |qlen++
        |* simple pointer insertion *
     |cpu_raise_softirq
        |softirq_active(cpu) |= (1 << NET_RX_SOFTIRQ) // set bit NET_RX_SOFTIRQ in the BH vector


  Functions:


  <20>  __skb_queue_tail [include/linux/skbuff.h]

  <20>  cpu_raise_softirq [kernel/softirq.c]

  8.2.2.  Post Interrupt management: "net_rx_action"

  Once IRQ interaction is ended, we need to follow the next part of the
  frame life and examine what NET_RX_SOFTIRQ does.


  We will next call ''net_rx_action [net/core/dev.c]'' according to
  "net_dev_init [net/core/dev.c]".


  |net_rx_action
     |skb = __skb_dequeue (the exact opposite of __skb_queue_tail)
     |for (ptype = first_protocol; ptype < max_protocol; ptype++) // Determine
        |if (skb->protocol == ptype)                               // what is the network protocol
           |ptype->func -> ip_rcv // according to ''struct ip_packet_type [net/ipv4/ip_output.c]''

      **** NOW WE KNOW THAT PACKET IS IP ****
           |ip_rcv
              |NF_HOOK (ip_rcv_finish)
                 |ip_route_input // search from routing table to determine function to call
                    |skb->dst->input -> ip_local_deliver // according to previous routing table check, destination is local machine
                       |ip_defrag // reassembles IP fragments
                          |NF_HOOK (ip_local_deliver_finish)
                             |ipprot->handler -> tcp_v4_rcv // according to ''tcp_protocol [include/net/protocol.c]''

       **** NOW WE KNOW THAT PACKET IS TCP ****
                             |tcp_v4_rcv
                                |sk = __tcp_v4_lookup
                                |tcp_v4_do_rcv
                                   |switch(sk->state)

       *** Packet can be sent to the task which uses relative socket ***
                                   |case TCP_ESTABLISHED:
                                      |tcp_rcv_established
                                         |__skb_queue_tail // enqueue packet to socket
                                         |sk->data_ready -> sock_def_readable
                                            |wake_up_interruptible


       *** Packet has still to be handshaked by 3-way TCP handshake ***
                                   |case TCP_LISTEN:
                                      |tcp_v4_hnd_req
                                         |tcp_v4_search_req
                                         |tcp_check_req
                                            |syn_recv_sock -> tcp_v4_syn_recv_sock
                                         |__tcp_v4_lookup_established
                                   |tcp_rcv_state_process

                      *** 3-Way TCP Handshake ***
                                      |switch(sk->state)
                                      |case TCP_LISTEN: // We received SYN
                                         |conn_request -> tcp_v4_conn_request
                                            |tcp_v4_send_synack // Send SYN + ACK
                                               |tcp_v4_synq_add // set SYN state
                                      |case TCP_SYN_SENT: // we received SYN + ACK
                                         |tcp_rcv_synsent_state_process
                                            tcp_set_state(TCP_ESTABLISHED)
                                               |tcp_send_ack
                                                  |tcp_transmit_skb
                                                     |queue_xmit -> ip_queue_xmit
                                                        |ip_queue_xmit2
                                                           |skb->dst->output
                                      |case TCP_SYN_RECV: // We received ACK
                                         |if (ACK)
                                            |tcp_set_state(TCP_ESTABLISHED)


  Functions can be found under:


  <20>  net_rx_action [net/core/dev.c]


  <20>  __skb_dequeue [include/linux/skbuff.h]

  <20>  ip_rcv [net/ipv4/ip_input.c]

  <20>  NF_HOOK -> nf_hook_slow [net/core/netfilter.c]

  <20>  ip_rcv_finish [net/ipv4/ip_input.c]

  <20>  ip_route_input [net/ipv4/route.c]

  <20>  ip_local_deliver [net/ipv4/ip_input.c]

  <20>  ip_defrag [net/ipv4/ip_fragment.c]

  <20>  ip_local_deliver_finish [net/ipv4/ip_input.c]

  <20>  tcp_v4_rcv [net/ipv4/tcp_ipv4.c]

  <20>  __tcp_v4_lookup

  <20>  tcp_v4_do_rcv

  <20>  tcp_rcv_established [net/ipv4/tcp_input.c]

  <20>  __skb_queue_tail [include/linux/skbuff.h]

  <20>  sock_def_readable [net/core/sock.c]

  <20>  wake_up_interruptible [include/linux/sched.h]

  <20>  tcp_v4_hnd_req [net/ipv4/tcp_ipv4.c]

  <20>  tcp_v4_search_req

  <20>  tcp_check_req

  <20>  tcp_v4_syn_recv_sock

  <20>  __tcp_v4_lookup_established

  <20>  tcp_rcv_state_process [net/ipv4/tcp_input.c]

  <20>  tcp_v4_conn_request [net/ipv4/tcp_ipv4.c]

  <20>  tcp_v4_send_synack

  <20>  tcp_v4_synq_add

  <20>  tcp_rcv_synsent_state_process [net/ipv4/tcp_input.c]

  <20>  tcp_set_state [include/net/tcp.h]

  <20>  tcp_send_ack [net/ipv4/tcp_output.c]

  Description:


  <20>  First we determine protocol type (IP, then TCP)

  <20>  NF_HOOK (function) is a wrapper routine that first manages the
     network filter (for example firewall), then it calls ''function''.

  <20>  After we manage 3-way TCP Handshake which consists of:


  SERVER (LISTENING)                       CLIENT (CONNECTING)
                             SYN
                     <-------------------


                          SYN + ACK
                     ------------------->


                             ACK
                     <-------------------

                      3-Way TCP handshake


  <20>  In the end we only have to launch "tcp_rcv_established
     [net/ipv4/tcp_input.c]" which gives the packet to the user socket
     and wakes it up.

  9.  Linux File System

  TODO


  10.  Useful Tips

  10.1.  Stack and Heap

  10.1.1.  Overview

  Here we view how "stack" and "heap" are allocated in memory


  10.1.2.  Memory allocation


  FF..        |                 | <-- bottom of the stack
         /|\  |                 |   |
   higher |   |                 |   |   stack
   values |   |                 |  \|/  growing
              |                 |
  XX..        |                 | <-- top of the stack [Stack Pointer]
              |                 |
              |                 |
              |                 |
  00..        |_________________| <-- end of stack [Stack Segment]

                     Stack


  Memory address values start from 00.. (which is also where Stack
  Segment begins) and they grow going toward FF.. value.


  XX.. is the actual value of the Stack Pointer.


  Stack is used by functions for:


  1. global variables

  2. local variables

  3. return address

  For example, for a classical function:


   |int foo_function (parameter_1, parameter_2, ..., parameter_n) {
      |variable_1 declaration;
      |variable_2 declaration;
        ..
      |variable_n declaration;

      |// Body function
      |dynamic variable_1 declaration;
      |dynamic variable_2 declaration;
       ..
      |dynamic variable_n declaration;

      |// Code is inside Code Segment, not Data/Stack segment!

      |return (ret-type) value; // often it is inside some register, for i386 eax register is used.
   |}
  we have

            |                       |
            | 1. parameter_1 pushed | \
      S     | 2. parameter_2 pushed |  | Before
      T     | ...................   |  | the calling
      A     | n. parameter_n pushed | /
      C     | ** Return address **  | -- Calling
      K     | 1. local variable_1   | \
            | 2. local variable_2   |  | After
            | .................     |  | the calling
            | n. local variable_n   | /
            |                       |
           ...                     ...   Free
           ...                     ...   stack
            |                       |
      H     | n. dynamic variable_n | \
      E     | ...................   |  | Allocated by
      A     | 2. dynamic variable_2 |  | malloc & kmalloc
      P     | 1. dynamic variable_1 | /
            |_______________________|

              Typical stack usage

  Note: variables order can be different depending on hardware architecture.


  10.2.  Application vs Process

  10.2.1.  Base definition

  We have to distinguish 2 concepts:


  <20>  Application: that is the useful code we want to execute

  <20>  Process: that is the IMAGE on memory of the application (it depends
     on memory strategy used, segmentation and/or Pagination).

  Often Process is also called Task or Thread.


  10.3.  Locks

  10.3.1.  Overview

  2 kind of locks:


  1. intraCPU

  2. interCPU

  10.4.  Copy_on_write

  Copy_on_write is a mechanism used to reduce memory usage. It postpones
  memory allocation until the memory is really needed.


  For example, when a task executes the "fork()" system call (to create
  another task), we still use the same memory pages as the parent, in
  read only mode. When a task WRITES into the page, it causes an
  exception and the page is copied and marked "rw" (read, write).


  1-) Page X is shared between Task Parent and Task Child
   Task Parent
   |         | RO Access  ______
   |         |---------->|Page X|
   |_________|           |______|
                            /|\
                             |
   Task Child                |
   |         | RO Access     |
   |         |----------------
   |_________|


  2-) Write request
   Task Parent
   |         | RO Access  ______
   |         |---------->|Page X|    Trying to write
   |_________|           |______|
                            /|\
                             |
   Task Child                |
   |         | RO Access     |
   |         |----------------
   |_________|


  3-) Final Configuration: Either Task Parent and Task Child have an independent copy of the Page, X and Y
   Task Parent
   |         | RW Access  ______
   |         |---------->|Page X|
   |_________|           |______|


   Task Child
   |         | RW Access  ______
   |         |---------->|Page Y|
   |_________|           |______|


  11.  80386 specific details

  11.1.  Boot procedure


  bbootsect.s [arch/i386/boot]
  setup.S (+video.S)
  head.S (+misc.c) [arch/i386/boot/compressed]
  start_kernel [init/main.c]


  11.2.  80386 (and more) Descriptors

  11.2.1.  Overview

  Descriptors are data structure used by Intel microprocessor i386+ to
  virtualize memory.


  11.2.2.  Kind of descriptors


  <20>  GDT (Global Descriptor Table)


  <20>  LDT (Local Descriptor Table)

  <20>  IDT (Interrupt Descriptor Table)

  12.  IRQ

  12.1.  Overview

  IRQ is an asyncronous signal sent to microprocessor to advertise a
  requested work is completed


  12.2.  Interaction schema


                                   |<-->  IRQ(0) [Timer]
                                   |<-->  IRQ(1) [Device 1]
                                   | ..
                                   |<-->  IRQ(n) [Device n]
      _____________________________|
       /|\      /|\          /|\
        |        |            |
       \|/      \|/          \|/

      Task(1)  Task(2) ..   Task(N)


               IRQ - Tasks Interaction Schema


  12.2.1.  What happens?

  A typical O.S. uses many IRQ signals to interrupt normal process
  execution and does some housekeeping work. So:


  1. IRQ (i) occurs and Task(j) is interrupted

  2. IRQ(i)_handler is executed

  3. control backs to Task(j) interrupted

  Under Linux, when an IRQ comes, first the IRQ wrapper routine (named
  "interrupt0x??") is called, then the "official" IRQ(i)_handler will be
  executed. This allows some duties like timeslice preemption.


  13.  Utility functions

  13.1.  list_entry [include/linux/list.h]

  Definition:


  #define list_entry(ptr, type, member) \
  ((type *)((char *)(ptr)-(unsigned long)(&((type *)0)->member)))


  Meaning:

  "list_entry" macro is used to retrieve a parent struct pointer, by
  using only one of internal struct pointer.


  Example:


  struct __wait_queue {
     unsigned int flags;
     struct task_struct * task;
     struct list_head task_list;
  };
  struct list_head {
     struct list_head *next, *prev;
  };

  // and with type definition:
  typedef struct __wait_queue wait_queue_t;

  // we'll have
  wait_queue_t *out list_entry(tmp, wait_queue_t, task_list);

  // where tmp point to list_head


  So, in this case, by means of *tmp pointer [list_head] we retrieve an
  *out pointer [wait_queue_t].


   ____________ <---- *out [we calculate that]
  |flags       |             /|\
  |task *-->   |              |
  |task_list   |<----    list_entry
  |  prev * -->|    |         |
  |  next * -->|    |         |
  |____________|    ----- *tmp [we have this]


  13.2.  Sleep

  13.2.1.  Sleep code

  Files:


  <20>  kernel/sched.c

  <20>  include/linux/sched.h

  <20>  include/linux/wait.h

  <20>  include/linux/list.h

  Functions:


  <20>  interruptible_sleep_on

  <20>  interruptible_sleep_on_timeout

  <20>  sleep_on

  <20>  sleep_on_timeout

  Called functions:


  <20>  init_waitqueue_entry

  <20>  __add_wait_queue

  <20>  list_add

  <20>  __list_add

  <20>  __remove_wait_queue

  InterCallings Analysis:


  |sleep_on
     |init_waitqueue_entry  --
     |__add_wait_queue        |   enqueuing request to resource list
        |list_add              |
           |__list_add        --
     |schedule              ---     waiting for request to be executed
        |__remove_wait_queue --
        |list_del              |   dequeuing request from resource list
           |__list_del        --


  Description:


  Under Linux each resource (ideally an object shared between many users
  and many processes), , has a queue to manage ALL tasks requesting it.


  This queue is called "wait queue" and it consists of many items we'll
  call the"wait queue element":


  ***   wait queue structure [include/linux/wait.h]  ***


  struct __wait_queue {
     unsigned int flags;
     struct task_struct * task;
     struct list_head task_list;
  }
  struct list_head {
     struct list_head *next, *prev;
  };


  Graphic working:

          ***  wait queue element  ***

                               /|\
                                |
         <--[prev *, flags, task *, next *]-->


                   ***  wait queue list ***

            /|\           /|\           /|\                /|\
             |             |             |                  |
  --> <--[task1]--> <--[task2]--> <--[task3]--> .... <--[taskN]--> <--
  |                                                                  |
  |__________________________________________________________________|


                ***   wait queue head ***

         task1 <--[prev *, lock, next *]--> taskN


  "wait queue head" point to first (with next *) and last (with prev *)
  elements of the "wait queue list".


  When a new element has to be added, "__add_wait_queue"
  [include/linux/wait.h] is called, after which  the generic routine
  "list_add" [include/linux/wait.h], will be executed:


  ***   function list_add [include/linux/list.h]  ***

  // classic double link list insert
  static __inline__ void __list_add (struct list_head * new,  \
                                     struct list_head * prev, \
                                     struct list_head * next) {
     next->prev = new;
     new->next = next;
     new->prev = prev;
     prev->next = new;
  }


  To complete the description, we see also "__list_del"
  [include/linux/list.h] function called by "list_del"
  [include/linux/list.h] inside "remove_wait_queue"
  [include/linux/wait.h]:


  ***   function list_del [include/linux/list.h]  ***


  // classic double link list delete
  static __inline__ void __list_del (struct list_head * prev, struct list_head * next) {
     next->prev = prev;
     prev->next = next;
  }


  13.2.2.  Stack consideration

  A typical list (or queue) is usually managed allocating it into the
  Heap (see Cap.10 for Heap and Stack definition and about where
  variables are allocated). Otherwise here, we statically allocate Wait
  Queue data in a local variable (Stack), then function is interrupted
  by scheduling, in the end, (returning from scheduling) we'll erase
  local variable.


    new task <----|          task1 <------|          task2 <------|
                  |                       |                       |
                  |                       |                       |
  |..........|    |       |..........|    |       |..........|    |
  |wait.flags|    |       |wait.flags|    |       |wait.flags|    |
  |wait.task_|____|       |wait.task_|____|       |wait.task_|____|
  |wait.prev |-->         |wait.prev |-->         |wait.prev |-->
  |wait.next |-->         |wait.next |-->         |wait.next |-->
  |..        |            |..        |            |..        |
  |schedule()|            |schedule()|            |schedule()|
  |..........|            |..........|            |..........|
  |__________|            |__________|            |__________|

     Stack                   Stack                   Stack


  14.  Static variables

  14.1.  Overview

  Linux is written in ''C'' language, and as every application has:


  1. Local variables

  2. Module variables (inside the source file and relative only to that
     module)

  3. Global/Static variables present in only 1 copy (the same for all
     modules)

  When a Static variable is modified by a module, all other modules will
  see the new value.


  Static variables under Linux are very important, cause they are the
  only kind to add new support to kernel: they typically are pointers to
  the head of a list of registered elements, which can be:


  <20>  added

  <20>  deleted

  <20>  maybe modified


                             _______      _______      _______
  Global variable  -------> |Item(1)| -> |Item(2)| -> |Item(3)|  ..
                            |_______|    |_______|    |_______|


  14.2.  Main variables

  14.2.1.  Current


                             ________________
  Current ----------------> | Actual process |
                            |________________|


  Current points to ''task_struct'' structure, which contains all data
  about a process like:


  <20>  pid, name, state, counter, policy of scheduling

  <20>  pointers to many data structures like: files, vfs, other processes,
     signals...

  Current is not a real variable, it is


  static inline struct task_struct * get_current(void) {
     struct task_struct *current;
     __asm__("andl %%esp,%0; ":"=r" (current) : "0" (~8191UL));
     return current;
  }
  #define current get_current()


  Above lines just takes value of ''esp'' register (stack pointer) and
  get it available like a variable, from which we can point to our
  task_struct structure.


  From ''current'' element we can access directly to any other process
  (ready, stopped or in any other state) kernel data structure, for
  example changing STATE (like a I/O driver does), PID, presence in
  ready list or blocked list, etc.


  14.2.2.  Registered filesystems


                         ______      _______      ______
  file_systems  ------> | ext2 | -> | msdos | -> | ntfs |
   [fs/super.c]         |______|    |_______|    |______|


  When you use command like ''modprobe some_fs'' you will add a new
  entry to file systems list, while removing it (by using ''rmmod'')
  will delete it.


  14.2.3.  Mounted filesystems


                          ______      _______      ______
  mount_hash_table  ---->|   /  | -> | /usr  | -> | /var |
  [fs/namespace.c]       |______|    |_______|    |______|


  When you use ''mount'' command to add a fs, the new entry will be
  inserted in the list, while an ''umount'' command will delete the
  entry.


  14.2.4.  Registered Network Packet Type


                          ______      _______      ______
       ptype_all  ------>|  ip  | -> |  x25  | -> | ipv6 |
  [net/core/dev.c]       |______|    |_______|    |______|


  For example, if you add support for IPv6 (loading relative module) a
  new entry will be added in the list.


  14.2.5.  Registered Network Internet Protocol


                            ______      _______      _______
  inet_protocol_base ----->| icmp | -> |  tcp  | -> |  udp  |
  [net/ipv4/protocol.c]    |______|    |_______|    |_______|


  Also others packet type have many internal protocols in each list
  (like IPv6).


                            ______      _______      _______
  inet6_protos ----------->|icmpv6| -> | tcpv6 | -> | udpv6 |
  [net/ipv6/protocol.c]    |______|    |_______|    |_______|


  14.2.6.  Registered Network Device


                            ______      _______      _______
  dev_base --------------->|  lo  | -> |  eth0 | -> |  ppp0 |
  [drivers/core/Space.c]   |______|    |_______|    |_______|


  14.2.7.  Registered Char Device


                            ______      _______      ________
  chrdevs ---------------->|  lp  | -> | keyb  | -> | serial |
  [fs/devices.c]           |______|    |_______|    |________|


  vector.


  14.2.8.  Registered Block Device


                            ______      ______      ________
  bdev_hashtable --------->|  fd  | -> |  hd  | -> |  scsi  |
  [fs/block_dev.c]         |______|    |______|    |________|


  15.  Glossary

  16.  Links

  Official Linux kernels and patches download site
  <http://www.kernel.org>


  Great documentation about Linux Kernel
  <http://jungla.dit.upm.es/~jmseyas/linux/kernel/hackers-docs.html>


  Official Kernel Mailing list
  <http://www.uwsg.indiana.edu/hypermail/linux/kernel/index.html>


  Linux Documentation Project Guides <http://www.tldp.org/guides.html>