LDP/LDP/guide/docbook/sag/ch07.sgml

<chapter id="system-monitoring">
<title>System Monitoring</title>

	<blockquote><para><quote>That's Hall Monitor to you!</quote>Spongebob
	Squarepants</para></blockquote>

	<para>One of the most important responsibilities a system administrator
	has, is monitoring their systems.  As a system administrator you'll need
	the ability to find out what is happening on your system at any given
	time.  Whether it's the percentage of system's resources currently used,
	what commands are being run, or who is logged on.  This chapter will cover
	how to monitor your system, and in some cases, how to resolve problems
	that may arise.</para>

	<para>When a performance issue arises, there are 4 main areas to consider:
	CPU, Memory, Disk I/O, and Network.  The ability to determine where the
	bottleneck is can save you a lot of time.</para>

<sect1 id="system-resources">
<title>System Resources</title>
	<para>Being able to monitor the performance of your system
        is essential.  If system resources become too low it can cause a lot of
	problems.  System resources can be taken up by individual users, or by
	services your system may host such as email or web pages.  The ability to
	know what is happening can help determine whether system upgrades are needed,
	or if some services need to be moved to another machine.</para>

<sect2 id="top">
<title>The <command>top</command> command.</title>
	<para>The most common of these commands is <command>top</command>.
	The <command>top</command> will display a continually updating report
	of system resource usage.

<screen>
<prompt>#</prompt> <userinput>top</userinput>
<computeroutput> 12:10:49  up 1 day,  3:47,  7 users,  load average: 0.23, 0.19, 0.10
125 processes: 105 sleeping, 2 running, 18 zombie, 0 stopped
CPU states:   5.1% user   1.1% system   0.0% nice   0.0% iowait  93.6% idle
Mem:   512716k av,  506176k used,    6540k free,       0k shrd,   21888k buff
Swap: 1044216k av,  161672k used,  882544k free                  199388k cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
 2330 root      15   0  161M  70M  2132 S     4.9 14.0  1000m   0 X
 2605 weeksa    15   0  8240 6340  3804 S     0.3  1.2   1:12   0 kdeinit
 3413 weeksa    15   0  6668 5324  3216 R     0.3  1.0   0:20   0 kdeinit
18734 root      15   0  1192 1192   868 R     0.3  0.2   0:00   0 top
 1619 root      15   0   776  608   504 S     0.1  0.1   0:53   0 dhclient
    1 root      15   0   480  448   424 S     0.0  0.0   0:03   0 init
    2 root      15   0     0    0     0 SW    0.0  0.0   0:00   0 keventd
    3 root      15   0     0    0     0 SW    0.0  0.0   0:00   0 kapmd
    4 root      35  19     0    0     0 SWN   0.0  0.0   0:00   0 ksoftirqd_CPU0
    9 root      25   0     0    0     0 SW    0.0  0.0   0:00   0 bdflush
    5 root      15   0     0    0     0 SW    0.0  0.0   0:00   0 kswapd
   10 root      15   0     0    0     0 SW    0.0  0.0   0:00   0 kupdated
   11 root      25   0     0    0     0 SW    0.0  0.0   0:00   0 mdrecoveryd
   15 root      15   0     0    0     0 SW    0.0  0.0   0:01   0 kjournald
   81 root      25   0     0    0     0 SW    0.0  0.0   0:00   0 khubd
 1188 root      15   0     0    0     0 SW    0.0  0.0   0:00   0 kjournald
 1675 root      15   0   604  572   520 S     0.0  0.1   0:00   0 syslogd
 1679 root      15   0   428  376   372 S     0.0  0.0   0:00   0 klogd
 1707 rpc       15   0   516  440   436 S     0.0  0.0   0:00   0 portmap
 1776 root      25   0   476  428   424 S     0.0  0.0   0:00   0 apmd
 1813 root      25   0   752  528   524 S     0.0  0.1   0:00   0 sshd
 1828 root      25   0   704  548   544 S     0.0  0.1   0:00   0 xinetd
 1847 ntp       15   0  2396 2396  2160 S     0.0  0.4   0:00   0 ntpd
 1930 root      24   0    76    4     0 S     0.0  0.0   0:00   0 rpc.rquotad
</computeroutput>
</screen></para>

	<para>The top portion of the report lists information such as
	the system time, uptime, CPU usage, physical and swap memory usage,
	and number of processes.  Below that is a list of the processes sorted
	by CPU utilization.</para>

	<para>You can modify the output of <command>top</command> while
	it is running.  If you hit an <option>i</option>, top will no longer
	display idle processes.  Hit <option>i</option> again to see them
	again.  Hitting <option>M</option> will sort by memory usage,
	<option>S</option> will sort by how long they processes have been
	running, and <option>P</option> will sort by CPU usage again.</para>

	<para>In addition to viewing options, you can also modify processes
	from within the <command>top</command> command.  You can use
	<option>u</option> to view processes owned by a specific user,
	<option>k</option> to kill processes, and <option>r</option> to
	renice them.</para>

	<para>For more in-depth information about processes you can look in
	the <filename>/proc</filename> filesystem.  In the <filename>/proc</filename>
	filesystem you will find a series of sub-directories with numeric names.
	These directories are associated with the processes ids of currently
	running processes.  In each directory you will find a series of files
	containing information about the process.</para>

	<para>YOU MUST TAKE EXTREME CAUTION TO NOT MODIFY THESE FILES, DOING
	SO MAY CAUSE SYSTEM PROBLEMS!</para>
</sect2>

<sect2 id="iostat">
<title>The <command>iostat</command> command.</title>
	<para>The <command>iostat</command> will display the current CPU load
	average and disk I/O information.  This is a great command to monitor
	your disk I/O usage.

<screen>
<prompt>#</prompt> <userinput>iostat</userinput>
<computeroutput>Linux 2.4.20-24.9 (myhost)       12/23/2003

avg-cpu:  %user   %nice    %sys   %idle
          62.09    0.32    2.97   34.62

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dev3-0            2.22        15.20        47.16    1546846    4799520
</computeroutput>
</screen>

	For 2.4 kernels the devices is names using the device's major
	and minor number.  In this case the device listed is <filename>
	/dev/hda</filename>.  To have <command>iostat</command> print this
	out for you, use the <option>-x</option>.
<screen>
<prompt>#</prompt> <userinput>iostat -x</userinput>
<computeroutput>Linux 2.4.20-24.9 (myhost)       12/23/2003

avg-cpu:  %user   %nice    %sys   %idle
          62.01    0.32    2.97   34.71

Device:  rrqm/s wrqm/s r/s  w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
/dev/hdc   0.00   0.00 .00 0.00   0.00   0.00  0.00  0.00     0.00     2.35  0.00  0.00 14.71
/dev/hda   1.13   4.50 .81 1.39  15.18  47.14  7.59 23.57    28.24     1.99 63.76 70.48 15.56
/dev/hda1  1.08   3.98 .73 1.27  14.49  42.05  7.25 21.02    28.22     0.44 21.82  4.97  1.00
/dev/hda2  0.00   0.51 .07 0.12   0.55   5.07  0.27  2.54    30.35     0.97 52.67 61.73  2.99
/dev/hda3  0.05   0.01 .02 0.00   0.14   0.02  0.07  0.01     8.51     0.00 12.55  2.95  0.01
</computeroutput>
</screen>
	<para>The <command>iostat</command> man page contains a detailed
	explanation of what each of these columns mean.</para>

</sect2>

<sect2 id="ps">
<title>The <command>ps</command> command</title>
	<para>The <command>ps</command> will provide you a list of
	processes currently running.  There is a wide variety of options
	that this command gives you.</para>

	<para>A common use would be to list all processes currently running.
	To do this you would use the <command>ps -ef</command> command.
	(Screen output from this command is too large to include, the following
	is only a partial output.)

<screen>
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 Dec22 ?        00:00:03 init
root         2     1  0 Dec22 ?        00:00:00 [keventd]
root         3     1  0 Dec22 ?        00:00:00 [kapmd]
root         4     1  0 Dec22 ?        00:00:00 [ksoftirqd_CPU0]
root         9     1  0 Dec22 ?        00:00:00 [bdflush]
root         5     1  0 Dec22 ?        00:00:00 [kswapd]
root         6     1  0 Dec22 ?        00:00:00 [kscand/DMA]
root         7     1  0 Dec22 ?        00:01:28 [kscand/Normal]
root         8     1  0 Dec22 ?        00:00:00 [kscand/HighMem]
root        10     1  0 Dec22 ?        00:00:00 [kupdated]
root        11     1  0 Dec22 ?        00:00:00 [mdrecoveryd]
root        15     1  0 Dec22 ?        00:00:01 [kjournald]
root        81     1  0 Dec22 ?        00:00:00 [khubd]
root      1188     1  0 Dec22 ?        00:00:00 [kjournald]
root      1675     1  0 Dec22 ?        00:00:00 syslogd -m 0
root      1679     1  0 Dec22 ?        00:00:00 klogd -x
rpc       1707     1  0 Dec22 ?        00:00:00 portmap
root      1813     1  0 Dec22 ?        00:00:00 /usr/sbin/sshd
ntp       1847     1  0 Dec22 ?        00:00:00 ntpd -U ntp
root      1930     1  0 Dec22 ?        00:00:00 rpc.rquotad
root      1934     1  0 Dec22 ?        00:00:00 [nfsd]
root      1942     1  0 Dec22 ?        00:00:00 [lockd]
root      1943     1  0 Dec22 ?        00:00:00 [rpciod]
root      1949     1  0 Dec22 ?        00:00:00 rpc.mountd
root      1961     1  0 Dec22 ?        00:00:00 /usr/sbin/vsftpd /etc/vsftpd/vsftpd.conf
root      2057     1  0 Dec22 ?        00:00:00 /usr/bin/spamd -d -c -a
root      2066     1  0 Dec22 ?        00:00:00 gpm -t ps/2 -m /dev/psaux
bin       2076     1  0 Dec22 ?        00:00:00 /usr/sbin/cannaserver -syslog -u bin
root      2087     1  0 Dec22 ?        00:00:00 crond
daemon    2195     1  0 Dec22 ?        00:00:00 /usr/sbin/atd
root      2215     1  0 Dec22 ?        00:00:11 /usr/sbin/rcd
weeksa    3414  3413  0 Dec22 pts/1    00:00:00 /bin/bash
weeksa    4342  3413  0 Dec22 pts/2    00:00:00 /bin/bash
weeksa   19121 18668  0 12:58 pts/2    00:00:00 ps -ef
</screen>
</para>

	<para>The first column shows who owns the process.  The second
	column is the process ID.  The Third column is the parent process
	ID.  This is the process that generated, or started, the process.
	The forth column is the CPU usage (in
	percent). The fifth column is the start time, of date if the process
	has been running long enough. The sixth column is the tty associated
	with the process, if applicable. The seventh column is the cumulitive
	CPU usage (total amount of CPU time is has used while running). The
	eighth column is the command itself.</para>

	<para>With this information you can see exacly what is running on
	your system and kill run-away processes, or those that are causing
	problems.</para>
</sect2>

<sect2 id="vmstat">
<title>The <command>vmstat</command> command</title>
	<para>The <command>vmstat</command> command  will provide a report
	showing statistics for system processes, memory, swap,
	I/O, and the CPU.  These statistics are generated using data from the
	last time the command was run to the present.  In the case of the
	command never being run, the data will be from the last reboot until
	the present.<para>

<screen>
<prompt>#</prompt> <userinput>vmstat</userinput>
<computeroutput>
   procs                      memory      swap          io     system      cpu
 r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id
 0  0  0 181604  17000  26296 201120    0    2     8    24  149     9 61  3 36
</computeroutput>
</screen></para>

	<para>The following was taken from the
	<command>vmstat</command> man page.</para>

	<blockquote><para><literallayout>
FIELD DESCRIPTIONS
Procs
    r: The number of processes waiting for run time.
    b: The number of processes in uninterruptable sleep.
    w: The number of processes swapped out but otherwise runnable.  This
       field is calculated, but Linux never desperation swaps.

Memory
    swpd: the amount of virtual memory used (kB).
    free: the amount of idle memory (kB).
    buff: the amount of memory used as buffers (kB).

Swap
    si: Amount of memory swapped in from disk (kB/s).
    so: Amount of memory swapped to disk (kB/s).

IO
    bi: Blocks sent to a block device (blocks/s).
    bo: Blocks received from a block device (blocks/s).

System
    in: The number of interrupts per second, including the clock.
    cs: The number of context switches per second.

CPU
    These are percentages of total CPU time.
    us: user time
    sy: system time
    id: idle time</literallayout>
</para></blockquote>
</sect2>

<sect2 id="lsof">
<title>The <command>lsof</command> command</title>
	<para>The <command>lsof</command> command will print out a list of
	every file that is in use.  Since Linux considers everythihng a file,
	this list can be very long. However, this command
	can be useful in diagnosing problems.  An example of this is if you wish
	to unmount a filesystem, but you are being told that it is in use.  You
	could use this command and <command>grep</command> for the name of the
	filesystem to see who is using it.</para>

	<para>Or suppose you want to see all files in use by a particular process.
	To do this you would use <command>lsof -p -processid-</command>.</para>

</sect2>

<sect2 id="more-utils">
<title>Finding More Utilities</title>
	<para>To learn more about what command line tools are available, Chris
	Karakas has wrote a reference guide titled <ulink
	url="http://www.tldp.org/LDP/GNU-Linux-Tools-Summary/html/GNU-Linux-Tools-Summary.html"> GNU/Linux
	Command-Line Tools Summary</ulink>.  It's a good resource for learning
	what tools are out there and how to do a number of tasks.</para>
</sect2>

<sect1 id="fs-usage">
<title>Filesystem Usage</title>

	<para>Many reports are currently talking about how cheap storage has
	gotten, but if you're like most of us it isn't cheap enough.  Most of
	us have a limited amount of space, and need to be able to monitor it
	and control how it's used.</para>


<sect2 id="df">
<title>The df command</title>
	<para>The <command>df</command> is the simplest tool available to
	view disk usage.  Simply type in <command>df</command> and you'll
	be shown disk usage for all your mounted filesystems in 1K blocks

<screen>
<prompt>user@server:~> <prompt><userinput>df</userinput>
<computeroutput>
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/hda3              5242904    759692   4483212  15% /
tmpfs                   127876         8    127868   1% /dev/shm
/dev/hda1               127351     33047     87729  28% /boot
/dev/hda9             10485816     33508  10452308   1% /home
/dev/hda8              5242904    932468   4310436  18% /srv
/dev/hda7              3145816     32964   3112852   2% /tmp
/dev/hda5              5160416    474336   4423928  10% /usr
/dev/hda6              3145816    412132   2733684  14% /var
</computeroutput>
</screen></para>

	<para>You can also use the <command>-h</command> to see the output in
	"human-readable" format.  This will be in K, Megs, or Gigs depending
	on the size of the filesystem.  Alternately, you can also use the
	<command>-B</command> to specify block size.</para>

	<para>In addition to space usage, you could use the
	<command>-i</command> option to view the number of used and available
	inodes.

<screen>
<prompt>user@server:~> </prompt><userinput>df -i</userinput>
<computeroutput>
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/hda3                  0       0       0    -  /
tmpfs                  31969       5   31964    1% /dev/shm
/dev/hda1              32912      47   32865    1% /boot
/dev/hda9                  0       0       0    -  /home
/dev/hda8                  0       0       0    -  /srv
/dev/hda7                  0       0       0    -  /tmp
/dev/hda5             656640   26651  629989    5% /usr
/dev/hda6                  0       0       0    -  /var
</computeroutput>
</screen></para>
</sect2>

<sect2 id="du">
<title>The du command</title>
	<para>Now that you know how much space has been used on a filesystem
	how can you find out where that data is?  To view usage by a directory
	or file you can use <command>du</command>.  Unless you specify a
	filename <command>du</command> will act recursively.  For example:

<screen>
<prompt>user@server:~> </prompt><userinput>du file.txt</userinput>
<computeroutput>
1300    file.txt
</computeroutput>
</screen>

	Or like the <command>df</command> I can use the <command>-h</command>
	and get the same output in "human-readable" form.

<screen>
<prompt>user@server:~> </prompt><userinput>du -h file.txt</userinput>
<computeroutput>
1.3M     file.txt
</computeroutput>
</screen>
</para>

	<para>Unless you specify a filename <command>du</command> will act
	recursively.

<screen>
<prompt>user@server:~> </prompt><userinput>du -h /usr/local</userinput>
<computeroutput>
4.0K    /usr/local/games
16K     /usr/local/include/nessus/net
180K    /usr/local/include/nessus
208K    /usr/local/include
62M     /usr/local/lib/nessus/plugins/.desc
97M     /usr/local/lib/nessus/plugins
164K    /usr/local/lib/nessus/plugins_factory
97M     /usr/local/lib/nessus
12K     /usr/local/lib/pkgconfig
2.7M    /usr/local/lib/ladspa
104M    /usr/local/lib
112K    /usr/local/man/man1
4.0K    /usr/local/man/man2
4.0K    /usr/local/man/man3
4.0K    /usr/local/man/man4
16K     /usr/local/man/man5
4.0K    /usr/local/man/man
</computeroutput>
</screen></para>

	<para>If you just want a summary of that directory you can use the
	<command>-s</command> option.

<screen>
<prompt>user@server:~> </prompt><userinput>du -hs /usr/local</userinput>
<computeroutput>
210M    /usr/local
</computeroutput>
</screen></para>
</sect2>


<sect2 id="quotas">
<title>Quotas</title>
	<para>For more information about quotas you can read
	<ulink url="http://www.tldp.org/HOWTO/Quota.html">The Quota HOWTO
	</ulink>.
</sect2>

</sect1>

<sect1 id="monitoring-users">
<title>Monitoring Users</title>

	<blockquote><para>Just because you're paranoid doesn't mean they
	AREN'T out to get you... Source Unknown</para></blockquote>

	<para>From time to time there are going to be occasions where you will
	want to know exactly what people are doing on your system.  Maybe you
	notice that a lot of RAM is being used, or a lot of CPU activity.
	You are going to want to see who is on the system, what they are
	running, and what kind of resources they are using.</para>

<sect2 id="who">
<title>The who command</title>
	<para>The easiest way to see who is on the system is to do a
	<command>who</command> or <command>w</command>.  The -->
	<command>who</command> is a simple tool that lists out who is logged -->
	on the system and what port or terminal they are logged on at.

<screen>
<prompt>user@server:~></prompt> <userinput> who</userinput>
<computeroutput>
bjones   pts/0        May 23 09:33
wally    pts/3        May 20 11:35
aweeks   pts/1        May 22 11:03
aweeks   pts/2        May 23 15:04
</computeroutput>
</screen></para>
</sect2>

<sect2 id="ps-u">
<title>The ps command -again!</title>

	<para>In the previous section we can see that user aweeks is logged
	onto both <filename>pts/1</filename> and <filename>pts/2</filename>,
	but what if we want to see what they are doing?  We could to a
	<command>ps -u aweeks</command> and get the following output

<screen>
<prompt>user@server:~> </prompt><userinput>ps -u aweeks</userinput>
<computeroutput>
20876 pts/1    00:00:00 bash
20904 pts/2    00:00:00 bash
20951 pts/2    00:00:00 ssh
21012 pts/1    00:00:00 ps
</computeroutput>
</screen>

	From this we can see that the user is doing a <command>ps</command>
	<command>ssh</command>.</para>

	<para>This is a much more consolidated use of the
	<command>ps</command> than discussed previously.<para>
</sect2>

<sect2 id="w">
<title>The w command</title>

	<para>Even easier than using the <command>who</command> and
	<command>ps -u</command> commands is to use the <command>w</command>.
	<command>w</command> will print out not only who is on the system,
	but also the commands they are running.

<screen>
<prompt>user@server:~> </prompt><userinput>w</userinput>
<computeroutput>
aweeks   :0        09:32   ?xdm?  30:09   0.02s -:0
aweeks   pts/0     09:33    5:49m  0.00s  0.82s kdeinit: kded
aweeks   pts/2     09:35    8.00s  0.55s  0.36s vi sag-0.9.sgml
aweeks   pts/1     15:03   59.00s  0.03s  0.03s /bin/bash
</computeroutput>
</screen></para>

	<para>From this we can see that I have a <command>kde</command> session
	running, I'm working in this document :-), and have another terminal
	open sitting idle at a bash prompt.</para>
</sect2>

<sect2 id="skill">
<title>The skill command</title>

	<para> To Be Added</para>

</sect2>

<sect2 id="nice">
<title>nice and renice</title>

	<para>To Be Added</para>

</sect2>


</sect1>
</chapter>