LDP/LDP/guide/docbook/Tuning-Linux/apps.sgml

<chapter id="apps">
  <title>Application Tuning</title>
  <para>
    While one can tune hardware and the OS and get great measurements from it,
    most applications for Linux have their own rules for improving performance.
    Tuning a hard drive for an application that uses a lot of memory will not
    improve the speed of that application, whereas investing in more memory will
    create a vast improvement in speed.  Let us examine some of the applications
    for Linux and how you can tune a box specifically for it.
  </para>

  <section id="appssquid">
    <title>Squid Proxy Server</title>
    <para>
      Squid proxy server has two common uses.  The first is as a proxy
      cache for internal clients access the external web.  The second is
      as a reverse proxy, caching content from internal web servers.
    </para>
    <section id="appssquidcache">
      <title>Squid as a proxy cache</title>
      <para>
        Squid as a proxy cache stores large numbers of web pages, enabling
	them to be sent to clients upon request instead of requiring
	another internet fetch.  Squid uses large amounts of disk space
	for the cache, but by nature doesn't access any one that much more
	frequently than any other.
      </para>
      <para>
        The first thing to avoid is a disk bottleneck.  In a high-usage
	environment, you will definitely want to distribute the cache file
	across as many spindles as is practical.  Since the cache data is
	relatively worthless, in that it is flushed out as a matter of
	course, the additional overhead of a RAID 5 or other relatively
	fail-safe multi-spindle system is not necessary.  RAID 0 striping
	will provide the fastest access with no fail-safe requirements.
      </para>
      <para>
        The next thing to avoid is a memory bottleneck.  There should
	ABSOLUTELY be no OS paging with Squid, Squid will page enough on
	its own.  This is a case where Squid has in effect its own virtual
	memory system, with thousands of pages on disk and tens or
	hundreds in memory.  If the OS is paging to provide the tens or
	hundreds in memory, your performance will be horrible.
      </para>
      <para>
        Network performance will probably not be a big deal.  If your
	byte hit rate is 25%, and you are maxing out a T1, you are still
	only putting around 2Mbps out your Ethernet connection.
      </para>
    </section> <!-- appssquidcache -->
    <section> <!-- appssquidproxy -->
      <title>Squid as a reverse proxy</title>
      <para>
        As a reverse proxy, the total set of web pages that Squid is
	proxying is reduced greatly.  It must store only the limited
	number of pages behind it on your web servers, as opposed to
	the entire internet.  Since disk storage requirements are down
	by an order of magnitude, so are memory requirements.
      </para>
      <para>
	However, one approach to this situation would be to build up
	real memory until the entire working set of pages could be
	stored, with no paging required.
      </para>
      <para>
        In this case we're trading large store/infrequent access for a
	smaller store/frequent access.  We want to optimize the byte
	hit rate in Squid with real memory to reduce disk operations.
	Again, there should be NO OS PAGING.  If any DNS lookups are
	required, it might be a good idea to run a DNS cache locally.
	Otherwise, this shouldn't be too difficult to tune to get
	good performance.
      </para>
    </section> <!-- appssquidproxy -->
  </section> <!-- appssquid -->

  <section id="appsmail">
    <title>SMTP and Mail Delivery</title>
    <para>
      While this section addresses sendmail directly, most of the
      concepts should apply to any mail delivery system.
    </para>

    <section id="appsmailsendmail">
      <title>Sendmail</title>
      <para>
        <indexterm>
	  <primary>
	    sendmail
	  </primary>
	</indexterm>
        A sendmail system will probably be the most likely to be
	network bottlenecked, but that bottleneck will also
	most likely be outside the system itself.  There are no
	special tricks since a typical sendmail system will be
	fairly balanced against typical hardware.  Depending
	on the size of the mail store, some type of RAID
	system may be implemented, but the mail store data
	is very valuable so use a RAID 5 or other fault-tolerant
	option.  If the store does get very large, you might
	also investigate the benefits of multiple caching
	layers down to the card.
      </para>
      <para>
        Sendmail itself is very dependent on DNS lookups for its
	operation, and may be doing ident lookups as well.  As such,
	some tricks on the network can really help.  Linux 2.4 kernels
	offer a lot of QoS features, these can be used to prioritize
	DNS and ident lookups over bulk traffic, preventing blocking
	of POP, IMAP, and SMTP waiting on a DNS lookup.  I would
	also strongly suggest implementing a DNS cache locally,
	maybe even on the sendmail server itself.
      </para>
    </section> <!-- appsmailsendmail -->
    <section id="appsmaildelivery">
      <title>Mail Delivery (POP and IMAP)</title>
        <para>
	  <indexterm>
	    <primary>
	      Post Office Protocol (POP)
	    </primary>
	  </indexterm>
	  <indexterm>
	    <primary>
	      IMAP
	    </primary>
	  </indexterm>
	  Memory and processor requirements will depend on the number
	  of usersand whether you are providing POP or IMAP services,
	  or only SMTP.  POP and IMAP require logins, and each login
	  will spawn another POP or IMAP process.  If concurrent usage
	  of these services is high, watch for OS paging.
	</para>
	<para>
	  Using POP for delivery has a very low memory and hard drive
	  footprint.  In POP, e-mail is typically downloaded from the
	  server to the client.  The POP server merely handles authentication
	  and moving the data to the client.  When a mail message is sent
	  to the client, it is typically removed from the server.
	</para>
	<para>
	  Using IMAP puts much more stress on the system.  IMAP will copy
	  an entire mailbox into memory, and perform sorting and other
	  functions on the mailbox.  When a client requests a message, the
	  message is sent to the client, but is not removed from the server.
	  So all mail messages remain on the server until the client explicitly
	  deletes it.  The advantage to this is a user can have IMAP clients
	  on multiple machines and still be able to access all the messages.
	  Since most of the processing is offloaded to the server, this makes
	  IMAP a great mechanism for light clients or wireless devices
	  that do not have much memory or processing power.
	</para>
	<para>
	  Since the data on a POP or IMAP server are critical, you should
	  be using RAID-5 or RAID-1 to protect the data and keep
	  the server running.  POP has low memory requirements, while
	  IMAP has very high memory and disk requirements, and both
	  will likely increase as more users are introduced to the system
	  and they start working with mailboxes in the thousands of messages.
	</para>
	<para>
	  In either case, using SSL on top of POP or IMAP will also
	  increase the CPU load of the protocols, but will have minimal
	  affect on disk or memory usage.
	</para>
	<para>
	  It is not advisable
	  to put a shell account on the same machine as one running IMAP
	  or POP, due to the CPU and memory used by these protocols.
	  Nor is it advisable to use NFS to export a mailbox to a shell
	  machine, as NFS locks can cause corruption of a mailbox
	  if sendmail tries to deliver a message and the user is deleting
	  another message from the same mailbox.
	</para>
    </section> <!-- appsmaildelivery -->
  </section> <!-- appsmail -->

  <section id="appsdatabase">
    <title>MySQL, PostgresSQL, other databases</title>
      <para></para>
  </section> <!-- appsdatabase -->

  <section id="appsfirewall">
    <title>Routers and firewalls</title>
      <para></para>
  </section> <!-- appsfirewall -->

  <section id="appsapache">
    <title>Apache web server</title>
    <para>
      Apache works best when it is delivering static content.  But if you're
      only delivering static content, you should consider using
      <application>Tux</application>, the Linux kernel HTTP server.
    </para>
    <para>
      There are a number of small quick-tips you can use to increase the
      throughput and responsiveness of Apache.
    </para>
    <para>
      One such tip is to turn off DNS resolving in Apache.  When Apache receives
      a request to send data to a remote server, Apache will request a reverse
      DNS lookup on that IP address.  While the requests winds its way through
      the DNS heirarchy, Apache is stuck waiting for the answer before it can
      send the page back to the requesting client.  By turning this feature off,
      Apache will merely log the IP address without name in its log file, and
      then deliver the page.  A problem with this is if the machine is cracked
      through the web server, you will have to do your own reverse DNS lookup to
      find out what country and domain the machine is from.  A way to have the
      best of both worlds is to run a caching-only name server run on the local
      network or on the web server box itself.  This allows Apache to quickly
      get the DNS information, and store a hostname and domain instead of just
      an IP address.
    </para>
    <para>
      Cutting the size of graphic images is also a way to improve performance.
      Using algorithms such as JPG or PNM will cut the size of the image file
      without reducing the quality of the image by much.  This allows more image
      files to go out per second.
    </para>
    <para>
      Instead of reinventing the wheel, see if Apache has a module for the
      feature you need.  Modules for authentication against SMB, NIS, and LDAP
      servers all exist for Apache, and you will not have to add the
      functionality to your server side scripts.
    </para>
    <para>
      Make sure your database is tuned properly.  This has been covered in
      <xref linkend="appsdatabase">, but a poorly tuned database can slow
      everything down.  If you expect heavy usage, do not put the database and
      web server on the same machine.  One machine running the database, and
      another running the web server connected by high speed (gigabit Ethernet)
      will allow both machines to operate quickly.
    </para>
    <para>
      Since so many web sites are using server-based applications and CGI, it
      makes sense to take a look at the server interpreters being used.  If you
      use a language like Java that is known to be pretty slow, you can expect
      lower performance as compared to using applications compiled in C.
    </para>
    <para>
      But Java has its own advantages over C.  The biggest is that Java is
      designed to be much more portable than C is.  Since Java is an object
      oriented language, its code reuse is much better, allowing delopers to
      create applications quicky.
    </para>
    <para>
      There are three other major server languages used by Apache.  These are
      Perl, Python, and PHP.  All three are included in most Linux
      distributions, and if they're not on your system, each is easy to install
      and get working.
    </para>
    <section id="appsapacheperl">
      <title>Perl</title>
      <para>
        Perl is pretty much the granddaddy of CGI and server programs.  It has
	excellent string manipulation abilities, plenty of libraries to let you
	create web applications quickly, and has drivers for just about anything
	you want to do.  It is known as the programmer's swiss army knife, and
	for good reason.
      </para>
      <para>
        The downside to perl is that it is an interpreted language, meaning each
	time a perl application is run, the perl interpreter has to be loaded,
	the application has to be loaded into memory, and the interpreter has to
	compile the application.  This can create a heavy load on a system.
      </para>
      <para>
        <indexterm>
	  <primary>
	    mod_perl
	  </primary>
	</indexterm>
        The best way around this is to use the mod_perl library with
	<application>apache</application>.  This library first loads the perl
	interpreter so it is always in memory and ready to run.  It also caches
	frequently-used perl CGI scripts in memory already in a precompiled
	state.  When a perl application starts up, many of the bottlenecks
	(loading and interpreting) have already been done.
        The cost to the sysadmin and programmer is that mod_perl takes up a lot
	of memory, and its memory use will increase as more perl scripts are
	added to a web server.
      </para>
      <note>
        <para>
	  Adding more memory and CPU speed to a machine running mod_perl will
	  improve its performance.
	</para>
      </note>
    </section> <!-- appsapacheperl -->
    <section id="appsapachepython">
      <title>Python</title>
      <para>
        Python was one of the first of the interpreted languages to implement
	object oriented programming, allowing programmers to reuse and share
	code.  It also learned a bit from perl and it's interpreted nature by
	comiling python code into a pre-compiled format.  This format is not
	portable like Java, but does much of the grunt work of compiling a
	python application.  The python interpreter then spends less time
	compiling the application each time it's run.
      </para>
      <para>
        Programmers may cringe a bit when they start writing in python.  Syntax
	is much more strict than C or Perl, using tabs and newlines to delineate
	parts of code.  And, like perl, python requires the interpreter to load
	into memory and run.
      </para>
      <para>
        There is also a mod_python for apache that operates simiar to mod_perl.
	More memory and CPU speed will be a great improvement.
      </para>
    </section> <!-- appsapachepython -->
    <section id="appsapachephp">
      <title>PHP</title>
      <para>
        PHP is a relative newcomer to the languages listed here.  But PHP was
	designed for server side applications.  The interpreter is a module in
	<application>apache</application>, and PHP code can exist within a HTML
	page, allowing authors to combine the web page and application into one
	file.
      </para>
      <para>
        PHP has two downsides to it from a programer's perspective.  First, PHP
	is still relatively new.  It is not quite as seasoned as Perl or Python,
	each of which have over ten years worth of development behind them.
	Second, PHP does not have modules that can be easily loaded and used
	like Perl and Python have.  If your implementation of PHP does not have
	the graphics libraries installes, you will have to recompile PHP and add
	them in.  Neither of these should prevent you from giving a serious look
	at using PHP for your server side application.
      </para>
    </section> <!-- appsapachephp -->
  </section> <!-- appsapache -->

  <section id="appssamba">
    <title>Samba File Sharing</title>
      <para></para>
  </section> <!-- appssamba -->

  <section id="appsnfs">
    <title>Network File System</title>
      <para></para>
  </section> <!-- appsnfs -->
</chapter> <!-- apps -->