LDP/LDP/howto/docbook/Spam-Filtering-for-MX/chapter-background.xml

<?xml version='1.0' encoding='ISO-8859-1'?>
<chapter id="background" xreflabel="Background">
  <?dbhtml filename="background.html"?>
  <title>Background</title>


  <abstract>
    <para>
      Here we cover the advantages of filtering mail during an
      incoming SMTP transaction, rather than following the more
      conventional approach of offloading this task to the mail
      routing and delivery stage.  We also provide a brief
      introduction to the SMTP transaction.
    </para>
  </abstract>

  <section id="whysmtptime" >
    <?dbhtml filename="whysmtptime.html"?>
    <title>Why Filter Mail During the SMTP Transaction?</title>

    <section id="statusquo">
      <title>Status Quo</title>

      <para>
	If you receive spam, raise your hands.  Keep them up.
      </para>

      <para>
	If you receive computer virii or other malware, raise your
	hands too.
      </para>

      <para>
	If you receive bogus <xref linkend="dsn"/>s (DSNs), such as
	<quote>Message Undeliverable</quote>, <quote>Virus
	found</quote>, <quote>Please confirm delivery</quote>, etc,
	related to messages you never sent, raise your hands as well.
	This is known as <xref linkend="colspam"/>.
      </para>

      <para>
	This last form is particularly troublesome, because it is
	harder to weed out than <quote>standard</quote> spam or
	malware, and because such messages can be quite confusing to
	recipients who do not possess godly skills in parsing message
	headers.  In the case of virus warnings, this often causes
	unnecessary concern on the recipient's end; more generally, a
	common tendency will be to ignore all such messages, thereby
	missing out on legitimate DSNs.
      </para>

      <para>
	Finally, I want those of you who have lost legitimate mail into
	a big black hole - due to misclassification by spam or virus
	scanners - to lift your feet.
      </para>

      <para>
	If you were standing before and are still standing, I suggest
	that you may not be fully aware of what is happening to your
	mail.  If you have been doing any type of spam filtering, even
	by manually moving mails to the trash can in your mail reader,
	let alone by experimenting with primitive filtering techniques
	such as DNS blacklists (SpamHaus, SPEWS, SORBS...), chances
	are that you have lost some valid mail.
      </para>
    </section>

    <section id="cause">
      <title>The Cause</title>

      <para>
	Spam, just like many other artifacts of greed, is a social
	disease.  Call it affluenza, or whatever you like; lower life
	forms seek to destroy a larger ecosystem, and if successful,
	will actually end up ruining their own habitat in the end.
      </para>

      <para>
	Larger social issues and philosophy aside: You - the mail system
	administrator - face the very concrete and real life dilemma of
	finding a way to deal with all this junk.
      </para>

      <para>
	As it turns out, there are some limitations with the
	conventional way that mail is being processed and delegated by
	the various components of mail transport and delivery
	software.  In a traditional setup, one or more <xref
	linkend="mx"/>(s) accept most or all incoming mail deliveries
	to addresses within a domain.  Often, they then forward the
	mail to one or more internal machines for further processing,
	and/or delivery to the user's mailboxes.  If any of these
	servers discovers that it is unable to perform the requested
	delivery or function, it generates and returns a DSN back to
	the sender address in the original mail.
      </para>

      <para>
	As organizations started deploying spam and virus scanners,
	they often found that the path of least resistance was to work
	these into the message delivery path, as mail is transferred
	from the incoming <xref linkend="mx"/>(s) to internal delivery
	hosts and/or software.  For instance, a common way filter out
	spam is by <emphasis>routing</emphasis> the mail through
	SpamAssassin or other software before it is delivered to a
	user's mailbox, and/or rely on spam filtering capabilities in
	the user's <xref linkend="mua" />.
      </para>

      <para>
	Options for dealing with mail that is classified as spam or
	virus at this point are limited:
      </para>

      <itemizedlist>
	<listitem>
	  <para>
	    You can return a <xref linkend="dsn"/> back to the sender.
	    The problem is that nearly all spam and e-mail borne
	    virii are delivered with faked sender addresses.  If you
	    return this mail, it will invariably go to innocent third
	    parties -- perhaps warning a grandmother in Sweden, who
	    uses Mac OS X and does not know much about computers, that
	    she is infected by the Blaster worm.  In other words, you
	    will be generating <xref linkend="colspam"/>.
	  </para>
	</listitem>

	<listitem>
	  <para>
	    You can drop the message into the bit bucket, without
	    sending any notification back to the sender.  This is an
	    even bigger problem in the case of <xref
	    linkend="falsepos"/>s, because neither the sender nor
	    the receiver will ever know what happened to the message
	    (or in the receiver's case, that it ever existed).
	  </para>
	</listitem>

	<listitem>
	  <para>
	    Depending on how your users access their mail (for
	    instance, if they access it via the IMAP protocol or use a
	    web-based mail reader, but not if they retreive it over
	    POP-3), you may be able to file it into a separate junk
	    folder for them -- perhaps as an option in their account
	    settings.
	  </para>

	  <para>
	    This may be the best of these three options.  Even so, the
	    messages may remain unseen for some time, or simply
	    overlooked as the receiver more-or-less periodically scans
	    through and deletes mail in their <quote>Junk</quote>
	    folder.
	  </para>
	</listitem>
      </itemizedlist>
    </section>


    <section id="solution">
      <title>The Solution</title>

      <para>
	As you would have guessed by now, the <emphasis>One
	True</emphasis> solution to this problem is to do spam and
	virus filtering during the SMTP dialogue from the remote host,
	as the mail is being received by the inbound mail exchanger
	for your domain.  This way, if the mail turns out to be
	undesirable, you can issue a SMTP <emphasis>reject</emphasis>
	response rather than face the dilemma described above.  As a
	result:
      </para>

      <itemizedlist>
	<listitem>
	  <para>
	    You will be able to stop the delivery of most junk mail
	    early in the SMTP transaction, before the actual message
	    data has been received, thus saving you both network
	    bandwidth and CPU processing.
	  </para>
	</listitem>

	<listitem>
	  <para>
	    You will be able to deploy some spam filtering techniques
	    that are not possible later, such as
	    <xref linkend="smtpdelays"/> and
	    <xref linkend="greylisting"/>.
	  </para>
	</listitem>

	<listitem>
	  <para>
	    You will be able to notify the sender in case of a
	    delivery failure (e.g. due to an invalid recipient
	    address) without directly generating <xref
	    linkend="colspam"/>
	  </para>

	  <para>
	    We will discuss how you can avoid causing collateral spam
	    indirectly as a result of rejecting mail forwarded from
	    trusted sources, such as mailing list servers or mail
	    accounts on other sites
	    <footnote>
	      <para>
		Untrusted third party hosts may still generate
		collateral spam if you reject the mail.  However,
		unless that host is an <xref linkend="openproxy"/> or
		<xref linkend="openrelay"/>, it presumably delivers
		mail only from legitimate senders, whose addresses are
		valid.  If it <emphasis>is</emphasis> an Open Proxy or
		SMTP Relay - well, it is better that you reject the
		mail and let it freeze in <emphasis>their</emphasis>
		outgoing mail queue than letting it freeze in yours.
		Eventually, this ought to give the owners of that
		server a clue.
	      </para>
	    </footnote>.
	  </para>
	</listitem>

	<listitem>
	  <para>
	    You will be able to protect yourself against collateral
	    spam from others (such as bogus <quote>You have a
	    virus</quote> messages from anti-virus software).
	  </para>
	</listitem>
      </itemizedlist>

      <para>
	OK, you can lower your hands now.  If you were standing, and
	your feet disappeared from under you, you can now also stand up
	again.
      </para>
    </section>
  </section>


  <section id="goodbadugly" xreflabel="The Good, The Bad, The Ugly">
    <?dbhtml filename="goodbadugly.html"?>
    <title>The Good, The Bad, The Ugly</title>

    <para>
      Some filtering techniques are more suitable for use during the
      SMTP transaction than others.  Some are simply better than
      others.  Nearly all have their proponents and opponents.
    </para>

    <para>
      Needless to say, these controversies extend to the methods
      described here as well.  For instance:
    </para>

    <itemizedlist>
      <listitem>
	<para>
	  Some argue that <xref linkend="dnschecks"/> penalize
	  individual mail senders purely based on their Internet
	  Service Provider (ISP), not on the merits of their
	  particular message.
	</para>
      </listitem>

      <listitem>
	<para>
	  Some point out that ratware traps like <xref
	  linkend="smtpdelays"/> and <xref linkend="greylisting"/> are
	  easily overcome and will be less effective over time, while
	  continuing to degrade the Quality of Service for legitimate
	  mail.
	</para>
      </listitem>

      <listitem>
	<para>
	  Some find that <xref linkend="senderauth"/> like the <xref
	  linkend="spf"/> give ISPs a way to lock their customers in,
	  and do not adequately address users who roam between
	  different networks or who forward their e-mail from one host
	  to another.
	</para>
      </listitem>
    </itemizedlist>

    <para>
      I will steer away from most of these controversies.  Instead, I
      will try to provide a functional description of the various
      techniques available, including their possible side effects, and
      then talk a little about my own experiences using some of them.
    </para>


    <para>
      That said, there are some filtering methods in use today that I
      deliberately omit from this document:
    </para>

    <itemizedlist>
      <listitem>
	<para>
	  Challenge/response systems (like <ulink
	  url="http://tmda.net/">TMDA</ulink>).  These are not
	  suitable for SMTP time filtering, as they rely on first
	  accepting the mail, then returning a confirmation request to
	  the <xref linkend="envfrom"/>.  This technique is therefore
	  outside the scope of this document.
	  <footnote>
	    <para>
	      Personally I do not think challenge/response systems are
	      a good idea in any case.  They generate <xref
	      linkend="colspam"/>, they require special attention for
	      mail sent from automated sources such as monthly bank
	      statements, and they degrade the usability of e-mail as
	      people need to jump through hoops to get in touch with
	      each other.  Many times, senders of legitimate mail will
	      not bother to or know that they need to follow up to the
	      confirmation request, and the mail is lost.
	    </para>
	  </footnote>
	</para>
      </listitem>

      <listitem>
	<para>
	  <xref linkend="bayesian"/>.  These require training specific
	  to a particular user, and/or a particular language.  As
	  such, these too are not normally suitable for use during the
	  SMTP transaction (But see <xref linkend="usersettings"/>).
	</para>
      </listitem>

      <listitem>
	<para>
	  <xref linkend="micropay"/> are not really suitable for
	  weeding out junk mail until all the world's legitimate mail
	  is sent with a virtual <emphasis>postage stamp</emphasis>.
	  (Though in the mean time, they can be used for the opposite
	  purpose - that is, to accept mail carrying the stamp that
	  would otherwise be rejected).
	</para>
      </listitem>
    </itemizedlist>

    <para>
      Generally, I have attempted to offer techniques that are as
      precise as possible, and to go to great lengths to avoid <xref
      linkend="falsepos"/>s.  People's e-mail is important to them,
      and they spend time and effort writing it.  In my view,
      willfully using techniques or tools that reject large amounts of
      legitimate mail is a show of disrespect, both to the people that
      are directly affected and to the Internet as a whole.
      <footnote>
	<para>
	  My view stands in sharp contrast to that of a large number
	  of <quote>spam hacktivists</quote>, such as the maintainers
	  of the <ulink url="http://www.spews.org/">SPEWS</ulink>
	  <link linkend="dnsbl">blacklist</link>.  One of the stated
	  aims of this list is precisely to inflict <xref
	  linkend="coldamage"/> as a means of putting pressure on ISPs
	  to react on abuse complaints.  Listing complaints are
	  typically met with knee-jerk responses such as <quote>bother
	  your ISP, not us</quote>, or <quote>get another ISP</quote>.
	</para>

	<para>
	  Often, these are not viable options.  Consider developing
	  countries.  For that matter, consider the fact that nearly
	  everywhere, broadband providers are regulated monopolies.  I
	  believe that these attitudes illustrate the exact crux of
	  the problem with trusting these groups.
	</para>

	<para>
	  Put plainly, there are much better and more accurate ways
	  available to filter junk mail.
	</para>
      </footnote>

      This is especially true for SMTP-time system wide filtering,
      because end recipients usually have little or no control over
      the criteria being used to filter their mail.
    </para>
  </section>


  <section id="smtpintro" xreflabel="The SMTP Transaction">
    <?dbhtml filename="smtpintro.html"?>
    <title>The SMTP Transaction</title>

    <para>
      SMTP is the protocol that is used for mail delivery on the
      Internet.  For a detailed description of the protocol, please
      refer to <ulink url="http://www.ietf.org/rfc/rfc2821.txt">RFC
      2821</ulink>, as well as Dave Crocker's introduction to
      <ulink url="http://www.brandenburg.com/specifications/draft-crocker-mail-arch-00.htm">Internet Mail Architecture</ulink>.
    </para>

    <para>
      Mail deliveries involve an SMTP transaction between the
      connecting host (client) and the receiving host (server).  For
      this discussion, the connecting host is the peer, and the
      receiving host is your server.
    </para>

    <para>
      In a typical SMTP transaction, the client issues SMTP commands
      such as <command>EHLO</command>, <command>MAIL FROM:</command>,
      <command>RCPT TO:</command>, and <command>DATA</command>.  Your
      server responds to each command with a 3-digit numeric code
      indicating whether the command was accepted
      (<command>2<parameter>xx</parameter></command>), was subject to
      a temporary failure or restriction
      (<command>4<parameter>xx</parameter></command>), or failed
      definitively/permanently
      (<command>5<parameter>xx</parameter></command>), followed by
      some human readable explanation.  A full description of these
      codes is included in
      <ulink url="http://www.ietf.org/rfc/rfc2821.txt">RFC 2821</ulink>.
    </para>

    <para>
      A best case scenario SMTP transaction typically consists of the
      following relevant steps:
    </para>

    <table id="smtpdialogue" frame="all">
      <title>Simple SMTP dialogue</title>

      <tgroup cols="2" align="left" colsep="1" rowsep="1">
	<thead>
	  <row>
	    <entry>Client</entry>
	    <entry>Server</entry>
	  </row>
	</thead>

	<tbody>
	  <row>
	    <entry>
	      <para>
		Initiates a TCP connection to server.
	      </para>
	    </entry>

	    <entry>
	      <para>
		Presents an SMTP banner - that is, a greeting that
		starts with the code <command>220</command> to indicate
		that it is ready to speak SMTP (or usually ESMTP, a
		superset of SMTP):

		<screen>220 <parameter>your.f.q.d.n</parameter> ESTMP...</screen>
	      </para>
	    </entry>
	  </row>

	  <row>
	    <entry>
	      <para>
		Introduces itself by way of an Hello command, either
		<command>HELO</command> (now obsolete) or
		<command>EHLO</command>, followed by its own <xref
		linkend="fqdn" />:
		<screen>EHLO <parameter>peers.f.q.d.n</parameter></screen>
	      </para>
	    </entry>

	    <entry>
	      <para>
		Accepts this greeting with a <command>250</command>
		response.  If the client used the
		<emphasis>extended</emphasis> version of the Hello
		command (<command>EHLO</command>), your server knows
		that it is capable of handling multi-line responses,
		and so will normally send back several lines
		indicating the capabilities offered by your server:

<screen>
250-<parameter>your.f.q.d.n</parameter> Hello ...
250-SIZE 52428800
250-8BITMIME
250-PIPELINING
250-STARTTLS
250-AUTH
250 HELP
</screen>
	      </para>

	      <para>
		If the <command>PIPELINING</command> capability is
		included in this response, the client can from this point
		forward issue several commands at once, without waiting
		for the response to each one.
	      </para>
	    </entry>
	  </row>

	  <row>
	    <entry>
	      <para>
		Starts a new mail transaction by specifying the
		<xref linkend="envfrom" />:

		<screen>MAIL FROM:&lt;<parameter>sender</parameter>@<parameter>address</parameter>&gt;
		</screen>
	      </para>
	    </entry>

	    <entry>
	      <para>
		Issues a <command>250</command> response to indicate
		that the sender is accepted.
	      </para>
	    </entry>
	  </row>

	  <row>
	    <entry>
	      <para>
		Lists the <xref linkend="envto"/>s of the message, one
		at a time, using the command:

		<screen>RCPT TO:&lt;<parameter>receiver</parameter>@<parameter>address</parameter>&gt;</screen>
	      </para>
	    </entry>

	    <entry>
	      <para>
		Issues a response to each command
		(<command>2<parameter>xx</parameter></command>,
		<command>4<parameter>xx</parameter></command>, or
		<command>5<parameter>xx</parameter></command>,
		depending on whether delivery to this recipient was
		accepted, subject to a temporary failure, or
		rejected).
	      </para>
	    </entry>
	  </row>

	  <row>
	    <entry>
	      <para>
		Issues a <command>DATA</command> command to indicate
		that it is ready to send the message.
	      </para>
	    </entry>

	    <entry>
	      <para>
		Responds <command>354</command> to indicate that the
		command has been provisionally accepted.
	      </para>
	    </entry>
	  </row>

	  <row>
	    <entry>
	      <para>
		Transmits the message, starting with RFC 2822
		compliant header lines (such as:
		<option>From:</option>, <option>To:</option>,
		<option>Subject:</option>, <option>Date:</option>,
		<option>Message-ID:</option>).  The header and the
		body are separated by an empty line.  To indicate the
		end of the message, the client sends a single period
		(".") on a separate line.
	      </para>
	    </entry>

	    <entry>
	      <para>
		Replies <command>250</command> to indicate that the
		message has been accepted.
	      </para>
	    </entry>
	  </row>

	  <row>
	    <entry>
	      <para>
		If there are more messages to be delivered, issues the
		next <command>MAIL FROM:</command> command.
		Otherwise, it says <command>QUIT</command>, or in rare
		cases, simply disconnects.
	      </para>
	    </entry>

	    <entry>
	      <para>
		Disconnects.
	      </para>
	    </entry>
	  </row>
	</tbody>
      </tgroup>
    </table>
  </section>
</chapter>