old-www/HOWTO/Spam-Filtering-for-MX/datachecks.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML
><HEAD
><TITLE
>Message data checks</TITLE
><META
NAME="GENERATOR"
CONTENT="Modular DocBook HTML Stylesheet Version 1.7"><LINK
REL="HOME"
TITLE="Spam Filtering for Mail Exchangers"
HREF="index.html"><LINK
REL="UP"
TITLE="Techniques"
HREF="techniques.html"><LINK
REL="PREVIOUS"
TITLE="Sender Authorization Schemes"
HREF="senderauth.html"><LINK
REL="NEXT"
TITLE="Blocking Collateral Spam"
HREF="collateral.html"></HEAD
><BODY
CLASS="section"
BGCOLOR="#FFFFFF"
TEXT="#000000"
LINK="#0000FF"
VLINK="#840084"
ALINK="#0000FF"
><DIV
CLASS="NAVHEADER"
><TABLE
SUMMARY="Header navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TH
COLSPAN="3"
ALIGN="center"
>Spam Filtering for Mail Exchangers: </TH
></TR
><TR
><TD
WIDTH="10%"
ALIGN="left"
VALIGN="bottom"
><A
HREF="senderauth.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="80%"
ALIGN="center"
VALIGN="bottom"
>Chapter 2. Techniques</TD
><TD
WIDTH="10%"
ALIGN="right"
VALIGN="bottom"
><A
HREF="collateral.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
></TABLE
><HR
ALIGN="LEFT"
WIDTH="100%"></DIV
><DIV
CLASS="section"
><H1
CLASS="section"
><A
NAME="datachecks"
></A
>2.6. Message data checks</H1
><P
>&#13;      Time has come to look at the content of the message itself.
      This is what conventional spam and virus scanners do, as they
      normally operate on the message after it has been accepted.
      However, in our case, we perform these checks
      <EM
>before</EM
> issuing the final
      <B
CLASS="command"
>250</B
> response, so that we have a chance to
      reject the mail on the spot rather than later generating <A
HREF="gloss.html#colspam"
><I
CLASS="glossterm"
>Collateral Spam</I
></A
>.
    </P
><P
>&#13;      If your incoming mail exchangers are very busy (i.e. large site,
      few machines), you may find that performing some or all of these
      checks directly in the mail exchanger is too costly.  In
      particular, running <A
HREF="datachecks.html#virusscanners"
>Virus Scanners</A
> and <A
HREF="datachecks.html#spamscanners"
>Spam Scanners</A
> do take up a fair amount of CPU
      bandwidth and time.
    </P
><P
>&#13;      If so, you will want to set up dedicated machines for these
      scanning operations.  Most server-side anti-spam and anti-virus
      software can be invoked over the network, i.e. from your mail
      exchanger.  More on this in the following chapters, where we
      discuss implementation for the various MTAs.
    </P
><DIV
CLASS="section"
><H2
CLASS="section"
><A
NAME="headerchecks"
></A
>2.6.1. Header checks</H2
><DIV
CLASS="section"
><H3
CLASS="section"
><A
NAME="headersmissing"
></A
>2.6.1.1. Missing Header Lines</H3
><P
>&#13;	  <A
HREF="http://www.ietf.org/rfc/rfc2822.txt"
TARGET="_top"
>RFC
	  2822</A
> mandates that a message
	  <EM
>should</EM
> contain at least the following
	  header lines:

<TABLE
BORDER="0"
BGCOLOR="#E0E0E0"
WIDTH="100%"
><TR
><TD
><FONT
COLOR="#000000"
><PRE
CLASS="screen"
>&#13;From: ...
To: ...
Subject: ...
Message-ID: ...
Date: ...
</PRE
></FONT
></TD
></TR
></TABLE
>
	</P
><P
>&#13;	  The absence of any of these lines means that the message
	  is not generated by a mainstream <A
HREF="gloss.html#mua"
><I
CLASS="glossterm"
>Mail User Agent</I
></A
>, and
	  that it is probably junk
	  <A
NAME="AEN1045"
HREF="#FTN.AEN1045"
><SPAN
CLASS="footnote"
>[1]</SPAN
></A
>.
	</P
></DIV
><DIV
CLASS="section"
><H3
CLASS="section"
><A
NAME="headersyntax"
></A
>2.6.1.2. Header Address Syntax Check</H3
><P
>&#13;	  Addresses presented in the message header (i.e. the
	  <B
CLASS="command"
>To:</B
>, <B
CLASS="command"
>Cc:</B
>,
	  <B
CLASS="command"
>From:</B
> ... fields) should be syntactically
	  valid.  Enough said.
	</P
></DIV
><DIV
CLASS="section"
><H3
CLASS="section"
><A
NAME="headeraddress"
></A
>2.6.1.3. Simple Header Address Validation</H3
><P
>&#13;	  For each address in the message header:
	</P
><P
></P
><UL
><LI
><P
>&#13;	      If the address is local, is the <EM
>local
	      part</EM
> (before the @ sign) a valid mailbox?
	    </P
></LI
><LI
><P
>&#13;	      If the address is remote, does the <EM
>domain
	      part</EM
> (after the @ sign) exist?
	    </P
></LI
></UL
></DIV
><DIV
CLASS="section"
><H3
CLASS="section"
><A
NAME="headercallout"
></A
>2.6.1.4. Header Address Callout Verification</H3
><P
>&#13;	  This works similar to <A
HREF="smtpchecks.html#callback"
>Sender Callout Verification</A
> and <A
HREF="smtpchecks.html#callforward"
>Recipient Callout Verification</A
>.  Each remote header address is
	  verified by calling the primary MX for the corresponding
	  domain to determine if a <A
HREF="gloss.html#dsn"
><I
CLASS="glossterm"
>Delivery Status Notification</I
></A
> would be
	  accepted.
	</P
></DIV
></DIV
><DIV
CLASS="section"
><H2
CLASS="section"
><A
NAME="jmsr"
></A
>2.6.2. Junk Mail Signature Repositories</H2
><P
>&#13;	One trait of junk mail is that it is sent to a large number of
	addresses.  If 50 other recipients have already flagged a
	particular message as spam, why couldn't you use this fact to
	decide whether or not to accept the message when it is
	delivered to you?  Better yet, why not set up <A
HREF="gloss.html#spamtrap"
><I
CLASS="glossterm"
>Spam Trap</I
></A
>s that feed a public pool of known spam?
      </P
><P
>&#13;	I am glad you asked.  As it turns out, such pools do exist:
      </P
><P
></P
><UL
><LI
><P
>&#13;	    <A
HREF="http://razor.sf.net/"
TARGET="_top"
>Razor</A
>
	  </P
></LI
><LI
><P
>&#13;	    <A
HREF="http://pyzor.sf.net/"
TARGET="_top"
>Pyzor</A
>
	  </P
></LI
><LI
><P
>&#13;	    <A
HREF="http://rhyolite.com/anti-spam/dcc/"
TARGET="_top"
>Distributed
	    Checksum Clearinghouse (DCC)</A
>
	  </P
></LI
></UL
><P
>&#13;	These tools have progressed beyond simple signature checks
	that only trigger if you receive an identical copy of a
	message that is known to be junk mail.  Rather, they evaluate
	common patterns, to account for slight variations in the
	message header and body.
      </P
></DIV
><DIV
CLASS="section"
><H2
CLASS="section"
><A
NAME="garbagechars"
></A
>2.6.3. Binary garbage checks</H2
><P
>&#13;	Messages containing non-printable characters are rare.  When
	they do show up, the message is nearly always a virus, or in
	some cases spam written in a non-western language, without the
	appropriate MIME encoding.
      </P
><P
>&#13;	One particular case is where the message contains NUL
	characters (ordinal zero).  Even if you decide that figuring
	out what a <EM
>non-printable</EM
> character means
	is more complex than beneficial, you might consider checking
	for this character.  That is because some <A
HREF="gloss.html#mda"
><I
CLASS="glossterm"
>Mail Delivery Agent</I
></A
>s, such as the <A
HREF="http://asg.web.cmu.edu/cyrus/"
TARGET="_top"
>Cyrus Mail Suite</A
>,
	will ultimately reject mails that contain it.
	<A
NAME="AEN1096"
HREF="#FTN.AEN1096"
><SPAN
CLASS="footnote"
>[2]</SPAN
></A
>.

	If you use such software, you should definitely consider
	getting rid of NUL characters.
      </P
><P
>&#13;	On the other hand, the (now obsolete) RFC 822 specification
	did not explicitly prohibit NUL characters in the message.
	For this reason, as an alternative to rejecting mails
	containing it, you may choose to strip these characters from
	the message before delivering it to Cyrus.
      </P
></DIV
><DIV
CLASS="section"
><H2
CLASS="section"
><A
NAME="mimeerrors"
></A
>2.6.4. MIME checks</H2
><P
>&#13;	Similarly, it might be worthwhile to validate the MIME
	structure of incoming message.  MIME decoding errors or
	inconsistencies do not happen very often; but when they do,
	the message is definitely junk.  Moreover, such errors may
	indicate potential problems in subsequent checks, such as
	<A
HREF="datachecks.html#fileext"
>File Attachment Check</A
>s, <A
HREF="datachecks.html#virusscanners"
>Virus Scanners</A
>,
	or <A
HREF="datachecks.html#spamscanners"
>Spam Scanners</A
>.
      </P
><P
>&#13;	In other words, if the MIME encoding is illegal, reject the
	message.
      </P
></DIV
><DIV
CLASS="section"
><H2
CLASS="section"
><A
NAME="fileext"
></A
>2.6.5. File Attachment Check</H2
><P
>&#13;	When was the last time someone sent you a Windows screensaver
	(<SPAN
CLASS="QUOTE"
>".scr"</SPAN
> file) or Windows Program Information File
	(<SPAN
CLASS="QUOTE"
>".pif"</SPAN
>) that you actually wanted?
      </P
><P
>&#13;	Consider blocking messages with <SPAN
CLASS="QUOTE"
>"Windows
	executable"</SPAN
> file attachment(s) - i.e. file names that
	end with a period followed by any of a number of three-letter
	combinations such as the above.  This check consumes
	significantly less resources on your server than <A
HREF="datachecks.html#virusscanners"
>Virus Scanners</A
>, and may also catch new virii for
	which a signature does not yet exist in your anti-virus
	scanner.
      </P
><P
>&#13;	For a more-or-less comprehensive list of such <SPAN
CLASS="QUOTE"
>"file name
	extensions"</SPAN
>, please visit: <A
HREF="http://support.microsoft.com/default.aspx?scid=kb;EN-US;290497"
TARGET="_top"
>http://support.microsoft.com/default.aspx?scid=kb;EN-US;290497</A
>.
      </P
></DIV
><DIV
CLASS="section"
><H2
CLASS="section"
><A
NAME="virusscanners"
></A
>2.6.6. Virus Scanners</H2
><P
>&#13;	A number of different server-side virus scanners are
	available.  To name a few:
      </P
><P
></P
><UL
><LI
><P
>&#13;	    <A
HREF="http://www.vanja.com/tools/sophie/"
TARGET="_top"
>Sophie</A
>
	  </P
></LI
><LI
><P
>&#13;	    <A
HREF="http://www.kapersky.com/"
TARGET="_top"
>KAVDaemon</A
>
	  </P
></LI
><LI
><P
>&#13;	    <A
HREF="http://clamav.elektrapro.com/"
TARGET="_top"
>ClamAV</A
>
	  </P
></LI
><LI
><P
>&#13;	    <A
HREF="http://www.sald.com/"
TARGET="_top"
>DrWeb</A
>
	  </P
></LI
></UL
><P
>&#13;	In situations where you are not willing to block all
	potentially dangerous files based on their file names alone
	(consider <SPAN
CLASS="QUOTE"
>".zip"</SPAN
> files), such scanners are
	helpful.  Also, they will be able to catch virii that are
	not transmitted as file attachments, such as the
	<SPAN
CLASS="QUOTE"
>"Bagle.R"</SPAN
> virus that arrived in March, 2004.
      </P
><P
>&#13;	In most cases, the machine performing the virus scan does not
	need to be your mail exchanger.  Most of these anti-virus
	scanners can be invoked on a different host over a network
	connection.
      </P
><P
>&#13;	Anti-virus software mainly detect virii based on a set of
	signatures for known virii, or <EM
>virus
	definitions</EM
>.  These need to be updated regularly,
	as new virii are developed.  Also, the software itself
	should at any time be up to date for maximum accuracy.
      </P
></DIV
><DIV
CLASS="section"
><H2
CLASS="section"
><A
NAME="spamscanners"
></A
>2.6.7. Spam Scanners</H2
><P
>&#13;	Similarly, anti-spam software can be used to classify messages
	based on a large set of heuristics, including their content,
	standards compliance, and various network checks such as <A
HREF="dnschecks.html#dnsbl"
>DNS Blacklists</A
> and <A
HREF="datachecks.html#jmsr"
>Junk Mail Signature Repository</A
>.  In the end,
	such software typically assigns a composite
	<SPAN
CLASS="QUOTE"
>"score"</SPAN
> to each message, indicating the
	likelihood that the message is spam, and if the score is above
	a certain threshold, would classify it as such.
      </P
><P
>&#13;	Two of the most popular server-side heuristic anti-spam
	filters are:

	<P
></P
><UL
><LI
><A
NAME="spamassassin"
></A
><P
>&#13;	      <A
HREF="http://www.spamassassin.org/"
TARGET="_top"
>SpamAssassin</A
>
	    </P
></LI
><LI
><A
NAME="brightmail"
></A
><P
>&#13;	      <A
HREF="http://www.brightmail.com/"
TARGET="_top"
>BrightMail</A
>
	    </P
></LI
></UL
>
      </P
><P
>&#13;	These tools undergo a constant evolution as spammers find ways
	to circumvent their various checks.  For instance, consider
	<SPAN
CLASS="QUOTE"
>"creative"</SPAN
> spelling, such as <SPAN
CLASS="QUOTE"
>"GR0W lO
	1NCH35"</SPAN
>.  So, just like anti-virus software, if you use
	anti-spam software, you should update it frequently for the
	highest level of accuracy.
      </P
><P
>&#13;	I use SpamAssassin, although to minimize impact on machine
	resources, it is no longer my first line of defense.  Out of
	approximately 500 junk mail delivery attempts to my personal
	address per day, about 50 reach the point where they are being
	checked by SpamAssassin (mainly because they are forwarded
	from one of my other accounts, so the checks described above
	are not effective).  Out of these 50 messages, one message
	ends up in my inbox approximately every 2 or 3 days.
      </P
></DIV
></DIV
><H3
CLASS="FOOTNOTES"
>Notes</H3
><TABLE
BORDER="0"
CLASS="FOOTNOTES"
WIDTH="100%"
><TR
><TD
ALIGN="LEFT"
VALIGN="TOP"
WIDTH="5%"
><A
NAME="FTN.AEN1045"
HREF="datachecks.html#AEN1045"
><SPAN
CLASS="footnote"
>[1]</SPAN
></A
></TD
><TD
ALIGN="LEFT"
VALIGN="TOP"
WIDTH="95%"
><P
>&#13;	      Some specialized MTAs, such as certain mailing list
	      servers, do not automatically generate a
	      <TT
CLASS="option"
>Message-ID:</TT
> header for
	      <SPAN
CLASS="QUOTE"
>"bounced"</SPAN
> messages (<A
HREF="gloss.html#dsn"
><I
CLASS="glossterm"
>Delivery Status Notification</I
></A
>s).  These messages are identified by an
	      empty <A
HREF="gloss.html#envfrom"
><I
CLASS="glossterm"
>Envelope Sender</I
></A
>.
	    </P
></TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="TOP"
WIDTH="5%"
><A
NAME="FTN.AEN1096"
HREF="datachecks.html#AEN1096"
><SPAN
CLASS="footnote"
>[2]</SPAN
></A
></TD
><TD
ALIGN="LEFT"
VALIGN="TOP"
WIDTH="95%"
><P
>&#13;	    The IMAP protocol does not allow for NUL characters to be
	    transmitted to the mail user agent, so the Cyrus
	    developers decided that the easiest way to deal with mails
	    containing it was to reject them.
	  </P
></TD
></TR
></TABLE
><DIV
CLASS="NAVFOOTER"
><HR
ALIGN="LEFT"
WIDTH="100%"><TABLE
SUMMARY="Footer navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
><A
HREF="senderauth.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="index.html"
ACCESSKEY="H"
>Home</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
><A
HREF="collateral.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
>Sender Authorization Schemes</TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="techniques.html"
ACCESSKEY="U"
>Up</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
>Blocking Collateral Spam</TD
></TR
></TABLE
></DIV
></BODY
></HTML
>