old-www/HOWTO/Spam-Filtering-for-MX/datachecks.html

776 lines
14 KiB
HTML

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML
><HEAD
><TITLE
>Message data checks</TITLE
><META
NAME="GENERATOR"
CONTENT="Modular DocBook HTML Stylesheet Version 1.7"><LINK
REL="HOME"
TITLE="Spam Filtering for Mail Exchangers"
HREF="index.html"><LINK
REL="UP"
TITLE="Techniques"
HREF="techniques.html"><LINK
REL="PREVIOUS"
TITLE="Sender Authorization Schemes"
HREF="senderauth.html"><LINK
REL="NEXT"
TITLE="Blocking Collateral Spam"
HREF="collateral.html"></HEAD
><BODY
CLASS="section"
BGCOLOR="#FFFFFF"
TEXT="#000000"
LINK="#0000FF"
VLINK="#840084"
ALINK="#0000FF"
><DIV
CLASS="NAVHEADER"
><TABLE
SUMMARY="Header navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TH
COLSPAN="3"
ALIGN="center"
>Spam Filtering for Mail Exchangers: </TH
></TR
><TR
><TD
WIDTH="10%"
ALIGN="left"
VALIGN="bottom"
><A
HREF="senderauth.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="80%"
ALIGN="center"
VALIGN="bottom"
>Chapter 2. Techniques</TD
><TD
WIDTH="10%"
ALIGN="right"
VALIGN="bottom"
><A
HREF="collateral.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
></TABLE
><HR
ALIGN="LEFT"
WIDTH="100%"></DIV
><DIV
CLASS="section"
><H1
CLASS="section"
><A
NAME="datachecks"
></A
>2.6. Message data checks</H1
><P
>&#13; Time has come to look at the content of the message itself.
This is what conventional spam and virus scanners do, as they
normally operate on the message after it has been accepted.
However, in our case, we perform these checks
<EM
>before</EM
> issuing the final
<B
CLASS="command"
>250</B
> response, so that we have a chance to
reject the mail on the spot rather than later generating <A
HREF="gloss.html#colspam"
><I
CLASS="glossterm"
>Collateral Spam</I
></A
>.
</P
><P
>&#13; If your incoming mail exchangers are very busy (i.e. large site,
few machines), you may find that performing some or all of these
checks directly in the mail exchanger is too costly. In
particular, running <A
HREF="datachecks.html#virusscanners"
>Virus Scanners</A
> and <A
HREF="datachecks.html#spamscanners"
>Spam Scanners</A
> do take up a fair amount of CPU
bandwidth and time.
</P
><P
>&#13; If so, you will want to set up dedicated machines for these
scanning operations. Most server-side anti-spam and anti-virus
software can be invoked over the network, i.e. from your mail
exchanger. More on this in the following chapters, where we
discuss implementation for the various MTAs.
</P
><DIV
CLASS="section"
><H2
CLASS="section"
><A
NAME="headerchecks"
></A
>2.6.1. Header checks</H2
><DIV
CLASS="section"
><H3
CLASS="section"
><A
NAME="headersmissing"
></A
>2.6.1.1. Missing Header Lines</H3
><P
>&#13; <A
HREF="http://www.ietf.org/rfc/rfc2822.txt"
TARGET="_top"
>RFC
2822</A
> mandates that a message
<EM
>should</EM
> contain at least the following
header lines:
<TABLE
BORDER="0"
BGCOLOR="#E0E0E0"
WIDTH="100%"
><TR
><TD
><FONT
COLOR="#000000"
><PRE
CLASS="screen"
>&#13;From: ...
To: ...
Subject: ...
Message-ID: ...
Date: ...
</PRE
></FONT
></TD
></TR
></TABLE
>
</P
><P
>&#13; The absence of any of these lines means that the message
is not generated by a mainstream <A
HREF="gloss.html#mua"
><I
CLASS="glossterm"
>Mail User Agent</I
></A
>, and
that it is probably junk
<A
NAME="AEN1045"
HREF="#FTN.AEN1045"
><SPAN
CLASS="footnote"
>[1]</SPAN
></A
>.
</P
></DIV
><DIV
CLASS="section"
><H3
CLASS="section"
><A
NAME="headersyntax"
></A
>2.6.1.2. Header Address Syntax Check</H3
><P
>&#13; Addresses presented in the message header (i.e. the
<B
CLASS="command"
>To:</B
>, <B
CLASS="command"
>Cc:</B
>,
<B
CLASS="command"
>From:</B
> ... fields) should be syntactically
valid. Enough said.
</P
></DIV
><DIV
CLASS="section"
><H3
CLASS="section"
><A
NAME="headeraddress"
></A
>2.6.1.3. Simple Header Address Validation</H3
><P
>&#13; For each address in the message header:
</P
><P
></P
><UL
><LI
><P
>&#13; If the address is local, is the <EM
>local
part</EM
> (before the @ sign) a valid mailbox?
</P
></LI
><LI
><P
>&#13; If the address is remote, does the <EM
>domain
part</EM
> (after the @ sign) exist?
</P
></LI
></UL
></DIV
><DIV
CLASS="section"
><H3
CLASS="section"
><A
NAME="headercallout"
></A
>2.6.1.4. Header Address Callout Verification</H3
><P
>&#13; This works similar to <A
HREF="smtpchecks.html#callback"
>Sender Callout Verification</A
> and <A
HREF="smtpchecks.html#callforward"
>Recipient Callout Verification</A
>. Each remote header address is
verified by calling the primary MX for the corresponding
domain to determine if a <A
HREF="gloss.html#dsn"
><I
CLASS="glossterm"
>Delivery Status Notification</I
></A
> would be
accepted.
</P
></DIV
></DIV
><DIV
CLASS="section"
><H2
CLASS="section"
><A
NAME="jmsr"
></A
>2.6.2. Junk Mail Signature Repositories</H2
><P
>&#13; One trait of junk mail is that it is sent to a large number of
addresses. If 50 other recipients have already flagged a
particular message as spam, why couldn't you use this fact to
decide whether or not to accept the message when it is
delivered to you? Better yet, why not set up <A
HREF="gloss.html#spamtrap"
><I
CLASS="glossterm"
>Spam Trap</I
></A
>s that feed a public pool of known spam?
</P
><P
>&#13; I am glad you asked. As it turns out, such pools do exist:
</P
><P
></P
><UL
><LI
><P
>&#13; <A
HREF="http://razor.sf.net/"
TARGET="_top"
>Razor</A
>
</P
></LI
><LI
><P
>&#13; <A
HREF="http://pyzor.sf.net/"
TARGET="_top"
>Pyzor</A
>
</P
></LI
><LI
><P
>&#13; <A
HREF="http://rhyolite.com/anti-spam/dcc/"
TARGET="_top"
>Distributed
Checksum Clearinghouse (DCC)</A
>
</P
></LI
></UL
><P
>&#13; These tools have progressed beyond simple signature checks
that only trigger if you receive an identical copy of a
message that is known to be junk mail. Rather, they evaluate
common patterns, to account for slight variations in the
message header and body.
</P
></DIV
><DIV
CLASS="section"
><H2
CLASS="section"
><A
NAME="garbagechars"
></A
>2.6.3. Binary garbage checks</H2
><P
>&#13; Messages containing non-printable characters are rare. When
they do show up, the message is nearly always a virus, or in
some cases spam written in a non-western language, without the
appropriate MIME encoding.
</P
><P
>&#13; One particular case is where the message contains NUL
characters (ordinal zero). Even if you decide that figuring
out what a <EM
>non-printable</EM
> character means
is more complex than beneficial, you might consider checking
for this character. That is because some <A
HREF="gloss.html#mda"
><I
CLASS="glossterm"
>Mail Delivery Agent</I
></A
>s, such as the <A
HREF="http://asg.web.cmu.edu/cyrus/"
TARGET="_top"
>Cyrus Mail Suite</A
>,
will ultimately reject mails that contain it.
<A
NAME="AEN1096"
HREF="#FTN.AEN1096"
><SPAN
CLASS="footnote"
>[2]</SPAN
></A
>.
If you use such software, you should definitely consider
getting rid of NUL characters.
</P
><P
>&#13; On the other hand, the (now obsolete) RFC 822 specification
did not explicitly prohibit NUL characters in the message.
For this reason, as an alternative to rejecting mails
containing it, you may choose to strip these characters from
the message before delivering it to Cyrus.
</P
></DIV
><DIV
CLASS="section"
><H2
CLASS="section"
><A
NAME="mimeerrors"
></A
>2.6.4. MIME checks</H2
><P
>&#13; Similarly, it might be worthwhile to validate the MIME
structure of incoming message. MIME decoding errors or
inconsistencies do not happen very often; but when they do,
the message is definitely junk. Moreover, such errors may
indicate potential problems in subsequent checks, such as
<A
HREF="datachecks.html#fileext"
>File Attachment Check</A
>s, <A
HREF="datachecks.html#virusscanners"
>Virus Scanners</A
>,
or <A
HREF="datachecks.html#spamscanners"
>Spam Scanners</A
>.
</P
><P
>&#13; In other words, if the MIME encoding is illegal, reject the
message.
</P
></DIV
><DIV
CLASS="section"
><H2
CLASS="section"
><A
NAME="fileext"
></A
>2.6.5. File Attachment Check</H2
><P
>&#13; When was the last time someone sent you a Windows screensaver
(<SPAN
CLASS="QUOTE"
>".scr"</SPAN
> file) or Windows Program Information File
(<SPAN
CLASS="QUOTE"
>".pif"</SPAN
>) that you actually wanted?
</P
><P
>&#13; Consider blocking messages with <SPAN
CLASS="QUOTE"
>"Windows
executable"</SPAN
> file attachment(s) - i.e. file names that
end with a period followed by any of a number of three-letter
combinations such as the above. This check consumes
significantly less resources on your server than <A
HREF="datachecks.html#virusscanners"
>Virus Scanners</A
>, and may also catch new virii for
which a signature does not yet exist in your anti-virus
scanner.
</P
><P
>&#13; For a more-or-less comprehensive list of such <SPAN
CLASS="QUOTE"
>"file name
extensions"</SPAN
>, please visit: <A
HREF="http://support.microsoft.com/default.aspx?scid=kb;EN-US;290497"
TARGET="_top"
>http://support.microsoft.com/default.aspx?scid=kb;EN-US;290497</A
>.
</P
></DIV
><DIV
CLASS="section"
><H2
CLASS="section"
><A
NAME="virusscanners"
></A
>2.6.6. Virus Scanners</H2
><P
>&#13; A number of different server-side virus scanners are
available. To name a few:
</P
><P
></P
><UL
><LI
><P
>&#13; <A
HREF="http://www.vanja.com/tools/sophie/"
TARGET="_top"
>Sophie</A
>
</P
></LI
><LI
><P
>&#13; <A
HREF="http://www.kapersky.com/"
TARGET="_top"
>KAVDaemon</A
>
</P
></LI
><LI
><P
>&#13; <A
HREF="http://clamav.elektrapro.com/"
TARGET="_top"
>ClamAV</A
>
</P
></LI
><LI
><P
>&#13; <A
HREF="http://www.sald.com/"
TARGET="_top"
>DrWeb</A
>
</P
></LI
></UL
><P
>&#13; In situations where you are not willing to block all
potentially dangerous files based on their file names alone
(consider <SPAN
CLASS="QUOTE"
>".zip"</SPAN
> files), such scanners are
helpful. Also, they will be able to catch virii that are
not transmitted as file attachments, such as the
<SPAN
CLASS="QUOTE"
>"Bagle.R"</SPAN
> virus that arrived in March, 2004.
</P
><P
>&#13; In most cases, the machine performing the virus scan does not
need to be your mail exchanger. Most of these anti-virus
scanners can be invoked on a different host over a network
connection.
</P
><P
>&#13; Anti-virus software mainly detect virii based on a set of
signatures for known virii, or <EM
>virus
definitions</EM
>. These need to be updated regularly,
as new virii are developed. Also, the software itself
should at any time be up to date for maximum accuracy.
</P
></DIV
><DIV
CLASS="section"
><H2
CLASS="section"
><A
NAME="spamscanners"
></A
>2.6.7. Spam Scanners</H2
><P
>&#13; Similarly, anti-spam software can be used to classify messages
based on a large set of heuristics, including their content,
standards compliance, and various network checks such as <A
HREF="dnschecks.html#dnsbl"
>DNS Blacklists</A
> and <A
HREF="datachecks.html#jmsr"
>Junk Mail Signature Repository</A
>. In the end,
such software typically assigns a composite
<SPAN
CLASS="QUOTE"
>"score"</SPAN
> to each message, indicating the
likelihood that the message is spam, and if the score is above
a certain threshold, would classify it as such.
</P
><P
>&#13; Two of the most popular server-side heuristic anti-spam
filters are:
<P
></P
><UL
><LI
><A
NAME="spamassassin"
></A
><P
>&#13; <A
HREF="http://www.spamassassin.org/"
TARGET="_top"
>SpamAssassin</A
>
</P
></LI
><LI
><A
NAME="brightmail"
></A
><P
>&#13; <A
HREF="http://www.brightmail.com/"
TARGET="_top"
>BrightMail</A
>
</P
></LI
></UL
>
</P
><P
>&#13; These tools undergo a constant evolution as spammers find ways
to circumvent their various checks. For instance, consider
<SPAN
CLASS="QUOTE"
>"creative"</SPAN
> spelling, such as <SPAN
CLASS="QUOTE"
>"GR0W lO
1NCH35"</SPAN
>. So, just like anti-virus software, if you use
anti-spam software, you should update it frequently for the
highest level of accuracy.
</P
><P
>&#13; I use SpamAssassin, although to minimize impact on machine
resources, it is no longer my first line of defense. Out of
approximately 500 junk mail delivery attempts to my personal
address per day, about 50 reach the point where they are being
checked by SpamAssassin (mainly because they are forwarded
from one of my other accounts, so the checks described above
are not effective). Out of these 50 messages, one message
ends up in my inbox approximately every 2 or 3 days.
</P
></DIV
></DIV
><H3
CLASS="FOOTNOTES"
>Notes</H3
><TABLE
BORDER="0"
CLASS="FOOTNOTES"
WIDTH="100%"
><TR
><TD
ALIGN="LEFT"
VALIGN="TOP"
WIDTH="5%"
><A
NAME="FTN.AEN1045"
HREF="datachecks.html#AEN1045"
><SPAN
CLASS="footnote"
>[1]</SPAN
></A
></TD
><TD
ALIGN="LEFT"
VALIGN="TOP"
WIDTH="95%"
><P
>&#13; Some specialized MTAs, such as certain mailing list
servers, do not automatically generate a
<TT
CLASS="option"
>Message-ID:</TT
> header for
<SPAN
CLASS="QUOTE"
>"bounced"</SPAN
> messages (<A
HREF="gloss.html#dsn"
><I
CLASS="glossterm"
>Delivery Status Notification</I
></A
>s). These messages are identified by an
empty <A
HREF="gloss.html#envfrom"
><I
CLASS="glossterm"
>Envelope Sender</I
></A
>.
</P
></TD
></TR
><TR
><TD
ALIGN="LEFT"
VALIGN="TOP"
WIDTH="5%"
><A
NAME="FTN.AEN1096"
HREF="datachecks.html#AEN1096"
><SPAN
CLASS="footnote"
>[2]</SPAN
></A
></TD
><TD
ALIGN="LEFT"
VALIGN="TOP"
WIDTH="95%"
><P
>&#13; The IMAP protocol does not allow for NUL characters to be
transmitted to the mail user agent, so the Cyrus
developers decided that the easiest way to deal with mails
containing it was to reject them.
</P
></TD
></TR
></TABLE
><DIV
CLASS="NAVFOOTER"
><HR
ALIGN="LEFT"
WIDTH="100%"><TABLE
SUMMARY="Footer navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
><A
HREF="senderauth.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="index.html"
ACCESSKEY="H"
>Home</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
><A
HREF="collateral.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
>Sender Authorization Schemes</TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="techniques.html"
ACCESSKEY="U"
>Up</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
>Blocking Collateral Spam</TD
></TR
></TABLE
></DIV
></BODY
></HTML
>