In order to put up a decent "passive" web-site defense against address harvesting, you are going to have to make slight changes to your web server, your DNS, and your mailer. The general idea is to:
All of this may be fairly difficult or even impossible when using some operating systems, web servers, or mail servers. However, it turns out to be quite easy on the configuration which I have (Apache web server, SMTPD mailer front-end).
Note: I can't claim ownership of these ideas; almost all of them were originated by others. I'm just publishing a sample implementation.
Important Note: If you do make use of these ideas, please customize
your scripts and names (e.g., use something other than
"/laughing-place", re-word the HTML output of the bogus address
generator, etc.). If lots of people use these scripts, and the output is
essentially identical, then spamware writers will change their code to
recognize this stuff and bypass it. If there are no consistent patterns to
the output, then they won't be able to avoid it easily.
This site's web-harvester re-direction has two parts:
HTTP_USER_AGENT" signature.Web harvesters are often smart enough to avoid CGI-BIN programs. So the
implementation (which I got from C. Brabec) is intended to make a script look
like a normal web page. This is done with an Apache rewriting rule, in the
"httpd.conf" file:
# Redirect various web-harvesters to POISON program
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebBandit [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [OR]
RewriteCond %{REQUEST_URI} ^/laughing-place
RewriteRule ^.*$ /cgi-bin/killspam.pl [L,T=application/x-httpd-cgi]
The rewriting rule takes the signature of spammer's web-harvesters (at
least those which would not also rule out some browsers) and no matter what
page they reference they get the output of "killspam.pl". The
line for "laughing-place" also re-directs any access to a
file within that (non-existent) directory.
The code for "killspam.pl" is
available for download here. You'll see that it
generates a list of bogus addresses, and also includes links to "other" lists
in the non-existent "/laughing-place" directory (which will be
another run of the same address generator).
On each real web page on my site, I include an invisible link to that directory. Actually this is done with a Server-Side Include that is placed in every web page. Each web page on my site has this at the bottom:
Among the code in "<!--#exec cmd="/cgi-bin/tail.pl"--> </BODY> </HTML>
tail.pl" is:
print <<END_OF_PRINT
<P>
<CENTER>
<A HREF="/laughing-place/bait.html"><IMG SRC="/images/tiny.gif"
ALT="Do not follow this link" WIDTH=1 HEIGHT=1 BORDER=0></A>
</CENTER>
END_OF_PRINT
;
The advantage of the server-side-include is that there is one central
location where changes can be made. You don't have to edit every HTML file on
your site in order to get a global change to the HTML that your web server
hands out. (For example: if I chose to rename "laughing-place"
to something else, I would have to make the change in "tail.pl"
instead of having to make it in every HTML file on the web server.)
The link is essentially invisible because it is a one-pixel-square transparent .GIF file with no border (you're welcome to copy the image from here). Users wouldn't see the link to click on it, but robot harvesters will follow it.
Just to be nice, I disallow access to "/laughing-place" by
adding the following lines to "robots.txt" (in the root
HTML-document directory):
User-agent: * Disallow: /laughing-place
Well-behaved robots pay attention to the restrictions in
"/robots.txt" and will not waste time in my spam-trap area.
Spammer's address-harvesters are generally poorly-behaved, and will pay the
price for ignoring those restrictions.
This is actually the simplest change, but it requires that one be in
charge of DNS for their own domain. If you look at "killspam.pl", you'll see that my
version of the script generates addresses of the form
"<junk>@bulk.stassen.com" and
"<junk>@<garbage>.bulk.stassen.com".
In order for MX-resolving spam-sending programs to think that these randomly-generated domains are valid, they don't have to resolve to an IP address but they do have to have an MX (Mail eXchanger) record. The following lines were added to the domain's file take care of that:
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; BULK and *.BULK are spam-traps ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; bulk IN MX 0 mail.stassen.com. *.bulk IN MX 0 mail.stassen.com.
Those lines cause the DNS server to claim that there is an MX record for
"bulk.stassen.com" or any sub-domain of it (e.g.,
"foo.bulk.stassen.com" or
"foo.bar.bulk.stassen.com"). This will direct all E-mail from
the bogus addresses to my real mail server. In the next (and final) section,
we'll fix the mail server to properly handle the unwanted E-mail.
The configuration changes described above have forced spammers' address-harvesters to swallow a large number of randomly-generated bogus E-mail addresses. We have modified DNS so that the spammers' programs can't tell that the addreses are bad. Unfortunately, the side effect of those changes is to direct a load of unwanted spam at our mail server. (However, it is the only ethical way to go; it wouldn't be fair to direct unwanted spam at others' mail servers.)
In the final step, we deal with the unwanted spam. With SMTPD, it is
trivial to reject it during the SMTP conversation (meaning that the message
never actually gets to this site). The bogus addresses are of the form
"<junk>@bulk.stassen.com" and
"<junk>@<garbage>.bulk.stassen.com".
And so we add the folloing text to the SMTPD configuration file:
############################################################################## # TAR-PITTING FOR SPAMMERS ############################################################################## noto_delay:ALL:ALL:*@bulk.stassen.com *@*.bulk.stassen.com:552 Bad spammer! %F (%H [%I])
We use the "noto_delay" directive so that each individual
recipient is rejected -- and there is a 30-second (configuratable at
compile-time) pause before SMTPD delivers each response. Our site and users
will never be bothered with the spam that is generated.