Nov 19 2003

Making it fun to fight blog spammers

As you can tell from some of the entries here, I get a lot of spam in comments to my posts. I am sick of it and want to make sure that I can react quickly when one of those idiots shows up on my site …

The first remedy was to modify Moveable Type’s comment cgi script to not record http-references for comment-authors any more. This was the single most annoying source of links to places that sell Viagra, HGH and some other useless crap. Now when somebody enters an “http://” link it gets magically transformed into an “hddp://” link. Those few good comments and pointers to other web-pages are transformed back manually after I checked them out (you may call it a moderation system for MT comments).

Next I noticed that I was always had an ssh window with a “tail -f” running, listening to my server’s access_log. Last night I changed that into a script running (in a secured area) on my server.
The perl script reads the last 500 lines from my server’s access_log backwards (using the nifty File::ReadBackwards module). It parses each line and separates it into it’s individual components. All the information is neatly formatted into a table:

Main Log Screen

Every 60 seconds the page will automatically refresh and show me the latest hits. Anything that caused the Apache server to not respond with a “200 OK” status code (or “304” and “301”) will be flagged in red. Referer entries (column is not shown in this screenshot) that point outside of kahunaburger.com are automatically wrapped into a redirect-script, which allows me to safely go where the request was linked from.

The ip-address is a link back to the log-script with one extra argument on the URL. This argument will drill down on the specific IP-address I selected:

Filter Log Screen

The detailed view for the ip-address tries to find the DNS-name for the selected IP-address. It then uses Caida’s NetGeo API (see NetGeo API) to find location information for that IP address. I know that the data is old, but it works for a lot of cases.
If the NetGeo lookup happens to return a latitude and longitude the data is fed to Tiger (see http://tiger.census.gov/cgi-bin/mapbrowse-tbl) to produce a little image with a map where the request supposedly came from (this part only work for the US at the moment).
On the right hand side I get the Whois information for the selected IP-address, telling me immediately who knows this particular IP-address I’m looking at.
And finally below all this I get the last 500 entries in the log-file from just this particular IP-address.

But that’s not all: At the top of the second screenshot you may notice a link saying Deny all future access (if you notice it, you have damned good eyes). Clicking on this link will call the same script yet another time. This time it will update a file called hosts.deny in my Apache’s configuration directory. It will add the ip-address at the end of this file and then redisplay the current log-file entries again.
The addition to the hosts.deny file is consumed by a mod_rewrite rule in my Apache configuration file:

    RewriteEngine on
    RewriteMap    hosts-deny  txt:/path/to/hosts.deny
    RewriteCond   ${hosts-deny:%{REMOTE_HOST}|NOT-FOUND} !=NOT-FOUND [OR]
    RewriteCond   ${hosts-deny:%{REMOTE_ADDR}|NOT-FOUND} !=NOT-FOUND
    RewriteRule   ^/.*  -  [F]

This little section checks the REMOTE_HOST and REMOTE_ADDR environment variables against the entries in the hosts.deny file. If a match is found, then the user will automatically get a “403” status code for any request to my server. If no entry is found, then the request is served normally.

So whenever I see some activity I don’t like (blog spamming, exploit checking, …) I just press the Deny all future access link and I’ll not be bothered again.

Leave a Reply