Dec 10 2004

Caching RSS Proxy in perl

Just recently somebody asked me how I would go about creating a caching RSS proxy and here’s one potential solution to the problem. All it requires is a web server, some perl and access to a mysql database.

The background for this script was the following situation: Imagine a big company with a few thousand people. All those people work behind a firewall and there are only a few public (transparent) proxies that take all the traffic from the inside of the firewall to the outside. This means that only those few IP-addresses of the proxies will be visible to web servers accessed from all employees.

Now add RSS to the mix. Imagine a few hundred of those employees point at popular RSS feeds out there. Even if each individual employee follows the rules set by the RSS feed provider and only checks the feed every 1-2 hrs (or whatever else the minimum update time may be), the RSS server will still see a few hundred hits from the same set of IP-addresses over time. And there have been situations where a server administrator has shut down access to the RSS feed because of obvious violation of the rules (too many accesses to the feed in too little time from the same IP-address).

What’s needed in this case is a caching proxy server where all people behind the firewall funnel their feed requests to an internal system. The internal system will check (based on the URL of the feed) whether it has downloaded the RSS feed recently and will provide the cached copy instead of bothering the RSS server. Once the minimum time between updates has passed, the proxy will get a fresh copy of the RSS feed.

The script below will do exactly what just mentioned. Besides it’ll also do some house-keeping on the feeds, display a list of all available feeds when hit without any argument and compress feed contents before storing them in the database.

In order to use it, you would do the following:
1) let’s assume you install the script at http://rssproxy.mycompany.com/rssproxy.pl
2) you want to provide a proxy for the slashdot rss feed at http://slashdot.org/index.rss
3) you would ask your users to use the feed http://rssproxy.mycompany.com/rssproxy.pl/slashdot.org/index.rss instead of the original feed.

Hitting the proxy via http://rssproxy.mycompany.com/rssproxy.pl would display a list of all known feeds and how often they’ve been used, etc.

Just after I had finished this I found out that http://www.rsscache.com/ is providing the same functionality as a public service.

Hope one or the other person still finds it useful.

#!/usr/bin/perl

use strict;
use CGI;
use Compress::Zlib;
use DBI;
use LWP::UserAgent;
use HTTP::Response;

# how often do we fetch the rss feed from a site? (every hour in this case)
use constant CHECKTIME => 1*60*60;
# what is our own URL?
use constant SELF => qq{http://rssproxy.mycompany.com/rssproxy.pl};
# what user agent string should we use if the client does not pass one?
use constant USERAGENT => qq{SharpReader/0.9.4.1 (.NET CLR 1.1.4322.573; WinNT 5.1.2600.0)};

# how do we access the database?
my $dbi_db =”rssproxy”;
my $dbi_user =”username”;
my $dbi_passwd =”password”;
my $dbi_datasource =”dbi:mysql:$dbi_db”;

#
# database definition (mysql 4.1.3):
#
# CREATE TABLE `rssproxy` (
# `id` int(11) NOT NULL auto_increment,
# `url` varchar(255) NOT NULL default ”,
# `lastcheck` int(10) unsigned default ‘0’,
# `lastrequest` int(10) unsigned default ‘0’,
# `hits` int(10) unsigned default ‘0’,
# `added` int(10) unsigned default ‘0’,
# `rss` mediumtext NOT NULL,
# PRIMARY KEY (`id`),
# UNIQUE KEY `index_url` (`url`)
# ) ENGINE=MyISAM DEFAULT CHARSET=latin1;

# — main ——————————
my $cgi=new CGI;
# get path_info and substract initial “/”
my $pinfo=$cgi->path_info(); $pinfo =~ s/^\///;
# connect to database
my $dbh=DBI->connect($dbi_datasource,
                     $dbi_user,
                     $dbi_passwd,
                    { PrintError => 0 }
                   ) || die “Can’t connect to database! – $DBI::errstr”;
# did somebody supply path_info?
unless(defined($pinfo) && length($pinfo)) {
    # no path_info – let’s list all RSS feeds we know about
    print $cgi->header();
    print qq{<table border=”1″><tr bgcolor=”#cccccc”><td colspan=”7″>}.
      qq{<center><b>rss proxy entries (tobias\@kahunaburger.com)</b></center></td></tr>\n};
    # table header
    print qq{<tr bgcolor=”#dddddd”><td>id</td><td>added</td><td>lastcheck</td>};
    print qq{<td>lastrequest</td><td>hits</td><td>rss length</td><td>url</td></tr>\n};

    # list all database entries
    my $sth=$dbh->prepare_cached(qq{select id, url,
                 added, lastcheck, lastrequest,
                                           hits, length(rss)
                 from rssproxy});
    $sth->execute();
    while(my $a = $sth->fetchrow_arrayref()) {
    # ‘id’ and ‘added’
    my $line=qq{<tr><td>$a->[0]</td><td>}.scalar(localtime($a->[2])).qq{</td>};
    # ‘lastcheck’ and ‘lastrequest’
    $line .=qq{<td>}.scalar(localtime($a->[3])).qq{</td>};
    $line .=qq{<td>}.scalar(localtime($a->[4])).qq{</td>};
    # ‘hits’ and ‘rss length’
    $line .=qq{<td>}.$a->[5].qq{</td><td>$a->[6]</td>};
    # ‘url’
    $line .=qq{<td><a href=”}.SELF.qq{/$a->[1]”>$a->[1]</a></td></tr>\n};
    print $line;
    }
    $sth->finish();
    $dbh->disconnect();
    print qq{</table>\n};
    # nothing more to do in this case
    exit(0);
}

# path_info supplied – let’s see if we know about the rss feed
my ($rss); # this will hold a HTTP::Response object
my $sth=$dbh->prepare_cached(qq{select id, lastcheck, rss, hits
                from rssproxy
                where url = ?});
$sth->execute(lc($pinfo));
my @row = $sth->fetchrow_array;
$sth->finish();
# if @row is defined it means that we have an entry for this feed
if(defined(@row) && scalar(@row)) {
    # is it more than CHECKTIME seconds since we last fetched the feed?
    if(time() – $row[1] > CHECKTIME) {
    # time to fetch a new copy
    $rss = fetchRSS(qq{http://}.$pinfo,$cgi->user_agent() || USERAGENT);
    if(defined($rss)) {
     # successfully fetched the feed
     my $sth=$dbh->prepare_cached(qq{update rssproxy set lastcheck = ?, rss = ?
                     where id = ?});
     $sth->execute(time(), compress($rss->as_string), $row[0]);
     $sth->finish();
    }else{
     # not successful – use old copy
     $rss=HTTP::Response->parse(uncompress($row[2]));
    }
    }else{
    # it’s not time to fetch a new feed – use existing copy from DB
    $rss=HTTP::Response->parse(uncompress($row[2]));
    }
    # mark in database when record was last requested and increase hit counter
    my $sth=$dbh->prepare_cached(qq{update rssproxy set lastrequest = ?, hits = ?
                 where id = ?});
    $sth->execute(time(), $row[3]+1, $row[0]);
    $sth->finish();
}else{
    # no such record yet – see if we can get rss
    $rss=fetchRSS(qq{http://}.$pinfo,$cgi->user_agent() || USERAGENT);
    if(defined($rss)) {
    # yes – we were successful in downloading the rss feed
    my $now=time();
    # create a new database record for this feed
    my $sth=$dbh->prepare_cached(qq{insert into rssproxy (url,lastcheck,lastrequest,added,rss,hits)
                    values (?,?,?,?,?,?)});
    $sth->execute(lc($pinfo),$now,$now,$now,compress($rss->as_string),1);
    $sth->finish();
    }
}

$dbh->disconnect();
# if we have $rss (HTTP::Response) send it back to the user
if(defined($rss)) {
    print $rss->as_string;
} else {
    print $cgi->header(-status => 401);
}

# fetchRSS
#
# GET a given URL and return HTTP::Response if successful, otherwise return undef

sub fetchRSS {
    my($url,$impersonate)=@_;
    # create a new user-agent
    my $ua=new LWP::UserAgent();
    # we are going to wait for 2 mins max
    $ua->timeout(120);
    # pretend to be someone else
    $ua->agent($impersonate);
    # get the rss feed
    my $response=$ua->get($url);
    # were we successful?
    if($response->is_success) {
      return $response;
    }else{
      return undef;
    }
}

Leave a Reply