...making Linux just a little more fun!
RSS, feed readers, and news aggregators are a hot topic right now. Given the massive amount of news available on the net, it makes sense to be able to read all of the news you're interested in, without having to check a dozen or more websites. With a good feed reader, you can subscribe to your favourite news sites - or even to sites you wouldn't read as often as you would like - and have the software follow the news for you.
RSS is one component of the Semantic Web - and the only element currently enjoying widespread use. RSS is not the only format available for providing news feeds, and the name RSS even describes two different formats. I say name rather than acronym because "RSS" has been variously used to mean "RDF Site Summary", "Rich Site Summary", and "Really Simple Syndication", and because of this, it is best regarded as a name, to avoid confusion.
For those who are interested, it was originally RDF Site Summary, and bore a close resemblance to RSS 1.0. RDF was removed for the first public version, and it was renamed Rich Site Summary, and Dave Winer (author of RSS 0.92 and RSS 2.0) subsequently dubbed it Really Simple Syndication.
Before RSS was created, several sites, such as Slashdot, used custom text files to describe their headlines so other news sites could make use of them.
1 line for info, other line for date. %% <A HREF="http://linux.ie/newsitem.php?n=540">Article on PAM</A> 23-July-2004 9:19 %% <A HREF="http://linux.ie/newsitem.php?n=483">Articles wanted for Beginners Linux Guide</A> 4-July-2004 2:55Sample of Linux.ie's text backend.
When the "version 4" browsers were released, both included their own mechanisms for supplying "push content" - sections of websites which were designed to be subscribed to. If the user subscribed to the channel, their browser read the channel definition, and periodically checked that site for updates, downloading the relevant portions.
Netscape used a HTML-based "sidebar" (which is still present in Mozilla etc.), but Microsoft created the "Channel Description Format", which they submitted to the W3C. (Sample). Push content didn't catch on at the time, but later work on RSS has, though not for "push content". (Though Dave Winer did add the <enclosure> element to RSS 0.92 to allow push in RSS).
The original version of RSS was 0.90, designed by Netscape for use on their "My Netscape" portal. It was originally based on RDF, but according to the RSS timeline, Netscape's marketing people caused it to be stripped down. Netscape later expanded the format, to be closer to Dave Winer's Scripting News XML (Sample).
Winer maintained the RSS specification, but meanwhile, the W3C released their own RSS specification, based on the original idea of having an RDF based format for syndication. This removed most of the RSS specific tags, and used RDF schema to provide analogous tags. Since the W3C decided to call this RSS 1.0, Winer named his updated version RSS 2.0.
Because of this confusion, an IETF working group was started to produce an alternate format, called Atom (formerly Echo), which, unfortunately, only served to add to the confusion.
RDF 2.0 is backward-compatible with RSS 0.92, which is backward-compatible with RSS 0.91. The main difference between RSS 2.0 and RSS 1.0 is that RSS 1.0 removes all terms from the RSS schema which are available in other schema - in place of the <title> element, for example, you could include a reference to the Dublin Core schema, and use <dc:title> instead. RSS 2.0 can be extended with RDF schema - as (to the best of my knowledge) can Atom.
There are also websites available which will convert feeds to your
preferred format, for example 2rss.com and
RSS2Atom.
Linux Gazette can then be accessed as an Atom feed from http://www.tnl.net/channels/rss2atom/http://linuxgazette.net/lg.rss.
Liferea is my current favourite feed reader. It features a UI that is similar to most of the popular Windows readers, supports several formats (including CDF!), and integrates nicely with the system tray. When minimised to the tray, it is very unassuming. When new news arrives, the icon is coloured; otherwise, it's greyed. (Though the version I'm using has a problem with my Blogspot Atom feed - try it if you're curious).
Newer versions of both Liferea and Snownews allow you to use the output of another program as a feed source. RSSscrape, which I'll look at in a little while, is a great tool to use with either of these programs.
URSS is
another Mozilla-based news ticker, originally based on the Devedge ticker.
It, however, allows you to add your own feeds, and adds a sidebar that
allows headlines and descriptions to be read. I don't use it much though,
because it's unable to read LG's feed, though this may not be an issue by
the time this article hits the newsstands, as I found invalid code in the
RSS which I've fixed.
LG uses the RSS 0.91 format, the original format designed by Netscape. This is an excerpt from last month's issue:
<rss version="0.91"> <channel> <title>Linux Gazette</title> <link>http://linuxgazette.net</link> <description>An e-zine dedicated to making Linux just a little bit more fun. Published the first day of every month.</description> <language>en</language> <webMaster>(email omitted)</webMaster> <image> <title>Linux Gazette RSS</title> <url>http://linuxgazette.net/gx/2004/newlogo-blank-100-gold2.jpg</url> <link>http://www.linuxgazette.net/</link> </image> <issue>104</issue> <month>July</month> <year>2004</year> <item> <title>The Mailbag</title> <link>http://linuxgazette.net/104/lg_mail.html</link> <description></description> </item> </channel> </rss>
Pretty straightforward, isn't it? From issue 105 on, there are a few extra bits. The <webMaster> tag has been added to complement the <managingEditor> tag, and the <image> tag has the optional <height> and <width> tags:
<image> <title>Linux Gazette RSS</title> <url>http://linuxgazette.net/gx/2004/newlogo-blank-100-gold2.jpg</url> <link>http://www.linuxgazette.net/</link> <height>42</height> <width>99</width> </image>
One feature which would have been nifty is <textinput>, which allows you to search a site from your feed reader. Unfortunately, since LG uses Google's site search, we can't use it as that requires an extra parameter which RSS doesn't support.
We'll take this heavily pruned Google search URL: http://www.google.com/search?q=test. Here's how a textinput for that URL would look:
<textinput> <title>Search</title> <description>Search Google.</description> <name>q</name> <link>http://www.google.com/search</link> </textinput>
One issue with LG's RSS feed is that <issue>, <month> and <year> are not valid tags. From next issue on, this information is in the channel description - it was unused before anyway.
Our <description> now looks like this:
<description>An e-zine dedicated to making Linux just a little bit more fun. Published the first day of every month. <br> Issue 105: August, 2004 </description>
Note the use of escaped HTML - this can be used in any <description>.
RSS 0.92 is Dave Winer's expanded version of RSS 0.91. The main difference between RSS 0.91 and 0.92 is that several tags were made optional, and the limit of 15 items per feed was removed. Some experimental tags were added, but they're not really useful for the majority of people - aside from the <enclosure> element I already mentioned, there is also a <cloud> element, which is used to provide a link to an XML-RPC or SOAP service, which is used to tell aggregators that the feed has been updated.
RSS 2.0 builds on RSS 0.92. I couldn't make out what the differences between RSS 2.0 and 0.92 were, though.
RSS 1.0 is also backward-compatible with RSS 0.91, though instead of adding new tags to support new concepts, a different RDF schema may be referenced. Slashdot, for example, has its own schema which represents the section, department ("from the do-the-shake-and-vac dept" etc.), the number of comments, and the "hit parade".
The first difference you'll notice is that the root element is <rdf:RDF> instead of <rss>. Another difference is that the <channel>, <image> and <item> tags now have rdf:about attributes, and that the <image> and <item> tags must now appear outside of the <channel> tag, and must have matching tags with rdf:resource tags within the channel element (this applies to the <textinput> element too).
Atom was designed out of frustration with the competing versions of RSS. It also shows that it was designed after blogs became popular, whereas RSS predates this. Atom feeds, as is to be expected, are similar to RSS feeds.
Here's a sample Atom feed:
<?xml version='1.0' encoding='utf-8' ?> <!-- If you are running a bot please visit this policy page outlining rules you must respect. http://www.livejournal.com/bots/ --> <feed version='0.3' xmlns='http://purl.org/atom/ns#'> <title mode='escaped'>Dung by any other name...</title> <tagline mode='escaped'>jimregan</tagline> <link rel='alternate' type='text/html' href='http://www.livejournal.com/users/jimregan/' /> <modified>2004-07-26T01:12:32Z</modified> <link rel='service.feed' type='application/x.atom+xml' title='AtomAPI-enabled feed' href='http://www.livejournal.com/interface/atomapi/jimregan/feed'/> <entry> <title mode='escaped'>Sample title</title> <id>urn:lj:livejournal.com:atom1:jimregan:4251</id> <link rel='alternate' type='text/html' href='http://www.livejournal.com/users/jimregan/4251.html' /> <issued>2004-07-26T02:12:00</issued> <modified>2004-07-26T01:12:32Z</modified> <author> <name>jimregan</name> </author> <content type='text/html' mode='escaped'>A simple example.</content> </entry> </feed>
Atom is not as simple as RSS - even RSS 1.0, with RDF, is easier. It's also much newer than RSS, and not as well documented, or as well supported. It is, however, the only feed type supported by Blogger, Google's Blog site, so we can expect to see greater support for it.
I tried to set up an Atom creator for LG, so we can support the three feed types, but none of the feed readers I have installed were able to display anything useful from my feed. If you're skeptical, I have included it - you can check it using FeedValidator.org.
A screen scraper is a program which checks a website for certain information. In this context, we want to take a site which lacks a feed, and generate one - because we're all sold on the idea of feed readers now, being the ultra-hip people that we are, right? - but they are also used to check stock prices, flight details, auctions, etc. - there are several companies who base their entire business around scraping other sites.
So, basically, a screen scraper is a script which grabs a webpage, runs it through a regex or two, and spits out results in the desired format.
I'm still stuck in Perl-novice land, and since I don't want to place too much extra pressure on Ben, our resident Perl guru, I'll use RSSscraper, a project which provides the framework for scrapers, including the RSS generation code, so you have little to do other than provide the URL and regex. (Plus, it's written in Ruby, so I get to horrify Thomas with the poor quality of my code instead (\o/) - check this to see what I'm talking about!
So, here's my example, formatted slightly to fit on the screen. (Text version).
class BenScanner < RSSscraper::AbstractScanner def initialize @url_string = 'http://okopnik.freeshell.org/blog/blog.cgi' @url_proper = 'http://okopnik.freeshell.org/blog/' @postsRE = /div class="HeadText"> \n\n([^<]*)\n\n <\/div>\n<\/td>\n<\/tr>\n\n<tr>\n <td bgcolor="#fdfded" class="UsualText" valign="middle"> \n\n<br><b>([^<]*)<\/b><p>\n\n ([^\]]*)\n\n<p>\[ <a href="([^\s\t\r\n\f]*)"> ([^<]*)<\/a>/m end def find_items require 'cgi' items = Array.new request_feed.scan(@postsRE).each{ |date, title, content, comments_link, comments| items << { :title => title, :description => "#{CGI::escapeHTML(content)}", :comments_link => @url_proper+comments_link, :comments => @url_proper+comments_link } } items end end class Ben < RSSscraper::AbstractScraper def scanner BenScanner.new end def description { :link => 'http://okopnik.freeshell.org/blog/blog.cgi', :title => 'The Bay of Tranquility', :description => 'Ben Okopnik\'s blog.', :language => 'en-us', :generator => generator_string } end end
So, looking at the parts in bold, you can probably guess that you need to customise the main class name, and the scanner class. The file must be named [Foo].scraper.rb, and these classes are then named [Foo] and [Foo]Scanner.
After that, the items in description are pretty obvious. @url_string is the URL of the site to be parsed; @url_proper is only required in this case because the comments link is not a string appended to the blog link, as it is in most cases. @postsRE is the regex, and date, etc. are the elements we wish to extract.
The regex is the hardest part, and even at that, it's not too hard - just remember that the parts you wish to extract go in parentheses.
When you've filled in the blanks, you place it in your scrapers directory, and run it like so: scrape.rb [Foo].
Well... I lied about wanting to spare Ben - I'm going to show a simple example of an RSS generator in Perl.
CPAN has a great module for generating RSS (both 1.0 and 0.9x), called XML::RSS.
Let's look at a sample for generating RSS 0.91 (Text, sample output).
#!/usr/bin/perl -w use strict; use XML::RSS; my $rss = new XML::RSS (version => '0.91'); $rss->channel(title => 'Linux Gazette', link => 'http://linuxgazette.net', language => 'en', description => 'An e-zine dedicated to making Linux just a little bit more fun. Published the first day of every month. <br>Issue 105: August, 2004', copyright => 'Copyright (c) 1996-2004 the Editors of Linux Gazette', managingEditor => 'email@here.com', webMaster => 'email@here.com'); $rss->image(title => 'Linux Gazette', url => 'http://linuxgazette.net/gx/2004/newlogo-blank-100-gold2.jpg', link => 'http://www.linuxgazette.net/', width => '99', height => '42'); $rss->add_item(title => 'Securing a New Linux Installation', link => 'http://linuxgazette.net/105/odonovan.html', description => 'By Barry O\'Donovan'); $rss->save("perl-test.rss");
To create an RSS 1.0 feed, we could simply take that example, and change the version to 1.0 (which would look like this), but with RSS 1.0 we can add different schema. XML::RSS uses Dublin Core and dmoz taxonomies by default (you can, of course, easily add others, but I'm not going to cover that).
(Text and sample output)
#!/usr/bin/perl -w use strict; use XML::RSS; my $rss = new XML::RSS (version => '1.0'); $rss->channel(title => 'Linux Gazette', link => 'http://linuxgazette.net', description => 'An e-zine dedicated to making Linux just a little bit more fun. Published the first day of every month. <br>Issue 105: August, 2004', #XML::RSS will do the dc stuff for us, but just to illustrate... dc => {rights => 'Copyright (c) 1996-2004 the Editors of Linux Gazette', creator => 'email@here.com', language => 'en',}, taxo => ['http://dmoz.org/Computers/Software/Operating_Systems/Linux/']); $rss->image(title => 'Linux Gazette', url => 'http://linuxgazette.net/gx/2004/newlogo-blank-100-gold2.jpg', link => 'http://www.linuxgazette.net/', width => '99', height => '42'); $rss->add_item(title => 'Securing a New Linux Installation', link => 'http://linuxgazette.net/105/odonovan.html', description => 'By Barry O\'Donovan'); $rss->save("perl-test.rss");
And that's it. I hope some of you found this interesting, and I know from experience that if I've made any mistakes you won't hesitate to contact me. Take care.
Jimmy is a single father of one, who enjoys long walks... Oh, right.
Jimmy has been using computers from the tender age of seven, when his father
inherited an Amstrad PCW8256. After a few brief flirtations with an Atari ST
and numerous versions of DOS and Windows, Jimmy was introduced to Linux in 1998
and hasn't looked back.
In his spare time, Jimmy likes to play guitar and read: not at the same time,
but the picks make handy bookmarks.