18 November 2009

General Problems encountered when parsing RSS or ATOM feeds

RSS and ATOM are XML specifications for web syndication. 9am completely depends on RSS/ATOM feeds discovered on the web, it uses Argotic Syndication Framework .Net library for parsing the feed, it is able to do the job most of the time, but whenever it encounters an unwanted element in the feed it throws an exception, which is expected. But my question is why these unwanted elements are there at first place? Have a look at these two RSS 2.0 feeds:

<?xml version="1.0"?>
<rss version="2.0">
<channel>
<title>Liftoff News</title>
<link>http://liftoff.msfc.nasa.gov/ </link>
<description>Liftoff to Space Exploration.</description>
<language>en-us</language>
<pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
<lastBuildDate>Tue, 10 Jun 2003 09:41:01 GMT</lastBuildDate>
<docs>http://blogs.law.harvard.edu/tech/rss </docs>
<generator>Weblog Editor 2.0</generator>
<managingEditor>editor@example.com</managingEditor>
<webMaster>webmaster@example.com</webMaster>
<item>
<title>Star City</title>
<link>http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp </link>
<description>How do Americans get ready to work with Russians aboard the International Space Station? They take a crash course in culture, language and protocol at Russia's Star City </description>
<pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
<guid>http://liftoff.msfc.nasa.gov/2003/06/03.html#item573 </guid>
</item>
</channel>
</rss>
Listing 1

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<item>DDI NEWS</item>
<link>http://www.ddinews.com </link>
<description>The latest news from Doordashan news - India's largest broadcaster</description>
<copyright>Copyright: (C) Doordarshan News</copyright>
<item>
<title>Climate talks make progress, pressure on US </title>
<description>Environment ministers made progress on Tuesday towards a scaled-down climate deal in Copenhagen next month, with Washington facing pressure to promise deep cuts by 2020 in greenhouse gas emissions. </description>
<link>http://www.ddinews.gov.in/Homepage/Homepage+-+Headlines/Climate+talks+make+progress.htm</link>
<pubDate>11/18/2009 1:53:18 PM</pubDate>
</item>
</channel>
</rss>
Listing 2

Listing 1 shows the correct format of an RSS 2.0 feed. Listing 2 is incorrect since tag <item> is there after tag <channel> which is not the right place for <item> tag. Argotic Syndication Framework throws an exception for the Listing 2 and doesn't parse it further.

This is one of the problems which I have observed along with the following:
1. Tag <language> missing or most of the time it has value "en-us" no matter what is the language used in the feed.
2. Tag <pubDate> is missing or having some wrong value.
3. Mismatch between the actual format of the feed and the format declared in <rss version="??"> tag.
4. Description is missing.
5. Feed Title is missing.
6. Item Title is missing ...most stupid and irritating.
7. Incomplete/Relative url in <link> tag
These are a few problems which I have seen with the web feeds, although I am talking about only RSS 2.0 here, similar problems can be imagined with ATOM/BLOGML and other syndication formats.