18 November 2009

General Problems encountered when parsing RSS or ATOM feeds

RSS and ATOM are XML specifications for web syndication. 9am completely depends on RSS/ATOM feeds discovered on the web, it uses Argotic Syndication Framework .Net library for parsing the feed, it is able to do the job most of the time, but whenever it encounters an unwanted element in the feed it throws an exception, which is expected. But my question is why these unwanted elements are there at first place? Have a look at these two RSS 2.0 feeds:

<?xml version="1.0"?>
<rss version="2.0">
<channel>
<title>Liftoff News</title>
<link>http://liftoff.msfc.nasa.gov/ </link>
<description>Liftoff to Space Exploration.</description>
<language>en-us</language>
<pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
<lastBuildDate>Tue, 10 Jun 2003 09:41:01 GMT</lastBuildDate>
<docs>http://blogs.law.harvard.edu/tech/rss </docs>
<generator>Weblog Editor 2.0</generator>
<managingEditor>editor@example.com</managingEditor>
<webMaster>webmaster@example.com</webMaster>
<item>
<title>Star City</title>
<link>http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp </link>
<description>How do Americans get ready to work with Russians aboard the International Space Station? They take a crash course in culture, language and protocol at Russia's Star City </description>
<pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
<guid>http://liftoff.msfc.nasa.gov/2003/06/03.html#item573 </guid>
</item>
</channel>
</rss>
Listing 1

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<item>DDI NEWS</item>
<link>http://www.ddinews.com </link>
<description>The latest news from Doordashan news - India's largest broadcaster</description>
<copyright>Copyright: (C) Doordarshan News</copyright>
<item>
<title>Climate talks make progress, pressure on US </title>
<description>Environment ministers made progress on Tuesday towards a scaled-down climate deal in Copenhagen next month, with Washington facing pressure to promise deep cuts by 2020 in greenhouse gas emissions. </description>
<link>http://www.ddinews.gov.in/Homepage/Homepage+-+Headlines/Climate+talks+make+progress.htm</link>
<pubDate>11/18/2009 1:53:18 PM</pubDate>
</item>
</channel>
</rss>
Listing 2

Listing 1 shows the correct format of an RSS 2.0 feed. Listing 2 is incorrect since tag <item> is there after tag <channel> which is not the right place for <item> tag. Argotic Syndication Framework throws an exception for the Listing 2 and doesn't parse it further.

This is one of the problems which I have observed along with the following:
1. Tag <language> missing or most of the time it has value "en-us" no matter what is the language used in the feed.
2. Tag <pubDate> is missing or having some wrong value.
3. Mismatch between the actual format of the feed and the format declared in <rss version="??"> tag.
4. Description is missing.
5. Feed Title is missing.
6. Item Title is missing ...most stupid and irritating.
7. Incomplete/Relative url in <link> tag
These are a few problems which I have seen with the web feeds, although I am talking about only RSS 2.0 here, similar problems can be imagined with ATOM/BLOGML and other syndication formats.

7 comments:

  1. This is the contact shown for http://www.natmac.org/9am/home - please remove my site (snipe.net) from your rss scraping sourcelist.

    ReplyDelete
  2. Please do not republish RSS feeds from StorageMojo.com. You are welcome to link to StorageMojo, but please remove StorageMojo.com from your RSS feed list.

    Regards,

    Robin Harris, publisher
    StorageMojo.com

    ReplyDelete
  3. Please do not publish the Rubis Business Solutions RSS feed (rubissolutions.com) on the site 9AM Tech.

    Thank you and best of luck in your development.

    Shawn Cheatham
    Site Owner

    ReplyDelete
  4. Phillips pitched the idea for Joker to Warner Bros. after his film War Dogs premiered in August 2016. 123 movies

    ReplyDelete
  5. mp3 juice download is very demanded, and everybody would like to get such platform. Here you can indeed found better-chosen files.

    ReplyDelete