29 August 2009

Feed Crawler Revisit Policy

Every crawler has some revisit policy for crawling a web document. If a web site changes very frequently, the crawler will try to visit the site more than a web site which does not change so frequently, for example a crawler will be more interested in a news site than a static About Us page. This saves time for the crawler and avoids unwanted traffic for the web servers.

A feed crawler is crawler which tries to get the data from web site feeds, which may include: rss, atom, blogml etc. A feed crawler should expect more rapid changes than a normal web crawler and also it should penalize the slow changes. I have used an exponential function to implement the revisit policy for my feed crawler. Following graphs show three different functions which can be used as a revisit policy:

nd : Number of days to wait till next visit
x : Number of visits without a feed update (Blank Visits)
lc : Last Change (or Update)
nv : Next Visit
lv : Last Visit
s: staleness

Function 1: nd = 2(x+1)/x

Function 2: nd = x2-2*x+k

Function 3: nd = ex/2*x

The revisit algorithm works in following way (for Function 3):
IF (s==0)
ELSE nv=TODAY + ex/2*x --(any function can be used!)
If a feed is not updated after a certain number of visits the crawler should assume the feed as a dead feed and stop visiting that page. Also if a feed gets updated after certain time after some blank visits (say y), next visit can be calculated in following way (for Function 3):
nv= TODAY + e(y-1)/2*(y-1)
Example: Take Function 2: x2-2*x+k (say k =3)

If a feed is updated after 5 blank visits, then: y=5;
F(y-1)= (y-1)2-2*(y-1)+3
= (5-1)2-2*(5-1)+3
so next visit will be after 11 days.

If a feed is not updated after 5 blank visits, then: k=3 & y=5;
F(y) = (y)2-2*(y)+3
= (5)2-2*(5)+3
now next visit will be after 18 days.