vikasing: August 2009

Every crawler has some revisit policy for crawling a web document. If a web site changes very frequently, the crawler will try to visit the site more than a web site which does not change so frequently, for example a crawler will be more interested in a news site than a static About Us page. This saves time for the crawler and avoids unwanted traffic for the web servers.

A feed crawler is crawler which tries to get the data from web site feeds, which may include: rss, atom, blogml etc. A feed crawler should expect more rapid changes than a normal web crawler and also it should penalize the slow changes. I have used an exponential function to implement the revisit policy for my feed crawler. Following graphs show three different functions which can be used as a revisit policy:

Definitions:

nd : Number of days to wait till next visit
x : Number of visits without a feed update (Blank Visits)
lc : Last Change (or Update)
nv : Next Visit
lv : Last Visit
s: staleness

Function 1: nd = 2^(x+1)/x

Function 2: nd = x²-2*x+k

Function 3: nd = e^x/2*x

The revisit algorithm works in following way (for Function 3):

s=lv-lc;
IF (s==0)
THEN nv=TOMORROW;
ELSE nv=TODAY + e^x/2*x --(any function can be used!)

If a feed is not updated after a certain number of visits the crawler should assume the feed as a dead feed and stop visiting that page. Also if a feed gets updated after certain time after some blank visits (say y), next visit can be calculated in following way (for Function 3):

nv= TODAY + e^(y-1)/2*(y-1)

Example: Take Function 2: x²-2*x+k (say k =3)

If a feed is updated after 5 blank visits, then: y=5;

F(y-1)= (y-1)²-2*(y-1)+3
= (5-1)²-2*(5-1)+3
=16-8+3=11

so next visit will be after 11 days.

If a feed is not updated after 5 blank visits, then: k=3 & y=5;

F(y) = (y)²-2*(y)+3
= (5)²-2*(5)+3
=25-10+3=18

now next visit will be after 18 days.

vikasing

29 August 2009

Feed Crawler Revisit Policy