Every crawler has some revisit policy for crawling a web document. If a web site changes very frequently, the crawler will try to visit the site more than a web site which does not change so frequently, for example a crawler will be more interested in a news site than a static About Us page. This saves time for the crawler and avoids unwanted traffic for the web servers.
A feed crawler is crawler which tries to get the data from web site feeds, which may include: rss, atom, blogml etc. A feed crawler should expect more rapid changes than a normal web crawler and also it should penalize the slow changes. I have used an exponential function to implement the revisit policy for my feed crawler. Following graphs show three different functions which can be used as a revisit policy:
Definitions:
nd : Number of days to wait till next visit
x : Number of visits without a feed update (Blank Visits)
lc : Last Change (or Update)
nv : Next Visit
lv : Last Visit
s: staleness
The revisit algorithm works in following way (for Function 3):
s=lv-lc;If a feed is not updated after a certain number of visits the crawler should assume the feed as a dead feed and stop visiting that page. Also if a feed gets updated after certain time after some blank visits (say y), next visit can be calculated in following way (for Function 3):
IF (s==0)
THEN nv=TOMORROW;
ELSE nv=TODAY + ex/2*x --(any function can be used!)
nv= TODAY + e(y-1)/2*(y-1)Example: Take Function 2: x2-2*x+k (say k =3)
If a feed is updated after 5 blank visits, then: y=5;
F(y-1)= (y-1)2-2*(y-1)+3so next visit will be after 11 days.
= (5-1)2-2*(5-1)+3
=16-8+3=11
If a feed is not updated after 5 blank visits, then: k=3 & y=5;
F(y) = (y)2-2*(y)+3now next visit will be after 18 days.
= (5)2-2*(5)+3
=25-10+3=18
No comments:
Post a Comment