25 June 2013

Extracting (meaningful) text from webpages - II

I was looking at the Readability Java clone snacktory to replace the very slow boilerpipe lib in my project, snacktory seemed faster but did not produce better results than boilerpipe. For example for the following url: http://alumniconnect.wordpress.com/2013/06/04/a-monk-who-didnt-care-for-ferrari-teaching-to-serve-society/ snacktory extracted just a small paragraph, about 25% of the whole text.
I forked the project and tried to fix it, but soon it turned out to be a challenging job. Snacktory's approach is totally based on the HTML markup, which makes it fail sometime, for example text inside <span></span> is ignored. That's exactly what is going on in the above case.

I thought of modifying the snacktory in such a way so that it can ignore HTML tags and still give better results. But soon realized that I'd have to change the whole logic behind the lib so I went ahead and created my own project NiceText.

How does NiceText work?

Instead of looking for particular tags and block sizes, NiceText calculates the ratios of all the text blocks w.r.t. the largest text block. Then it excludes the blocks with a smaller ratio than a give limit (say a ratio of 0.15). After that it clusters the nearest blocks into multiple clusters by checking the distance between two blocks, each cluster contains several text blocks, the largest cluster is marked as the main text.

Google’s Cloud Platform is slowly becoming ay fully featured environment for running complex web apps, but it’s not easy to just give it a quick try. To get started with Cloud Platform, after all, you have to first install the right and other tools on your local machine. Today, however, Google is launching its browser-based Cloud Playground, which is meant to give developers a chance to try some sample code and see how actual production APIs will behave, or to just share some code with colleagues without them having to install your whole development environment.Cloud Playground, Google says, is meant to be a place “for developers to experiment and play with some of the services offered by the Google Cloud Platform, such as Google App Engine, Google Cloud Storage and Google Cloud SQL.”For now, Cloud Playground only supports Python 2.7 App Engine apps, and Google considers it to be an experimental service (so it could shut it down anytime).To get started, you simply head for the Cloud Playground or, if you just want to see it at work, head for Google’s getting started documentation, which now features green Run/Modify buttons that allow you to run any of the sample code on these sites. The Cloud Playground itself features numerous sample apps and also gives you the option to clone other open source App Engine template projects written in Python 2.7 from GitHub.The project itself is open source and consists of a basic browser-based code editor and , a Python App Engine app that serves as the development server.
Which is what we need !! I'm also working on a summarizer, which will be integrated in NiceText. Some part of it is already there, but it needs a lot of improvements. Here is the summary of the above text; produced by SimilaritySummarizer.java (present in the repo):
Today, however, Google is launching its browser-based Cloud Playground, which is meant to give developers a chance to try some sample code and see how actual production APIs will behave, or to just share some code with colleagues without them having to install your whole development environment.
I have yet to create a jar, however you can download the files from github repository. NiceText has a dependency on Jsoup.