25 June 2013

Extracting (meaningful) text from webpages - II

I was looking at the Readability Java clone snacktory to replace the very slow boilerpipe lib in my project, snacktory seemed faster but did not produce better results than boilerpipe. For example for the following url: http://alumniconnect.wordpress.com/2013/06/04/a-monk-who-didnt-care-for-ferrari-teaching-to-serve-society/ snacktory extracted just a small paragraph, about 25% of the whole text.
I forked the project and tried to fix it, but soon it turned out to be a challenging job. Snacktory's approach is totally based on the HTML markup, which makes it fail sometime, for example text inside <span></span> is ignored. That's exactly what is going on in the above case.

I thought of modifying the snacktory in such a way so that it can ignore HTML tags and still give better results. But soon realized that I'd have to change the whole logic behind the lib so I went ahead and created my own project NiceText.

How does NiceText work?

Instead of looking for particular tags and block sizes, NiceText calculates the ratios of all the text blocks w.r.t. the largest text block. Then it excludes the blocks with a smaller ratio than a give limit (say a ratio of 0.15). After that it clusters the nearest blocks into multiple clusters by checking the distance between two blocks, each cluster contains several text blocks, the largest cluster is marked as the main text.

Google’s Cloud Platform is slowly becoming ay fully featured environment for running complex web apps, but it’s not easy to just give it a quick try. To get started with Cloud Platform, after all, you have to first install the right and other tools on your local machine. Today, however, Google is launching its browser-based Cloud Playground, which is meant to give developers a chance to try some sample code and see how actual production APIs will behave, or to just share some code with colleagues without them having to install your whole development environment.Cloud Playground, Google says, is meant to be a place “for developers to experiment and play with some of the services offered by the Google Cloud Platform, such as Google App Engine, Google Cloud Storage and Google Cloud SQL.”For now, Cloud Playground only supports Python 2.7 App Engine apps, and Google considers it to be an experimental service (so it could shut it down anytime).To get started, you simply head for the Cloud Playground or, if you just want to see it at work, head for Google’s getting started documentation, which now features green Run/Modify buttons that allow you to run any of the sample code on these sites. The Cloud Playground itself features numerous sample apps and also gives you the option to clone other open source App Engine template projects written in Python 2.7 from GitHub.The project itself is open source and consists of a basic browser-based code editor and , a Python App Engine app that serves as the development server.
Which is what we need !! I'm also working on a summarizer, which will be integrated in NiceText. Some part of it is already there, but it needs a lot of improvements. Here is the summary of the above text; produced by SimilaritySummarizer.java (present in the repo):
Today, however, Google is launching its browser-based Cloud Playground, which is meant to give developers a chance to try some sample code and see how actual production APIs will behave, or to just share some code with colleagues without them having to install your whole development environment.
I have yet to create a jar, however you can download the files from github repository. NiceText has a dependency on Jsoup.

10 comments:

  1. Obviously, these ends can frame the premise of some significant changes that may have been in the pipeline for quite a while. Data Analytics Course

    ReplyDelete
  2. Thank you for writing down such a wonderful piece of content writing. I really eulogize your insights. I have come across a lot of appealing piece of information in this article that is bold.
    SAP training in Kolkata
    SAP course in kolkata

    ReplyDelete
  3. Wonderful blog! Do you have any tips and hints for aspiring writers? Because I’m going to start my website soon, but I’m a little lost on everything. Many thanks! voice over video

    ReplyDelete
  4. It is one such field which has applications in almost all the fields, right from social media to healthcare and to product sales. Most of the companies are hiring data scientists to analyze and predict their sales data science course in india

    ReplyDelete

  5. Amazingly by and large very interesting post. I was looking for such an information and thoroughly enjoyed examining this one. Keep posting. An obligation of appreciation is all together for sharing.data science training in gwalior

    ReplyDelete
  6. I am genuinely thankful to the holder of this web page who has shared this wonderful paragraph at at this place data analytics course in kanpur

    ReplyDelete
  7. 360DigiTMG offers the best Data Analytics courses in the market with placement assistance. Enroll today and fast forward your career.
    <a href="https://360digitmg.com/india/business-analytics-training-in-patna''>business analytics course in patna</a>

    ReplyDelete
  8. 360DigiTMG offers the best Data Analytics courses in the market with placement assistance. Enroll today and fast forward your career.
    business analytics course in patna

    ReplyDelete
  9. Advance your technical skills required to crack huge datasets to bring out new possibilities from data. Join the Data Science institutes in vijayawada and get access to top industry trainers, LMS, live projects, assignments, and mock interviews to skyrocket your career in the ever- evolving field of Data Science.
    Data Science training in vijayawada

    ReplyDelete