25 June 2013

Extracting (meaningful) text from webpages - II

I was looking at the Readability Java clone snacktory to replace the very slow boilerpipe lib in my project, snacktory seemed faster but did not produce better results than boilerpipe. For example for the following url: http://alumniconnect.wordpress.com/2013/06/04/a-monk-who-didnt-care-for-ferrari-teaching-to-serve-society/ snacktory extracted just a small paragraph, about 25% of the whole text.
I forked the project and tried to fix it, but soon it turned out to be a challenging job. Snacktory's approach is totally based on the HTML markup, which makes it fail sometime, for example text inside <span></span> is ignored. That's exactly what is going on in the above case.

I thought of modifying the snacktory in such a way so that it can ignore HTML tags and still give better results. But soon realized that I'd have to change the whole logic behind the lib so I went ahead and created my own project NiceText.

How does NiceText work?

Instead of looking for particular tags and block sizes, NiceText calculates the ratios of all the text blocks w.r.t. the largest text block. Then it excludes the blocks with a smaller ratio than a give limit (say a ratio of 0.15). After that it clusters the nearest blocks into multiple clusters by checking the distance between two blocks, each cluster contains several text blocks, the largest cluster is marked as the main text.

Google’s Cloud Platform is slowly becoming ay fully featured environment for running complex web apps, but it’s not easy to just give it a quick try. To get started with Cloud Platform, after all, you have to first install the right and other tools on your local machine. Today, however, Google is launching its browser-based Cloud Playground, which is meant to give developers a chance to try some sample code and see how actual production APIs will behave, or to just share some code with colleagues without them having to install your whole development environment.Cloud Playground, Google says, is meant to be a place “for developers to experiment and play with some of the services offered by the Google Cloud Platform, such as Google App Engine, Google Cloud Storage and Google Cloud SQL.”For now, Cloud Playground only supports Python 2.7 App Engine apps, and Google considers it to be an experimental service (so it could shut it down anytime).To get started, you simply head for the Cloud Playground or, if you just want to see it at work, head for Google’s getting started documentation, which now features green Run/Modify buttons that allow you to run any of the sample code on these sites. The Cloud Playground itself features numerous sample apps and also gives you the option to clone other open source App Engine template projects written in Python 2.7 from GitHub.The project itself is open source and consists of a basic browser-based code editor and , a Python App Engine app that serves as the development server.
Which is what we need !! I'm also working on a summarizer, which will be integrated in NiceText. Some part of it is already there, but it needs a lot of improvements. Here is the summary of the above text; produced by SimilaritySummarizer.java (present in the repo):
Today, however, Google is launching its browser-based Cloud Playground, which is meant to give developers a chance to try some sample code and see how actual production APIs will behave, or to just share some code with colleagues without them having to install your whole development environment.
I have yet to create a jar, however you can download the files from github repository. NiceText has a dependency on Jsoup.

24 comments:

  1. Obviously, these ends can frame the premise of some significant changes that may have been in the pipeline for quite a while. Data Analytics Course

    ReplyDelete
    Replies
    1. Great Article Cloud Computing Projects

      Networking Projects

      Final Year Projects for CSE

      JavaScript Training in Chennai

      JavaScript Training in Chennai

      The Angular Training covers a wide range of topics including Components, Angular Directives, Angular Services, Pipes, security fundamentals, Routing, and Angular programmability. The new Angular TRaining will lay the foundation you need to specialise in Single Page Application developer. Angular Training

      Delete
  2. Such a very useful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article.

    data science course

    ReplyDelete
  3. I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!

    Simple Linear Regression

    Correlation vs Covariance

    ReplyDelete
  4. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Correlation vs Covariance
    Simple linear regression
    data science interview questions

    ReplyDelete
  5. I am looking for and I love to post a comment that "The content of your post is awesome" Great work!

    Simple Linear Regression

    Correlation vs covariance

    KNN Algorithm

    Logistic Regression explained

    ReplyDelete
  6. Amazing Article ! I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Correlation vs Covariance
    Simple Linear Regression
    data science interview questions
    KNN Algorithm
    Logistic Regression explained

    ReplyDelete
  7. I have to search sites with relevant information on given topic and provide them to teacher our opinion and the article.

    Simple Linear Regression

    Correlation vs Covariance

    ReplyDelete
  8. very well explained .I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Simple Linear Regression
    Correlation vs covariance
    data science interview questions
    KNN Algorithm
    Logistic Regression explained

    ReplyDelete
  9. This splendid article really deserves a courteous bow down. I must own up that this article is very helpful.
    Data Science training in Mumbai
    Data Science course in Mumbai
    SAP training in Mumbai

    ReplyDelete
  10. Thank you for writing down such a wonderful piece of content writing. I really eulogize your insights. I have come across a lot of appealing piece of information in this article that is bold.
    SAP training in Kolkata
    SAP course in kolkata

    ReplyDelete
  11. I am pleased to know that we have such a strong voice in the media, someone like you who did not dither to speak her mind on the consequence of the Indian culture in this article making it so astounding and worthwhile to read on.
    SAP training in Mumbai
    SAP course in Mumbai

    ReplyDelete
  12. Wonderful blog! Do you have any tips and hints for aspiring writers? Because I’m going to start my website soon, but I’m a little lost on everything. Many thanks! voice over video

    ReplyDelete
  13. It is one such field which has applications in almost all the fields, right from social media to healthcare and to product sales. Most of the companies are hiring data scientists to analyze and predict their sales data science course in india

    ReplyDelete
  14. Thanks for posting the best information and the blog is very helpful .data science interview questions and answers

    ReplyDelete
  15. I was just examining through the web looking for certain information and ran over your blog.It shows how well you understand this subject. Bookmarked this page, will return for extra. data science course in vadodara

    ReplyDelete
  16. Extremely overall quite fascinating post. I was searching for this sort of data and delighted in perusing this one. Continue posting. A debt of gratitude is in order for sharing. data scientist course in delhi

    ReplyDelete
  17. I want to leave a little comment to support and wish you the best of luck.we wish you the best of luck in all your blogging enedevors
    data science course delhi

    ReplyDelete
  18. Truly quite fascinating post. I was searching for this sort of data and delighted in perusing this one. Continue to post. Much obliged for sharing.artificial intelligence institute in delhi

    ReplyDelete