13 April 2012

Introduction to node.js

Recently I had to give a presentation on a technology, I sorted out two: a graph db (neo4j) and node.js. I had heard a lot about node.js and was not very sure what it really was, so I choose it for the presentation. Here it is:


27 March 2012

Extracting meaningful text from webpages

I was trying to extract the meaningful text from a webpage for a given URL for crowl. For example if I visit any news site for a particular article, I will find a lot of crap (clutter) with the news text, this includes: ads, related news stories, top news stories, comments on the article, other web site links and much more.
Lets take an example of this The Times of India article:

 http://timesofindia.indiatimes.com/tech/news/hardware/83-year-old-woman-sues-Apple-for-1m/articleshow/12415012.cms

The useful text in the The Times of India article has around 30% share of total content, the remaining 70% is the clutter. You may argue that you need those links related to most popular stories, related stories etc. But sill a lot of extra stuff is there which we really don't care about.  (Meaningful) Information extraction from such a page is a big nightmare. We can start with getting the HTML source and stripping the HTML tags from the text.  Using regular expressions, lets remove all the links too. The resultant content will look like:

83-year-old woman sues Apple for $1m - The Times of India | The Times of India | | More More ADVERTISEMENT Hardware The Times of India The Times of India Indiatimes Web (by Google) Video Photos You are here:  »   »   » Hardware Breaking News: 83-year-old woman sues Apple for $1m The writer has posted comments on this articleANI | Mar 26, 2012, 04.42PM IST My Saved articles Read more:||||||| SHARE AND DISCUSS NEW YORK: An 83-year-old American woman has sued for 1 million dollars after she failed to see the glass door at the tech giant's office and smashed her face. Evelyn Paswall, a former Manhattan fur-company vice president, went to to return an on December 13. While approaching the store, Paswall didn't realize she was heading straight for a wall of glass. She smashed her face against it, breaking her nose, Paswall claims in her suit filed in the US Eastern District federal court. Now the Forest Hills, Queens, resident, Paswall claimed in her lawsuit that the company was negligent not elderly-proofing the store's see-through fa ade, The New York Post reports. She argues that Apple should have put marks on the glass that older people could spot before they come face-to-face with disaster. "The defendant was negligent . . . in allowing a clear, see-through glass wall and/or door to exist without proper warning," Paswall suit said. Hi ! Do you like this story? My saved articles RELATED COVERAGE Articles Blogs LATEST NEWS » ......

As you can observe the above text has a lot of extra text which we don't want. Attempts have been made to get extract the main content, here is one such article: How to Extract a Webpage’s Main Article Content
The Java program to get the above text: (Jsoup can be downloaded from here)

 public static void main(String[] args) throws Exception {
     String href="(.*?<\\/a>)";
     Document doc = Jsoup.connect("http://timesofindia.indiatimes.com/tech/news/hardware/83-year-old-woman-sues-Apple-for-1m/articleshow/12415012.cms").get();
            String source = doc.html();
            source = source.replaceAll(href, "");
     System.out.println(Jsoup.parse(source).text());
 }

The best Java lib I could find to get the main text from a web page was boilerpipe, and the same can be tested here. It does a pretty good job of removing the clutter around the meaningful text. Running the The Times of India news article link through boilerpipe gives the following text:

Tweet
NEW YORK: An 83-year-old American woman has sued Apple for 1 million dollars after she failed to see the glass door at the tech giant's office and smashed her face.
Evelyn Paswall, a former Manhattan fur-company vice president, went to Apple's Manhasset store to return an iPhone on December 13.
While approaching the store, Paswall didn't realize she was heading straight for a wall of glass.
She smashed her face against it, breaking her nose, Paswall claims in her suit filed in the US Eastern District federal court.
Now the Forest Hills, Queens, resident, Paswall claimed in her lawsuit that the company was negligent not elderly-proofing the store's see-through fa ade, The New York Post reports.
She argues that Apple should have put marks on the glass that older people could spot before they come face-to-face with disaster.
"The defendant was negligent . . . in allowing a clear, see-through glass wall and/or door to exist without proper warning," Paswall suit said.
Hi !

The above text is very close to what we want. Boilerpipe library is based on this paper. By combining Jsoup (to get the page title) with boilerpipe (to get the page content) we can get the meaningful content from a webpage.

10 July 2011

A Note on YCSB

Recently we had to benchmark a number of In-Memory databases available, mainly open source ones. I didn't know about YCSB until my architect told me about it.
YCSB = Yahoo! Cloud Serving Benchmark
It didn't impress me at first because it was from Yahoo! no offense but Yahoo! still expects us to pay for it's email POP3 access (Yahoo! Plus), they haven't learned anything from GMail, immaturity at its best. Nevertheless we started our benchmarking with Oracle and MongoDB. I know neither of them is an in-memory database but we liked the concept of memory mapped data of MongoDB.

I wrote the Oracle client for YCSB and MongoDB client was included with the benchmark code (thanks to Yen Pai). Writing a client for YCSB is fairly simple and that's what impressed me. But my impressions were washed away by horrible glitches I found in the included drivers as well as in YCSB code itself. There are a number of forks (including mine, which is a dead one by the way) which provide a lot of patches to the original YCSB code and include many new clients as well but the owner of the project Brian Frank Cooper has a very small interest in reviewing them.

I ran the first benchmark on 1,00,000 data sets for all the work loads provided with YCSB. Default workloads are not sufficient to test all the operation properly, which forced me to create my own workload configuration. It turned out that MongoDB was just 2-4 times faster than Oracle and that didn't impressed us much. So we considered Gemfire and Hazelcast as well, both "real" in-memory databases, one open source and other commercial (a 60 day trial in this case).

Again I had to write the clients for both the new DBs and it turned out to be a piece of cake. I have to admit YCSB has a great pluggability, plugging a client for any db just requires the driver libs + some 20 lines of code and you are done . YCSB can also run on multiple machines. YCSB offers a great platform for benchmarking any kind of database out there and same should be realized by Yahoo! or Brian Cooper who can put some more effort in its development.

Here are the results of MongoDB, Gemfire and Hazelcast benchmarks on 100000 data sets:

Operation (100,000)
DBs Throughput (operations/sec)

Gemfire
MongoDB
Hazelcast
Write (ops/sec)
3032.324
5123.475
3709.336
Read (ops/sec)
7634.170
7825.338
4315.367

MongoDB turns out to be the winner, the reason which I can think of is that both Gemfire and Hazelcast use JVM but MongoDB leaves everything to OS by mapping the data into memory.

More about YCSB can be found here and on the wiki

26 June 2011

Find Me Lazy

I was supposed to write a New Year post six months back, I didn't. Someone last week asked me what are you best at, I didn't (or couldn't) answer, now I guess laziness is what I am best at. Last year's new year post can be found here which I posted just after 3 days of new year. This year I am late by just 6 months. So I'll summarize whats happened in last 18 months span:

Series I finished:
The Wire
OZ
Generation Kill
Six Feet Under
Twin Peaks
The Life and Times of Tim
24
The Lost Room
The Pacific
Daria
Long Way Round
An Idiot Abroad
Spartcus: Blood and Sand

Series which I started following
Breaking Bad
Game of Thrones
In Treatment
Its Always Sunny In Philadelphia
The Ricky Gervais Show
The Venture Bros.
The IT Crowd
Boardwalk Empire
Fringe

Apart from series, movies and games; few more insignificant things happened in my life:
Started a project Crowl and released first revision (0.1)
Shifted to Noida from Bangalore.
Started gizmoage.com.
Finished a couple of novels.
Finished following games:
  • Crysis 2
  • Blur
  • Battlefield: Bad Company 2
  • Need for Speed: Hot Pursuit
  • Call of Duty Black Ops
  • Call Of Duty Modern Warfare 2
  • Just Cause 2
To add to the list I bought a car and still learning how to drive with L sign on front as well as back. In Jack Sparrow way: the feeling which someone should have after getting a car, I don't have it.

Caution: A Blurry Pic Ahead!

21 November 2010

A Simple URL Shortening Algorithm in JAVA

We have so many url shortening services available today, I am not sure what kind of algorithm they use to shorten a particular url. Given the limitations over the characters which can be used in a url it becomes pretty much obvious that we are limited to 62 alpha numeric chars i.e. [a-z 0-9 A-Z]. Though - (hyphen) and _ (underscore) are allowed in a url still we want to avoid them for many good reasons. Very obvious would be a bad looking url like http://xyz.com/c0--rw_ or http://xyz.com/______-.
Following is the simple implementation of base10 to base62 converter, that's all we need to shorten a url. With 62 chars and a unique string 7 char long we can shorten:
627 = 3,521,614,606,208 urls
that's a lots of urls.

How shortening works in the present case:

Suppose you have a table with following columns:
1. unique auto increment id (long),
2. url (string),
3. base62 string (string)
Now the trick is that we convert unique id to base62 string not the url, and then the url is mapped to the unique id. For example if we want to shorten the following url:
http://news.xinhuanet.com/english2010/world/2010-11/18/c_13612801.htm
First we need to look for the last unique id in the table then add 1 to it and convert the resulting number to base62. Suppose last unique id was 678544325 now the next id 678544326 will be mapped to the above url and base62 of a 678544326 will be:
45*624+57*623+6*622+23*621+20*620
means a five char url, having following array indexes {45}{57}{6}{23}{20} in
String[] elements = {
                "a","b","c","d","e","f","g","h","i","j","k","l","m","n","o",
                "p","q","r","s","t","u","v","w","x","y","z","1","2","3","4",
                "5","6","7","8","9","0","A","B","C","D","E","F","G","H","I",
                "J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X",
                "Y","Z"
                };
which will give a base62 string: JVgxu and a shortened url can be http://xyz.com/JVgxu

Following is the java code to convert a number to base62 string
/**
 * @author vikasing
 *
 */
public class Base62Converter {
    private final int LENGTH_OF_URL_CODE=6;
    public String convertTo62Base(long toBeConverted)
    {
        String[] elements = {
                "a","b","c","d","e","f","g","h","i","j","k","l","m","n","o",
                "p","q","r","s","t","u","v","w","x","y","z","1","2","3","4",
                "5","6","7","8","9","0","A","B","C","D","E","F","G","H","I",
                "J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X",
                "Y","Z"
                };
        String convertedString="";
        int numOfDiffChars= elements.length;
        if(toBeConverted<numOfDiffChars+1 && toBeConverted>0)
        {
            convertedString=elements[(int) (toBeConverted-1)];
        }
        else if(toBeConverted>numOfDiffChars)
        {
            long mod = 0;
            long multiplier = 0;
            boolean determinedTheLength=false;
            for(int j=LENGTH_OF_URL_CODE;j>=0;j--)
            {
                multiplier=(long) (toBeConverted/Math.pow(numOfDiffChars,j));
                if(multiplier>0 && toBeConverted>=numOfDiffChars)
                {
                    convertedString+=elements[(int) multiplier];
                    determinedTheLength=true;
                }
                else if(determinedTheLength && multiplier==0)
                {
                    convertedString+=elements[0];
                }
                else if(toBeConverted<numOfDiffChars)
                {
                    convertedString+=elements[(int) mod];
                }
                
                mod=(long) (toBeConverted%Math.pow(numOfDiffChars,j));
                toBeConverted=mod;                
            }
            
        }
        return convertedString;
    }

}
Above code is part of the project Crowl on which I have been working for a while. File can be browsed under org.crow.utils package.

Update: found this precise code for base62 conversion on the web:
public String converter ( int base, long decimalNumber)
 {
  
   String tempVal = decimalNumber == 0 ? "0" : "";
         long mod = 0;

         while( decimalNumber != 0 ) {
             mod = decimalNumber % base;
             tempVal = baseDigits.substring( (int)mod, (int)mod + 1 ) + tempVal;
             decimalNumber = decimalNumber / base;
         }
         System.out.print(tempVal);
         return tempVal;
 }

I didn't check the performance of the above code but it is smaller than the first one but both give the same output.

25 October 2010

10 TV series You Don't Want to Miss

I watched (or started following) 15 TV series in 2009 itself and in 2010 I was able to finish another 6. Here is a list of some of the great series you wouldn't wanna miss:

1. The Sopranos: You'll have to get through a couple of episodes first to get to know whats really going on. Its one of the best series HBO ever produced, depicting a New Jersey mafia family. Watch Tony Sopranos running a mafia family and struggling with his own.


2. OZ: This is my favorite from HBO, it shows the daily life of prisoners in a maximum security penitentiary. It has got every possible face of crime. One of the many things I like about HBO is that it gives its characters a complete freedom and doesn't hesitate in showing anything, things which can't be seen on any other channel. The narration of character Augustus Hill is one of the best part of the series.


3. Dexter: When it comes to the narration Dexter takes the cake, Michael C. Hall is a great actor and he fits perfectly in the role of a serial killer. Although many times a factor of luck plays an important role in the life of Dexter, the character of Dexter is very strong. A serial killer who can't feel any emotions: that's new and fresh.



4. Curb Your Enthusiasm: Co-creator of Seinfeld Larry David unites with HBO to produce this masterpiece. Believe me this series is better than Seinfeld, I have to admit I could not finish watching Seinfeld beyond 5th season because of its repetitive expressions/actions/dialogues. I liked Kramer but others became dull and boring after a certain number of episodes. Curb Your Enthusiasm shows David's daily life and his unusual way of handling the everyday matters.


5. The Band of Brothers and The Pacific: If you liked Saving Private Ryan, you don't wanna miss these two mini-series produced by HBO with the collaboration of Steven Spielberg and Tom Hanks. Aren't the names of these two masters enough? Both are based on World War-II.


6. Rome: BBC has produced a few documentary series's on Rome before. For this 2 season series BBC unites with HBO to produce a great series, which covers the most important time in history of Roman Empire, the lives of Julius Caesar and Augustus when the Rome expanded the most.

7. Carnivàle: This is again from HBO, a kind of serious and disturbing drama series set in 1934. Its a story of a Carnivàle which travels around and a fugitive Ben who joins the Carnivàle. It is great to watch the superb  acting from these nobodies. Its a (sadly) 2 season series with some amazing background music.


8. Arrested Development: Its neither like FRIENDS nor like Seinfeld, its a different kind of comedy series from Fox. A story of a broken (or stupid) family. Jason Bateman tries to fix the problems of the family only to find himself in the funniest situations.



9. Twin Peaks: Twin Peaks is a name of the town in which a murder takes place and an FBI agent visits the town to investigate the murder, this investigation covers the whole two seasons. The series is known for its bizarre characters e.g the main character Dale Cooper likes black coffee and cherry pie and it can be seen very often. Twin Peaks became a huge hit and almost gave birth to a cult.


10. Californication: I was surprised to see this kind of stuff coming from a non-HBO channel. This is an amazing drama/comedy series from Showtime. The main character Hank Moody is played by David Duchovny who played Fox Mulder in The X-Files. Moody is an interesting character who tries to live his life in his own way and does many 'nasty' things. Another interesting character is Charlie Runkle played by Evan Handler.

09 January 2010

available memory less than 128mb!!! -1 half life 2

Half Life 2 has got some problem on Vista 64 bit, it keeps crashing with a pop up message:
 available memory less than 128mb!!! -1
Solution for this is to run Half Life in Windows XP Compatibility Mode, right click on the short-cut of Half Life, click on the Compatibility tab, select Run this program in compatibility mode for:  


06 January 2010

Enable GZip Compression on Glassfish v3

Login to https://localhost:4848 (admin panel). Go to the Network Config > Network Listener and select the listener for which you want to enable gzip. Click on the HTTP tab, see below:



Scroll down until you get the following entries:


Select Compression= on, Put the mime types and click on Save button.
Its done!!

03 January 2010

mysql and jtpl template engine tutorial

jtpl is a light-weight java template engine, which is good for small size application but becomes sludgy for data intensive apps. Following example shows how to use jtpl with mysql.

jtpl replaces everything which is put inside {} and it uses html comments as entry and exit points
< ! -- BEGIN: main -- >
{This will be replaced by jtpl}
{ThisToo}
< ! -- END: main -- >
this template file should be saved with the extension .jtpl
In your servlet you need to create a Template object which will take template file as input parameter.
Template tpl = new Template(new File("FULL_PATH\\home.jtpl"));
 next you need to assign the value to the template parameters like:
tpl.assign("ThisToo", "Assigned Value Here !"); 
in the end template is parsed using
tpl.parse("main"); 
If you have nested template regions in a template like this:

< ! -- BEGIN: main -- >
< ! -- BEGIN: header-- >
{Links}
< ! -- BEGIN: header-- >
< ! -- END: main -- >
Everything remains same except when you parse, you'll have to parse the inner region first, like:
tpl.parse("main.header"); 
then the outer (or main) region:
tpl.parse("main"); 
You can put as many regions you want inside a main region.
Using jtpl with mysql(or any other db) is simple as explained above. Here is a sample template file home.jtpl:
<!-- BEGIN: main1 -->
<html>
    <head>
        <title>{PTITLE}</title>       
    </head>
    <body>
            <div>               
                    <div>
                       <a class="a" href="/anylink1">{LINK1}</a>
                       <a class="a" href="/anylink2">{LINK2}</a>
                       <a class="a" href="/anylink3">{LINK3}</a>
                    </div>
            </div>
            <div>
                <div>
                    <div>
                        <!-- BEGIN: div -->
                            <div>
                                <a target="_blank" href ="{LINK}">{TITLE}</a>
                                <br><span>{CONTENT}</span>
                            </div>
                        <!-- END: div -->
                    </div>
                </div>
        </div>
    </body>
</html>
<!-- END: main1 --> 
and here is the Servlet:
import net.sf.jtpl.Template;

import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import java.sql.*;

/**
 *
 * @author viksin
 */
public class sample extends HttpServlet {

    @Override
    protected void doGet(HttpServletRequest request,
            HttpServletResponse response) throws ServletException, IOException {
        PrintWriter out = response.getWriter();
        try {
            out.print(this.generatePage());
        } catch (Exception e) {
            e.printStackTrace(out);
        } finally {
            out.close();
        }
    }

    protected String generatePage() throws Exception {
        Template tpl = null;
        Connection conn = null;
        Statement st = null;
        ResultSet rs = null;
        String Content = "";
        TimeCalc tc = new TimeCalc();
        tpl = new Template(new File("FULL_PATH\\home.jtpl"));
        tpl.assign("PTITLE", "MySite");
        tpl.assign("LINK1", "Home");
        tpl.assign("LINK2", "News");
        tpl.assign("LINK3", "About");
        try {
            Class.forName("org.gjt.mm.mysql.Driver").newInstance();
            conn = DriverManager.getConnection("mysql_URL", "USERNAME", "PASSWORD");
            st = conn.createStatement();
            rs = st.executeQuery("select title,link,content from table");
            while (rs.next()) {
                tpl.assign("TITLE", rs.getString("title"));
                tpl.assign("CONTENT", rs.getString("content"));
                tpl.assign("LINK", rs.getString("link"));
                tpl.parse("main1.div");
            }           
            tpl.parse("main1");
        } catch (Exception ex) {
            return ex.toString();
        } finally {
            if (rs != null) {
                rs.close();
            }
            if (st != null) {
                st.close();
            }
            if (conn != null) {
                conn.close();
            }
        }
        return (tpl.out());
    }
}
jtpl is meant for small and simple applications, it does not have many features which other template engines like velocity, stringtemplate etc. have.
jtpl also uses the SingleThreadModel which is not recommended also it gets slower with large data.
Presently I am using StringTemplate which is faster and better than jtpl.

Bad Old Year! Happy New Year!

2009: One of the worst years of my life!

Here are a few facts which support above statement:

1. Got cheated for USB modem in March.
2. Got chicken-pox in April.
3. Lost money in shares in June
4. Father got into a major accident in Oct.
5. Got back pain in Nov.
6. Troubles at workplace (Whole Year!)

There were a few good moments too:

1. Got a beautiful niece (Asmi) on 7th March.
2. Had fun with college friends in Ahemdabad and Mumbai.
3. Watched following TV Series:
1) The Sopranos
2) Carnivale
3) Scrubs
4) Dead Like Me
5) 24
6) The Life and Times of Tim
7) Rome
8) John Adams
9) Avatar the last Airbender
10) Psych
11) That 70's Show
12) Curb Your Enthusiasm
13) Dexter
14) Heroes
15) Bones
2010: Expectations!

1. Complete the development of 9am.in, algowiki.com, showthe.info/about, paltan.org, index3r.com
2. Complete remaining TV series: The Wire, OZ, Deadwood, Futurama, Meerkat Manor, Generation Kill, Mash, Six Feet Under,Twin Peaks, From the Earth to the Moon
3. Self Employment
4. Get a powerful server for natmac.org