Showing posts with label algo. Show all posts
Showing posts with label algo. Show all posts

05 May 2015

Word vectors using LSA, Part - 2

Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. [1] More about LSA can be found here and here. LSA uses Singular Value Decomposition (SVD), a matrix factorization method. For a given matrix A,

SVD (A) = U*S*VT

In the current scenario matrix A is a term-document matrix (m terms * n documents). Visually SVD looks like this:


Unlike word2vec, LSA does not require any training. But it suffers from curse of dimensionality because SVD calculations get slower and slower as we increase the number of documents, i.e. size of matrix A. On a single machine it can take hours. The overall cost of calculating SVD is O(mn2) Flops. This means if we had m =100,000 unique words with n = 80,000 documents, it would require 6.4 x 1014 Flops or 640,000 GFlops. At stock clock speed (4.0 GHz) my AMD FX-8350 gives around 40 GFlops. So it will take around 640,000/40 = 16,000 Seconds which is around 4 hours 30 minutes. [2]
In my previous post I had used 1.7 million sentences and 44 million words for training word2vec, i.e. if we run SVD on this large matrix, it might end up taking centuries on my machine. However SVD calculations on large matrices can be done using a large cluster of Spark. [3] [4]
Results

I kept the document size constant at 2500 and let the term size vary. In order to rank the terms in relation to query term I used cosine distance. This time along with named entities I also added the noun phrases. The data is the news articles from yesterday (4th May, 2015). Here is the vector for the first query "delhi":
[law_minister=0.34, jitender_singh_tomar=0.23, chief_minister=0.22, fake=0.21, arvind_kejriwal=0.21, degree=0.18, protest=0.16, law_degree=0.16, win=0.15, congress=0.15, aam_aadmi_party=0.14, incident=0.14]
Notice that the vector contains terms like chief_minister and law_degree which are not named entities.
Query for "chief_minister":
[arvind_kejriwal=0.34, parkash_singh_badal=0.29, today=0.28, delhi=0.28, mamata_banerjee=0.27, office=0.26, state=0.26, people=0.26, act=0.25, mufti_mohammad_sayeed=0.25, bjp=0.25, jammu_and_kashmir=0.25, governor=0.24]
The vector gives the name of all the chief ministers which were in the news recently. Same goes for the query "prime_minster":
[shinzo_abe=0.37, japanese=0.33, sushil_koirala=0.27, david_cameron=0.27, tony_abbott=0.26, 2015=0.24, benjamin_netanyahu=0.24, president=0.23, country=0.23, government=0.22, washington=0.22]
Lets look up for a person now, "rohit_sharma":
[mumbai_indians=0.45, skipper=0.4, ritika_sajdeh=0.37, captain=0.36, batsman=0.35, indian=0.34, lendl_simmons=0.34, kieron_pollard=0.33, parthiv_patel=0.31, ipl=0.29, runs=0.29, good=0.29, ambati_rayudu=0.29, mitchell_mcclenaghan=0.27, unmukt_chand=0.27]
Finding relations

What if I query for chief_minister and west_bengal and add both the vectors?  
[mamata_banerjee=0.69, bjp=0.67, state=0.6] 
It gives the correct result, Mamata Banerjee is the current Chief Minister of West Bengal. Note that now numbers don't represent the cosine distance.

What if we want to find out a relationship, instead of querying? Query for india and narendra_modi:
[prime_minister=0.5, make=0.42, government=0.4, country=0.4]
Querying mumbai_attack with charged gives a list of a few names of those who were involved/charged:
[people=1.14, left=1.08, november=1.05, dead=1.05, 166=1.05, executing=1.04, planning=1.04, 2008=1.02, hamad_amin_sadiq=1.0, shahid_jameel_riaz=1.0, mazhar_iqbal=1.0, jamil_ahmed=1.0, younis_anjum=0.94, abdul_wajid=0.94, zaki-ur_rehman_lakhvi=0.62]
Although above results look good, they are not always accurate, for example, query for captain and royal_challengers_bangalore does not return virat_kohli as the first result:
[ipl=0.67, rcb=0.66, match=0.64, kolkata_knight_riders=0.6, virat_kohli=0.57]
I guess more data from different time periods can help in establishing concrete relationships.

Word vectors obtained from LSA can be useful in expanding the search queries, guessing the relationships (as shown above), generating similarity based recommendations and many other tasks related to text.
I wrote a one file implementation of LSA in Java (its buggy and design patterns free!), it uses jBLAS for SVD and other matrix operations, code can be found at github.
A couple of more links to understand LSA through examples:


24 March 2015

Word vectors (word2vec) on named entities and phrases - I

word2vec is a C lib to compute the vector representation of a given word (or a phrase). It was released by a few Googlers and being maintained at word2vec. A couple of nice articles on what word2vec is capable of (roughly):
Word vectors can boost performance of many ML and NLP applications, for example, sentiment analysis, recommendations, chat threading etc.
I used deeplearning4j's implementation of word2vec. The example given on that page does not work with the latest release of dl4j (at present 0.0.3.3) , working example can be found here. I ended up using Stanford's CoreNLP for named entity recognition, OpenNLP works fine too.

The training was done on the recent news data gathered from various sources, articles were split into sentences (using OpenNLP), duplicate and short sentences were removed.  The size of the corpus was around 300MB containing 1.7 million sentences and 44 million words. The training took almost 36 Hours with 3 iteration and a layer size of 200. Lets start with simple examples:

For the word water I get the following word vector (limited to 21 words):
[groundwater, vapor, heater, pollutant, rainwater, dioxide, wastewater, sewage, potable, moisture, seawater, methane, nitrogen, vegetation, vapour, oxide, reservoir, hydrogen, plume, monoxide, sediment]
We can see that almost all the words are used in the context of water, but this is limited to the trained corpus. With different corpus you'll get different set of results. Lets look at something which was there in the news recently, e.g., the term plane:
[mh370, c17, crashland, skidd, 777, takeoff, qz8501, transasia, malaysia_airline, aircraft, globemaster, turboprop, cockpit, laguardia_airport, 1086, locator, singleengine, atr, solarpower, midair]
Except 1086 and atr, every other word (or phrase) in the vector makes sense, but if you search for 1086 and atr, you'll find that 1086 was a Delta Air Lines Flight which crashed recently and ATR is an aircraft manufacturer company. Lets look for an entity (specially phrase) vector, for example Leslee Udwin was in the news recently:
[mukesh_singh, gangrape, nirbhaya, storyville, documentary, rapist, tihar, andrew_jarecki, citizenfour, telecast, bar_council_of_india, udwin, laura_poitra, filmmaker, jinx, bci, bbc, derogatory, chai_j, leslie_udwin, hansal_mehta, bbc_storyville]
You can relate most of the words/phrases in the vector to Leslee Udwin or her documentary India's Daughter. Other words in the vector are either the names of the documentaries or the documentary makers, for example, The Jinx is an HBO documentary mini-series directed by Andrew Jarecki.

dl4j library also provides the vector addition and subtraction mechanism, for subtraction code is as follows:

List<String>  p = new ArrayList<>(), n = new ArrayList<>();
p.add("imitation");
n.add("oscar");
vec.wordsNearest(p, n, 20);

Here is how the subtraction works, vector for imitation:
[grand_budapest_hotel, michael_keaton, screenplay, birdman, boyhood, eddie_redmayne, whiplash, benedict_cumberbatch, felicity_jone, jk_simmon, richard_linklater, julianne_moore, j_k_simmon, wes_anderson, patricia_arquette, edward_norton, graham_moore, alejandro_gonzalez_inarritu, stephen_hawk, alejandro_g_inarritu, alexandre_desplat]
It contains many Oscars entries and related terms, so subtracting the vector of term oscar should remove all those entries and give us something related to The Imitation Game:
[changer, throne, lllp, alan_tur, chris_kyle, oneindia, lilih620150308, sniper, rarerbeware, grand_budapest_hotel, benedict_cumberbatch, mockingjay, iseven, cable_news_network, extractable, theory, watchapple, enigma, codebreaker, washington_posttv, mathematician]
This vector is not a very good representation of the movie The Imitation Game, there is a lot of noise. This is because of the poor and small training data. But we see a few terms in the vector which are related to the movie, e.g.,
alan_tur*, benedict_cumberbatch, enigma, theory, mathematician
ing was removed by the tokenizer

I have trained the data on entities for now (by replacing the space with underscore), I am planning to train it on general phrases as well, like Member of Parliament should be combined into a single term member_of_parliament. Will publish the results in the second part. Next I want to compare it with Brown Clustering, it is also used for the similar purpose.

09 September 2013

Keyword extraction in Java

Around two months back I started working on a Java library: NiceText, to isolate data mining stuff from Crowl (another weekend project!) and make it an independent lib. I already implemented a text extractor for web pages (hence the name NiceText), which I observed sometimes performs faster and better than boilerpipe library.

I also wanted to extract keywords (or keyphrases) from text. I considered widely used algo TF-IDF for this purpose. Since keyphrases can contain more than one word, I also considered n-grams (mono, bi and tri), e.g., consider the following text:

Astronomers have gotten the first-ever peek at our solar system's tail, called the heliotail, finding that it's shaped like a four-leaf clover, NASA scientists announced.
The discovery was made using NASA's Interstellar Boundary Explorer (IBEX), a coffee-table-sized spacecraft that is studying the edge of the solar system.
'We always drew pictures where the tail of the solar system just trailed off the page.'
- David McComas, IBEX principal investigator
"Many models have suggested the heliotail might look like this or like that, but we have had no observations," David McComas, IBEX principal investigator at Southwest Research Institute in San Antonio, Tex., said in a statement. "We always drew pictures where the tail of the solar system just trailed off the page, since we couldn't even speculate about what it really looked like." [Images: NASA's IBEX Sees Our Solar System's Tail]
The heliotail is "a much larger structure with a much more interesting configuration" than scientists had previously predicted, McComas added during a news conference announcing the finding.
The heliotail is inflated by the solar wind of particles streaming off the sun, and the four-leaf clover shape is the result of fast solar wind shooting out near the sun's poles and slower wind flowing from near the sun's equator, researchers say. The finding is based on the first three years of IBEX's measurements of energetic neutral atoms.
In the interstellar boundary region, charged particles from the sun stream outward far beyond the planets toward the gas- and dust-filled space between stars. Collisions between these particles and interstellar material create fast-moving particles with no charge, known as energetic neutral atoms, or ENAs. Some of these particles speed inward toward the sun, where IBEX can detect them from its perch 200,000 miles above Earth.
"Scientists have always presumed that the heliosphere had a tail," Eric Christian, IBEX mission scientist at Goddard Space Flight Center in Greenbelt, Md., said during a Google+ Hangout announcing the finds.  "But this is actually the first real data that we have to give us the shape of the tail."
Though IBEX data has given scientists an idea of the shape and structure of the heliotail, they say they have not been able to measure its length particularly well. They think it is probably evaporating over something like the 1000 times the distance between the Earth and the sun, McComas said.
The $169 million IBEX spacecraft, launched in 2008, was built for an initial two-year mission, which has since been extended.  Early on in its mission, IBEX detected ENAs flowing toward the sun in an unexpected pattern: They were significantly enhanced in a mysterious ribbon on the edge of the solar system that scientists now think is a reflection of the solar wind, shot back toward the sun by a strong galactic magnetic field.
IBEX has made several other important discoveries throughout its mission. In 2010, the spacecraft turned its gaze back toward Earth and got the first-ever peek at the solar wind crashing into the planet's magnetosphere. Last year, NASA announced that the spacecraft made its first detection of matter from outside the solar system, finding alien particles of hydrogen, oxygen and neon in the interstellar wind.

The extracted keywords are (arranged in decreasing order of tf-idf score):
ibex, solar system, heliotail, interstellar, mccomas, nasa, solar wind, spacecraft, particles, tail, toward sun, shape
As you can observe keyphrases "solar wind" and "solar system" has a common word "solar" but it does not appear as a keyword, because the program takes care of this ambiguity by comparing monograms with bigrams and then bigrams with trigrams to check if certain words appear together at the same number of time. For example: if the word "solar" appeared 10 times with the word "system" and they individually occurred exactly 10 times each, then both words are merged to form a keyphrase "solar system".

The tf can be calculated in many ways, but in the above case it is calculate by the below formula (it worked better than others!):
$$tf = log(TermFrequency+1)$$
Another way to calculate the tf is (it is commented out, you can uncomment and comment the above one, if you decide to use the below one):
$$tf = {\alpha + (1-\alpha)*TermFrequency \over \max\{f(w,d): w \in d\}} $$
Above method somewhat takes care of the bias created by document length. You can read more about it here.

As we know tf-idf depends on multiple documents, more number of documents leads to a better accuracy. Above results are based on only 40+ documents present @ https://github.com/vikasing/NiceText/tree/master/data

CODE:

Code for extracting the keyphrases from the text is present in https://github.com/vikasing/NiceText/blob/master/src/com/vikasing/nicetext/TfIdf.java

I would suggest to take (or fork it to play with it) the whole project to avoid any dependency issues. This is still a WIP, I'll release a lib (a jar) soon after adding more features and fixing some bugs.

21 November 2010

A Simple URL Shortening Algorithm in JAVA

We have so many url shortening services available today, I am not sure what kind of algorithm they use to shorten a particular url. Given the limitations over the characters which can be used in a url it becomes pretty much obvious that we are limited to 62 alpha numeric chars i.e. [a-z 0-9 A-Z]. Though - (hyphen) and _ (underscore) are allowed in a url still we want to avoid them for many good reasons. Very obvious would be a bad looking url like http://xyz.com/c0--rw_ or http://xyz.com/______-.
Following is the simple implementation of base10 to base62 converter, that's all we need to shorten a url. With 62 chars and a unique string 7 char long we can shorten:
627 = 3,521,614,606,208 urls
that's a lots of urls.

How shortening works in the present case:

Suppose you have a table with following columns:
1. unique auto increment id (long),
2. url (string),
3. base62 string (string)
Now the trick is that we convert unique id to base62 string not the url, and then the url is mapped to the unique id. For example if we want to shorten the following url:
http://news.xinhuanet.com/english2010/world/2010-11/18/c_13612801.htm
First we need to look for the last unique id in the table then add 1 to it and convert the resulting number to base62. Suppose last unique id was 678544325 now the next id 678544326 will be mapped to the above url and base62 of a 678544326 will be:
45*624+57*623+6*622+23*621+20*620
means a five char url, having following array indexes {45}{57}{6}{23}{20} in
String[] elements = {
                "a","b","c","d","e","f","g","h","i","j","k","l","m","n","o",
                "p","q","r","s","t","u","v","w","x","y","z","1","2","3","4",
                "5","6","7","8","9","0","A","B","C","D","E","F","G","H","I",
                "J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X",
                "Y","Z"
                };
which will give a base62 string: JVgxu and a shortened url can be http://xyz.com/JVgxu

Following is the java code to convert a number to base62 string
/**
 * @author vikasing
 *
 */
public class Base62Converter {
    private final int LENGTH_OF_URL_CODE=6;
    public String convertTo62Base(long toBeConverted)
    {
        String[] elements = {
                "a","b","c","d","e","f","g","h","i","j","k","l","m","n","o",
                "p","q","r","s","t","u","v","w","x","y","z","1","2","3","4",
                "5","6","7","8","9","0","A","B","C","D","E","F","G","H","I",
                "J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X",
                "Y","Z"
                };
        String convertedString="";
        int numOfDiffChars= elements.length;
        if(toBeConverted<numOfDiffChars+1 && toBeConverted>0)
        {
            convertedString=elements[(int) (toBeConverted-1)];
        }
        else if(toBeConverted>numOfDiffChars)
        {
            long mod = 0;
            long multiplier = 0;
            boolean determinedTheLength=false;
            for(int j=LENGTH_OF_URL_CODE;j>=0;j--)
            {
                multiplier=(long) (toBeConverted/Math.pow(numOfDiffChars,j));
                if(multiplier>0 && toBeConverted>=numOfDiffChars)
                {
                    convertedString+=elements[(int) multiplier];
                    determinedTheLength=true;
                }
                else if(determinedTheLength && multiplier==0)
                {
                    convertedString+=elements[0];
                }
                else if(toBeConverted<numOfDiffChars)
                {
                    convertedString+=elements[(int) mod];
                }
                
                mod=(long) (toBeConverted%Math.pow(numOfDiffChars,j));
                toBeConverted=mod;                
            }
            
        }
        return convertedString;
    }

}
Above code is part of the project Crowl on which I have been working for a while. File can be browsed under org.crow.utils package.

Update: found this precise code for base62 conversion on the web:
public String converter ( int base, long decimalNumber)
 {
  
   String tempVal = decimalNumber == 0 ? "0" : "";
         long mod = 0;

         while( decimalNumber != 0 ) {
             mod = decimalNumber % base;
             tempVal = baseDigits.substring( (int)mod, (int)mod + 1 ) + tempVal;
             decimalNumber = decimalNumber / base;
         }
         System.out.print(tempVal);
         return tempVal;
 }

I didn't check the performance of the above code but it is smaller than the first one but both give the same output.