vikasing: ml

05 May 2015

Word vectors using LSA, Part - 2

Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. [1] More about LSA can be found here and here. LSA uses Singular Value Decomposition (SVD), a matrix factorization method. For a given matrix A,

SVD (A) = U*S*V^T

In the current scenario matrix A is a term-document matrix (m terms * n documents). Visually SVD looks like this:

Unlike word2vec, LSA does not require any training. But it suffers from curse of dimensionality because SVD calculations get slower and slower as we increase the number of documents, i.e. size of matrix A. On a single machine it can take hours. The overall cost of calculating SVD is O(mn²) Flops. This means if we had m =100,000 unique words with n = 80,000 documents, it would require 6.4 x 10¹⁴Flops or 640,000 GFlops. At stock clock speed (4.0 GHz) my AMD FX-8350 gives around 40 GFlops. So it will take around 640,000/40 = 16,000 Seconds which is around 4 hours 30 minutes. [2]

In my previous post I had used 1.7 million sentences and 44 million words for training word2vec, i.e. if we run SVD on this large matrix, it might end up taking centuries on my machine. However SVD calculations on large matrices can be done using a large cluster of Spark. [3] [4]

Results

I kept the document size constant at 2500 and let the term size vary. In order to rank the terms in relation to query term I used cosine distance. This time along with named entities I also added the noun phrases. The data is the news articles from yesterday (4th May, 2015). Here is the vector for the first query "delhi":

[law_minister=0.34, jitender_singh_tomar=0.23, chief_minister=0.22, fake=0.21, arvind_kejriwal=0.21, degree=0.18, protest=0.16, law_degree=0.16, win=0.15, congress=0.15, aam_aadmi_party=0.14, incident=0.14]

Notice that the vector contains terms like chief_minister and law_degree which are not named entities.
Query for "chief_minister":

[arvind_kejriwal=0.34, parkash_singh_badal=0.29, today=0.28, delhi=0.28, mamata_banerjee=0.27, office=0.26, state=0.26, people=0.26, act=0.25, mufti_mohammad_sayeed=0.25, bjp=0.25, jammu_and_kashmir=0.25, governor=0.24]

The vector gives the name of all the chief ministers which were in the news recently. Same goes for the query "prime_minster":

[shinzo_abe=0.37, japanese=0.33, sushil_koirala=0.27, david_cameron=0.27, tony_abbott=0.26, 2015=0.24, benjamin_netanyahu=0.24, president=0.23, country=0.23, government=0.22, washington=0.22]

Lets look up for a person now, "rohit_sharma":

[mumbai_indians=0.45, skipper=0.4, ritika_sajdeh=0.37, captain=0.36, batsman=0.35, indian=0.34, lendl_simmons=0.34, kieron_pollard=0.33, parthiv_patel=0.31, ipl=0.29, runs=0.29, good=0.29, ambati_rayudu=0.29, mitchell_mcclenaghan=0.27, unmukt_chand=0.27]

Finding relations

What if I query for chief_minister and west_bengal and add both the vectors?

[mamata_banerjee=0.69, bjp=0.67, state=0.6]

It gives the correct result, Mamata Banerjee is the current Chief Minister of West Bengal. Note that now numbers don't represent the cosine distance.

What if we want to find out a relationship, instead of querying? Query for india and narendra_modi:

[prime_minister=0.5, make=0.42, government=0.4, country=0.4]

Querying mumbai_attack with charged gives a list of a few names of those who were involved/charged:

[people=1.14, left=1.08, november=1.05, dead=1.05, 166=1.05, executing=1.04, planning=1.04, 2008=1.02, hamad_amin_sadiq=1.0, shahid_jameel_riaz=1.0, mazhar_iqbal=1.0, jamil_ahmed=1.0, younis_anjum=0.94, abdul_wajid=0.94, zaki-ur_rehman_lakhvi=0.62]

Although above results look good, they are not always accurate, for example, query for captain and royal_challengers_bangalore does not return virat_kohli as the first result:

[ipl=0.67, rcb=0.66, match=0.64, kolkata_knight_riders=0.6, virat_kohli=0.57]

I guess more data from different time periods can help in establishing concrete relationships.

Word vectors obtained from LSA can be useful in expanding the search queries, guessing the relationships (as shown above), generating similarity based recommendations and many other tasks related to text.
I wrote a one file implementation of LSA in Java (its buggy and design patterns free!), it uses jBLAS for SVD and other matrix operations, code can be found at github.
A couple of more links to understand LSA through examples:

24 March 2015

Word vectors (word2vec) on named entities and phrases - I

word2vec is a C lib to compute the vector representation of a given word (or a phrase). It was released by a few Googlers and being maintained at word2vec. A couple of nice articles on what word2vec is capable of (roughly):

Word vectors can boost performance of many ML and NLP applications, for example, sentiment analysis, recommendations, chat threading etc.
I used deeplearning4j's implementation of word2vec. The example given on that page does not work with the latest release of dl4j (at present 0.0.3.3) , working example can be found here. I ended up using Stanford's CoreNLP for named entity recognition, OpenNLP works fine too.

The training was done on the recent news data gathered from various sources, articles were split into sentences (using OpenNLP), duplicate and short sentences were removed. The size of the corpus was around 300MB containing 1.7 million sentences and 44 million words. The training took almost 36 Hours with 3 iteration and a layer size of 200. Lets start with simple examples:

For the word water I get the following word vector (limited to 21 words):

[groundwater, vapor, heater, pollutant, rainwater, dioxide, wastewater, sewage, potable, moisture, seawater, methane, nitrogen, vegetation, vapour, oxide, reservoir, hydrogen, plume, monoxide, sediment]

We can see that almost all the words are used in the context of water, but this is limited to the trained corpus. With different corpus you'll get different set of results. Lets look at something which was there in the news recently, e.g., the term plane:

[mh370, c17, crashland, skidd, 777, takeoff, qz8501, transasia, malaysia_airline, aircraft, globemaster, turboprop, cockpit, laguardia_airport, 1086, locator, singleengine, atr, solarpower, midair]

Except 1086 and atr, every other word (or phrase) in the vector makes sense, but if you search for 1086 and atr, you'll find that 1086 was a Delta Air Lines Flight which crashed recently and ATR is an aircraft manufacturer company. Lets look for an entity (specially phrase) vector, for example Leslee Udwin was in the news recently:

[mukesh_singh, gangrape, nirbhaya, storyville, documentary, rapist, tihar, andrew_jarecki, citizenfour, telecast, bar_council_of_india, udwin, laura_poitra, filmmaker, jinx, bci, bbc, derogatory, chai_j, leslie_udwin, hansal_mehta, bbc_storyville]

You can relate most of the words/phrases in the vector to Leslee Udwin or her documentary India's Daughter. Other words in the vector are either the names of the documentaries or the documentary makers, for example, The Jinx is an HBO documentary mini-series directed by Andrew Jarecki.

dl4j library also provides the vector addition and subtraction mechanism, for subtraction code is as follows:

List<String>  p = new ArrayList<>(), n = new ArrayList<>();
p.add("imitation");
n.add("oscar");
vec.wordsNearest(p, n, 20);

Here is how the subtraction works, vector for imitation:

[grand_budapest_hotel, michael_keaton, screenplay, birdman, boyhood, eddie_redmayne, whiplash, benedict_cumberbatch, felicity_jone, jk_simmon, richard_linklater, julianne_moore, j_k_simmon, wes_anderson, patricia_arquette, edward_norton, graham_moore, alejandro_gonzalez_inarritu, stephen_hawk, alejandro_g_inarritu, alexandre_desplat]

It contains many Oscars entries and related terms, so subtracting the vector of term oscar should remove all those entries and give us something related to The Imitation Game:

[changer, throne, lllp, alan_tur, chris_kyle, oneindia, lilih620150308, sniper, rarerbeware, grand_budapest_hotel, benedict_cumberbatch, mockingjay, iseven, cable_news_network, extractable, theory, watchapple, enigma, codebreaker, washington_posttv, mathematician]

This vector is not a very good representation of the movie The Imitation Game, there is a lot of noise. This is because of the poor and small training data. But we see a few terms in the vector which are related to the movie, e.g.,

alan_tur*, benedict_cumberbatch, enigma, theory, mathematician

* ing was removed by the tokenizer

I have trained the data on entities for now (by replacing the space with underscore), I am planning to train it on general phrases as well, like Member of Parliament should be combined into a single term member_of_parliament. Will publish the results in the second part. Next I want to compare it with Brown Clustering, it is also used for the similar purpose.