## 05 May 2015

### Word vectors using LSA, Part - 2

Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text.  More about LSA can be found here and here. LSA uses Singular Value Decomposition (SVD), a matrix factorization method. For a given matrix A,

SVD (A) = U*S*VT

In the current scenario matrix A is a term-document matrix (m terms * n documents). Visually SVD looks like this:

Unlike word2vec, LSA does not require any training. But it suffers from curse of dimensionality because SVD calculations get slower and slower as we increase the number of documents, i.e. size of matrix A. On a single machine it can take hours. The overall cost of calculating SVD is O(mn2) Flops. This means if we had m =100,000 unique words with n = 80,000 documents, it would require 6.4 x 1014 Flops or 640,000 GFlops. At stock clock speed (4.0 GHz) my AMD FX-8350 gives around 40 GFlops. So it will take around 640,000/40 = 16,000 Seconds which is around 4 hours 30 minutes. 
In my previous post I had used 1.7 million sentences and 44 million words for training word2vec, i.e. if we run SVD on this large matrix, it might end up taking centuries on my machine. However SVD calculations on large matrices can be done using a large cluster of Spark.  
Results

I kept the document size constant at 2500 and let the term size vary. In order to rank the terms in relation to query term I used cosine distance. This time along with named entities I also added the noun phrases. The data is the news articles from yesterday (4th May, 2015). Here is the vector for the first query "delhi":
[law_minister=0.34, jitender_singh_tomar=0.23, chief_minister=0.22, fake=0.21, arvind_kejriwal=0.21, degree=0.18, protest=0.16, law_degree=0.16, win=0.15, congress=0.15, aam_aadmi_party=0.14, incident=0.14]
Notice that the vector contains terms like chief_minister and law_degree which are not named entities.
Query for "chief_minister":
[arvind_kejriwal=0.34, parkash_singh_badal=0.29, today=0.28, delhi=0.28, mamata_banerjee=0.27, office=0.26, state=0.26, people=0.26, act=0.25, mufti_mohammad_sayeed=0.25, bjp=0.25, jammu_and_kashmir=0.25, governor=0.24]
The vector gives the name of all the chief ministers which were in the news recently. Same goes for the query "prime_minster":
[shinzo_abe=0.37, japanese=0.33, sushil_koirala=0.27, david_cameron=0.27, tony_abbott=0.26, 2015=0.24, benjamin_netanyahu=0.24, president=0.23, country=0.23, government=0.22, washington=0.22]
Lets look up for a person now, "rohit_sharma":
[mumbai_indians=0.45, skipper=0.4, ritika_sajdeh=0.37, captain=0.36, batsman=0.35, indian=0.34, lendl_simmons=0.34, kieron_pollard=0.33, parthiv_patel=0.31, ipl=0.29, runs=0.29, good=0.29, ambati_rayudu=0.29, mitchell_mcclenaghan=0.27, unmukt_chand=0.27]
Finding relations

What if I query for chief_minister and west_bengal and add both the vectors?
[mamata_banerjee=0.69, bjp=0.67, state=0.6]
It gives the correct result, Mamata Banerjee is the current Chief Minister of West Bengal. Note that now numbers don't represent the cosine distance.

What if we want to find out a relationship, instead of querying? Query for india and narendra_modi:
[prime_minister=0.5, make=0.42, government=0.4, country=0.4]
Querying mumbai_attack with charged gives a list of a few names of those who were involved/charged: