05 May 2015

Word vectors using LSA, Part - 2

Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. [1] More about LSA can be found here and here. LSA uses Singular Value Decomposition (SVD), a matrix factorization method. For a given matrix A,

SVD (A) = U*S*VT

In the current scenario matrix A is a term-document matrix (m terms * n documents). Visually SVD looks like this:

Unlike word2vec, LSA does not require any training. But it suffers from curse of dimensionality because SVD calculations get slower and slower as we increase the number of documents, i.e. size of matrix A. On a single machine it can take hours. The overall cost of calculating SVD is O(mn2) Flops. This means if we had m =100,000 unique words with n = 80,000 documents, it would require 6.4 x 1014 Flops or 640,000 GFlops. At stock clock speed (4.0 GHz) my AMD FX-8350 gives around 40 GFlops. So it will take around 640,000/40 = 16,000 Seconds which is around 4 hours 30 minutes. [2]
In my previous post I had used 1.7 million sentences and 44 million words for training word2vec, i.e. if we run SVD on this large matrix, it might end up taking centuries on my machine. However SVD calculations on large matrices can be done using a large cluster of Spark. [3] [4]
Results

I kept the document size constant at 2500 and let the term size vary. In order to rank the terms in relation to query term I used cosine distance. This time along with named entities I also added the noun phrases. The data is the news articles from yesterday (4th May, 2015). Here is the vector for the first query "delhi":
[law_minister=0.34, jitender_singh_tomar=0.23, chief_minister=0.22, fake=0.21, arvind_kejriwal=0.21, degree=0.18, protest=0.16, law_degree=0.16, win=0.15, congress=0.15, aam_aadmi_party=0.14, incident=0.14]
Notice that the vector contains terms like chief_minister and law_degree which are not named entities.
Query for "chief_minister":
[arvind_kejriwal=0.34, parkash_singh_badal=0.29, today=0.28, delhi=0.28, mamata_banerjee=0.27, office=0.26, state=0.26, people=0.26, act=0.25, mufti_mohammad_sayeed=0.25, bjp=0.25, jammu_and_kashmir=0.25, governor=0.24]
The vector gives the name of all the chief ministers which were in the news recently. Same goes for the query "prime_minster":
[shinzo_abe=0.37, japanese=0.33, sushil_koirala=0.27, david_cameron=0.27, tony_abbott=0.26, 2015=0.24, benjamin_netanyahu=0.24, president=0.23, country=0.23, government=0.22, washington=0.22]
Lets look up for a person now, "rohit_sharma":
[mumbai_indians=0.45, skipper=0.4, ritika_sajdeh=0.37, captain=0.36, batsman=0.35, indian=0.34, lendl_simmons=0.34, kieron_pollard=0.33, parthiv_patel=0.31, ipl=0.29, runs=0.29, good=0.29, ambati_rayudu=0.29, mitchell_mcclenaghan=0.27, unmukt_chand=0.27]
Finding relations

What if I query for chief_minister and west_bengal and add both the vectors?
[mamata_banerjee=0.69, bjp=0.67, state=0.6]
It gives the correct result, Mamata Banerjee is the current Chief Minister of West Bengal. Note that now numbers don't represent the cosine distance.

What if we want to find out a relationship, instead of querying? Query for india and narendra_modi:
[prime_minister=0.5, make=0.42, government=0.4, country=0.4]
Querying mumbai_attack with charged gives a list of a few names of those who were involved/charged:
[people=1.14, left=1.08, november=1.05, dead=1.05, 166=1.05, executing=1.04, planning=1.04, 2008=1.02, hamad_amin_sadiq=1.0, shahid_jameel_riaz=1.0, mazhar_iqbal=1.0, jamil_ahmed=1.0, younis_anjum=0.94, abdul_wajid=0.94, zaki-ur_rehman_lakhvi=0.62]
Although above results look good, they are not always accurate, for example, query for captain and royal_challengers_bangalore does not return virat_kohli as the first result:
[ipl=0.67, rcb=0.66, match=0.64, kolkata_knight_riders=0.6, virat_kohli=0.57]
I guess more data from different time periods can help in establishing concrete relationships.

Word vectors obtained from LSA can be useful in expanding the search queries, guessing the relationships (as shown above), generating similarity based recommendations and many other tasks related to text.
I wrote a one file implementation of LSA in Java (its buggy and design patterns free!), it uses jBLAS for SVD and other matrix operations, code can be found at github.
A couple of more links to understand LSA through examples:

24 March 2015

Word vectors (word2vec) on named entities and phrases - I

word2vec is a C lib to compute the vector representation of a given word (or a phrase). It was released by a few Googlers and being maintained at word2vec. A couple of nice articles on what word2vec is capable of (roughly):
Word vectors can boost performance of many ML and NLP applications, for example, sentiment analysis, recommendations, chat threading etc.
I used deeplearning4j's implementation of word2vec. The example given on that page does not work with the latest release of dl4j (at present 0.0.3.3) , working example can be found here. I ended up using Stanford's CoreNLP for named entity recognition, OpenNLP works fine too.

The training was done on the recent news data gathered from various sources, articles were split into sentences (using OpenNLP), duplicate and short sentences were removed.  The size of the corpus was around 300MB containing 1.7 million sentences and 44 million words. The training took almost 36 Hours with 3 iteration and a layer size of 200. Lets start with simple examples:

For the word water I get the following word vector (limited to 21 words):
[groundwater, vapor, heater, pollutant, rainwater, dioxide, wastewater, sewage, potable, moisture, seawater, methane, nitrogen, vegetation, vapour, oxide, reservoir, hydrogen, plume, monoxide, sediment]
We can see that almost all the words are used in the context of water, but this is limited to the trained corpus. With different corpus you'll get different set of results. Lets look at something which was there in the news recently, e.g., the term plane:
[mh370, c17, crashland, skidd, 777, takeoff, qz8501, transasia, malaysia_airline, aircraft, globemaster, turboprop, cockpit, laguardia_airport, 1086, locator, singleengine, atr, solarpower, midair]
Except 1086 and atr, every other word (or phrase) in the vector makes sense, but if you search for 1086 and atr, you'll find that 1086 was a Delta Air Lines Flight which crashed recently and ATR is an aircraft manufacturer company. Lets look for an entity (specially phrase) vector, for example Leslee Udwin was in the news recently:
[mukesh_singh, gangrape, nirbhaya, storyville, documentary, rapist, tihar, andrew_jarecki, citizenfour, telecast, bar_council_of_india, udwin, laura_poitra, filmmaker, jinx, bci, bbc, derogatory, chai_j, leslie_udwin, hansal_mehta, bbc_storyville]
You can relate most of the words/phrases in the vector to Leslee Udwin or her documentary India's Daughter. Other words in the vector are either the names of the documentaries or the documentary makers, for example, The Jinx is an HBO documentary mini-series directed by Andrew Jarecki.

dl4j library also provides the vector addition and subtraction mechanism, for subtraction code is as follows:

```List<String>  p = new ArrayList<>(), n = new ArrayList<>();
p.add("imitation");
n.add("oscar");
vec.wordsNearest(p, n, 20);
```

Here is how the subtraction works, vector for imitation:
[grand_budapest_hotel, michael_keaton, screenplay, birdman, boyhood, eddie_redmayne, whiplash, benedict_cumberbatch, felicity_jone, jk_simmon, richard_linklater, julianne_moore, j_k_simmon, wes_anderson, patricia_arquette, edward_norton, graham_moore, alejandro_gonzalez_inarritu, stephen_hawk, alejandro_g_inarritu, alexandre_desplat]
It contains many Oscars entries and related terms, so subtracting the vector of term oscar should remove all those entries and give us something related to The Imitation Game:
[changer, throne, lllp, alan_tur, chris_kyle, oneindia, lilih620150308, sniper, rarerbeware, grand_budapest_hotel, benedict_cumberbatch, mockingjay, iseven, cable_news_network, extractable, theory, watchapple, enigma, codebreaker, washington_posttv, mathematician]
This vector is not a very good representation of the movie The Imitation Game, there is a lot of noise. This is because of the poor and small training data. But we see a few terms in the vector which are related to the movie, e.g.,
alan_tur*, benedict_cumberbatch, enigma, theory, mathematician
ing was removed by the tokenizer

I have trained the data on entities for now (by replacing the space with underscore), I am planning to train it on general phrases as well, like Member of Parliament should be combined into a single term member_of_parliament. Will publish the results in the second part. Next I want to compare it with Brown Clustering, it is also used for the similar purpose.

28 February 2015

Side Projects

I started my professional career in July 2008, fresh out of college I was really excited about working on real projects. After two months of training I was assigned to the Call Handling (IVR) team, a .Net based project. Soon I realized that the project did not require much coding, whole day went into writing test cases and IVR workflows in XML. This motivated me to work on the following side projects:

Indews.in (2008)
Soon after 2008 Mumbai attacks, I decided to create my own news aggregation site for India, hence the name Indews (Indian News). I was not happy how Google News clubbed the news and sometimes ended up showing stale information as the main headline. I implemented some part of the site, wrote a very basic crawler in C#, hosted the site on a local IIS server at home. It had a really bad interface and did not survive more than 3 months. I had lost all my interest in a general news aggregator, now I was mainly interested in the tech/programming news.

9AM.in (2009-2010)
After shutting down indews.in, I started a project called 9am under http://www.natmac.org/9am/. I got bored with ASP.Net technologies, there was a lot of abstraction and many times I did not understand how things worked underneath, for example ajax implementation of ASP.Net. It was all magic and lots of dlls. And there were a very few open source projects in C#. So decided to move away from .Net and migrated all my code to Java. Having worked on Java and J2EE in college projects, it was easy.
9am was again an RSS/ATOM aggregator, but this time I was crawling the whole web for the tech related stuff and some news sites for general news (kept general news anyway from indews!). You can find an internet archive snapshot here.
9am had a many features, like: finding top keywords, grouping similar items, inbuilt search, categorizing a feed item into one of these categories:
DBs, UI, .Net, S/W Engg, Languages, Mobile, Java, XML, OS
Categorization was based on bag of words and worked fairly well. It was also hosted at my home computer using a static IP and had a 384 Kbps network connection. Crawler used to crawl a few thousand URLs everyday based on a Revisit Policy. The database had around 60,000 feed (RSS/ATOM) URLs and everyday it used to discover new ones. Some of the website owners got pissed with the crawling and asked me to remove the URLs. Since everything was automated, I had no control over URL discovery.
9am was really fun, it used to discover really good articles on the web everyday and I always had something amazing to read in my office. Following is an internet archive screenshot of the Language category under TECH tab:

The whole setup had many issues, day time power cuts, internet outage, slow internet, slow machine, poor MySQL full text search. Despite of all, it used to get ~5000 visits/day from Google.
When I was moving to another city, I had to shut it down. For unknown reasons it remained that way forever. The crawler code can be found at https://code.google.com/p/crowl/ and https://github.com/vikasing/crowl

Mozvo.com (2011-2013)
Mozvo analyzed the sentiments of tweets, reviews and blogs to create a Mozvo score for a movie. It had many other cool features like: movie recommendations, actor profiles, friends' tweets about a movie, movie explorer based on many attributes etc. This was the most ambitious side project I ever did. It also involved two more guys from the same company I was working at. We worked after office, almost everyday, initially it felt like it might end up evolving in a startup. I mainly worked on the back-end part of it, which had MongoDB as its database and a data layer written in Java. It was fun building the core parts. I ended up learning lots of new stuff.

We kept on adding many features without asking our users whether they really wanted them or not. It was like a playground for us, whatever we (or any one of us) thought was cool, we ended up implementing that ignoring the outcome. We did not analyze whether any feature was helping us in retaining the users. Google brought all the traffic and that was not really enough, ~200 visits/day. Gradually we lost our interest and in April 2013 we altogether stopped working on it. It is still alive at mozvo.com but in a dormant state.

GizmoAge (2012)
This was an Android app built on top of PhoneGap, main aim was to collect latest gadget news and group it to remove ambiguity.  The first version of the app was ready to use and did look much better than many apps in the Play Store. I published the app in Play Store, but removed it after a couple of months, don't remember why :), I guess there were some server issues.

Cryptocurrency Mining (2014)
This was my first hardware hacking project. I ended up investing around \$1000 in this project, bought 2 top end graphics cards, a 850 watt SMPS and lots of hacky stuff like PCI risers, power buttons from Hong Kong etc.
The rig mined all the popular alt coin currencies at that time from Dogcoin to Coinocoin. I also did some trading at various exchanges. After three months of mining all the fun was gone so I stopped my rig and decided to sell the hardware. But before that following happened:
I had to RMA one of the graphics card and the motherboard short circuited (no RMA). Also lost around 0.1 bitcoin in trading. Sometime later I sold my 0.42 bitcoin and stopped the crypto currency madness all together.
Nevertheless it was fun, got to learn many things about crypto currencies eg. bitcoin, blockchain, ASIC, primecoin, mintcoin, CPU only coins, and there were these crazy ideas of coin drops, also country specific coins like Auroracoin for Iceland.

Others
There were some other small projects here and there:
• Crowl (2009- ): The web crawler which powered 9am, still working on it, its much more powerful now.
• NiceText (2012-): This is a very small library I wrote to extract the text from a webpage. Other libs boilerpipe and readability port did not work that well on many pages. This is a part of crowl project. I wrote a post about it and here is the github link.
• jaLSA (2014-): A lib I wrote to do Latent Semantic Analysis. It was needed for a project I was working on in my previous company.
• velocityplus (2011): An eclipse plugin for Apache velocity templating engine. Worked on it when I was working in my first company. It is unfinished, got really bored while developing it.
• Fing.in (2012): A Bollywood news portal, it was supposed to be a sub project of mozvo.com. Finished it locally but did not publish it anywhere.
• Paltan.org (2008): I briefly worked on creating a social networking website for my college group, it was based on Wordpress based Buddypress. But the Buddypress itself was in beta development and lacked many obvious features. It was all PHP, lost my interest very soon, did not go anywhere, shut it down after sometime.
• letsj.com (2012): An aggregator for Java related articles. Intention was to use Lucene as a database as well as indexing engine, got into many issues, abandoned.