24 March 2015

Word vectors (word2vec) on named entities and phrases - I

word2vec is a C lib to compute the vector representation of a given word (or a phrase). It was released by a few Googlers and being maintained at word2vec. A couple of nice articles on what word2vec is capable of (roughly):
Word vectors can boost performance of many ML and NLP applications, for example, sentiment analysis, recommendations, chat threading etc.
I used deeplearning4j's implementation of word2vec. The example given on that page does not work with the latest release of dl4j (at present 0.0.3.3) , working example can be found here. I ended up using Stanford's CoreNLP for named entity recognition, OpenNLP works fine too.

The training was done on the recent news data gathered from various sources, articles were split into sentences (using OpenNLP), duplicate and short sentences were removed.  The size of the corpus was around 300MB containing 1.7 million sentences and 44 million words. The training took almost 36 Hours with 3 iteration and a layer size of 200. Lets start with simple examples:

For the word water I get the following word vector (limited to 21 words):
[groundwater, vapor, heater, pollutant, rainwater, dioxide, wastewater, sewage, potable, moisture, seawater, methane, nitrogen, vegetation, vapour, oxide, reservoir, hydrogen, plume, monoxide, sediment]
We can see that almost all the words are used in the context of water, but this is limited to the trained corpus. With different corpus you'll get different set of results. Lets look at something which was there in the news recently, e.g., the term plane:
[mh370, c17, crashland, skidd, 777, takeoff, qz8501, transasia, malaysia_airline, aircraft, globemaster, turboprop, cockpit, laguardia_airport, 1086, locator, singleengine, atr, solarpower, midair]
Except 1086 and atr, every other word (or phrase) in the vector makes sense, but if you search for 1086 and atr, you'll find that 1086 was a Delta Air Lines Flight which crashed recently and ATR is an aircraft manufacturer company. Lets look for an entity (specially phrase) vector, for example Leslee Udwin was in the news recently:
[mukesh_singh, gangrape, nirbhaya, storyville, documentary, rapist, tihar, andrew_jarecki, citizenfour, telecast, bar_council_of_india, udwin, laura_poitra, filmmaker, jinx, bci, bbc, derogatory, chai_j, leslie_udwin, hansal_mehta, bbc_storyville]
You can relate most of the words/phrases in the vector to Leslee Udwin or her documentary India's Daughter. Other words in the vector are either the names of the documentaries or the documentary makers, for example, The Jinx is an HBO documentary mini-series directed by Andrew Jarecki.

dl4j library also provides the vector addition and subtraction mechanism, for subtraction code is as follows:

List<String>  p = new ArrayList<>(), n = new ArrayList<>();
p.add("imitation");
n.add("oscar");
vec.wordsNearest(p, n, 20);

Here is how the subtraction works, vector for imitation:
[grand_budapest_hotel, michael_keaton, screenplay, birdman, boyhood, eddie_redmayne, whiplash, benedict_cumberbatch, felicity_jone, jk_simmon, richard_linklater, julianne_moore, j_k_simmon, wes_anderson, patricia_arquette, edward_norton, graham_moore, alejandro_gonzalez_inarritu, stephen_hawk, alejandro_g_inarritu, alexandre_desplat]
It contains many Oscars entries and related terms, so subtracting the vector of term oscar should remove all those entries and give us something related to The Imitation Game:
[changer, throne, lllp, alan_tur, chris_kyle, oneindia, lilih620150308, sniper, rarerbeware, grand_budapest_hotel, benedict_cumberbatch, mockingjay, iseven, cable_news_network, extractable, theory, watchapple, enigma, codebreaker, washington_posttv, mathematician]
This vector is not a very good representation of the movie The Imitation Game, there is a lot of noise. This is because of the poor and small training data. But we see a few terms in the vector which are related to the movie, e.g.,
alan_tur*, benedict_cumberbatch, enigma, theory, mathematician
ing was removed by the tokenizer

I have trained the data on entities for now (by replacing the space with underscore), I am planning to train it on general phrases as well, like Member of Parliament should be combined into a single term member_of_parliament. Will publish the results in the second part. Next I want to compare it with Brown Clustering, it is also used for the similar purpose.