09 September 2013

Keyword extraction in Java

Around two months back I started working on a Java library: NiceText, to isolate data mining stuff from Crowl (another weekend project!) and make it an independent lib. I already implemented a text extractor for web pages (hence the name NiceText), which I observed sometimes performs faster and better than boilerpipe library.

I also wanted to extract keywords (or keyphrases) from text. I considered widely used algo TF-IDF for this purpose. Since keyphrases can contain more than one word, I also considered n-grams (mono, bi and tri), e.g., consider the following text:

Astronomers have gotten the first-ever peek at our solar system's tail, called the heliotail, finding that it's shaped like a four-leaf clover, NASA scientists announced.
The discovery was made using NASA's Interstellar Boundary Explorer (IBEX), a coffee-table-sized spacecraft that is studying the edge of the solar system.
'We always drew pictures where the tail of the solar system just trailed off the page.'
- David McComas, IBEX principal investigator
"Many models have suggested the heliotail might look like this or like that, but we have had no observations," David McComas, IBEX principal investigator at Southwest Research Institute in San Antonio, Tex., said in a statement. "We always drew pictures where the tail of the solar system just trailed off the page, since we couldn't even speculate about what it really looked like." [Images: NASA's IBEX Sees Our Solar System's Tail]
The heliotail is "a much larger structure with a much more interesting configuration" than scientists had previously predicted, McComas added during a news conference announcing the finding.
The heliotail is inflated by the solar wind of particles streaming off the sun, and the four-leaf clover shape is the result of fast solar wind shooting out near the sun's poles and slower wind flowing from near the sun's equator, researchers say. The finding is based on the first three years of IBEX's measurements of energetic neutral atoms.
In the interstellar boundary region, charged particles from the sun stream outward far beyond the planets toward the gas- and dust-filled space between stars. Collisions between these particles and interstellar material create fast-moving particles with no charge, known as energetic neutral atoms, or ENAs. Some of these particles speed inward toward the sun, where IBEX can detect them from its perch 200,000 miles above Earth.
"Scientists have always presumed that the heliosphere had a tail," Eric Christian, IBEX mission scientist at Goddard Space Flight Center in Greenbelt, Md., said during a Google+ Hangout announcing the finds.  "But this is actually the first real data that we have to give us the shape of the tail."
Though IBEX data has given scientists an idea of the shape and structure of the heliotail, they say they have not been able to measure its length particularly well. They think it is probably evaporating over something like the 1000 times the distance between the Earth and the sun, McComas said.
The $169 million IBEX spacecraft, launched in 2008, was built for an initial two-year mission, which has since been extended.  Early on in its mission, IBEX detected ENAs flowing toward the sun in an unexpected pattern: They were significantly enhanced in a mysterious ribbon on the edge of the solar system that scientists now think is a reflection of the solar wind, shot back toward the sun by a strong galactic magnetic field.
IBEX has made several other important discoveries throughout its mission. In 2010, the spacecraft turned its gaze back toward Earth and got the first-ever peek at the solar wind crashing into the planet's magnetosphere. Last year, NASA announced that the spacecraft made its first detection of matter from outside the solar system, finding alien particles of hydrogen, oxygen and neon in the interstellar wind.

The extracted keywords are (arranged in decreasing order of tf-idf score):
ibex, solar system, heliotail, interstellar, mccomas, nasa, solar wind, spacecraft, particles, tail, toward sun, shape
As you can observe keyphrases "solar wind" and "solar system" has a common word "solar" but it does not appear as a keyword, because the program takes care of this ambiguity by comparing monograms with bigrams and then bigrams with trigrams to check if certain words appear together at the same number of time. For example: if the word "solar" appeared 10 times with the word "system" and they individually occurred exactly 10 times each, then both words are merged to form a keyphrase "solar system".

The tf can be calculated in many ways, but in the above case it is calculate by the below formula (it worked better than others!):
$$tf = log(TermFrequency+1)$$
Another way to calculate the tf is (it is commented out, you can uncomment and comment the above one, if you decide to use the below one):
$$tf = {\alpha + (1-\alpha)*TermFrequency \over \max\{f(w,d): w \in d\}} $$
Above method somewhat takes care of the bias created by document length. You can read more about it here.

As we know tf-idf depends on multiple documents, more number of documents leads to a better accuracy. Above results are based on only 40+ documents present @ https://github.com/vikasing/NiceText/tree/master/data

CODE:

Code for extracting the keyphrases from the text is present in https://github.com/vikasing/NiceText/blob/master/src/com/vikasing/nicetext/TfIdf.java

I would suggest to take (or fork it to play with it) the whole project to avoid any dependency issues. This is still a WIP, I'll release a lib (a jar) soon after adding more features and fixing some bugs.