Monrai Blog

News about Cypher, Semantic Web, Natural Language Processing, and Computational Linguistics

Saturday, August 05, 2006

Google's N-gram Set

eBiquity's semantic web blog has reported Google's announcement that it will share a gigantic n-gram dataset generated from a corpus of one trillion words from Web pages. Google found 1.1B five-word sequences that appear at least 40 times and 13.8M words that appear at least 200 times. The dataset will be distributed by the Linguistic Data Consortium.

Google’s describes its motivation as follows:

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together.

I wonder if the data set will be annotated with linguistic information such as part of speech and word frequency. Sounds promising for semantic web data mining and hopefully will provide another corpus of data for Cypher and NLP.


Post a Comment

Subscribe to Post Comments [Atom]

<< Home