Jay Taylor's notes

back to listing index

LexVec, a word embedding model written in Go that outperforms word2vec | Hacker News

[web search]
Original source (news.ycombinator.com)
Tags: golang go nlp lexvec word2vec news.ycombinator.com
Clipped on: 2016-07-27

Image (Asset 1/2) alt= Hacker News new | threads | comments | show | ask | jobs | submit jaytaylor (1983) | logout
LexVec, a word embedding model written in Go that outperforms word2vec (github.com)
69 points by atrudeau 5 hours ago | unvote | flag | hide | past | web | 19 comments | favorite

Image (Asset 2/2) alt=

As pre-built word vectors go, Conceptnet Numberbatch [1], introduced less flippantly as the ConceptNet Vector Ensemble [2], already outperforms this on all the measures evaluated in its paper: Rare Words, MEN-3000, and WordSim-353.

This fact is hard to publicize because somehow the luminaries of the field decided that they didn't care about these evaluations anymore, back when RW performance was around 0.4. I have had reviewers dismiss it as "incremental improvements" to improve Rare Words from 0.4 to 0.6 and to improve MEN-3000 to be as good as a high estimate of inter-annotator agreement.

It is possible to do much, much better than Google News skip-grams ("word2vec"), and one thing that helps get there is lexical knowledge of the kind that's in ConceptNet.

[1] https://blog.conceptnet.io/2016/05/25/conceptnet-numberbatch...

[2] https://blog.luminoso.com/2016/04/06/an-introduction-to-the-...

That said: LexVec gives quite good results on word-relatedness for using only distributional knowledge, and only from Wikipedia at that. Adding ConceptNet might give something that is more likely to be state-of-the-art.

...And just distributional knowledge makes it easy to train new models on domain-specific corpora, or new languages. Is it possible to do the same with ConceptNet?

I generally find that expert-derived ontologies suffer from bad coverage of low frequency items, rigidly discrete relationships, and are usually limited to a single language. That said, they're vastly better than nothing for a lot of tasks (same goes for WordNet).

You can retrain your distributional knowledge and keep your lexical knowledge. Moving to a new domain shouldn't mean you have to forget everything about what words mean and hope you manage to learn it again.

The whole idea of Numberbatch is that a combination of distributional and lexical knowledge is much better than either one alone.

BTW, ConceptNet is only partially expert-derived (much of it is crowd-sourced), aims not to be rigid like WordNet is, and is in a whole lot of languages.

"Retraining" ConceptNet itself is a bit of a chore, but you can do it. That is, you can get the source [1], add or remove sources of data, and rebuild it. Meanwhile, if you wanted to retrain word2vec's Google News skip-gram vectors, you would have to get a machine learning job at Google.

[1] https://github.com/commonsense/conceptnet5

Thanks for bringing these tools to my attention! Awesome stuff!

It feels weird how word embedding models have come to refer to both the underlying model, as well as the implementation. word2vec is the implementation of two models: the continuous bag-of-word and the skipgram models by Mikolov, while LexVec implements a version of the PPMI weighted count matrix as referenced in the README file. But the papers also discuss implementation details of LexVec that has no bearing on the final accuracy. I feel like we should make more effort to keep the models and reference implementations separate.

Aren't skip-grams equivalent to NMF of the PPMI matrix?


If anyone else is wondering what the heck "word embedding" means, it's a natural language processing technique.

Here's a nice blog post about it: http://sebastianruder.com/word-embeddings-1/

It can process something like this: king - man + woman = queen


Excellent blog post. Thanks for the link, I was going to ask about it.

Has anyone done any work on handing words that have overloading meanings? Something like 'lead' has two really distinct uses. It's really multiple words that happened to be spelt the same.

Google "word sense induction" or "word sense disambiguation". Intuitively, distributional information of the same sort that is used to derive representations for different word types in W2V or LexVec is useful for distinguishing word senses. Two (noun) senses of lead, two senses of bat, etc. are pretty easy to distinguish on the basis of a bag of words (or syntactic features) around them. Other words are polysemous: they have multiple related senses (across the language, names for materials can be used as containers; animal name for the corresponding food--but with exceptions). For some high frequency words it's a crazy gradient combination of polysemy and homonymy: 'home' for can refer to 1) a place someone lives 2) the corresponding physical structure 3) where something resides (a more 'metaphorical' sense), among other things. Obviously an individual use of a word has a gradient relationship to these senses, and speakers differ regarding what they think the substructure is (polysemous or homonymous, hierarchical or not, etc.). I've been working in my PhD on a technique to figure this out, but people clearly use a lot of information that isn't available in language corpora alone (e.g. intuitive physics).

It's tricky because we don't have good ground truth on what different word senses there are. (WordNet is not the final answer, especially as it separates every metaphorical use of the same word into its own sense.)

My experience is that you can distinguish word senses, but it seems the data isn't good enough to improve anything but a task that specifically evaluates that same vocabulary of word senses.

I see a sibling comment with link to spaCy's sense2vec, which uses the coarsest possible senses -- one sense for nouns, one sense for verbs, one sense for proper nouns, etc. It's a start.

Well, there is Sense2Vec: https://github.com/spacy-io/sense2vec

Sense2Vec can solve this one, but what if both meaning of the world are of the same pos tag?

Reminds me of Chord[1], word2vec written in Chapel

[1] https://github.com/briangu/chord

Are there IP considerations? Word2vec is patented.

System and method for generating a relationship network - K Franks, CA Myers, RM Podowski - US Patent 7,987,191, 2011 - http://www.google.com/patents/US7987191

Would this really be usable in court? It seems super general to me, using a lot of common techniques. Silly question, is it infringement to use any part of the patent?

It is only infringement if you do something matching every part of some claim. There may be lots of stuff in the description, and that doesn't matter. That is, if a claim is a system comprising A, B, C, and D, and you do just A, B, and C, then you're fine.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact