Would Word2Vec get into grad school? Evaluating Word2Vec against the Miller Analogies Test

So, I’m using word2vec in my dissertation. As I was writing today how word embedding models are surprisingly accurate in their modeling of complex semantic relationships like analogy, I realized I had very little data to support this claim. Sure, we know from the original GloVe paper (Table 2) that a word2vec model trained on a large English-language corpus can accurately solve about 66% of analogies in a test dataset of 19,544 analogies, and GloVe about 75%. But the analogies in this dataset are trivial: they consist of relationships like capital city to nation, city to state, currency to nation, and familial relationships like “brother is to sister as son is to daughter”; as well as a bunch of syntactic relationships, like adjectives (“glad”) to their adverbial components (“gladly”).

A far more interesting and informative set of analogies would be those given to humans, like the analogies in the Miller Analogies Test given to grad-school-bound college students. The analogies in this test are quite diverse, testing logical and analytical reasoning through analogies of words, as well as through proper nouns of places and key Western-cultural figures.1 Luckily, I was able to find an informational booklet for the 2002-2003 MAT that lists how many of its 100 analogies each percentile of students accurately solved. I was also able to scrape 150 sample analogies from an admittedly shady website that claims to have real MAT analogies from sample tests.

So, what would happen if a word2vec model took the MAT? I used a similar methodology as in the GloVe paper; for any analogy of the form “A is to B as C is to what?”, the model chose the word from the four multiple-choice options that had the highest cosine similarity with V(C) + V(B) – V(A).2 The only difference here with the GloVe methodology is that the model is simply choosing between the four multiple-choice options; but this is the same advantage that the human test-takers have, so it’s a better method of comparison.

Here’s the result, using a pre-trained word2vec model of 100 billion words from Google News that was provided by Google:

############################################
Model: GoogleNews-vectors-negative300.bin.gz
# of analogies attempted: 145
# correctly solved: 94
% correctly solved: 64.83%
############################################

At 2.59 times better than chance (which would be 25% correct, given four options), this is already heartening. But the comparison with human test-takers is even more striking. According to the informational booklet for the 2002-2003 MAT (Appendix D, p. 48) mentioned above, to score ~65% of analogies correctly places you in the 85th pecentile of all the grad-school bound college students taking the test that year. So, only about 15% of students were more capable analogy-solvers than a word2vec model.

This is, to me, really remarkable. Word2Vec is going to grad school! And now I have something a bit more meaningful to cite in my dissertation with respect to how semantically accurate a word2vec model can be. In any case: a fun bout of procrastination this was! Now I better get back to writing…

Appendix

Notes on Method

For bigram components in the analogies, if the bigram itself was not present, I used the averaged vector position for each word in the bigram [hat tip to Ben Schmidt.]

The model discarded five of the 150 analogies because at least one of the seven words involved (the three words in the analogy and the four in the possible answers) were not present in the model. The missing words were “aesthete” and a bunch of numbers (19, 17, 37, 39, 36, 34, 31, 360, 540, 0.75, 0.125, ⅑, 1/16, 1/20, 1916, 1815). Of these, it’s only “aesthete” that I feel the model should have known; numerical analogies seem like an unfair test for a language-based model. So, if you were to argue that the model should have gotten the “aesthete” question wrong, since, like a student, it didn’t know the meaning of that word, then the model would have instead scored 94/146 (64.38%), which is still in the 85th percentile of students.

Note: This post was updated on Friday, 28 Aug 2017. I implemented the bigram fix as described above; fixed some spelling problems in the original tests (it’s “Devanagari”, not “Devangiri”); and Americanized “plough” to “plow.”

Data and Code

The analogies I scraped from MajorTests.com are here in an excel file, in case anyone would like to replicate my experiment on a different word2vec model, or using GloVe or a different kind of semantic model. The folder of the original scraped pages are here.

The above analogies as solved by the Google News word2vec model are here.

The code I used is here. You’ll need to install the python modules gensim, xlrd, and xlwt. You can run it:

python miller.py my_word2vec_model_in_binary_format.bin

[expand]

  1. “Unlike analogies found on past editions of the GRE and the SAT, the MAT’s analogies demand a broad knowledge of Western culture, testing subjects such as science, music, literature, philosophy, mathematics, art, and history. Thus, exemplary success on the MAT requires more than a nuanced and cultivated vocabulary” (Wikipedia).
  2. So, for instance, “man is to woman as king is to what” is translated to V(king) + V(woman) – V(man).