From Fields to Vectors

The third episode of Word Vectors in the Eighteenth Century, originally published on 10 Sep 2016.

0. Preface

Unlike the previous two posts in the series, this post is presented in slideshow form. It's an attempt to reign in my hobby-horse of verbosity; to accommodate our shrinking attention spans; and (more somberly) to experiment with DH-inspired forms of visual rhetoric. To advance, push the right arrow key, or click the right arrow that appears when you hover over this text. To expand an image, click the title of the slide, which should appear in blue.

In any case: ahoy! ...

1. Introduction

Last time on Word Vectors in the Eighteenth Century, we looked at how word vectors work under the hood. And before that, we close-read how word vectors might model Edward Young's influential analogy that "riches are to virtue as learning is to genus."

But how can we distant-read word vectors? Surprisingly, this is not an easy question. Unlike topic modeling and other unsupervised methods, it's not immediately clear how to use word vectors for large-scale text analysis. All word vectors give us is a multidimensional semantic space, into there's no one particular or privileged way to enter. The matrix, as it were, won't tell us what questions to ask it.

2A. Back to semantic fields

Anyway, I didn't know where to begin. So I thought I'd piggy-back on work that Long Le-Khac and I have done in the past on semantic fields, or "cohorts."

For us, a semantic cohort is a group of words that are both semantically similar (i.e. a semantic field); and historically similar, in that they rise or fall together across time (i.e. an historical cohort). I won't go here into how we made these semantic fields/cohorts, but if you're interested, check out the pamphlet.

Instead, this post asks the questions: How do semantic fields relate to semantic vectors? Do vector-based approaches to semantics corroborate field-based approaches? More importantly, how can vectors help us think in new ways about semantics in DH?

2B. Semantic cohort #1: “Abstract Values”

Long and I believe we discovered two large semantic cohorts. The first we called the "abstract values." It contained words like the following, categorized into sub-fields:

-- Moral Valuation: character, honour, conduct, respect, worthy … -- Social Restraint: gentle, pride, proud, proper, agreeable, … -- Sentiment: heart, feeling, passion, bosom, emotion, … -- Partiality: correct, prejudice, partial, disinterested, …

Note that most of these words are Latinate abstractions, and that as a cohort they fall in frequency across the nineteenth century.

Click the blue title above to see the graph more clearly.

2C. Semantic cohort #2: “Hard Seed”

Our second semantic cohort we called “hard seed,” after its seed word, "hard." It contained words like the following, categorized into sub-fields:

-- Action Verbs: see, come, go, came, look, let, looked, … -- Body Parts: eyes, hand, face, head, hands, eye, arms, … -- Physical Adjectives: round, hard, low, clear, heavy, hot, straight, … -- Colors: white, black, red, blue, green, gold, grey, … -- Locative Prepositions: out, up, over, down, away, back, through, … -- Numbers: two, three, ten, thousand, four, five, hundred, …

Note that most of these words are concrete and Anglo-Saxon, and that they rise in frequency across the nineteenth century.

3A. Semantic fields in vector space

So, what would happen if we located each semantic field's words in the vector space? Would words from the same field appear closer to each other than to words from other fields?

3B. Semantic fields in vector space: All fields

It appears so. Words here are colored by their semantic field. If they are closer together in this image, then they are closer together in the vector space. The image is made by a t-sne dimensionality reduction of the cosine distances between each word (where the distances come from the ECCO-TCP word2vec model). In effect, t-sne tries to flatten the multidimensional geography of the data onto two dimensions with as little information loss as possible.

3C. Semantic fields in vector space: Hard Seed

The separation between the semantic fields is even more obvious if we look only at the words in each large field separately: displayed here is only the "hard seed" field. As a whole, "hard seed" tends to occupy the southwestern quadrant of the graph, and not to occupy the northeastern. Moreover, its sub-fields occupy their own distinct regions of the vector space: Action Verbs in purple, Body Parts in brown, Colors in pink, Numbers in gray, Physical Adjectives in yellow, and Locative Prepositions in light blue.

3D. Semantic cohorts in vector space: Abstract Values

Conversely, almost all of the "abstract values" occupy the northeastern quadrant of the graph. Its sub-fields, however, are less tightly organized than in "hard seed": Social Restraint (red) and Moral Valuation (blue) are stretched together across the northeast, intermixing also with the Sentiment field (green).

3E. Semantic cohorts in vector space: Abstract Values [Zoom]

positively valued (faultless, refined, admiration) and the westerly ones negatively valued (vulgar, sinful, reckless). This reorganization is one reason the fields look mixed-up: the distinction between abstractions of social vs. moral behavior has been subordinated to the distinction between positive vs. negative abstractions.

4A. From fields…

So, semantic vectors corroborate semantic fields, while also nuancing them. Within a vector space of semantics, words from the same "semantic field"—made from a totally different and independent process—cluster together in meaningful ways.

But corroboration is a bittersweet moment in DH: impressive, empowering even, but unsatisfying, boring. How, then, can word vectors allow us to approach these semantic questions differently? How can they help us ask new questions?

Perhaps we could think less in terms of discrete semantic units, like semantic fields or cohorts...

4B. To vectors

...and instead, we could think more in terms of vectors or axes of meaning.

For example, instead of thinking of concrete and abstract words as belonging to distinct semantic fields, we could think of them as lying at the extreme ends of a semantic spectrum—a vector—that points from one to the other, or from the semantics of concreteness to the semantics of abstractness.

Vectors make it easy to define this new kind of semantic unit: the semantic vector, V(Abstract-Concrete). But how?

5A. Measuring abstractness everywhere

Our vector of concreteness-vs.-abstractness, V(Abstract-Concrete), affords a whole range of interesting distant readings. For instance, now we can measure the relative abstractness of any word by taking the cosine similarity between its vector and V(Abstract-Concrete). If above 0, the word points toward abstractness; if below, it points toward concreteness; and if around 0, it points orthogonally, neutral with respect to the contrast. Here are the most frequent 1,000 words in the corpus by part-of-speech. That there are more abstract than concrete adjectives, and more concrete than abstract verbs, is not an artifact of our vector, but reappears in contemporary measures of linguistic abstractness.

4C. Defining V(Abstract) and V(Concrete)

One way would be to build on the abstract and concrete semantic fields we've been looking at.

We could define a generalized abstract word, V(Abstract), as the centroid of the vector positions for all words in the "abstract values" field. Because what words in this field most share is their abstractness, we would expect an artificial word vector pointing there [i.e. V(Abstract)] to primarily capture the semantics of abstractness.

Likewise, we can define a generalized concrete word, V(Concrete), as the centroid of the vector positions for all words in the "hard seed" fields, since what these words most share is their concreteness.

4D. Defining V(Abstract-Concrete)

Finally, we can define a semantic vector pointing from the concrete to the abstract as V(Abstract-Concrete). This is the vector subtraction of V(Concrete) from V(Abstract). By the logic of subtraction, this vector points from the semantics only concrete words have to the semantics only abstract words have.

In effect, V(Abstract-Concrete) expresses the difference between concrete and abstract words, not as two distinct semantic fields, but rather as a single semantic axis of difference: that is, as a vector.

5B. Comparing to contemporary measures of abstractness

We can also compare V(Abstract-Concrete), our measure of abstractness specific to eighteenth-century semantics, with contemporary measures of abstractness. The y-axis here is V(Abstract-Concrete). Along the x-axis here is a contemporary measure of concreteness, drawn from a Mechanical Turk study (Brysbaert et al). As we expect, they negatively correlate: the linear regression explains about a third of the data (R^2 = 32%). But the variations from the norm are even more interesting...

5C. Concrete words can sublimate

Take the word "discovered." The word is more abstract today, and more concrete in the eighteenth century, than we would expect from the linear model. Why is this? This may have a simple historical-linguistic explanation. The concrete usage of "discover" (to un-cover and make visible to the eye), now marked "rare" by the OED, was common in the 18C. As a random example, from Burney's Evelina (1778): "Just then our attention was attracted by a pine-apple; which, suddenly opening, discovered a nest of birds, which immediately began to sing."

5D. Abstract words can ossify

Conversely, the word "human" is highly concrete today (almost maximally), but was highly abstract in the 18C—both much more so than we would expect. Why? My guess is that today we think of a human more than the human: I'm a human, you're a human, this human, that human: it's human with an indefinite article, it's plural, it's concrete. But in the 18C, "human" operated as a sacred, top-level abstraction: as in human nature, or the contrast between the human and animal worlds, or between the human and the divine.

Is "human" an abstraction, then? In the 18C, yes; in the 21C, no. Results such as these provoke further questions. For example, is abstraction best understood along a timeline? Just as metaphors "die", hardening into a new literal meaning, perhaps abstractions also pass away, drifting into concrete meanings.

6. Conclusion

In sum, we saw how semantic fields appear clustered together in a semantic vector-space in interesting ways: and so vector-based approaches to semantics corroborate, even nuance, field-based ones. But they also build on and reframe them: by redefining the relationship between abstract and concrete words as the semantic vector between them, we can measure the relative abstractness of any given word. With this measure, we can do any number of things. Here, we compared it with a contemporary measure of concreteness, and interpreted a couple outliers.

Next time, on Word Vectors in the Eighteenth Century, we’ll make use of this vector-based measure of abstractness to construct “semantic transportation networks” between abstract nouns. Until then, … stay tuned!