The Story of Stopwords: Topic Modeling an Ekphrastic Tradition

The following paper was first presented on July 9, 2014 at DH2014 in Lausanne, Switzerland. The slides can be found here.

The story I’d like to tell you today is about exploring hermeneutic spaces between poetic convention and artistic invention. It’s a story about the strict algorithmic logic of topic models confronted by the rich ambiguity of poetry to revise long-standing critical assumptions about the syntactic and semantic compression of poetry. It might even be considered a cautionary tale about how exclusively relying on close reading of texts diminishes our interpretive reach. The story I’m going to tell you features the critical tradition of ekphrasis—poetry to, for, and about the visual arts—as it unfolds through a digitally-enabled, critical deformance that occurs while preparing and modeling a corpus of 4,500 English-language poems with the latent dirichlet allocation algorithm, commonly called topic modeling. [Slide 2]

The purpose of my story is three-fold: First, to demonstrate through a case study the influence of stopword removal on topic models of poetic corpora. Second, to suggest one way we may respond to the concerns of skeptical poetry scholars about the practical and methodological constraints of distant reading practices such as topic modeling. And, third, to reveal one way in which my understanding of ekphrasis has transformed through an active engagement with the formal constraints of distant reading practices. [Slide 3]

The work I’m going to discuss today is part of a larger project in which I look at the critical and poetic tradition of ekphrasis, its genre definitions and conventions, as well as its treatment of women as active participants in a tradition of looking at and writing about the visual arts. In the twentieth and twenty first centuries, poetic engagements with the visual arts, in this genre called ekphrasis—have drawn from and reshaped a Western tradition of viewing, describing, creating, and narrating images. From W.H. Auden’s “Musée des Beaux Arts” to Jorie Graham’s “San Sepolcro,” poetic conversations between the visual and verbal arts are as active as they have ever been since Homer’s first description of the shield of Achilles in The Iliad (18.483-601). Likewise over the 20th century, our critical understanding of the genre has also evolved. For example, Jean Hagstrum’s 1950 book The Sister Arts describes ekphrasis as “pictoral” poetry, while New Critics Joseph Frank and Murray Kreiger adopt the term to describe the realization of an image through poetic form. More recently W.J.T. Mitchell in 1992 redefines ekphrasis as “the verbal representation of visual representation”—a move that signals a methodological shift away from a metaphorical comparison of the arts of poetry and painting toward a semiotic and cultural approach.

Similarly, my own study addresses methodological challenges faced by scholars of ekphrasis as they consider its active tradition. As significant and transformative as Mitchell’s contributions have been, “Ekphrasis and the Other” is limited in ways that are difficult to ignore in today’s critical landscape. Specifically, Mitchell’s “Ekphrasis and the Other” is based on four poems by white, male either British Romantic or American Modernist authors:[Slide 4]

  • Wallace Stevens’ “Anecdote of a Jar”;
  • William Carlos Williams’ “Portrait of a Lady”;
  • John Keats’ “Ode on a Grecian Urn”;
  • Percy B. Shelley’s “On the Medusa of Leonardo Da Vinci in the Florentine Gallery.”

In fact, whether it be in essay or monograph form, the most poems considered in any critical definition of the genre is 75; and of all of those collections, only a handful have included examples of ekphrastic poetry by women. So when Mitchell’s essay confidently concludes: [Slide 5]

My examples are canonical in their staging of ekphrasis as a suturing of dominant gender stereotypes into the semiotic structure of the imagetext, the image identified as feminine, the speaking / seeing subject of the text identified as masculine.

It’s difficult to read such a conclusion without some skepticism. While Mitchell’s and others such as James A.W. Heffernan suggest that the “treatment of the ekphrastic image as female other is commonplace in the genre,” numerous feminist scholars, such as Beth Loizeaux, Barbara Fischer, Jane Hedley, Anne Keefe, and Bonnie Costello caution us against the dangers of reading ekphrasis as an overly-determined gendered contest, suggesting that women in the 20th century write with a much more nuanced attitude toward the gendered dynamics of ekphrasis. However, moving beyond Mitchell’s description has proven difficult. The genre’s saturation with frequently used and gendered terms such as “see,” “say,” “look” and “still” ” – all associated with either a dominant male gaze or a still, passive female figure—complicate our access to other critical approaches at close distance.

I came to this particular project, then, with an explicit intent to uncover methodological means for expanding the number of poems we can consider at one time, as well as a method for detecting latent patterns across texts that were perhaps to this point obscured from view. My project, Revising Ekphrasis, asks what opportunities distant reading practices, such as topic modeling, present scholars of ekphrasis as a means for casting the widest possible net to detect latent patterns across an increasingly diverse corpus of ekphastic poetry. [Slide 6] The Revising Ekphrasis dataset includes 4,771 poems (documents) from three kinds of poetic corpora: The American Academy of Poets, Poetry Daily, and print anthologies of ekphrastic poetry. The collection, breaks down in the following way: 2,466 from the American Academy of Poets website; 373 from Poetry Daily (2011 – 2012); 34 from a print anthology titled The Gazer’s Spirit by John Hollander; 79 poems discovered through Poets on Paintings: A Bibliography by Robert Denhem; 19 poems by Jorie Graham, Carol Snow, Barbara Guest, and Cole Swenson. I also lightly described the dataset as “ekphrastic,” “nonekphrastic” or “unknown” and discovered that about 425 could be identified as ekphrastic.

I chose topic modeling because critical treatments of ekphrasis often hinge upon perceptions that the genre uses a higher ratio of words relating to “stillness” and words relating to “look” and “see” than other genres. Questions about recurring patterns of term across similar texts are excellent kinds of questions to explore with topic modeling in theory; however, topic modeling best practices suggest that words such as “see” and “say” and “still,” which are considered high-frequency, low-semantic value terms known as stopwords. [Slide 7] Stopwords, as you likely already know, are typically removed from the dataset entirely in order to improve the quality of the model’s results. So, I was confronted right away with a dilemma, could I still get useful, reliable, and interesting results from topic modeling if all the words considered to be most influential to the genre’s definition were removed?

As a literary critic, the idea of removing so many words from a dataset of poetry seemed preposterous at first. Auden’s opening line, for example, “About suffering they were never wrong, / the Old Masters: how well they understood…” would be dramatically different if instead it read:[Slide 8] “suffering wrong masters understood.” Robert Browning’s pivotal line in “My Last Duchess,” [Slide 9]”she looked on, and her looks went everywhere” justifying the duke’s “silencing” of his wife would be removed [Slide 10] entirely from the text of the poem if I were to use the default preprocessing steps for MALLET, the software I chose to use for my topic modeling.

In order to feel any sense of confidence about the kind of results I would be generating, I first needed to know how the “standard” stopword removal process would influence results, so I designed a test to see how the presence of stopwords affected the usefulness of the topic keyword distributions the LDA produced.  [Slide 11] Without introducing the LDA process in detail here, what is useful to know is that LDA is an algorithm that sorts through documents and creates groupings of words that are most likely to co-occur in similar documents. Topic modeling produces distributions of words, vocabularies, that are most likely to be found in similar documents, which are called “topics,” and these distributions frequently enjoy thematic coherence.

My tests were designed to foreground the presence or absence of important ekphrastic words, such as look, see, still, and stillness in order to expose how their absence or presence influenced the topic model’s keyword distributions—those lists of key words that topic models spit out. I imported the entire text dataset into MALLET using four different stopword lists, then ran a 40 topic model. [Slide 12]

In the first test, I skipped preprocessing altogether, leaving every word in tact in the corpus. In the second, I heavily edited the MALLET stoplist, so that only 200 words were removed from the dataset, including adverbs (accordingly, actually, and usually); articles (a, the); and conjunctions (and, or). The third test only slightly modifies the stoplist, removing 510 high-frequency words, but leaving words frequently associated with ekphrasis in the text to be modeled, such as: after, before, look, looking, looked, looks, said, saw, say, saying, says, see, sees, seeing. Finally, I ran the last test using MALLET’s default stoplist of all 538 words.

Rather than focusing on the significance of the individual topics to the composition of the whole model, I want to focus on the composition of the topics themselves. For those unfamiliar with reading topic model results, the number on the far left represents the topic number, [Slide 13] in this case from 0-39 because 40 topics were requested. The next number, called a hyperparameter estimation, shows the model’s prediction as to how much of the collection might be described by each topic. Next, to the right of the hyperparameter estimation are the top 20 words associated with the topic in descending order from most likely to least likely called “keys”.

When too many high-frequency words were left in the dataset, the signal-to-noise ratio became too high to interpret the word distributions. Results from the first test are almost too difficult to read. While literary scholars of ekphrasis have understood words such as “see” and “say” and “look” to be more prominent in ekphrastic poetry than in other poetic genre, the degree to which those words are more frequently used is not significant enough to create an “ekphrastic” topic. For example, [Slide 14] topic 8 is dominated by frequently used verbs: “was, and, a, had, were, that, in, it, to, said, but, could, did, came, when, saw, they, then, I, one.” While the topic includes the word “saw,” which might be interesting in terms of understanding ekphrastic poetry, in this case it does little to identify a trend regarding ekphrastic “looking.” Instead, the topics with the highest proportions, such as Topic 1 [Slide 15], represent distributions of articles and pronouns, offering little insight into the texts themselves.

The results from the second experiment with about half the words from the MALLET stoplist removed were not much better, as the model again appears heavily influenced by pronouns and prepositions. [Slide 16] For example, Topic 2 forms around collective identities with key words including: they, their, them, are, in, by, men, children, on, have, women, up, themselves, see. Similarly, [Slide 17] Topic 22 combines pronouns with collective bodies, or many body parts, such as: our, us, ourselves, from, how, together, live, bodies, even, heads. Furthermore, despite what we might expect from the art-oriented vocabulary in topic 33 [Slide 18], which includes words such as “by, new, fear, modern, art, times, painting, museum, mr, order, model, artist, calm….”, none of the ekphrastic poems in the dataset are expected to draw more than 4% of their language from that topic. Parsing the exact relationship between the documents that draw heavily from Topic 33 is not a matter of nuance. Instead the group seems to be primarily created around the use of the word “by.” Having specifically included “by” in the model does seem to have made a difference in terms of locating and identifying poems in which “by” accompanies other kinds of words, many of which relate to other visual aesthetic objects, but sorting through the topic is about as useful as conducting any kind of close reading of the word “by” in a poetic collection. Consequently, the results are too disperse and don’t help us to answer the questions we hoped to ask about “stillness” or “looking.”

Perhaps the most demonstrative and telling difference between the results in the third and fourth test results is that topics where the words “look,” “see,” “still,” “at,” and “before” are included the key term distributions, as they are in tests 3 and 4, where the concentration of ekphrastic poems in a single topic is more likely when the stopwords are removed than when they are left in.[1] In other words, in the third test, when words such as “see” “say” and “still” are left in, ekphrastic poems are less likely to draw from the same topic than in the fourth test where the words are removed. For example, in the third test, 88 ekphrastic poems were predicted to draw more than 1% of their language from Topic 13. [Slide 19] Topic 13, likewise, is the most dominant topic across the collection and is associated with a predicted 72% of the poems overall, meaning the words with the greatest weight in Topic 13—“night, at, light, dark, sun, sky, day, wind, sleep, still”—are likely to be found in at least 72% of the corpus as a whole. Other topics from which many ekphrastic poems draw more than 1% of their language include topics [Slide 20] 13, 15, 16, 19, and 29. In the end, by returning to the ekphrastic poems in the dataset and sifting through the topics they are most likely associated with, what becomes evident is that the presence of “see” and “saw” or “look” and “looked” in the dataset tends to decrease the likelihood that the topic model will cluster ekphrastic poems together.

Unexpectedly to me, leaving words such as “still” and “see” in the dataset actually impeded topic cohesion among ekphrastic poems, detecting affinities between texts that had more to do with the tense of the word than with its semantic function. The fourth test, however, which uses MALLET’s default stoplist, yielded the most salient and viable results. For example, Topic 35 predicts that 7% of the collection includes language that draws heavily from the visual arts: color, stop, painting, art, artist, painted, model, perfect, cover, museum, wrong makes, reason, wanted, witness, change, completely, case, light, hope. The model’s prediction closely reflects our pre-existing knowledge from the metadata that approximately 5% of the database’s poems are ekphrastic. These findings are doubly relevant. First, by so [Slide 21] closely estimating the number of poems that draw from the language of visual art the model promises a higher likelihood of identifying the distributions of language my study wants to explore.

Second, [Slide 22] by estimating slightly more texts draw from language closely allied to the visual arts, the fourth test offers the tantalizing possibility of discovery. Topic 11, which describes 64% of the entire corpus but is not the most heavily weighted topic in the model, contains 125 ekphrastic poems—almost 50% of the poems with the category tag, “ekphrastic.” Looking at the key terms associated with the topic 11, one might describe it as the “body part / physical feature” topic (eyes, back, body, face, hands, hand, head, arms, feet), (open, air, inside, world, small, close), and shade (light, dark, white, black). While it’s not inconceivable that ekphrastic poems draw heavily from the language of physical description, it seems interesting the degree to which this is the case. Interestingly, too, while most of the poems most closely associated with Topic 11 are ekphrastic, the poem with the highest estimated proportion of language from it is William Carlos Williams’s poem “Danse Russe”—which I had originally tagged “nonekphrastic.” Williams, however, wrote entire volumes of ekphrastic poetry, of which Pictures from Breughal is an example. The Ballets Russes, which transformed 20th century ballet, combined efforts across the fine arts. Visual artists, such as Pablo Picasso, Henri Matisse, and Juan Gris collaborated with Russian and French choreographers, producing sets, costumes, curtains, posters, and even programs for performances.[2] While there is no textual, or as far as I can tell critical, discussion of this particular poem in terms of the visual arts Williams’s prolific ekphrastic writing, his close relationship to visual artists, and the shared language between “Danse Russe” and more than half the other ekphrastic poems in the collection present a rich opportunity for further exploration, particularly in light of some critics’ assertion that Williams’ representation of female bodies differs significantly from his male contemporaries.

Herein, lies the critical point that I’d like to leave you with today. It is the fourth test, the one in which the language seen as “conventionally ekphrastic” is removed that opens up future hermeneutic possibility. Akin to Stephen Ramsay’s suggestion in Reading Machines that algorithmic criticism is most useful to the literary scholar when it exposes the text to new hermeneutic potential, topic models’ are most useful to the study of poetry, and in this case ekphrasis, when the method exposes our scholarly assumptions through de-familiarization: [Slide 23]

Literary-critical insight begins with a change of vision—what Wittgenstein called the “drawing of an aspect.” (Philsophical 194). Sometimes… the noticing is the result of some sort of overt manipulation of the text. We read out of order, we translate and paraphrase, we look only at certain words or certain constellations of surrounding the text. The text hasn’t changed its graphic content any more than the duck-rabbit changes between one’s seeing it one way one moment and another the next. But the text quite literally assumes a different organization from what it had before. Once a new aspect/pattern has been discovered, one immediately begins to test the viability of that pattern (47-8).

In the case of these topic model distant readings, the radical extrusion of words considered vital to the identification of a poetic genre exposes rich opportunities to explore latent patterns and connections heretofore obscured by the over-presence of a limited vocabulary. Removing stopwords, a form of textual “deformance” akin to Jerome McGann and Lisa Samuel’s use of the term, leads us to wonder what more be gained by adjusting the aperture of our scholarly lens to reconsider the ekphrastic tradition at scale. The value of algorithmic criticism, brought to bear on the Revising Ekphrasis corpus, leverages what Ramsay describes as “a desire to use the narrowing forces of constraint to enable the liberating visions of potentiality” (32). Much in the vein of Matthew Jockers, Ted Underwood, Andrew Goldstone, Ben Schmidt, and Lauren Klein’s work with corpora of fiction, literary genres, literary criticism, and archival materials what the story of stopwords and the tradition of ekphrasis suggests is that a wealth of hermeneutic potential exists for in topic modeling for scholars of poetry, as well. [Slide 25]

[1] The exact list of words included in test 2 that were not included in test 1 can be found in Appendix B.

[2] “Visual Art and the Ballets Russes.” Ballet Russes Cultural Partnership. Boston University. Web. 16 Sept. 2012.

Denham, Robert D. Poets on Paintings: A Bibliography. Jefferson, N.C.: McFarland, 2010. Print.
Frank, Joseph. The Idea of Spatial Form. New Brunswick: Rutgers University Press, 1991. Print.
Hagstrum, Jean H. The Sister Arts: The Tradition of Literary Pictorialism and English Poetry from Dryden to Gray. Chicago: University of Chicago Press, 1987. Print.
Hedley, Jane, Nick Halpern, and Willard Spiegelman. In the Frame: Women’s Ekphrastic Poetry from Marianne Moore to Susan Wheeler. Newark, DE: University of Delaware Press, 2009. Print.
Heffernan, James A. W. Museum of Words: The Poetics of Ekphrasis from Homer to Ashbery. Chicago: University Of Chicago Press, 2004. Print.
Hollander, John, and Various Authors. The Gazer’s Spirit: Poems Speaking to Silent Works of Art. 1st ed. Chicago: University Of Chicago Press, 1995. Print.
Krieger, Professor Murray. Ekphrasis: The Illusion of the Natural Sign. Baltimore: The Johns Hopkins University Press, 1992. Print.
Loizeaux, Elizabeth Bergmann. Twentieth-Century Poetry and the Visual Arts. 1st ed. Cambridge, UK; New York: Cambridge University Press, 2008. Print.
McCallum, Andrew Kachites. MALLET: A Machine Learning for Language Toolkit. N.p., 2002. Web.
McGann, Jerome. Radiant Textuality: Literature After the World Wide Web. New York, NY: Palgrave Macmillan, 2004. Print.
Mitchell, W. J. T. Iconology: Image, Text, Ideology. Chicago: University of Chicago Press, 1986. Print.
Mitchell, W. J. T. Picture Theory: Essays on Verbal and Visual Representation. Chicago: University of Chicago Press, 1995. Print.
Ramsay, Stephen. Reading Machines: Toward an Algorithmic Criticism. 1st Edition edition. Urbana: University of Illinois Press, 2011. Print.
Scott, Grant F. “The Rhetoric of Dilation: Ekphrasis and Ideology.” Word & Image 7.4 (1991): 301–310. Taylor and Francis+NEJM. Web. 12 Oct. 2012.