Small Projects & Limited Datasets

I’ve been thinking a lot lately about the significance of small projects in an increasingly large-scale DH environment.  We seem almost inherently to know the value of “big data:” scale changes the name of the game.  Still, what about the smaller universes of projects with minimal budgets, fewer collaborators, and limited scopes, which also have large ambitions about what can be done using the digital resources we have on hand?  Rather than detracting from the import of big data projects, I, like Natalie Houston, am wondering what small projects offer the field and whether those potential outcomes are relevant and useful both in and of themselves as well as beneficial to large-scale projects, such as in fine-tuning initial results.

My project in its current iteration involves a limited dataset of about 4500 poems and challenges rudimentary assumptions about a particular genre of poetry called ekphrasis—poems regarding the visual arts.  It is the capstone project to a dissertation in which I use the methods of social network analysis to explore socially-inscribed relationships between visual and verbal media and in which the results of my analysis are rendered visually to demonstrate the versatility and flexibility available to female poets writing ekphrastic poetry. My MITH project concludes my dissertation by demonstrating that network analysis is one way of disrupting existing paradigms for understanding the social-signification of ekphrastic poetry, but there are more methods available through computational tools such as text modeling, word frequency analysis, and classification that might also be useful.

To this end, I’ve begun by asking three modest questions about ekphrastic poetry using a machine learning application called MALLET:

1.) Could a computer learn to differentiate between ekphrastic poems by male and female poets?  In “Ekphrasis and the Other,” W.J.T. Mitchell argues that were we to read ekphrastic poems by women as opposed to ekphrastic poetry by men, that we might find a very different relationship between the active, speaking poetic voice and the passive, silent work of art—a dynamic which informs our primary understanding of how ekphrastic poetry operates.  Were this true and were the difference to occur within recurring topics and language use, a computer might be trained to recognize patterns more likely to co-occur in poetry by men or by women.

2.) Will topic modeling of ekphrastic texts pick out “stillness” as one of the most common topics in the genre?  Much of the definition of ekphrasis revolves around the language of stillness: poetic texts, it has been argued, contemplate the stillness and muteness of the image with which it is engaged.  Stillness, metaphorically linked to muteness, breathlessness, and death, provides one of the most powerful rationales for an understanding how words and images relate to one another within the ut pictura poesis tradition—usually seen as an hostile encounter between rival forms of representation.  The argument to this point has been made largely on critical interpretations enacted through close readings of a limited number of texts.  Would a computer designed to recognize co-occurrences of words and assign those words to a “topic” based on the probability they would occur together also reveal a similar affiliation between stillness and death, muteness, even femininity?

3.) Would a computer be able to ascertain stylistic and semantic differences between ekphrastic and non-ekphrastic texts and reliably classify them according to whether or not the subject of the poem is an aesthetic object or not?  We tend to believe that there are no real differences between how we describe the natural world as opposed to how we describe visual representations of the natural world.  We base this assumption on human, interpretive, close readings of  poetic texts; however, there is the potential that a computer might recognize subtle differences as statistically significant when considering hundreds of poems at a time.  If a classification program such as Mallet could reliably categorize texts according to ekphrastic and non-ekphrastic, it is possible that we have missed something along the way.

In general, these are small questions constructed in such a way that there is a reasonable likelihood that we may get useful results.  (I purposefully choose the word results instead of answers, because none of these would be answers.  Instead the result of each study is designed to turn critics back to the texts with new questions.)  And yet, how do we distinguish between useful results and something else?  How do we know if it worked?  Lots of money is spent trying to answer this question about big data, but what about these small and mid-sized data sets?  Is there a threshold for how much data we need to be accurate and trustworthy?  Can we actually develop standards for how much data we need to ask particular kinds of humanities questions to make relevant discoveries?  In part, my project also addresses these questions, because otherwise, I can’t make convincing arguments about the humanities questions I’m asking.

Small projects (even mid-sized projects with mid-sized datasets) offer the promise of richly encoded data that can be tested, reorganized, and applied flexibly to a variety of contexts without potentially becoming the entirety of a project director’s career.  The space between close, highly-supervised readings and distant, unsupervised analysis remains wide open as a field of study, and yet its potential value as a manageable, not wholly consuming, and reproducible option make it worth seriously considering.  What exactly can be accomplished by small and mid-scale projects is largely unknown, but it may well be that small and mid-sized projects are where many scholars will find the most satisfying and useful results.

6 thoughts on “Small Projects & Limited Datasets”

  1. Lisa, I’m really interested in your project and will be looking forward to what you discover about topic modelling for poetic texts. Around the question of data sets, I think your question “Can we actually develop standards for how much data we need to ask particular kinds of humanities questions to make relevant discoveries?” is an important one — precisely because too often not enough attention is paid to the question of “how much data” is optimal in traditional humanities work.

    1. Natalie, The interest in each other’s work, really, is mutual. I’m really excited about your recent NEH start-up grant and am looking forward to hearing a lot more about it.

      It would be almost impossible for me to respond here without also saying that Ted Underwood‘s questions about what makes humanities questions different from what computer science researchers want from LDA and other mining tools are particularly relevant. Travis Brown and I have talked a little bit about ways that we generally know that topic modeling is working, and he has a wide range of experience from which to draw upon to determine whether or not he’s getting what he wants to see out of the model. What I’m hoping for as we move forward with the project is more conversation between Travis and I about what the topics “say” about themselves. I take to heart Ted’s point in our back and forth conversation over the past couple of days that what he’s seeing is a particular kind of discourse: a poetic diction, various ways of using language used within a corpora. Looking at the topics he’s getting, I see that. Perhaps this is what I will find, too. I’m just not far enough along yet to say definitively one way or the other.

      I also wonder if maybe we don’t see it as discourse because of the difference between time periods. Ted is working with 18th and 19th century texts, and perhaps, this part of why it can be more easily interpreted as diction. I’m working primarily (though not exclusively as I’ll explain in some later post) with those produced in the 20th and 21st centuries. I wonder if that also doesn’t play into how we read various topics. I’ll certainly be paying attention to that now. Publication date, however, has been one of the major hurdles that I’ve had to overcome with this particular kind of research, and it is also the driving reason why I’m interested in how large data sets need to be to get interesting/reliable/useful results. There simply aren’t publicly available collections of 20th and 21st poetic (or, really many other literary genres of) texts to use. In order to do the kind of research I want to do, and in order to really bring “topic modeling” into post-1922 American literature, I had to put together my own data set… on my own. So, I think this question of how big the data set needs to be, what we see when we model using the smaller set, and how that changes the digging process (the archaeology, so to speak?) is significant not just to those who use topic modeling to study texts in the humanities… but to whether or not there’s a place for 20th and 21st century American literature in this methodology.

Comments are closed.