Literary Mathematics: Introduction Excerpt

Literary Mathematics

Quantitative Theory for Textual Studies

Michael Gavin

INTRODUCTION

THE CORPUS AS AN OBJECT OF STUDY

ONLY IN LITERARY STUDIES IS distant reading called “distant reading.” In fact, corpus-based research is practiced by many scholars from a wide range of disciplines across the humanities and social sciences. The availability of large full-text databases has introduced and brought to prominence similar research methods in disciplines like political science, psychology, sociology, public health, law, and geography. Henry E. Brady has recently described the situation in terms that humanists may find familiar: “With this onslaught of data, political scientists can rethink how they do political science by becoming conversant with new technologies that facilitate accessing, managing, cleaning, analyzing, and archiving data.”¹ In that field, Michael Laver, Kenneth Benoit, and John Garry demonstrated methods for using “words as data” to analyze public policy back in 2003.² Scholars like Jonathan B. Slapin, Sven-Oliver Proksch, Will Lowe, and Tamar Mitts have used words as data to study partisanship and radicalization in Europe and elsewhere across multilingual datasets.³ In psychology the history is even deeper. The first mathematical models of word meaning came from psychologist Charles E. Osgood in the 1950s.⁴ The procedures he developed share much in common with latent semantic analysis, of which Thomas K. Landauer, also a psychologist, was a pioneering figure, alongside computer scientists like Susan Dumais.⁵ In sociology, geography, law, public health, and even economics, researchers are using corpus data as evidence in studies on topics of all kinds.⁶

At a glance, research in computational social science often looks very different from what Franco Moretti called “distant reading” or what Lev Manovich and Andrew Piper have called “cultural analytics.”⁷ But the basic practices of quantitative textual research are pretty much the same across disciplines. For example, Laver, Benoit, and Garry worked with a digitized collection of political manifestoes in their 2003 study titled “Extracting Policy Positions from Political Texts Using Words as Data.” Their goal was to develop a general model for automatically identifying the ideological positions held by politicians across Great Britain, Ireland, and Europe. In largely the same way that a digital humanist might study a corpus of fiction by counting words to see how genres change over time, these social scientists sort political documents into ideological categories by counting words. “Moving beyond party politics,” they write, “there is no reason the technique should not be used to score texts generated by participants in any policy debate of interest, whether these are bureaucratic policy documents, the transcripts of speeches, court opinions, or international treaties and agreements.”⁸ The range of applications seemed limitless. Indeed, many scholars have continued this line of research, and in the twenty years since, the study of “words as data” has become a major practice in computational social science. This field of inquiry emerged independently of humanities computing and corpus linguistics, but the basic procedures are surprisingly similar. Across the disciplines, scholars study corpora to better understand how social, ideological, and conceptual differences are enacted through written discourse and distributed over time and space.

Within literary studies, corpus-based inquiry has grown exponentially. When I first sketched out a plan for this book in late 2015, it was possible to imagine that my introduction would survey all relevant work. My plan was to cite Franco Moretti and Matthew Jockers, of course, as well as a few classic works of humanities computing, alongside newer studies by Ted Underwood and, especially, Peter de Bolla.⁹ Now, as I finish the manuscript in 2021, I see so many studies of such incredible variety, I realize it’s no longer possible to sum them up. The last few years have seen the publication of several major monographs. Books by Underwood and Piper have probed the boundaries between genres while updating and better specifying the methods of distant reading.¹⁰ Katherine Bode has traced the history of the Australian novel.¹¹ Sarah Allison and Daniel Shore have explored the dispersion of literary tropes.¹² Numerous collections have been published and new journals founded; countless special issues have appeared. To these can be added a wide range of articles and book chapters that describe individual case studies and experiments in humanities computing. To offer just a few examples: Kim Gallon has charted the history of the African American press; Richard Jean So and Hoyt Long have used machine learning to test the boundaries of literary forms; and Nicole M. Brown has brought text mining to feminist technoscience.¹³ Scholars like Dennis Yi Tenen, Mark Algee-Hewitt, and Peter de Bolla offer new models for basic notions like space, literary form, and conceptuality.¹⁴ Manan Ahmed, Alex Gil, Moacir P. de Sá Pereira, and Roopika Risam use digital maps for explicitly activist purposes to trace immigrant detention, while others use geographic information systems (GIS) for more conventional literary-critical aims.¹⁵ The subfield of geographical textual analysis is one of the most innovative areas of research: important new studies have appeared from Ian Gregory, Anouk Lang, Patricia Murrieta-Flores, Catherine Porter, and Timothy Tangherlini.¹⁶ Within the field of early modern studies, essays by Ruth and Sebastian Ahnert, Heather Froehlich, James Jaehoon Lee, Blaine Greteman, Anupam Basu, Jonathan Hope, and Michael Witmore have used quantitative techniques for describing the social and semantic networks of early print.¹⁷

This work comes from a bewildering variety of disciplinary perspectives. Although controversy still occasionally swirls around the question of whether corpus-based methods can compete with close reading for the purpose of literary criticism, such debates miss a larger and more important point.¹⁸ We are undoubtedly in the midst of a massive shift in the study of textuality. If economists are studying corpora, something important has changed, not just about the discipline of economics but also about the social realization of discourse more broadly. The processes by which discourse is segmented into texts, disseminated, stored, and analyzed have fundamentally altered. From the perspective of any single discipline, this change is experienced as the availability of new evidence (“big data,” “words as data,” “digital archives,” and such) and as the intrusion of alien methods (“topic modeling,” “classification algorithms,” “community detection,” and so on). But when you step back and look at how this kind of work has swept across so many disciplines, the picture looks very different. Considered together, this research represents an extraordinary event in the long history of textuality. More or less all at once, and across many research domains in the humanities and social sciences, the corpus has emerged as a major genre of cultural and scientific knowledge.

To place the emergence of the corpus at the center of our understanding offers a different perspective on the rise of cultural analytics within the humanities. It is often said or assumed that the basic story of computer-based literary analysis involves a confrontation between explanatory regimes represented by a host of binary oppositions: between qualitative and quantitative methods, between close and distant reading, between humans and machines, between minds and tools, or between the humanities and the sciences. This formulation has always struck me as, not quite wrong exactly, but not quite right. Research fields that were already comfortable with many forms of statistical inquiry, like political science, still need to learn and invent techniques for measuring, evaluating, and interpreting corpora. It’s not that quantification is moving from one place to another but that researchers across domains have been suddenly tasked with learning (or, more precisely, with discovering) what methods are appropriate and useful for understanding these new textual forms. For this reason, what other people call “humanists using digital tools,” I see differently: as one small piece of a large and diffuse interdisciplinary project devoted to learning how to use textual modeling to describe and explain society.

Rather than see quantitative humanities research as an intrusion from outside, I think of it as our field’s contribution to this important project. Because of humanists’ hard-won knowledge about the histories of cultures, languages, and literatures, we are uniquely well positioned to evaluate and innovate computational measures for their study. By interdisciplinary standards, we have the advantage of working with data that is relatively small and that is drawn from sources that are already well understood. Whereas an analysis of social media by public-health experts might sift through billions of tweets of uncertain provenance, our textual sources tend to number in the thousands, and they’ve often been carefully cataloged by archivists. Select canonical works have been read carefully by many scholars over decades, and many more lesser-known texts have been studied in extraordinary detail. Even more importantly, we are trained readers and philologists who are sensitive to the vagaries of language and are therefore especially alert to distortions that statistical modeling exerts on its sources. As scholars of meaning and textuality, we are in the best position to develop a general theory for corpora as textual objects and to understand exactly how the signifying practices of the past echo through the data. Put simply, we know the texts and understand how they work, so it makes sense for us to help figure out what can be learned by counting them. Such is the task before us, as I see it.

Here, then, are the guiding questions of this book: What are corpora? What theory is required for their description? What can be learned by studying them?

To ask these questions is different from asking, “What can digital methods contribute to literary studies?” That’s the question most scholars—practitioners and skeptics alike—tend to emphasize, but it’s a very bad question to start with. It places all emphasis on “findings” and “results” while leaving little room for theoretical reflection. It creates an incentive for digital humanists to pose as magicians by pulling findings out of black-box hats, while also licensing a closed-minded “show me whatcha got” attitude among critics who believe they can evaluate the merits of research without having to understand it. I believe this to be unfortunate, because what literary scholars bring to the interdisciplinary table is their robust and sustained attention to textual forms. We have elaborate theories for poetic and novelistic structures, yet cultural analytics and the digital humanities have proceeded to date without a working theory of the corpus as an object of inquiry—without pausing to consider the corpus itself as a textual form. By jumping the gun to ask what corpora can teach us about literature, we miss the chance to ask how quantification transforms the very texts we study. Many impressive case studies have been published in the last ten years, and I believe they have largely succeeded in their stated goals by demonstrating the efficacies of various computational methods. But I do not believe that any provide a satisfactory general answer to the questions (restated from above in somewhat different terms) that have nagged me from the beginning: What are these new textual things? What does the world look like through the perspective they provide? What new genres of critical thinking might they inform or enable? This book represents my best effort to answer these questions and to develop an understanding of corpus-based inquiry from something like the ground up.

The Argument

First, a preview of the argument. In this book, I will argue the following:

Corpus-based analysis involves a specific intellectual practice that shouldn’t be called “distant reading” because it really has little to do with reading. I call that practice describing the distribution of difference. Across any collection of documents, variations that reflect meaningful differences in the histories of their production can be discovered. Depending on the corpus and the analysis that’s brought to bear on it, these variations can be observed at a wide scale, revealing differences across broad outlines, and they can be highly granular, revealing what’s particular about any given document or word. However, in order to engage in this practice more effectively, we need a good working theory, a general account of the corpus that is grounded in mathematics but sensitive to the histories of textual production behind its source documents. We also need good middle-range theories to justify the statistical proxies we use to represent those histories. In the chapters that follow, I’ll borrow concepts from network science, computational linguistics, and quantitative geography to demonstrate how corpora represent relations among persons, words, and places. But I’ll also argue that we have an imperative to innovate. We can’t simply transpose ideas from one domain to another without being willing to get down to the theoretical basics and to offer new accounts of key concepts.

In support of this overarching argument, I will put forward two main supporting claims.

The first is an extremely technical claim, the precise details of which will matter to relatively few readers. I will propose a hyperspecialized definition for the word corpus. Conventionally, a corpus is defined by linguists as “a set of machine-readable texts.”¹⁹ This definition has the merit of simplicity, but I believe it to be inadequate, because it leaves unmentioned the role played by bibliographical metadata. For any corpus-based cultural analysis, whether it involves a few hundred novels or billions of tweets, the key step always hinges on comparing and contrasting different sources. To do this, researchers need both good metadata and an analytical framework for identifying the most relevant lines of comparison and establishing the terms under which they’ll be evaluated statistically. I’ll argue that any corpus is best defined in mathematical terms as a topological space with an underlying set of elements (tokens) described under a topology of lexical and bibliographical subsets. That is admittedly a mouthful. I’ll explain what I mean by it in chapter 4. For now, the main point is simply to say that our understanding of the corpus should be grounded in a theoretical framework that anticipates the quantitative methods we plan to apply. To understand what’s happening in any collection, we need to be able to describe how words cluster together in the documents, and we need to be able to correlate those clusters with the generic, social, temporal, and spatial properties of the source texts. The first goal of corpus-based cultural analysis is to explain with confidence who wrote what, when, and where. With that goal in mind, we should begin with a theory of the corpus that foregrounds its peculiar ability to blend textual data with contextual metadata and thereby to represent both text and context as a single, mutually informing mathematical abstraction. I will invite you to see words as something other than words—as countable instances that mark the points of intersection between language and history.

The second claim follows from this reconceptualization of the corpus as an object of inquiry. It’s a much broader claim and will be of interest, I hope, to all readers of this book. Put simply, I’ll argue that corpus-based inquiry works—that to study corpora is an effective way to learn about the world. Why? Because the documents that make up any corpus were written on purpose by people who meant to write them—people who were motivated in many cases because they cared deeply about their topics and believed (or at least hoped) their readers might care just as much.²⁰ This means that their intentions and their lives echo through the data. If that sounds fanciful, it’s not. In fact, if you consider just a few examples, you’ll see that this proposition is quite obvious (at least in its most basic formulation): Imagine a collection of American newspaper articles from 1930 to 1950. You don’t have to read them to know that articles from 1939 to 1945 will have a lot to say about the topic of war. Similarly, a collection of novels set in New York or Los Angeles will refer to different buildings and streets from novels set in London or Tokyo. The same principle holds for any collection of documents. Food blogs will differ from obituaries; obituaries, from job ads. Essays by one author will differ from those by another; sermons from earlier centuries will differ from those preached today. No matter the axis of differentiation—genre, historical period, geographical place, author, subject, or virtually anything recorded in our bibliographical metadata—differences in the ways we categorize texts will tend to correspond with differences in their contents (that is, in the words they use). And those differences will tend to correspond, in turn, with things that mattered to the authors and their readers, and therefore to their shared histories. Of course, in any large corpus there will be plenty of exceptions—outliers or anomalies that have little to do with larger trends—but the general tendencies usually hold, and this correspondence, I’ll argue, is why corpus analysis works.

However, as might be obvious, to say that corpus analysis “works” in this way is to say it does something very different from reading or interpretation—terms that can be very misleading when used to describe the practices of intellection I’ll demonstrate in this book.²¹ To read a text critically is to attempt to comprehend its meaning while remaining on the lookout for ideological distortions, to consider the enabling constraints of genre, to imagine or describe their reception by other readers, and to evaluate the texts and their authors for credibility and bias. Although many computational methods can be used to assist with critical reading in various ways, reading is not what those methods most directly involve.²² (This disjunction has caused a great deal of confusion and frustration, not only among skeptics of corpus-based literary analysis but also, I think, among practitioners who contort their studies while attempting to meet the demands of such critics.) Instead, the main function of corpus analysis is to measure and describe how discourse is situated in the world. Its fundamental question is not, “What does this text mean?” but “Who wrote what, when, and where?”²³ Theodor Adorno once remarked that “topological thinking . . . knows the place of every phenomenon and the essence of none.”²⁴ He meant it in a bad way, but nonetheless I agree. The goal of corpus analysis is not to explain the presumed essence of its object but to describe the observable relations among its parts. Quantitative textual analysis does not ask what words or texts really are; it describes how they’re actually situated in relation to each other.

To turn from the real to the actual is a powerful act. Corpus-based inquiry is so effective across so many research domains because documents tend to correspond, one way or another, to the actual lives of their authors and readers. The kind of reasoning it facilitates is fundamentally dialectical. To learn about words, you need to know who wrote and read them, when and where. To learn about people, you need to learn what words they wrote and read, when and where. By aligning biographical, temporal, geographical, and lexical metadata and bringing them to bear as descriptive frameworks for historical documents, corpus analysis provides an extraordinarily powerful set of techniques for identifying correspondences across all these axes of comparison.

For this reason, quantitative textual analysis also draws together insights from other disciplines:

• In network science, the principle of homophily suggests that similar people will tend to travel in similar social circles.²⁵ Doctors are more likely to know and interact with other doctors than they are with, say, police officers. These differences correlate to what we might think of as kinds of people, and the weak ties among them form conduits through which information and influence flow. Robert K. Merton and Paul F. Lazarsfeld are sometimes credited with discovering this principle in 1954, but in fact psychologists, sociologists, and others had been working with similar ideas much earlier, and this principle has been foundational to the study of social networks more or less since the beginning.²⁶ The idea itself is simple enough to be stated in the terms of an ancient aphorism: “Birds of a feather flock together.”

• In corpus linguistics, the distributional hypothesis proposes that similar words tend to be used in similar contexts.²⁷ Medical reports use different words from police reports. These differences correlate to meaning—terms like “pancreatic” and “embolization” will appear together in clusters, separate from but much like “burglary” and “perpetrator.” Just as social networks form into cliques, so too words gather into topical clusters. The distributional hypothesis was first formally proposed by the linguist Zellig Harris, but Warren Weaver and other scientists also contributed to its development in the wake of World War II.²⁸ Indeed, the distributional hypothesis is so similar to the principle of homophily that computational linguists often describe it using J. R. Firth’s variation on a similar aphorism: “You shall know a word by the company it keeps.”²⁹

• In quantitative geography, the principle of spatial autocorrelation holds that nearby places will tend to have similar attributes.³⁰ Neighboring towns will sit at relatively even elevations, have similar annual rainfalls, and share common demographics and cultures. For this reason, global averages are usually less meaningful than local ones. It doesn’t tell you much to say that the elevation of Denver is above average, if that average was calculated over the entire United States; city planners need to know whether it’s higher or lower than nearby places in central Colorado. As a premise, spatial autocorrelation goes back at least to John Snow’s map of the cholera outbreak in 1854. Sometimes this principle is given as Tobler’s First Law, which states that, in a geographic system, “everything is related to everything else, but near things are more related than distant things.”³¹

Corpus-based inquiry draws together data about words, people, and places, thus bringing the distributional hypothesis, homophily, and spatial autocorrelation into a productive new theoretical relation. Corpora identify that point of intersection between modalities that draws these notions together and reveals them to be variants of a single idea. Reasoning out from these premises, we might ask: What do we actually mean when we say two people, two places, or any two things are similar, except that we use similar words to describe them? What does it actually mean to say that two words are similar, except that they’re used by similar people in similar places when talking about similar things? By the end of the book, this will be my central argument: Corpora propose a measurable tautology between language and actuality that subjects the symbolic production of human experience to new forms of mathematical description. This tautology explains why corpus analysis works, why corpora have so much explanatory power, and why they provide so rich a source of evidence across so many research domains. I express this tautology with the phrase, Similar words tend to appear in documents with similar metadata.

The most difficult task facing cultural analytics, I believe, is to develop a body of quantitative theory and a general set of procedures for exploring this correspondence. For lack of a better name, I have taken to calling this nascent project literary mathematics. By that term I refer to a category of intellectual work at the center of cultural analytics that has received relatively little discussion. Not learning to code; learning the math. When commentators talk about “computational methods” in “digital humanities,” they’re usually talking about “tools”—about software that can be used to advance the extremely vague ambition of “finding patterns in the data.”³² But the most interesting and important aspects of such work relate only obliquely to the task of managing computer software. Instead, quantitative textual analysis involves sampling ideas from radically different disciplines and folding them into our own ways of thinking, such that we learn to see in our texts, not patterns, but structures that can be defined, described, and evaluated formally using mathematical expressions that name relations among textual parts. Literary mathematics is not a branch of mathematics; rather, it stretches like a loose and fragile cobweb across already existing and much sturdier branches—most importantly among them: graph theory, matrix algebra, and statistics. However, the phrase is also meant to suggest that there remains work to do. The theories and concepts we need aren’t already somehow given; we can’t take them for granted. The question of which quantitative models are appropriate for describing which textual phenomena remains very much open.

This book’s use of mathematics will have different stakes for different groups of scholars. The primary readers I mean to address are those in the field of cultural analytics, including people who might be hoping to learn or are merely curious. For experienced researchers in the field, the mathematical operations I’ll describe will all be familiar. In some places, my methods might even seem simple or willfully pared down to their basics. The chapters that follow will make no use of cutting-edge machine-learning algorithms. No topic modeling. No word-embedding models. No neural networks. No black boxes. That’s intentional. What’s needed for rigorous computational humanities research, I’ll argue, is not that it borrows methods from the vanguard of technological advancement, but things altogether more pedestrian: good corpora, accurate metadata, and a clear-eyed attention to their interrelation.

For others—for humanists looking to enter the field or students hoping to learn—the math will be unfamiliar and perhaps even daunting. Math can pose a seemingly insurmountable obstacle to one’s own research goals. To these scholars, I hope to show that their fear can be set aside. Armed with a good corpus and a clear sense of what can and can’t be accomplished, anyone can learn to perform corpus-based analysis and to write about their work with confidence and clarity. All methods used in this study could be learned in a couple of semesters by graduate students or aspiring faculty. Each case study provides a model to follow, and the final chapter summarizes the most crucial theoretical concepts. Further, this book’s online supplement features easy-to-use datasets and easy-to-execute sample code, along with tutorials and student-oriented documentation. Those tutorials provide a walkthrough for each chapter, explaining the mathematical procedures in more detail and providing further examples of my own and samples of student work.³³

As corpus-based inquiry continues to gain footing across the disciplines, information scientists will continue to advance methods for modeling language-in-use. Humanists and social scientists will continue to evaluate those methods against known cases, every now and then pushing the methods in new directions where needed. Together, whether working in collaboration or (more often) in parallel, such research will advance our shared understanding of how computationally recorded discourse can be studied to tackle big questions that matter. But to be most successful it will require, I believe, a guiding theory, a motivating proposition about how history and society are represented through corpora and about how discourse, measured at scale, reflects the actualities of its production. The purpose of this book is to sketch an outline of such a theory.

Notes

1. Henry E. Brady, “The Challenge of Big Data and Data Science,” Annual Review of Political Science 22, no. 1 (2019): 297–323, 298.

2. Michael Laver, Kenneth Benoit, and John Garry, “Extracting Policy Positions from Political Texts Using Words as Data,” American Political Science Review 97, no. 2 (2003): 311–31.

3. Jonathan B. Slapin and Sven-Oliver Proksch, “A Scaling Model for Estimating Time-Series Party Positions from Texts,” American Journal of Political Science 52, no. 3 (2008): 705–22; Sven-Oliver Proksch et al., “Multilingual Sentiment Analysis: A New Approach to Measuring Conflict in Legislative Speeches,” Legislative Studies Quarterly 44, no. 1 (2019): 97–131; and Tamar Mitts, “From Isolation to Radicalization: Anti-Muslim Hostility and Support for ISIS in the West,” American Political Science Review 113, no. 1 (2019): 173–94. For overviews of the field, see Justin Grimmer and Brandon M. Stewart, “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts,” Political Analysis 21, no. 3 (2013): 267–97; and John Wilkerson and Andreu Casas, “Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges,” Annual Review of Political Science 20, no. 1 (2017): 529–44.

4. Charles E. Osgood, “The Nature and Measurement of Meaning,” Psychological Bulletin 49, no. 3 (1952): 197–237.

5. Thomas K. Landauer, “LSA as a Theory of Meaning,” in Handbook of Latent Semantic Analysis, ed. Landauer et al. (Mahwah, NJ: Lawrence Erlbaum Associates, 2007), 3–34. In the same volume, see Susan Dumais, “LSA and Information Retrieval: Getting Back to Basics,” 293–322. A more recent application of semantic modeling to the science of meaning can be found in Gabriel Grand et al., “Semantic Projection: Recovering Human Knowledge of Multiple, Distinct Object Features from Word Embeddings,” arXiv preprint arXiv:1802.01241 (2018).

6. Corpus-based research in these fields is abundant. See, for example, John W. Mohr and Petko Bogdanov, “Topic Models: What They Are and Why They Matter,” Poetics 41, no. 6 (2013): 545–69; Kyung Hye Kim, “Examining US News Media Discourses about North Korea: A Corpus-Based Critical Discourse Analysis,” Discourse and Society 25, no. 2 (2014): 221–44; Debarchana Ghosh and Rajarshi Guha, “What Are We ‘Tweeting’ about Obesity?: Mapping Tweets with Topic Modeling and Geographic Information System,” Cartography and Geographic Information Science 40, no. 2 (2013): 90–102; Frank Fagan, “Big Data Legal Scholarship: Toward a Research Program and Practitioner’s Guide,” Virginia Journal of Law and Technology 20, no. 1 (2016): 1–81; Nick Obradovich, et al., “Expanding the Measurement of Culture with a Sample of Two Billion Humans,” NBER Working Paper Series (Cambridge, MA: National Bureau of Economic Research, 2020). For a theoretical overview of quantitative theory for cultural sociology, see John W. Mohr et al., Measuring Culture (New York: Columbia University Press, 2020).

7. Franco Moretti, Distant Reading (London: Verso, 2013); Lev Manovich, “The Science of Culture?: Social Computing, Digital Humanities, and Cultural Analytics,” Journal of Cultural Analytics 1, no. 1 (2016), https://doi.org/10.22148/16.004; Andrew Piper, “There Will Be Numbers,” Journal of Cultural Analytics 1, no. 1 (2016), https://doi.org/10.22148/16.006.

8. Laver, Benoit, and Garry, “Extracting Policy Positions,” 326.

9. The year 2013 was an annus mirabilis for quantitative literary theory. That year saw the publication of Franco Moretti’s Distant Reading; Matthew L. Jockers’s Macroanalysis: Digital Methods and Literary History; Ted Underwood’s Why Literary Periods Mattered; and Peter de Bolla’s The Architecture of Concepts: The Historical Formation of Human Rights.

10. See Ted Underwood, Distant Horizons: Digital Evidence and Literary Change (Chicago: University of Chicago Press, 2019) and Andrew Piper, Enumerations: Data and Literary Study (Chicago: University of Chicago Press, 2018).

11. Katherine Bode, A World of Fiction: Digital Collections and the Future of Literary History (Ann Arbor: University of Michigan Press, 2019).

12. Sarah Allison, Reductive Reading: A Syntax of Victorian Moralizing (Baltimore: Johns Hopkins University Press, 2018); and Daniel Shore, Cyberformalism: Histories of Linguistic Forms in the Digital Archive (Baltimore: Johns Hopkins University Press, 2018).

13. See Kim Gallon, “Making a Case for the Black Digital Humanities,” in Debates in the Digital Humanities, ed. Matthew K. Gold and Lauren F. Klein (Minneapolis: University of Minnesota Press, 2016), 42–49; Kim Gallon, The Black Press Research Collective, http://blackpressresearchcollective.org/about/; Hoyt Long and Richard Jean So, “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning,” Critical Inquiry 42, no. 2 (2016): 235–67, and “Turbulent Flow: A Computational Model of World Literature,” Modern Language Quarterly 77, no. 3 (2016): 345–67; Nicole M. Brown et al., “Mechanized Margin to Digitized Center: Black Feminism’s Contributions to Combatting Erasure within the Digital Humanities,” International Journal of Humanities and Arts Computing 10, no. 1 (2016): 110–25; and “In Search of Zora/When Metadata Isn’t Enough: Rescuing the Experiences of Black Women through Statistical Modeling,” Journal of Library Metadata 19, no. 3–4 (2019): 141–62.

14. See Dennis Yi Tenen, “Toward a Computational Archaeology of Fictional Space,” New Literary History 49, no. 1 (2018): 119–47; Mark Algee-Hewitt, “Distributed Character: Quantitative Models of the English Stage, 1550–1900,” New Literary History 48, no. 4 (2017): 751–82; and Peter de Bolla et al., “Distributional Concept Analysis,” Contributions to the History of Concepts 14, no. 1 (2019): 66–92.

15. For the full list of contributors to the Torn Apart project, see the project’s credits at https://xpmethod.columbia.edu/torn-apart/credits.html. This work is described in Roopika Risam, New Digital Worlds: Postcolonial Digital Humanities in Theory, Praxis, and Pedagogy (Evanston, IL: Northwestern University Press, 2018).

16. See, for example, Ian N. Gregory and Andrew Hardie, “Visual GISting: Bringing Together Corpus Linguistics and Geographical Information Systems,” Literary and Linguistic Computing 26, no. 3 (2011): 297–314; Anouk Lang, “Visual Provocations: Using GIS Mapping to Explore Witi Ihimaera’s Dear Miss Mansfield,” English Language Notes 52, no. 1 (2014): 67–80; Patricia Murrieta-Flores and Naomi Howell, “Towards the Spatial Analysis of Vague and Imaginary Place and Space: Evolving the Spatial Humanities through Medieval Romance,” Journal of Map and Geography Libraries 13, no. 1 (2017): 29–57; Catherine Porter, Paul Atkinson, and Ian Gregory, “Geographical Text Analysis: A New Approach to Understanding Nineteenth-Century Mortality,” Health and Place 36 (2015): 25–34; and Peter M. Broadwell and Timothy R. Tangherlini, “WitchHunter: Tools for the Geo-Semantic Exploration of a Danish Folklore Corpus,” Journal of American Folklore 129, no. 511 (2016): 14–42.

17. See Ruth Ahnert and Sebastian E. Ahnert, “Protestant Letter Networks in the Reign of Mary I: A Quantitative Approach,” ELH 82, no. 1 (2015): 1–33; Heather Froehlich, “Dramatic Structure and Social Status in Shakespeare’s Plays,” Journal of Cultural Analytics 5, no. 1 (2020), doi:10.22148/001c.12556; James Jaehoon Lee et al., “Linked Reading: Digital Historicism and Early Modern Discourses of Race around Shakespeare’s Othello,” Journal of Cultural Analytics 3, no. 1 (2018), doi:10.22148/16.018; and Anupam Basu, Jonathan Hope, and Michael Witmore, “The Professional and Linguistic Communities of Early Modern Dramatists,” in Community-Making in Early Stuart Theatres: Stage and Audience, ed. Roger D. Sell, Anthony W. Johnson, and Helen Wilcox (London: Routledge, 2017), 63–94.

18. For my overview and response to recent controversies in cultural analytics, see Michael Gavin, “Is There a Text in My Data? (Part 1): On Counting Words,” Journal of Cultural Analytics: Debates 5, no. 1 (2020), https://doi.org/10.22148/001c.11830.

19. Tony McEnery and Andrew Hardie, Corpus Linguistics: Method, Theory, and Practice (Cambridge: Cambridge University Press, 2012), 1.

20. Adam Kilgarriff advances a similar argument in “Language Is Never, Ever, Ever, Random,” Corpus Linguistics and Linguistic Theory 1, no. 2 (2005): 263–75.

21. The word interpretation is particularly debilitating because of how it promises to explain the meaning of data when visualized without needing to consider whether the underlying mathematical operations are appropriate, or even how they work. I believe this to be the central flaw of Franco Moretti’s later work. This flaw can also be thought of as the flip side of his loose and unprincipled use of data, which Katherine Bode objects to in “The Equivalence of ‘Close’ and ‘Distant’ Reading; or, Toward a New Object for Data-Rich Literary History,” Modern Language Quarterly 78, no. 1 (2017): 77–106.

22. For a virtuoso performance of computationally assisted close reading, see Martin Paul Eve, Close Reading with Computers: Textual Scholarship, Computational Formalism, and David Mitchell’s Cloud Atlas (Stanford, CA: Stanford University Press, 2019).

23. Stanley Fish has argued that “the purpose of literary interpretation is to determine what works of literature mean; and therefore the paradigmatic question in literary criticism is ‘What is this poem (or novel or drama) saying?’” Professional Correctness: Literary Studies and Political Change (Oxford: Clarendon Press, 1995), 25. Fish has been a longtime critic of computational literary analysis and has published many essays in its opposition. For an early example of this particular genre of scholarly performance, see “What Is Stylistics and Why Are They Saying Such Terrible Things about It?” chap. 2 in Is There a Text in This Class? The Authority of Interpretive Communities (Cambridge, MA: Harvard University Press, 1980).

24. Theodor W. Adorno, Prisms, trans. Samuel Weber and Shierry Weber (Cambridge, MA: MIT Press, 1967), 33.

25. For an introduction and overview of the concept of homophily, see Charles Kadushin, Understanding Social Networks: Theories, Concepts, and Findings (Oxford: Oxford University Press, 2012), 18–21.

26. See Paul F. Lazarsfeld and Robert K. Merton, “Friendship as a Social Process: A Substantive and Methodological Analysis,” in Freedom and Control in Modern Society, ed. Morroe Berger (New York: Van Nostrand, 1954), 18–66.

27. For the classic theoretical treatment of this idea, see Zellig Harris, “Distributional Structure,” Mind 10 (1954): 146–62.

28. This history is narrated in W. John Hutchins, Machine Translation: Past, Present, Future (Chichester, UK: Ellis Horwood, 1986).

29. J. R. Firth, “A Synopsis of Linguistic Theory, 1930–1955,” in Studies in Linguistic Analysis (Oxford: Oxford University Press, 1962), 11.

30. For an overview of spatial autocorrelation and techniques for its measurement, see A. Stewart Fotheringham, Chris Brunsdon, and Martin Charlton, Quantitative Geography: Perspectives on Spatial Data Analysis (Thousand Oaks, CA: Sage Publications, 2000), chap. 5.

31. Waldo Tobler, “A Computer Movie Simulating Urban Growth in the Detroit Region,” Economic Geography 46, no. sup1 (1970): 234–40.

32. There is no specific essay I am quoting here, but this instrumentalist assumption—which presumes textual computing is useful only for how it is used—is prevalent throughout the humanities.

33. At the time of publication, the supplementary website is located at www.literarymathematics.org.

Back to Excerpts + more

Introduction Excerpt for Literary Mathematics

INTRODUCTION

THE CORPUS AS AN OBJECT OF STUDY

The Argument

Notes