Return to the MLA Commons
An Evolving Anthology

Text Analysis, Data Mining, and Visualizations in Literary Scholarship

Tanya Clement

1 Leave a comment on paragraph 1 0 A rumor prevails that literary scholars should and do neglect using digital applications that aid interpretation because most of these tools seem too objective or deterministic—digital tools seem to take the “human” (e.g., the significance of gender, race, class, religion, sexuality, and history) out of literary study. The thinking is that twentieth- and twenty-first-century literary (and cultural) theory, which tends to value the literary texts and aspects of them that resist simple evaluative resolutions, is incompatible with digital methodologies, which are supposedly geared toward simplifications and fast solutions.1 A consequence of this thinking is the perception that a literary scholar’s research questions are not readily transferable to modes of research, dissemination, design, preservation, and communication that rely on algorithms, software, and the Internet. Happily, this perspective is changing: in “The State of the Digital Humanities: A Report and a Critique,” Alan Liu asserts that text analysis, visualization, and data mining represent paradigmatic shifts in the work of the humanities that force scholars to reflect on the relation between information and new media and technology and that require them “to investigate underlying database, data-flow, cross-platform data architecture” (14). Liu poses a challenge to digital humanities scholars to show how these methodologies—which Franco Moretti and others have called “distant reading”—compare with and contribute to more traditional close reading practices (27). This essay works in answer to this challenge by presenting several computer-assisted modes of scholarship that depend on differential (close and distant, subjective and objective) reading practices, technologies of self-reflection and collaboration, and the value of plausibility, all of which have always been crucial to literary inquiry.

2 Leave a comment on paragraph 2 0 One main thrust of the argument that literary study and digital methodologies are incompatible is that digital methodologies function outside the contexts that are meaningful to literary study. On the contrary, much of the rigorous literary scholarship that depends on digital methodologies is deeply entrenched in current traditions of humanist inquiry, as I demonstrate in this essay. The first part introduces what Marjorie Perloff calls “differential reading,” which positions close and distant reading practices as both subjective and objective methodologies. The second part discusses textual analysis and visualization and the extent to which differential reading requires the technologies of self-reflection or self-consciousness. In labeling self-reflection a technology, I take my cue from Richard Poirier, who claims, “All literature is to some extent aware of itself as a technology” (113), and from Martha Nell Smith, who defines technology as “the means by which we accomplish various ends—the tools and devices on which our critical suppositions rely” (“Computing” 836); both invite us to understand self-consciousness, access, and collaboration in humanist inquiry as technologies. Technologies of access and collaboration are the main focus of the third part, in which I look at the practice of data mining as a methodology squarely situated within traditionally humanist technologies. Finally, I look at plausibility in differential reading practices, present in both close or traditional and distant or new methodologies.

Differential Reading and the “Double Discipline” of Digital Humanities

3 Leave a comment on paragraph 3 0 There are at least two instances in which the computer is used as a tool for interpreting the work of Gertrude Stein. In his foreword to the 1995 Dalkey edition of Stein’s The Making of Americans (1925), William Gass mentions his computational tool, “a magnifying glass which [he] can draw down out of its shy place in the corner,” to “examine the layout of the page” by “enlarg[ing] and mak[ing] comprehensible some chosen bit” (vii). An earlier mention of computer usage occurs in Carolynn Van Dyke’s 1993 article on Stein’s novel Lucy Church Amicably (1930), “‘Bits of Information and Tender Feeling’: Gertrude Stein and Computer-Generated Prose.” Van Dyke creates “computer-generated texts similar to Stein’s work,” because, she hypothesizes, they “may help to illuminate [Stein’s] writing not only because they furnish a novel basis for comparison but also, more particularly, because their principles of composition can be described with some certainty” (169–70). What is significant about Gass’s and Van Dyke’s uses is not their differences but their similarities. Gass is using his tool to magnify one sentence of Stein’s text because he believes that it will augment his ability to examine the rest of the novel. Doing so is “convenient” and generalizable: he maintains that “almost any sentence would yield the same results” (vii). Van Dyke, in generating syntactic, semantic, and pragmatic patterns and comparing these with the style in Stein’s novel, seeks “to explore both the nature of Stein’s art and certain wider questions about linguistic and literary meaning” (170); that is, Van Dyke is also in pursuit of useful generalizations. Her digital tool is as much a magnifying glass as Gass’s: each is a tool that helps the scholar get a better look at a small part of the text to learn something about the workings of the whole. In general, computer-assisted methodologies such as text analysis, visualizations, and data mining are just such tools, but they often provide the view the magnifying glass gives the user when he or she turns it upside down. These methodologies defamiliarize texts, making them unrecognizable in a way (putting them at a distance) that helps scholars identify features they might not otherwise have seen, make hypotheses, generate research questions, and figure out prevalent patterns and how to read them.

4 Leave a comment on paragraph 4 0 Both close and distant reading practices can facilitate interpretation through subjective and objective means. A useful term for thinking about how new methodologies engage similar modes of interpretation with what Geoffrey Harpham calls the “double discipline” (32) of subjective and objective practices is what Marjorie Perloff calls “differential reading.” Perloff’s call for interpretive practices that engage differential reading is specifically geared toward reinvigorating or reconceptualizing the practice of literary study in the double discipline of subjective objectivity (xxv–xxxiv). To this end, she identifies the following four approaches to literature:

  • 5 Leave a comment on paragraph 5 0
  • As rhetoric or practical criticism: “the examination of diction and syntax, rhythm and repetition, and the various figures of speech” (6)
  • As philosophy or the “potential expression of truth and knowledge” (7)2
  • As art or a unique aesthetic construct—a form of discourse inherently other, of which the objective is the “pleasure of representation” and the “pleasure of recognition,” or the pleasure “of taking in impersonations, fictions, and language creations of others and recognizing their justice” (17)
  • As “cultural production”—“for its political role, its exposure of the state of a given society” (9)

6 Leave a comment on paragraph 6 0 Perloff argues that each of the classifications above must be adopted in conversation with the others; that is, notions of justice become contingent not only on the historicity of text (on its political role and its exposure of the state of this society positioned in this moment) but also on its rhetorical practices (the argument embodied in its structure and themes) and ultimately on its potential for expression and for giving pleasure in terms of representation and recognition. Perloff’s emphasis on a dialectic of knowledge (“connaissance”) comprising the “pleasure of representation” and the “pleasure of recognition” (“reconnaisance”) (17) is echoed by Stanley Fish, who calls theoretical work “the pleasure of making visible the work of so many hitherto invisible hands” (377), and Harpham, who argues that the “the truth of the past exists largely in textual traces” that “can always be reassembled from a different point of view, with different emphases, presumptions, and priorities” (24). This process of assembly, Harpham continues, in addition to being pleasurable, “is accompanied, especially for the scholar, by a distinct sense of power.” Perloff sees this empowerment as engendered by a broadened perspective: “the wider one’s reading in a specified area, the greater the pleasure of a given text and the greater the ability to make connections between texts” (16).

7 Leave a comment on paragraph 7 0 Reading differentially—that is, closely and at a distance, subjectively and objectively—with new methodologies necessitates Liu’s mandated close attention to databases, data flows, and data architectures, including attention to the human element behind them. For example, a scholarly edition that incorporates archival materials and is published and accessed within the technological infrastructure (the digital repository) of an academic archive or library is a database with an interface that reflects, as Derrida reminds us, a structure of “archivization” (17). The interface poses as the objective gatekeeper to archives of seemingly agnostic content, but the content and its functionality generally reflect the situated and subjective practices of a particular institutional setting. In their introduction to the special issue Toward a Poetics of the Archive, Paul Voss and Marta Werner look dreamily toward the electronic archive that is devoid of a situational context:

8 Leave a comment on paragraph 8 0 On the cusp of the twenty-first century—now—we speak of an ex-static archive, of an archive not assembled behind stone walls but suspended in a liquid element behind a luminous screen; the archive becomes a virtual repository of knowledge without visible limits, an archive in which the material now becomes immaterial. (ii)

9 Leave a comment on paragraph 9 0 Yet electronic archives—the source materials of so much text analysis, data mining, and visualization methodologies—are always assembled behind very real stone walls, by very real people. Jerome McGann, in writing about the “categorical systems and subsystems (‘cross-references’)” used by archives, libraries, and museums, implies that any database is the result of an interface between a person and an archive:

10 Leave a comment on paragraph 10 0 No more than databases do these complex systems exhaust, or define, the multiple possible paths through which we may negotiate and (so to say) narrativize our way(s) through these great towers of Babel. . . . The physicality of an archive’s categorical system shows a flexibility that a database does not have, because a card catalog is itself an interfaced database. (“Responses” 1591)

11 Leave a comment on paragraph 11 0 While it is not within the purview of this discussion to debate whether the categorical systems that structure the archive are any less structured than the database, the notion that we are constantly met with interfaces (such as the card catalog) that reflect real structures with real people (with all of their quirks and fallibilities and imaginative wonderfulness) in real institutions reminds us how material and constructed (how situated) is the context in which the reader accesses and analyzes cultural content with text analysis, data mining, and visualization methodologies.

Textual Analysis, Visualization, and the Technologies of Self-Reflective Analysis

12 Leave a comment on paragraph 12 0 Using linguistic text analysis to chart subtle linguistic variations has some precedent in literary studies. In John Burrows’s classic 1987 text Computation into Criticism, for instance, his goal is an attempt “to prevent broad resemblances from obscuring subtle differences” by exploring how counting what had previously been considered “insignificant” words (e.g., “of,” “the,” “in”) can point to “larger” arguments about style in Jane Austen’s oeuvre (179–80). Burrows writes, “From no other evidence than statistical analysis of the relative frequencies of the very common words, it is possible to differentiate sharply and appropriately among the idiolects of Jane Austen’s characters and even to trace the ways in which an idiolect can develop in the course of a novel” (4). The multivariate statistical techniques that Burrows used are still cited as standard protocol in a study performed almost ten years later by Wayne McKenna and Alexis Antonia.3 McKenna and Antonia examine James Joyce’s “Nausicaa” episode from Ulysses by focusing on the modals “could” and “would,” the causal conjunctives “so” and “because,” and prepositional phrases beginning with “like.” When mapped and plotted against characters in other episodes, the authors argue, these words reflect the extent to which the character Gerty MacDowell does not understand her social world or have any power in it. Likewise, Stephen Ramsay has used StageGraph to cluster Shakespearean plays on the basis of low-level structural elements such as the length of acts, scene changes, and character movements (“In Praise of Pattern”). The clusters map remarkably well to genre classifications previously discussed by critics, a finding that allows Ramsay to explore the idea that abstract genres such as tragedy or history have a direct correlation to these “low hanging” structural elements. By identifying quantifiable pieces of a text using word frequencies and locations, these scholars have generated computer-assisted close readings of the structures of texts that correspond to, contradict, or otherwise provide interesting insight into what has been assumed about the texts on an abstract level.

13 Leave a comment on paragraph 13 1 There are several freely available tools that can help scholars generate hypotheses about a digital text. For example, I used HyperPo (now evolved into Voyant) to compare word frequencies in The Making of Americans with word frequencies in a small sample of other texts on Project Gutenberg. Visualizing these results using a “stack graph for categories” visualization available from Many Eyes immediately shows one way the text is different from other well-known texts of comparable size.4 The results, as seen in figure 1, make it clear that The Making of Americans has the largest number of words, or tokens (see the label “Total”), the least number of unique words or types (“Unique”), and the largest average word frequency (“Avg. Frequency”). This text also has the largest standard deviation (“STDV Frequency”), because each repeated word is repeated more times than is expected in a bell curve based on the previous three statistics. The graph also makes visible the fact that the low number of unique words (or words that are used at least once in each text) and the high average frequency of words in The Making of Americans are almost exactly the inverse of the numbers for Joyce’s Finnegans Wake. The distant view, of course, does not tell us anything about the nature of Joyce’s unique words or Stein’s repetition. For such insight we must turn to the texts, where we can see that Joyce includes words that are “word-sounds” such as “tauftauf thuartpeatrick” and “bababadalgh­araghtakamminarronn­konnbronn­tonnerronntuonnthunn­trovarrhounawn­skawntoohoohoor­denenthurnuk!,” linked words such as “upturnpikepointandplace” and “devlinsfirst,” and many other experimental words that are only used once. On reading The Making of Americans, we can quickly see its experimentation—almost any page includes passages in which words are repeated over and over. By using these tools to chart word frequencies from The Making of Americans and to compare these with other texts, we have fast proof that the repetition in the text is highly unusual. We can see that anomalies between and among texts are primarily structural. The next step requires differential reading to discover the function of these anomalies: like flipping the magnifying glass, we can use digital methodologies to move in closer and to draw our attention back out as we seek to understand the relation of patterns we see at a distance and those we see up close.

14 Leave a comment on paragraph 14 2
Figure 1. Table of word frequencies from texts comparable in size or composition date to Stein’s The Making of Americans. The novels are arranged in the order that they were published, starting with the earliest at the bottom.

15 Leave a comment on paragraph 15 0 Noting trends in word frequencies, however, provides us with a simplified view of the text. The computer’s ability to sort and illustrate quantified data helps identify patterns, but understanding why a pattern occurs and determining whether it is one that offers insight into a text requires technologies of self-reflective inquiry. Harpham sums up humanistic study as “[t]he scholarly study of documents and artifacts produced by human beings in the past [that] enables us to see the world from different points of view so that we may better understand ourselves” (23): the next step for the computing humanist, then, is to find or create computer-assisted research practices that are self-reflective or self-conscious, and many scholars are doing just that. John F. Sowa writes that “to be useful, a computer program must represent information about things in the world,” much of which can be interpreted in wildly different ways since “computerized information passes through many levels of representations of representations of representations” (186); Ben Shneiderman considers the effect of visualizations, which provide a window into research results but have an inherent limitation of space that results in “occlusion of data, disorientation, and misinterpretation”; and D. Sculley and Bradley Pasanek preface their discussion of data mining with the observation that these methodologies “will always be subject to experimenter bias” (413). Many digital humanists have been trained in literary study and use this training to consider new modes of inquiry in discussions concerning materiality and the digital text (Clement, “Digital Regiving”; Kirschenbaum; McGann, Radiant Textuality; Smith, “Importance”); Lacananian psycholinguistics and the computer screen’s “flickering signifier” (Hayles); and the French Oulipo movement as an influence in exploratory, computational analysis studies (Ramsay, “Reconceiving”; Rockwell; Sinclair). Further, situated and self-reflective reading practices have directly affected how scholars think about the process of using these digital methodologies. John Unsworth touts “the importance of failure” within digital methods (“Documenting”). Willard McCarty’s notion of a “via negative,” or a “negative way,” to knowledge involves an iterative, trial-and-error process (5; 39–41). This short list of examples reflects the awareness within digital humanities that “situated knowledges” are entwined in the “silences, absences, and distortions in dominant paradigms” that compose the layers of representations of representations of representations that literature and digital tools employ (Haraway; Hawkesworth 8).

16 Leave a comment on paragraph 16 0 Using tools to facilitate differential reading practices can facilitate self-reflective or self-conscious critical practices by helping us interpret the patterns we see in texts and in how others have read texts. Vocabulary Management Profiles (VMP), for example, was developed to serve as a measuring stick that marks changes among passages or between points in a text where an author describes, tells, or analyzes a story. VMP 2.2 enables users to chart patterns that map to points in a text that indicate shifts in style, thus permitting differential reading. The software’s method of analysis assumes that new episodes, new settings, and new characters are signaled by an increase in new vocabulary while the description or analysis of these activities usually involves more repetition of already used words.5 The tool’s analytic procedure entails determining the average frequency ratio (between 0.0 and 1.0) for each word across a text (Youmans, “How to Generate VMP 2.2s”).6 In the visualizations this tool generates (see fig. 2 and fig. 3), the y-axis represents the frequency ratio while the x-axis represents the location of each word across the text. This relation is represented by a line graph that maps narrative style changes. Peaks on the charts signal new vocabulary words and thus new episodes, settings, or characters. Valleys signal repetition and thus “a continuation of the episode, description, or characterization” (Youmans, “Vocabulary-Management Profile”). Figure 2 tells us more about the behavior of repetitive patterns in The Making of Americans that we see in figure 1: the repetition in the text increases and decreases in certain spots of the narrative. In essence, these VMP 2.2 visualizations show points in the text at which the most dynamic changes in repetition occur.

17 Leave a comment on paragraph 17 0
Figure 2. A VMP 2.2 visualization of narrative versus description in The Making of Americans.

18 Leave a comment on paragraph 18 0 The VMP 2.2 visualizations afford me a new perspective from which I can read passages that scholars have identified as significant in The Making of Americans. The hump at point B in figure 3 and the zoomed-in version of the same pattern in figure 4 indicate the “Hodder episode.” Most critics date the composition of The Making of Americans, published by Contact in Paris in 1925, to have begun between 1901 (Katz) or 1903 (Wald) and concluded in 1911; during this time, Stein was also working on Q.E.D. (1903), Fernhurst (1904–05), and the character sketches published as Three Lives (1906). The Hodder episode is considered significant because this short narrative mirrors an incident in Stein’s circle between her acquaintances Mary Gwinn and Alfred Hodder and is also fictionalized in Q.E.D., a novel it is argued Stein was unable to publish because it portrays an affair between two women. In the episode, one of the women, Cora Dounor, is having an affair with Martha Hersland’s husband, Phillip Redfern (Alfred Hodder’s supposed fictional counterpart). The episode also marks “the longest sustained narrative in the plot” (Wald 286). Priscilla Wald sees this episode as “a key for the overall project,” as well as a key to reading the more traditional (diegetic) narratives within the text (286). Leon Katz, meanwhile, argues that the Hodder episode marks both the clumsy insertion of earlier drafts and the end of the novel’s “best writing in conventional idiom” with its “stunningly effective incantatory prose rhythms that lend color and great weight to the quality of her observation”—a style that Katz argues stems from Stein’s work in Three Lives and that would eventually become the experimental writing that emerges as Stein’s “aesthetic ideal,” which seeks “absolute consistency” between “overt subject matter and form” (224).

19 Leave a comment on paragraph 19 0
Figure 3. A VMP 2.2 visualization of “Martha Hersland” in a line chart (x-axis = location, y-axis = type/token).

20 Leave a comment on paragraph 20 0 This analysis provides for different readings. The data peak at point B, which corresponds to the Hodder episode in the text, shows that VMP 2.2 can mark the same episode as critics, on the basis of changes in the underlying textual patterns, or form, of the text. That critics’ findings are mirrored in the VMP 2.2 visualization confirms the worth of the tool’s analysis. Yet the overlap also demonstrates how digital methodologies can help us expand our understanding of texts. That is, the VMP 2.2 data for The Making of Americans show what critics have found, that Stein is both telling a story (diegesis) and explaining that story (exegesis) at the same moment in the text, and also elucidate patterns critics have disregarded. We can see from the visualization of the words in the Hodder episode in figure 4, for example, that the narrative style changes before the narrative has moved to the next topic: the words that correspond to the downward trend on the graph are in the paragraph that appears toward the conclusion of the Hodder episode, after Martha Hersland has read a letter from her husband’s desk and discovered that he is having an affair. The paragraph (and thus the episode) concludes with, “She read it to the end, she had her evidence.” This sentence ends at point A in figure 4. Though the story of Redfern’s infidelity ends at point A, the next paragraph shifts into the narrator’s first-person ruminations about the subjective meanings of categories about words (“the meaning of the words they are using . . . later have not any meaning”) in relation to Redfern:

21 Leave a comment on paragraph 21 0
Figure 4. The “Hodder episode” of The Making of Americans in a Spotfire scatter plot.

22 Leave a comment on paragraph 22 0 and some then have a little shame in them when they are copying an old piece of writing where they were using words that sometime had real meaning for them and now have not any real meaning in them . . . now I commence again with words that have meaning, a little perhaps I had forgotten when it came to copying the meaning in some of the words I have just been writing. Now to begin again with what I know of the being in Phillip Redfern, now to begin again a description of Phillip Redfern and always now I will be using words having in my feeling, thinking, imagining very real meaning . . . (441)

23 Leave a comment on paragraph 23 0 These later mentions of Redfern (as seen in fig. 4) map to Stein’s change to a more repetitive style and the narrator’s attempt to describe Redfern through diagrammic typing or repetition with variation. And so “redfern” and thus the subject of Redfern disappears at point B in figure 4, well after the change in style at point A, indicating that the two changes (subject and style), contrary to what critics have contended, are not necessarily concurrent. A clear pattern is emerging as a result of these changes from exegesis to diegesis and back; less clear is the extent to which pattern changes are a comment on the overt subject of the text (the affair), as Wald and Katz contend, or on identity construction—a more subtle subject—that seems to emerge from a distant reading of the text (Clement, “‘Thing’”). Ultimately, these analytics and visualizations help us generate new knowledge by facilitating new readings of the text and by affording a self-reflective stance for comparisons, a perspective from which we can begin to ask why we as close readers have found some patterns and yet left others undiscovered.

Data Mining, Visualizations, and the Technologies of Collaboration

24 Leave a comment on paragraph 24 0 I developed a simple hypothesis from these initial explorations of word frequencies: arguments scholars make about The Making of Americans are based on limited knowledge of the text’s underlying structure because the underlying patterns are difficult to discern with close reading. Data-mining procedures proved to be productive in initially illuminating complex structural patterns that helped me discern those underlying patterns. There are three main steps that comprise predictive data-mining analyses. The overall goal is to examine a large collection of documents such as the three thousand paragraphs constituting The Making of Americans; the first step is to determine decision criteria for classification. The decision criteria could include features of the text to be analyzed such as n-grams (n number of characters or words of text), parts of speech, or phonetic sounds and certain behaviors or relations such as words or phrases that are repeated or words or phrases that are collocated (found in proximity to one another). The second step is to use the data-mining algorithm to analyze and map the behavior of or patterns created by the decision criteria in a subset of the document collection or corpora. The third step is to use the data-mining algorithm to apply that mapping to new documents to find similar patterns or behaviors.

25 Leave a comment on paragraph 25 0 In our case study for the MONK project, we used a frequent-pattern-analysis algorithm to extract features from Stein’s The Making of Americans for data mining. MONK (Metadata Offer New Knowledge) is a collaborative project funded by the Andrew W. Mellon Foundation that includes departments of computing, design, library science, and English at several universities in the United States and Canada. The goal of MONK was to develop data-mining and visualization applications that would help scholars leverage their access to large-scale text collections.7 The Data to Knowledge (D2K) data-mining environment we used to generate our decision criteria identified thousands of co-occurring, repetitive patterns in The Making of Americans.8 Establishing these patterns is a function of moving a window over trigrams (a three-word series), one word at a time, until each of the text’s three thousand paragraphs has been analyzed for co-occurring trigrams.9 Figure 5 shows how this analysis works on one sentence, in which the first two trigram sequences (A and B) are shown with the last trigram (C) and the resulting set of trigrams for the whole sentence. Looking at n-grams allows for an element of “fuzzy matching” that is useful when considering repetition with variation because it facilitates searching for like patterns that are not exact duplicates. For example, one result of executing the D2K frequent-pattern-analysis algorithm on trigrams from The Making of Americans is a subset of four trigrams (“a description of,” “now a description,” “this is now,” and “is now a”) that co-occur in three different paragraphs on pages 290 and 291. These four trigrams appear in the following three sentences in those paragraphs: “This is now a description of such feeling,” “This is now a description of my feeling,” “This is now a description of all of them.” Executing the frequent-pattern-analysis algorithm on longer n-grams produces matches of greater length. An analysis executed on 36-grams produces a subset of co-occurring patterns that enables us to find two multiparagraph sections that present an unusual pattern in the text: these sections share the same 495 words on page 444 (in ch. 4) and on page 480 (in ch. 5). The midsection between the two pages (p. 462) is the exact center of the novel, which means these pages form a bridge over the middle of the book. This startling discovery confirms the idea that the repetition in the text is not completely random. Making the same discovery would have been difficult through close reading, since the text is replete with many shorter repetitions, and impossible through more straightforward string searches without preknowledge of its existence.

26 Leave a comment on paragraph 26 0
Figure 5. Example of trigrams from D2K analysis of The Making of Americans.

27 Leave a comment on paragraph 27 0 Knowing that this bridge existed led me to further hypothesize that the relation of the patterns in the first part of the text to patterns in the second half of the text was purposeful and significant to the text’s experimentation. A combination of close and distant reading, of subjective and objective reading, would be required to expand my understanding of the text. Executing the algorithm on the text generated thousands of patterns (since each slight variation in a repetition generated a new pattern) and thus a long list of results that was impossible to read. To address this difficulty, we developed the interface FeatureLens, which allowed me to sort the results in different ways and view them in the context of the text. Being able to see where patterns map back to the text became increasingly important as I performed close readings to investigate the relation of the patterns in the two parts by reading at a distance. The intricate and complex patterns I discovered confirmed my hypothesis. From this perspective, in which I could read The Making of Americans differentially, I made the argument that the discourse about identity formation continues to develop through the compositional progress of the text even as the narrative progression within the text unweaves and ceases (Clement, “Digital Regiving”).

28 Leave a comment on paragraph 28 0 Close reading and distant reading in data-mining projects are facilitated most by “technologies of collaboration” (Smith, “Computing” 845). In general, using data mining and visualizations as methodologies for exploring literary texts means using cutting-edge tools that are not available to all scholars. As a result, data-mining projects in particular call for the “technologies of collaboration” or “the work of many hands” (Unsworth, “Creating”). MONK, for example, is built on two previous collaborative research projects, Nora and WordHoard, and depends on SEASR (Software Environment for the Advancement of Scholarly Research) to provide tools such as D2K, which was developed by the Automated Learning Group at the National Center for Supercomputing Applications. Further, the online MONK tool contains a selection of texts spanning from the sixteenth century through the late nineteenth century that collaborators have transcribed and encoded in TEI-Analytics (TEI-A), a TEI (Text Encoding Initiative) markup created for analytics, through Abbot, a tool created by Martin Mueller, Steve Ramsay, and Brian L. Pytlik Zillig. I helped develop the FeatureLens application in collaboration with a team of researchers at the Human-Computer Interaction Lab at the University of Maryland, College Park.10 The data that I used to visualize the patterns of repetition were created in coordination with Mueller, who helped develop the algorithm used to parse repetition in the text. This algorithm was developed in coordination with Craig Berry, who wrote the software to study repetitive patterns in the Iliad and Odyssey as well as those in the poems of Hesiod and the Homeric Hymns.11 In the digital humanities community, most projects are replete with collaborators, and resources are continually shared, reused, and remixed; yet even in such a context, data-mining methodologies stand out as being particularly dependent on collaboration.

29 Leave a comment on paragraph 29 0 The extent to which the development of data-mining projects relies on collaborative practices is evident in a recent report published by the Council on Library and Information Resources (CLIR), One Culture: Computationally Intensive Research in the Humanities and Social Sciences: A Report on the Experiences of First Respondents to the Digging into Data Challenge (Williford and Henry). In this report, the authors summarize, contextualize, and analyze eight projects that were the first recipients of Digging into Data Challenge grants, awarded in 2009 and 2011 and funded by the National Science Foundation, the National Endowment for the Humanities, the Social Sciences and Humanities Research Council in Canada, and the Joint Information Systems Committee in the United Kingdom. The grants supported a diverse range of projects with teams of three to thirty-four people and resources that included images, text, and recordings of letters, trial records, and speech. Each project also included a significant investment in the human labor required to preprocess and clean the texts that would be used for the analyses as well as the labor required for testing and fine-tuning algorithms for new and varied data. Student workers are often unseen collaborators, but this report details their significant contributions. These adaptive and iterative processes and the management of so many people and resources also required significant collaborative work from project management staff.

30 Leave a comment on paragraph 30 0 Technologies of collaboration tend to bring to the fore the significant role subjective practices play in data-mining projects. It quickly became clear to the CLIR investigators, for example, that their initial questions set up a binary between “old” (i.e., human) versus “new” (i.e., computational) practices that did not correspond well to the varied experiences of the researchers. Originally, the CLIR investigators had asked:

  1. 31 Leave a comment on paragraph 31 0
  2. Why do you as a scholar need a computer to do your work?; and
  3. What kinds of new research can be done when computer algorithms are applied to large data corpora? (9–10)

32 Leave a comment on paragraph 32 0 The CLIR investigators noted that the eight projects “reflect more complex, iterative interactions between human- and machine-mediated methods than are implied by our second question. Rather than being a combination of fixed, clearly defined entities—the researcher’s question, the algorithm, and the corpus—the projects are structures built with continually moving parts” (10). Finally, after some initial work with the project participants, the writers realized that “there was never clear separation between past and present, traditional and digital, or other bounded concepts” and that “[m]any of the researchers interviewed for this study assiduously avoided making such distinctions” (10).

33 Leave a comment on paragraph 33 0 Observations about the data-mining process made by collaborators also support the supposition that it is the combination of data mining and visualization (distant reading) with the ability to read and contextualize one’s results (close reading) that generated many of the most productive studies and that is needed to address the remaining gaps and difficulties. Much as I did in looking at The Making of Americans with VMP 2.2, the team working on Structural Analysis of Large Amounts of Music Information depended on a human-generated “ground truth”—here, student-created metadata—by which the computational analysis was measured. Students worked first at labeling various structural features of recordings for which the team already had metadata, to ensure that the students were accurate; then, the students applied their knowledge to a larger set of recordings, to which machine learning was later applied and tweaked until it could produce results similar to the students’. The metadata represents rigorous close reading practices and analysis, subjective practices that dictate in what way the machine learning must be tweaked (Williford and Henry, “Case Studies”). In another project, researchers analyzing images of quilts from the Digging into Image Data to Answer Authorship Related Questions identified the need for “new tools to facilitate interdisciplinary collaboration and iteration on results of computational analysis of data sets, particularly in a manner that can incorporate the participation of citizen scholars” as an important next step in their work. The struggle faced by the team on the project Digging into the Enlightenment: Mapping the Republic of Letters seems particularly important. The project was somewhat hobbled by “incomplete data,” including missing dates and names on many of the letters; the team concluded that “[h]umanistic inquiry . . . is freeform, fluid, and exploratory; not easily translatable into a computationally reproducible set of actions.” This observation led the CLIR investigators to conclude that there was a common need across all the projects—“the need to ‘bridge’ a gap between automated computational analysis and interpretive reasoning that must make allowances for doubt, uncertainty, and/or multiple possibilities” is a need that “is characteristic of the Challenge projects.”

34 Leave a comment on paragraph 34 0 Data mining is a computational process that is at once exciting and expensive in terms of time, labor, and processing power. The technologies of collaboration that make this work possible also provide opportunities to analyze the situated practices that make up the data architecture behind that work. The milestones in my own research, for example, would have been inaccessible without the entire extended MONK team and a complicated network of grant-funded and institutional support. Data-mining projects, like all projects, are shaped by issues that are philosophical, but the dependence these projects have on expensive resources means they are also colored by broad brushstrokes of technological, practical, financial, and political factors that, as Liu contends, we must learn to read and interrogate. The work can be exclusionary, since individual academics often do not have the resources needed to develop scholarly projects that incorporate digital methodologies and since these limited resources—not the evolving philosophies about the value of digital analysis versus human analysis—remain the largest obstacles to the wider adoption of data mining as a means for producing scholarship in literary studies.12

Valuing the Plausible in Digital Humanities Inquiries

35 Leave a comment on paragraph 35 0 If we consider the digital environment a “provocation to alternatives” (Poirier 113), one in which the reader is provoked to understand every reading as one of plausible alternatives, we begin to see that differential reading in the context of digital methodologies is not new. As with more traditional methodologies, these readings are enacted in the context of the technologies of access, self-reflection, and collaboration that make it a situated act. My research into works by Gertrude Stein has been greatly facilitated by technologies that enable access to documents in certain forms, to the computing technologies needed to analyze these objects, to a community of scholars and computing scientists with whom I was able to collaborate, and to sources of funding geared toward work in digital methodologies. My self-conscious reflections on these projects was encouraged by the work of many colleagues. Consequently, I see the analysis engaged in these projects in the context of the double discipline of objective and subjective perspectives that refract within one differential prism. My ability to use computational and therefore empirical tools is based on my situated context in which I am afforded the opportunity to experiment and implement. As a result, I am using practices that are not inherently empirical but situated—practices that account for the plausible instead of the truth in literary research.

36 Leave a comment on paragraph 36 0 Mary Hawkesworth describes this notion of plausibility in science as the result of disbanding Wilfrid Sellars’s “myth of the given”:

37 Leave a comment on paragraph 37 0 Once the “myth of the given” has been abandoned and once the belief that the absence of one invariant empirical test for the truth of a theory implies the absence of all criteria for evaluative judgment has been repudiated, then it is possible to recognize the rational grounds for assessing the merits of alternative theoretical interpretations . . . the stimuli that trigger interpretation limit the class of plausible characterizations without dictating one absolute description. (48–49; emphasis added)

38 Leave a comment on paragraph 38 0 One can consider Hawkesworth’s notion of value in the context of digital methodologies as a measurement of the extent to which these new methodologies lead one to question one’s interpretations.13 In other words, digital methodologies offer not an “invariant empirical test for the truth of a theory” but only plausibly sound interpretations. We have seen that values mediate every aspect of using digital tools, from selecting objects of study and methods of analysis to formulating and validating the knowledge produced. Digital projects are founded on so many small steps of human interaction and input that they inevitably reflect—as do all modes of literary inquiry—gendered, racial, economic, social, historicized, and politicized contexts. When we use simpler approaches—when we use digital methodologies to count large sets of data or to magnify small sets of data—however, it is easier for us to describe how we use this new information to make new readings of our cultural artifacts. The more complex projects require that we articulate the more complex interplay between subjective and objective practices, but literary inquiry prepares us to read these complexities as an approach of alternatives and as plausible interpretations instead of truth.

39 Leave a comment on paragraph 39 0 The notion of plausibility with computational practices will become more productive as we learn to consider the technologies and the layering of representations of representations that make up the digital methodologies we use to look at literary texts. Consider, for example, this description of typical data-mining procedures for developers who generally “have only a superficial understanding” of what is usually numerical data:

40 Leave a comment on paragraph 40 0 They accept what they are given by the domain experts and do not have a deep understanding of the measurements or their relationship with each other. Results are analyzed primarily by empirical analysis. When something goes awry, we may have difficulty in attributing this to problems with the collection process or the specification of the features. (Weiss, Indurkhya, Zhang, and Damerau 51)

41 Leave a comment on paragraph 41 0 The authors maintain, however, that text-mining procedures are easier for developers than more quantitative data mining. “For text mining,” they write, “we are much closer to understanding the data, and we all have some expertise. The document is text. We can read and comprehend it, and we analyze a result by going directly to the documents of interest” (51–52; emphasis added). But what if the text is The Making of Americans or Finnegans Wake or any text that emphasizes the multiple, meaning-making properties of a literary text? How often is there a common, agreed-upon meaning of the words in a literary text? What if the language is code? Sowa notes how programming languages incorporate the “vagueness, uncertainty, randomness, and ignorance” that constitutes any human language at work (352). We must consider also the well-documented limitations surrounding practices of text encoding and textual knowledge representation. Julia Flanders views subjectivity as one aspect of humanist inquiry that standardized encoding practices tend to disregard because of precedents set by institutional-level projects such as those produced by libraries and the commercial industry. She offers an alternative vision that includes “understand[ing] XML as a way of expressing perspectival understandings of the text: not as a way of capturing what is timeless and essential,” but rather as a way of reflecting “shifting vantage points from which the text appears to us, the shifting relationships that constrain our understanding of it, the adaptability and strategic positioning of our own readerly motivations” (ch. 2, para. 60). Incorporating a sense of the subjectivity of seemingly objective practices like data mining corresponds to imagining alternative uses for any structured environment that incorporates systems of representation. Literary study and differential reading practices that consistently present us with the same messiness and doubt that make the complexities of concepts like race, gender, class, and culture most immediately relevant also remind us that the products of our technologies, our devices, always fall short of our perceived notions of the real or the authentic.

42 Leave a comment on paragraph 42 0 Ultimately, the rule of plausibility dictates that differently situated eyes panning multiple directions (or realities) not only are more powerful than a small magnifying glass but also serve different purposes and research agendas. This many-eyed perspective might be like “eye vision,” which involves shooting a dynamic event, such as a soccer game, from multiple cameras placed at different angles. A computer combines the video streams from these cameras, and the resulting images duplicate a multidimensional viewpoint. That we are aware it is a virtual reality keeps us mindful of the processes we use to produce it, but the experience of this encompassing vantage point allows for a feeling of justice or authenticity that is based on plausible complexities, not simplified and immutable truths. While computers cannot necessarily do what humanists also cannot do—such as solve literary conundrums—computational practices do allow scholars to experiment with texts in ways that were formerly prohibitive in print culture. Sometimes the view facilitated by digital tools generates the same data human beings (or humanists) could generate by hand, but more quickly—an important advantage when so many literary texts go unread and, essentially, undervalued. At other times, these vantage points are remarkably different from that which has been afforded within print culture and provide us with a new perspective on texts that continue to compel and surprise us by being so provocative and complex—so human.


43 Leave a comment on paragraph 43 0 1. On the history of difficulty in literary study, see Diepeveen; Poirier.

44 Leave a comment on paragraph 44 0 2. Perloff writes, for example, “if theories of poetry-as-rhetoric regard James Joyce and Ezra Pound as key modernists, the theory of poetry-as-philosophy would (and has) put Samuel Beckett or Paul Celan at that center” (7).

45 Leave a comment on paragraph 45 0 3. Some examples of Burrows’s techniques include principal component analysis, used to reduce a data set to its most useful dimensions of variance, and probability distribution tests such as Student’s t-test and the Mann-Whitney test.

46 Leave a comment on paragraph 46 0 4. This is another opportunity to note the subjective nature of how texts can be chosen for comparison. The texts considered for this study were limited by availability primarily because many of the texts published at the same time as The Making of Americans are under copyright and not freely available as full texts. All the texts seen in figure 1 were freely available from Project Gutenberg and were chosen because they are seminal texts of varied lengths from the nineteenth and early twentieth centuries.

47 Leave a comment on paragraph 47 0 5. Youmans found a strong correlation between paragraph boundaries and valleys in VMP 1 visualizations that were constructed with thirty-five-word moving intervals (“New Tool”). In contrast, Youmans used a fifty-five-word interval to investigate the correlation between VMP 1 charts and the boundaries between numbered sections in two short stories by William Faulkner (“Vocabulary Management Profile”).

48 Leave a comment on paragraph 48 0 6. This ratio is based on the type ratio divided by the token ratio for each word, where type equals the single occurrence of each distinct word (or form of lexeme) and tokens are the total number of words. The formula for computing a visualization in VMP 2.2 counts new vocabulary as 1.0 and repeated words as a ratio greater than 0.0 based on how recently the word occurred in the text. “Recently” is determined by “(Number of Current Word minus Number of Previous Occurrence minus 1)/(Total Tokens in the Text minus 1)” (Youmans, “How to Generate VMP 2.2s”). The creators maintain that this procedure mimics a “second reading” of a text because the first thirty-five-word window begins with the seventeenth word from the end of a text and wraps through to the eighteenth word of the beginning of the text. The next ratio is computed for the sixteenth word from the end of the text through to the nineteenth word of the beginning of the text and so on throughout the text, thereby creating a moving window that generates an average of ratios for each word.

49 Leave a comment on paragraph 49 0 7. These collections include the Early American Fiction Collection, Documenting the American South, Nineteenth-Century Fiction, and Wright American Fiction.

50 Leave a comment on paragraph 50 0 8. D2K was developed by the Automated Learning Group at the National Center for Supercomputing Applications.

51 Leave a comment on paragraph 51 0 9. N-grams are sequences of n-length items and are usually used as a basis for analysis in natural language processing and genetic sequence analysis. For the algorithm used in the MONK project, see Pei, Han, and Mao.

52 Leave a comment on paragraph 52 0 10. The list of participants and more information about the project are available at the MONK Workbench Home Test.

53 Leave a comment on paragraph 53 0 11. At the time, Berry was working with Northwestern University’s Academic Technologies, a unit within the Northwestern University Library. They ultimately created the Chicago Homer, which produces the same kind of data from the repetitive patterns in the Iliad and Odyssey as well as in the poems of Hesiod and the Homeric Hymns.

54 Leave a comment on paragraph 54 0 12. Other exclusions are also possible: the CLIR report notes that none of the principal investigators in the first round were women, whereas in the second round, funded in December 2011, “nine of the fourteen funded projects have a woman as a principal investigator” (Williford and Henry, “Case Studies” [introd., n1]).

55 Leave a comment on paragraph 55 0 13. In championing computational modeling, McCarty admits to the historicized scholar’s situatedness; in “raising the level of complexity in the questions we can ask of our sources [we do] better justice to them . . . justice is justice as we now conceive of it, not by an ahistorical, absolute measure” (128).

Works Cited

Burrows, John F. Computation into Criticism: A Study of Jane Austen’s Novels and an Experiment in Method. Oxford: Clarendon, 1987. Print.

Clement, Tanya. “A Digital Regiving: Editing the Sweetest Messages in the Dickinson Electronic Archives.” A Companion to Emily Dickinson. Ed. Martha Nell Smith and Mary Loeffelholz. Oxford: Blackwell, 2008. 415–36. Print. Blackwell Companions to Lit. and Culture.

———. “‘A Thing Not Beginning or Ending’: Using Digital Tools to Distant-Read Gertrude Stein’s The Making of Americans.” Literary and Linguistic Computing 23.3 (2008): 361–82. Print.

Derrida, Jacques. Archive Fever: A Freudian Impression. Trans. Eric Prenowitz. Chicago: U of Chicago P, 1998. Print.

Diepeveen, Leonard. The Difficulties of Modernism. New York: Routledge, 2003. Print.

Fish, Stanley. “Theory’s Hope.” Critical Inquiry 30.2 (2004): 374–78. Web. 19 Sept. 2012.

Flanders, Julia H. “Digital Humanities and the Politics of Scholarly Work.” Diss. Brown U, 2005. Brown University. Web. 5 Jan. 2009.

Gass, William. Foreword. Stein, Making vii–xii.

Haraway, Donna. “Situated Knowledges: The Science Question in Feminism and the Privilege of Partial Perspective.” Feminist Studies 14.3 (1988): 575–99. Print.

Harpham, Geoffrey Galt. “Beneath and beyond the ‘Crisis in the Humanities.’” New Literary History: A Journal of Theory and Interpretation 36.1 (2005): 21–36. Print.

Hawkesworth, Mary E. Feminist Inquiry: From Political Conviction to Methodological Innovation. New Brunswick: Rutgers UP, 2006. Print.

Hayles, N. Katherine. How We Became Posthuman: Virtual Bodies in Cybernetics, Literature, and Informatics. Chicago: U of Chicago P, 1999. Print.

Katz, Leon. The First Making of The Making of the Americans. New York: Columbia UP, 1963. Print.

Kirschenbaum, Matthew G. Mechanisms: New Media and the Forensic Imagination. Cambridge: MIT P, 2008. Print.

Liu, Alan. “The State of the Digital Humanities: A Report and a Critique.” Arts and Humanities in Higher Education 11.1-2 (2012): 8–41. Web. 31 July 2012.

McCarty, Willard. Humanities Computing. New York: Palgrave, 2005. Print.

McGann, Jerome J. Radiant Textuality: Literature after the World Wide Web. New York: Palgrave, 2001. Print.

———. “Responses to Ed Folsom’s ‘Database as Genre: The Epic Transformation of Archives?’” PMLA 122.5 (2007): 1580–612. Web. 24 Oct. 2011.

McKenna, Wayne, and Alexis Antonia. “‘A Few Simple Words’ of Interior Monologue in Ulysses: Reconfiguring the Evidence.” Literary and Linguistic Computing 11.2 (1996): 55–66. Print.

Moretti, Franco. “Conjectures on World Literature.” New Left Review 1 (2000): 54–68. Print.

Pei, Jian, Jiawei Han, and Runying Mao. “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets.” Proceedings of the DMKD 2000 Conference. New York: ACM-SIGMOD, 2000. Print.

Perloff, Marjorie. Differentials: Poetry, Poetics, Pedagogy. Tuscaloosa: U of Alabama P, 2004. Print.

Poirier, Richard. “The Difficulties of Modernism and the Modernism of Difficulty.” Critical Essays on American Modernism. Ed. Michael Hoffman and Patrick D. Murphy. New York: Hall, 1992. 104–14. Print.

Ramsay, Stephen. “In Praise of Pattern.” TEXT Technology 14.2 (2005): 177–90. Print.

———. “Reconceiving Text Analysis: Toward an Algorithmic Criticism.” Literary and Linguistic Computing 18.2 (2003): 167–74. Print.

Rockwell, Geoffrey. “What Is Text Analysis, Really?” Literary and Linguistic Computing 18.2 (2003): 209–19. Print.

Sculley, D., and Bradley M. Pasanek. “Meaning and Mining: The Impact of Implicit Assumptions in Data Mining for the Humanities.” Literary and Linguistic Computing 23.4 (2008): 409–24. Print.

Shneiderman, Ben. “User Controlled Specifications of Interesting Patterns.” University of Maryland. Dept. of Computer Science, U of Maryland, 25 Nov. 2001. Web. 9 Oct. 2007.

Sinclair, Stéfan. “Computer-Assisted Reading: Reconceiving Text Analysis.” Literary and Linguistic Computing 18.2 (2003): 175–84. Print.

Smith, Martha Nell. “Computing: What Has American Literary Study to Do with IT?” American Literature 74.4 (2002): 833–57. Print.

———. “The Importance of a Hypermedia Archive of Dickinson’s Creative Work.” Emily Dickinson Journal 4.1 (1995): n. pag. Web. 1 Oct. 2009.

Sowa, John F. Knowledge Representation: Logical, Philosophical, and Computational Foundations. Pacific Grove: Brooks, 2000. Print.

Stein, Gertrude. The Making of Americans: Being a History of a Family’s Progress. 1925. Normal: Dalkey Archive, 1995. Print.

Unsworth, John M. “Creating Digital Resources: The Work of Many Hands.” Digital Resources for the Humanities. St. Anne’s Coll., Oxford U. 14 Sept. 1997. Address.

———. “Documenting the Reinvention of Text: The Importance of Failure.” Journal of Electronic Publishing 3.2 (1997): n. pag. Web. 17 Sept. 2009.

Van Dyke, Carolynn. “‘Bits of Information and Tender Feeling’: Gertrude Stein and Computer-Generated Prose.” Texas Studies in Literature and Language 35.2 (1993): 168–97. Print.

Voss, Paul J., and Marta L. Werner. “‘Who’s In, Who’s Out’: The Cultural Poetics of Archival Exclusion.” Toward a Poetics of the Archive. Ed. Voss and Werner. Spec. issue of Studies in the Literary Imagination 32.1 (1999): i–viii. Print.

Vuillemot, R., Tanya Clement, Catherine Plaisant, and Amit Kumar. “What’s Being Said Near ‘Martha’? Exploring Name Entities in Literary Text Collections.” Proceedings of IEEE Symposium on Visual Analytics Science and Technology (VAST). Washington: IEEE Computer Soc., 2009. 107–14. Print.

Wald, Priscilla. Constituting Americans: Cultural Anxiety and Narrative Form. Durham: Duke UP, 1995. Print.

Walker, Jayne L. The Making of a Modernist: Gertrude Stein from Three Lives to Tender Buttons. Amherst: U of Massachusetts P, 1984. Print.

Weiss, Sholom M., Nitin Indurkhya, Tong Zhang, and Fred Damerau. Text Mining: Predictive Methods for Analyzing Unstructured Information. New York: Springer, 2005. Print.

Williford, Christa, and Charles Henry. “Case Studies.” One Culture: Computationally Intensive Research in the Humanities and Social Sciences: A Report on the Experiences of First Respondents to the Digging into Data Challenge. Online supp. to pub. 151. Council on Lib. and Information Resources, June 2012. Web. 31 July 2012.

———. One Culture: Computationally Intensive Research in the Humanities and Social Sciences: A Report on the Experiences of First Respondents to the Digging into Data Challenge. Pub. 151. Council on Lib. and Information Resources, June 2012. Web. 31 July 2012.

Youmans, Gilbert. “How to Generate VMP 2.2s.” Vocabulary Management Profiles. U of Missouri, n.d. Web. 19 Sept. 2012.

———. “A New Tool for Discourse Analysis: Vocabulary Management Project.” Language 67.4 (1991): 763–89. Print.

———. “The Vocabulary Management Profile: Two Stories by William Faulkner.” Empirical Studies of the Arts 12.2 (1994): 113–30. Print.

Page 9

Source: https://~^(?[\\w-]+\\.)?(?[\\w-]+)\\.hcommons\\.org$/text-analysis-data-mining-and-visualizations-in-literary-scholarship/