David L. Hoover
¶ 1 Leave a comment on paragraph 1 0 Computer-assisted textual analysis has a long, rich history, despite the fact that, as has often been noted, it has not been widely adopted in contemporary literary studies. Instead of debating the causes for this neglect, I will concentrate here on computational methods that can be of use in many different kinds of literary research (for two contrasting views, see Ramsay; Hoover, “End”). I would argue that almost any literary study can benefit from at least some modest and basic kinds of computer assistance. For example, it would seem perverse not to use an available digital text of a work for searching for a vaguely remembered passage that is important for an argument or for locating every significant example of a word or phrase, and studying a concordance remains an effective method for understanding a text. In these cases, the computer is valuable despite the fact that one could perform the activities without it. When the collection of texts is larger or the items to be investigated occur more frequently, however, it becomes impossible to perform the work without a computer (imagine studying personal pronouns in one hundred Victorian novels). Many kinds of evidence produced by statistical methods are simply not accessible without a computer. I will argue with John Burrows that “computer-assisted textual analysis can be of value in many different sorts of literary inquiry, helping to resolve some questions, to carry others forward, and to open entirely new ones” (“Textual Analysis”).
¶ 2 Leave a comment on paragraph 2 0 Producing electronic texts and locating and accessing data within them are simple but vital functions the computer can perform, but the computer’s greatest strengths are in storing, counting, comparing, sorting, and performing statistical analysis. This makes computer-assisted textual analysis especially appropriate and effective for investigating textual differences and similarities, either in an exploratory way (examining the novels of an author or a group of authors for unexpected similarities or differences) or in a more directed investigation (studying the shared vocabulary of gothic novels). Some of the many kinds of investigations and questions that can be approached through textual analysis are the following:
- ¶ 3 Leave a comment on paragraph 3 0
- Testing a hunch, hypothesis, or thesis about an author, text, passage, genre, or period. Was Shakespeare’s vocabulary really unusually large? (Apparently not, according to Elliott and Valenza). Did Milton’s use of visual imagery change when he became blind? Do authors’ styles change when they dictate rather than write their texts by hand? What textual characteristics, if any, define the gothic novel?
- Testing the claims of an unsatisfying critical work or supporting and building on a compelling critical work. For example, Wallace Stevens’s “The Snow Man” is not really a “noun-heavy” poem, as Jerome McGann and Lisa Samuels claim (Hoover, “Hot-Air Textuality” 81–87).
- Investigating how and the extent to which authors differentiate the voices of characters or narrators in a novel or play or correspondents in an epistolary novel (Burrows, Computation; Rybicki; Stewart; McKenna and Antonia; Hoover, “Evidence”).
- Investigating how perceived radical shifts in style in a text are accomplished (Hoover, “Multivariate Analysis”).
- Studying how an author’s style changes over time or how genres develop and decay (Hoover, “Corpus Stylistics”; Stamou; Martindale; Pennebaker and Stone; Garrard, Maloney, Hodges, and Patterson; Craig, “Jonsonian Chronology”; Burrows “Computers”; Moretti).
- Investigating the history of an important word, concept, or group of words or concepts over a long time span (Algee-Hewitt).
- Studying the effects of genre conventions on the language of texts or characterizing or refining how genres are defined (Biber).
- Answering questions like the following: “Do playwrights from different classes, with different education, or brought up in different places write differently? . . . [W]hich playwrights are the most diverse stylistically across their various works? Which show the widest variation across the characters?” (Craig and Kinney 14).
- Exploring the characteristic vocabulary of an author, genre, period, group of texts, a single text, or a part of a text or the similarities and differences between the vocabularies of various writers, genres, periods, or groups of texts (Burrows, “Textual Analysis”; Hoover, “Quantitative Analysis,” “Corpus Stylistics,” “Searching”).
- Studying the extent to which and how gender, sexual orientation, race, nationality, and age of authors are reflected in the language of their texts (Koppel, Argamon, and Shimoni).
- Assessing how similar imitations, pastiches, completions, continuations, prequels, and sequels of texts written by other authors are to the original texts (Hoover, “Authorial Style”; Sigelman and Jacoby; Burrows,“Who Wrote Shamela?,” “Englishing”; Rybicki).
- Investigating questions of authorship attribution (Holmes, Robertson, and Paez; Craig and Kinney; Hoover and Hess; Forsyth, Holmes, and Tse; Juola “Authorship”; Grieve).
- Exploring thematic language in texts (Fortier).
¶ 4 Leave a comment on paragraph 4 0 Before discussing a few kinds of analysis more fully, the problems of planning a project and collecting digital texts must be addressed. “Planning” may take a very loose form at first for an exploratory project based on a hunch, yet even a hunch has implications for what texts and what methods will be appropriate, and more explicit planning will eventually become necessary to avoid the wasted effort of a poorly conceived study. Conversely, any project may need to take new directions in response to the availability of texts and computational tools, and even well-defined and carefully planned projects often uncover promising and unexpected avenues of exploration and sometimes fail to produce significant results. Thus flexibility is an important virtue in computer-assisted textual analysis, and testing a project on a subset of texts or methods can avoid wasted effort.
¶ 5 Leave a comment on paragraph 5 0 Any investigation must begin with a preliminary list of texts and some idea of method, and one good way of preparing is to study previous computational work addressing similar questions or methods.1 Those new to computational approaches may also benefit from one of the increasingly common university courses on digital humanities, humanities computing, or text analysis or from shorter, specialized workshops, such as those offered at the Digital Humanities Summer Institute.
¶ 6 Leave a comment on paragraph 6 0 Once a preliminary set of texts to be investigated has been identified, the first question is whether those texts are available in digital form.2 The temptation to begin with a Web search should be resisted. Though many electronic texts can be found through such a search, thousands more cannot. The most critical factor determining where to look for an electronic text and whether it is likely to be found at all is its copyright date. In the United States, the crucial date is 1923: texts published earlier are very likely out of copyright. (The term of copyright is different in other countries; in the European Union, for example, copyright generally extends seventy years from the death of the author.)
¶ 7 Leave a comment on paragraph 7 0 For texts likely to be out of copyright, The Online Books Page (J. Mark Ockerbloom) and Alex Catalogue of Electronic Texts (Morgan) are especially valuable. Both list texts available at many sites, including most of those available at Project Gutenberg (the oldest electronic text collection), though it is worthwhile searching Gutenberg itself as well, since texts are added continually. Users of Google Books can limit searches to books for which the full text is available or search by author or title. The University of Oxford Text Archive contains many high-quality texts, some still in copyright but available by permission, and The Internet Archive contains a huge number of electronic texts of extremely variable quality, most of which cannot be found by a Web search.
¶ 8 Leave a comment on paragraph 8 0 Many university libraries have their own digital collections and even more subscribe to services like Early English Books Online, Literature Online, Eighteenth Century Collections Online, or Orlando: Women’s Writing in the British Isles from the Beginnings to the Present, and many others, most of which are accessible only through a library search. Many also have librarians specializing in digital resources, and, because some electronic resources are not widely known outside their specific subject area, subject librarians are another valuable resource, as are the subject-specific pages of resources on library Web sites.
¶ 9 Leave a comment on paragraph 9 0 If these searches fail, a general Web search may locate specialized collections hosting texts or versions of texts not available elsewhere, such as academic sites devoted to a historical period, like The Victorian Web (Landow); to a geographic area, like Documenting the American South; to special interests, like A Celebration of Women Writers (M. Mark Ockerbloom) or The Brown University Women Writers Project; or to extraordinary individual efforts, like The Wilkie Collins Pages (Lewis) and the Henry James site The Ladder (Dover).
¶ 10 Leave a comment on paragraph 10 0 Unfortunately, finding the electronic texts is only the first step: they vary so much in nature and quality that it is worthwhile to compare the available versions before selecting one. Consider Henry James’s The Awkward Age (1899), available from Project Gutenberg, The Ladder, Google Books, and The Internet Archive, among other sites. The Gutenberg text, as is usually true, contains no information about its print source, although a comparison shows that it is the revised New York Edition of 1908.3 Dover’s text matches the first British edition. A Google Books advanced search (with “awkward age” as title and “james” as author) finds one copy of the 1899 British edition and two copies of the New York Edition (one from 1908, one from 1922), but there are links to five other versions available at books.google.com at The Internet Archive, three of the New York Edition (two from 1908 and one printed later) and two of the first American edition, one much better than the other. The Internet Archive also has four independent versions. No two of these electronic texts are identical, and the best edition to select will depend on what kind of analysis will be performed and what texts, if any, will be compared with this novel.
¶ 11 Leave a comment on paragraph 11 0 If the original version is the most appropriate, the first American edition is probably the best choice. The Internet Archive version and the Google Books version have competing strengths and drawbacks. The optical character recognition (OCR) used to digitize the Internet Archive version seems slightly more accurate, and the entire text can be downloaded at once, but it has hundreds of line-end hyphens. The hyphenation has been corrected in the Google Books version, but it seems to have more errors and can only be copied and pasted into a document a few pages at a time. If British spellings are appropriate, Dover’s excellent first British edition at The Ladder is the obvious choice. If the New York Edition is more appropriate, the Project Gutenberg text, with far fewer errors than any of the others, is the obvious choice, unless James’s spaced contractions (e.g., “could n’t,” “they ’re,” “I ’m”) are of interest, in which case, the Google Books version taken from a later printing seems the most accurate.
¶ 12 Leave a comment on paragraph 12 0 For texts still in copyright, the search should probably begin at the library, rather than on the Web, since many authors or their estates vigorously protect their copyrights and frequently force Web sites to remove works, especially novels.4 Literature Online, an expensive but widely held resource, contains a huge number of texts in English from AD 600 to the present. Many of these are still in copyright, including scholarly editions (but not the most recent ones) of works that are out of copyright, similar editions of the collected or complete poems of modern and contemporary poets, collections of modern drama, and national and regional literature (modern fiction, much of it still in print, is not well represented, though there is a large selection of novels in English by African writers). As noted above, The University of Oxford Text Archive also has some texts still in copyright, including novels.
¶ 13 Leave a comment on paragraph 13 0 For texts not available in digital form, an electronic text can be created by scanning and OCR. Unfortunately, it is not entirely clear that this is legal for texts in copyright. Although I am not a lawyer and cannot give any legal advice on this subject, creating an electronic text of a work in copyright seems defensible under the “fair use” exception of United States copyright laws (17 USC, sec. 107), provided that the electronic text is not sold or distributed in any way. This view is supported by the fact that both Literature Online and The University of Oxford Text Archive allow authorized users to download copyrighted materials with restrictions on their use and by the exceptions to the prohibition on copying and disseminating copies of such materials for libraries and archives (17 USC, sec. 108). One respected and detailed source for information on copyright is Stanford University Libraries’ Copyright and Fair Use.
¶ 14 Leave a comment on paragraph 14 0 There is too great a variety of hardware and software available for scanning and OCR to permit a detailed discussion here. Many university libraries, IT departments, and computer labs have expertise and equipment and may be able to help. The process is not difficult, however, and even inexpensive scanners (under $100) typically come bundled with an OCR program, so that no one who wants to produce electronic texts from printed texts should feel intimidated.5 Yet even the most accurate OCR produces errors. Many programs boast impressive accuracy rates of 98% or higher, but these must be taken with a grain of salt, and the accuracy is reduced by complex formats, multiple fonts, yellowed paper, stains, underlining, and marginal comments. Even at 98% accuracy, scanning and performing OCR on a typical novel produces several thousand errors. Many can be found using a spell-checker or grammar checker, but the effort is certainly not trivial, and the entire process of scanning, checking, and proofreading a novel can easily occupy many hours of tedious labor. Clearly, scanning and OCR should normally be reserved for small numbers of texts that will be used extensively or for texts out of copyright that will be made available online.
¶ 15 Leave a comment on paragraph 15 0 However digital texts are acquired, they almost invariably require some editing. (A copy of the original, unedited text should be kept for reference purposes and for extracting passages to be quoted.) Normally textual analysis is performed only on text actually written by the author, so that introductions, prefaces, footnotes, tables of contents, title pages, indexes, appendixes, quotations, running heads, epigraphs, part and chapter numbers and titles, like “Chapter II,” and any other material not by the author should be removed. Occasionally, even some of the author’s own words, such as prefaces, explanatory footnotes, or poems, are so different in genre that they should be removed. Any header and licensing information (as in Project Gutenberg texts) and other similar markup should be removed.
¶ 16 Leave a comment on paragraph 16 0 Some typographic elements may need to be addressed. For example, to prevent dashes from being treated as hyphens by some text-analysis software, spaces may need to be added before and after them. In most electronic texts, apostrophes and opening and closing single quotation marks are identical; this is especially problematic for dialect forms, scare quotes, quotation within quotation, and dialogue marked with single quotation marks. It may be necessary to examine every apostrophe and single quotation mark and perhaps delete each one that is not an apostrophe or replace it with a double quotation mark or acute accent. Literature written before about 1800 presents additional problems, such as variant spellings, frequent and variable editorial intervention, and a high proportion of anonymous texts and texts of doubtful authorship.
¶ 17 Leave a comment on paragraph 17 0 The more detailed the analysis, the more important these editing processes are. They may not be feasible for large collections of texts and may not be necessary for analyses in which precise word frequencies are not at issue. Fortunately, most methods of textual analysis will not be severely affected unless there are a great many errors. It may be wise, therefore, to clear up just the most important problems and then perform some preliminary analysis to test whether the analysis seems likely to be effective before spending a great deal of time cleaning up the texts.
¶ 18 Leave a comment on paragraph 18 0 Given the wide variety of literary studies for which textual analysis is appropriate and the many methods that exist for pursuing literary studies, it would be impossible to discuss even a small sample of them in detail here. Rather, I will discuss processes that are common to a large number of methods, suggest some resources for learning about methods, and then discuss a few methods in more detail.
¶ 19 Leave a comment on paragraph 19 0 For most literary study, the smallest unit of analysis is the word. There is some evidence that letter sequences and information about parts of speech sometimes work better than words for authorship attribution (Clement and Sharp), but words have the advantage of being meaningful in themselves and in their significance to larger issues like theme, characterization, plot, gender, race, and ideology.6 Many tools for generating word-frequency lists and concordances exist, including good free programs that can be downloaded: AntConc (Anthony), KWIC Concordance for Windows (Tsukamoto), and Conc. Online collections of tools often allow the user to upload texts to Web-based tools that do not require installation; these seem especially valuable for exploratory work (TAPoR). Finally, there are inexpensive, commercially available programs like WordSmith Tools (Scott), MonoConc Pro (Barlow), and Concordance (Watt), which tend to be more powerful, versatile, and comprehensive than the free programs.7 These programs and others have various strengths and weaknesses, but they all typically produce word-frequency lists in alphabetic or descending frequency order and concordances that list designated words along with a substantial amount of context. Word lists and concordances are very useful exploratory tools, and concordances can also be used to test hunches about how specific words are used in a text. Many of these programs can also statistically compare word frequencies among several texts and can show which texts have unusual frequencies of words of interest. Many can also generate lists of collocations (words that occur repeatedly near each other); an examination of collocations can be especially useful for thematic studies. Note that many text-analysis programs can only process plain text files; if a word-processing program is used for editing and cleanup of the text, it will probably be necessary to save the file as plain text.
¶ 20 Leave a comment on paragraph 20 0 Most textual analysis begins with word-frequency lists and compares the frequencies of words across two or more texts. This comparison requires a parallel word-frequency list, consisting of the words listed in descending frequency order for the entire group of texts and the relative frequency of each word in each text, including zero frequencies for texts in which the word does not occur. Unfortunately, not many simple, easy-to-use programs for producing this kind of list are available. I know of only three that can handle large numbers of texts: WordSmith Tools (Scott), The Intelligent Archive, and my own The Parallel Wordlist Spreadsheet.8
¶ 21 Leave a comment on paragraph 21 0 The parallel word lists are processed with general purpose statistical programs, a variety of which are frequently installed in computer labs, and many IT departments offer instructional sessions on using these programs. I normally use Minitab, which is relatively easy to learn, has a good graphing function and an excellent help function, and is inexpensive enough that most users can afford to purchase a copy. The most frequently used statistical techniques are principal components analysis (PCA) and cluster analysis, but discriminant analysis and other techniques have also been used. Many of the essays cited here give some information on how to perform statistical analysis of word lists, and there are detailed instructions for doing PCA and cluster analysis in Minitab on my The Excel Text-Analysis Pages. PCA Online allows the user to experiment with PCA on Shakespeare’s plays without learning a statistical program.
¶ 22 Leave a comment on paragraph 22 0 As a demonstration of how PCA and cluster analysis work, consider figure 1, a cluster analysis of ten texts by Walter Besant (five novels and five stories) and sixteen novels by Wilkie Collins, based on the one hundred most frequent words of the entire set (the last two digits of the date of publication precedes each abbreviated title; all are from the eighteen hundreds). These texts were collected for a study of Besant’s completion of Collins’s unfinished novel Blind Love, which shows that, despite Besant’s use of the extensive notes Collins provided, the point at which Besant takes over is very clearly marked (Hoover, “Authorial Style”). Cluster analysis compares the frequencies of all one hundred of the most frequent words simultaneously, determines which two texts are most similar to each other in how they use these words, and joins them into a cluster, then proceeds to find the next most similar pair or group of texts until all the texts are joined in a single cluster. The more similar the frequencies of the one hundred most frequent words are in two or more texts, the closer to the left those texts form a cluster. Thus figure 1 shows that “72 PoorF” and “75 LawLady” are much more similar to each other than are “95 Quarantine” and “93 Shrinking.” Similarly, “84 Dorothy” and “82 RevoltMan” are much more similar to each other than they are to the eight texts in the cluster below them. Finally, the ten Besant texts at the top of the graph are much more similar to each other than they are to the sixteen Collins texts at the bottom. Clearly, even the frequencies of the one hundred most frequent words very distinctly separate the styles of these two authors. Equally clearly, Collins’s texts are more similar to each other than are Besant’s, possibly because of the great variation in length in Besant’s texts. Furthermore, Collins’s texts show some tendency to group by date of publication: the five texts in the bottom cluster were all written after 1880 (the clustering by date becomes more accurate when larger numbers of words are analyzed).
¶ 24 Leave a comment on paragraph 24 0 The results of PCA based on the same texts and the same words can be seen in figure 2. Instead of clustering similar texts together on the basis of the frequencies of the one hundred most frequent words, PCA compresses as much of the information about the frequencies of the one hundred most frequent words as possible into a small number of unrelated new variables, or components. The values for the two most important of these variables are then used to locate each text on a two-dimensional graph, with the first component on the horizontal axis and the second on the vertical axis. Figure 2 shows that the first component, which accounts for almost 33% of the variation in the frequencies of the words, is capturing authorship, with all the Collins texts to the right and all the Besant texts to the left. This means that many words are more frequent in all of the texts by Besant than in those by Collins and vice versa. The second component has no clear interpretation, though it is suggestive that later texts by both authors tend to appear toward the top of the graph. (The wider scattering of the texts by Besant reflects the same greater variation among his texts than among those by Collins that is evident in figure 1.)
¶ 26 Leave a comment on paragraph 26 0 PCA and cluster analysis are valuable for both exploratory work and in-depth analyses, but they have different strengths and weaknesses. Cluster analysis has the benefit of giving unequivocal results, while PCA graphs are more dependent on judgment, especially where the texts being compared do not separate as clearly as these. But PCA has one great advantage: using the same data as in figure 2, PCA can produce a graph like that in figure 3, in which the words are graphed onto the same two dimensions as the texts, so that it is immediately apparent which words are disproportionately rare or frequent in which texts.
¶ 28 Leave a comment on paragraph 28 0 The words most favored by Besant over Collins (on the far left) are but, would, one, man, a, all, or, so, and about, while those most favored by Collins over Besant (on the far right) are to, in, on, had, left, letter, time, and back. Even a cursory examination of figure 3 uncovers other interesting characteristics of the vocabularies of these two authors that suggest further directions for research (see Hoover, “Authorial Style”). For example, Besant favors negatives like no, not, nothing; forms of to be (be, is, are, was, were; only been and am are about equally favored); and the third-person plural pronouns they, them, and their. Collins favors the feminine pronouns she and her, the titles Mrs. and Miss, and the noun lady (together suggesting more emphasis on women); the first-person singular pronouns I, me, my; and several content words—house, room, letter, time, lady, way, first, and looked—compared with only man and know for Besant. Other methods can locate characteristic vocabulary, and PCA graphs quickly become unreadable as more words are analyzed, but the ability to produce graphs like those in figure 2 and figure 3 from a single set of data has made this kind of analysis among the most frequently used in computational studies.
¶ 29 Leave a comment on paragraph 29 0 As the incipient chronological differentiation in the texts above suggests, these same techniques can be used to study an author’s stylistic development by treating early and late periods as different authors. Both cluster analysis and PCA easily distinguish the early from the late Henry James (see also Hoover, “Corpus Stylistics”). As can be seen in figure 4, the clustering of his twenty-one major novels matches their chronology extremely closely, except for the unusual late novel, The Outcry (adapted from a 1909 play). This graph does more than dramatically demonstrate the development of James’s style, however. It also casts doubt on the widely held notion that the late style is a result of James’s adoption of dictation because of wrist pain in 1897, during the composition of What Maisie Knew. There is certainly no sign of any radical transformation of James’s style in 1897.
¶ 31 Leave a comment on paragraph 31 0 My final example of computer-assisted textual analysis is an exploratory study of differences in poetic vocabulary among a group of twenty-six male and female American poets born between 1911 and 1943.9 This discussion will not pretend to settle the complex and contentious debate about the existence of feminine writing and will make no global claims about gender theory. Rather, it will demonstrate that textual analysis can produce provocative results that point toward areas where more research is needed, and will argue that interesting results are the norm for such an analysis.
¶ 32 Leave a comment on paragraph 32 0 Burrows has shown that it is relatively easy to distinguish male and female writers of the seventeenth and eighteenth centuries even using only a subset of the 150 most frequent words of the texts but that it becomes progressively more difficult with more recent writers (“Computers,” “Textual Analysis”). A 2002 study (Koppel, Argamon, and Shimoni), however, has shown that it remains possible, using more sophisticated methods, to identify the gender of both fiction and nonfiction documents in the British National Corpus (mostly written between 1974 and 1993) at a rate of about 80%.
¶ 33 Leave a comment on paragraph 33 0 Here I will use a method that focuses not on the most frequent words of texts that have been the province of so much textual analysis but on words that are neither very common nor very rare. The goal is not primarily to show that authors can be identified by gender on the basis of their characteristic words but rather to explore the vocabularies of the poets. The method is a modification of Burrows’s Zeta (“Who Wrote Shamela?,” “All the Way Through”) developed by Craig (Craig and Kinney) that I call Craig Zeta. This simple method divides two sets of texts into approximately equal-sized sections and compares how many sections for each author contain each word, ignoring the frequencies of words and concentrating on their consistency of appearance across the sections. The sets can be selected on the basis of any perceived contrast, but here the contrast is texts by women versus those by men.10 Combining the ratio of the sections by women in which each word occurs with the ratio of the sections by men from which it is absent yields a single measure of distinctiveness that ranges from two (words found in every women’s section and absent from every men’s section) to zero (vice versa). Sorting the words on this composite score produces two lists of marker words, one favored by these women and avoided by these men, and one favored by these men and avoided by these women.
¶ 34 Leave a comment on paragraph 34 0 Testing fourteen additional poets, seven men and seven women, with these marker words produces the result shown in figure 5, where the vertical axis shows the proportion of all the different words in each text that are among the five hundred most distinctive male marker words, and the horizontal axis shows the proportion of all the different words in each text that are among the five hundred most distinctive female marker words. Despite the limitations of this exploratory study, it does a remarkably good job, correctly identifying the genders of twenty of the twenty-five new sections of poetry by poets who played no part in the selection of the words (the errors are in bold type). These same words produce a similar result for seven male and seven female contemporary novelists, which is further evidence that the method is capturing some kind of genuine difference. Consider now the one hundred most distinctive male and female marker words, shown in table 1 (the lists are identified by gender only in the note below each table, so that readers who want to can try identifying which is which—an informal preliminary survey suggests that most readers can).
¶ 37 Leave a comment on paragraph 37 0 The most distinctive female and male marker words can be distributed variously, so long as there is a great difference between the two genders. Relatively common words like mother are found in twenty women’s sections but only eleven men’s; some less frequent words like cross are found in sixteen men’s sections but only three women’s; others, still less frequent, like spin, are found in nine women’s sections but no men’s sections. Female markers like children and mirrors and male markers like beer and lust seem almost stereotypical, but there are also surprises, like the female marker fist and the male markers song and dancing. Studying a concordance of the entire set of texts is an excellent way to examine these words in context. The fact that Sara Teasdale, H.D., and Edna St. Vincent Millay, whose texts cluster with the men’s, were born about twenty to thirty years before the poets on which the words are based also seems worth investigating, as do some large, distinctive clusters of related words, shown in table 2, which are drawn from among the five hundred most distinctive male and five hundred most distinctive female markers. Any study of the vocabulary of male and female poets would benefit from larger numbers of poets and larger samples, and many other configurations that address different contrasts are possible (e.g., nationality or historical period).
¶ 39 Leave a comment on paragraph 39 0 Examples and suggestions could be multiplied almost indefinitely, but I hope to have provided a general idea of the varieties, the challenges, and the benefits of computer-assisted textual analysis and of the opportunities it provides for a wide range of literary studies. Textual analysis can help the literary scholar in the relatively simple but important tasks of collecting, organizing, and evaluating examples and evidence that are relevant to a more traditional study. It can act as a kind of discovery procedure for revealing previously unnoticed trends and suggesting productive and original questions in open-ended, exploratory work. It can inform much more specific and directed kinds of research that test a hypothesis, hunch, thesis, or critical claim. It can provide access to detailed and precise kinds of evidence that would otherwise be impractical to assemble or completely unavailable. It can also help establish or revise an author’s canon by removing spurious works, by adding previously unknown works, or by suggesting or confirming the chronological relations among an author’s works. Computer-assisted textual analysis is neither a panacea nor a substitute for sound literary judgment, but its ability to refine, support, and augment that judgment makes it an important analytic method for literary studies in the digital age.
¶ 40 Leave a comment on paragraph 40 0 1. The best resources for this kind of research are the journals Literary and Linguistic Computing and Computers and the Humanities (through 2004), which specialize in computational approaches. Computational work is, however, increasingly appearing in other journals, such as Eighteenth-Century Studies, Ben Jonson Journal, Milton Quarterly, Modern Language Review, Style, Victorian Periodicals Review, and Early Modern Literary Studies.
¶ 41 Leave a comment on paragraph 41 0 2. Increasingly sophisticated online resources built around collections of texts provide another opportunity for textual analysis (see, e.g., Cooney, Roe, and Olsen’s essay in this volume). Most of Cather’s texts, for example, can be analyzed online through The Willa Cather Archive (Jewell). The Brown University Women Writers Project allows users to perform textual analysis on a large collection of writings by women, though it requires an institutional subscription or an inexpensive license. ARTFL has a huge collection of French texts with tools for analysis, and MONK offers sophisticated tools that operate on some large, publicly available collections of texts.
¶ 43 Leave a comment on paragraph 43 0 4. Faulkner’s novels, for example, are not generally available, and an earlier online electronic version of Light in August is no longer available. The full text of The Sound and the Fury has also recently been removed from an online scholarly edition (Stoicheff et al.).
¶ 44 Leave a comment on paragraph 44 0 5. Also available are several free online OCR tools of varying quality and ease of use, and some older versions of Microsoft Office include document imaging, which can scan paper documents and perform OCR.
¶ 45 Leave a comment on paragraph 45 0 6. Juola’s JGAAP is a simple but powerful and versatile suite of authorship methods, best used in conjunction with his “Authorship Attribution,” which discusses many of the methods it implements.
¶ 47 Leave a comment on paragraph 47 0 8. In WordSmith this function, located under the Detailed Consistency tab of Wordlist, operates on special word-list files produced earlier (the View Column Totals option must also be selected). The Web sites for my The Parallel Wordlist Spreadsheet and The Intelligent Archive provide detailed instructions.
¶ 48 Leave a comment on paragraph 48 0 9. The texts come from Literature Online; to simplify the analysis and avoid problems of samples of different sizes, I truncated each poet’s sample at about eight thousand words. Obviously, an analysis based on only eight thousand words by each of twenty-six authors must be considered preliminary.
Algee-Hewitt, Mark Andrew. “The Afterlife of the Sublime: Toward a New History of Aesthetics in the Long Eighteenth Century.” Diss. New York U, 2008. Print.
Anthony, Laurence. AntConc. Laurence, 2012. Web. 9 July 2012.
The ARTFL Project. Division of the Humanities, U of Chicago, n.d. Web. 9 July 2012.
Barlow, Michael. “Concordancer: MonoConc Pro (MP 2.2).” Athelstan, n.d. Web. 9 July 2012.
Biber, Douglas. Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge: Cambridge UP, 1995. Print.
The Brown University Women Writers Project. Brown U, n.d. Web. 9 July 2012.
Burrows, John. “All the Way Through: Testing for Authorship in Different Frequency Strata.” Literary and Linguistic Computing 22.1 (2006): 27–47. Web. 9 July 2012.
———. Computation into Criticism. Oxford: Clarendon, 1987. Print.
———. “Computers and the Study of Literature.” Computers and Written Texts. Ed. Christopher S. Butler. Oxford: Blackwell, 1992. 167–204. Print.
———. “The Englishing of Juvenal: Computational Stylistics and Translated Texts.” Style 36.4 (2002): 677–99. Web. 9 July 2012.
———. “Textual Analysis.” A Companion to Digital Humanities. Ed. Susan Schreibman, Ray Siemans, and John Unsworth. Oxford: Blackwell, 2004. N. pag. The Alliance of Digital Humanities Organizations. Web. 9 July 2012.
———. “Who Wrote Shamela? Verifying the Authorship of a Parodic Text.” Literary and Linguistic Computing 20.4 (2005): 437–50. Web. 9 July 2012.
Clement, Ross, and David Sharp. “Ngram and Bayesian Classification of Documents.” Literary and Linguistic Computing 18.4 (2003): 423–47. Web. 9 July 2012.
Conc. Summer Inst. of Linguistics, 1996. Web. 9 July 2012.
Copyright and Fair Use. Stanford U Libs., 2005–09. Web. 9 July 2012.
Craig, Hugh. “Jonsonian Chronology and the Styles of A Tale of a Tub.” Re-presenting Ben Jonson: Text, History, Performance. Ed. Martin Butler. Houndmills: Macmillan, 1999. 210–32. Print.
Craig, Hugh, and Arthur F. Kinney. Shakespeare, Computers, and the Mystery of Authorship. Cambridge: Cambridge UP, 2009. Print.
Davies, Mark. BYU-BNC: The British National Corpus. Brigham Young U, n.d. Web. 9 July 2012.
Documenting the American South. U of North Carolina Lib., U of North Carolina, Chapel Hill, 26 Sept. 2012. Web. 26 Sept. 2012.
Dover, Adrian. The Ladder: A Henry James Website. Dover, 15 Nov. 2011. Web 9 July 2012.
Elliott, Ward E. Y., and Robert J. Valenza. “Shakespeare’s Vocabulary: Did It Dwarf All Others?” Stylistics and Shakespeare’s Language: Transdisciplinary Approaches. Ed. Mireille Ravassat and Jonathan Culpeper. London: Continuum, 2011. 34–57. Print.
Forsyth, R. S., D. I. Holmes, and E. K. Tse. “Cicero, Sigonio, and Burrows: Investigating the Authenticity of the Consolatio.” Literary and Linguistic Computing 14.3 (1999): 375–400. Web. 9 July 2012.
Fortier, Paul. “Prototype Effect vs. Rarity Effect in Literary Style.” Thematics: Interdisciplinary Studies. Ed. Max Louwerse and Willie van Peer. Amsterdam: Benjamins, 2002. 397–405. Print.
Garrard, Peter, Lisa M. Maloney, John R. Hodges, and Karalyn Patterson. “The Effects of Very Early Alzheimer’s Disease on the Characteristics of Writing by a Renowned Author.” Brain 128.2 (2005): 250–60. Web. 9 July 2012.
Grieve, Jack. “Quantitative Authorship Attribution: An Evaluation of Techniques.” Literary and Linguistic Computing 22.3 (2007): 251–70. Web. 9 July 2012.
Halliday, M. A. K. “Linguistic Function and Literary Style: An Inquiry into the Language of William Golding’s The Inheritors.” Essays in Modern Stylistics. Ed. Donald C. Freeman. London: Methuen, 1981. 325–60. Print.
Holmes, David I., Michael Robertson, and Roxanna Paez. “Stephen Crane and the New York Tribune: A Case Study in Traditional and Non-traditional Authorship Attribution.” Computers and the Humanities 35.3 (2001): 315–31. Web. 9 July 2012.
Hoover, David L. “Authorial Style.” Language and Style: In Honour of Mick Short. Ed. Dan McIntyre and Beatrix Busse. Houndmills: Palgrave, 2010. 250–71. Print.
———. “Corpus Stylistics, Stylometry, and the Styles of Henry James.” Style 41.2 (2007): 160–89. Web. 9 July 2012.
———. “The End of the Irrelevant Text: Electronic Texts, Linguistics, and Literary Theory.” Digital Humanities Quarterly 1.2 (2007): n. pag. Web. 9 July 2012.
———. “Evidence of Value and the Value of Evidence: Some Case Studies.” Evidence of Value. Ed. Harold Short, David Robey, and Lorna Hughes. Aldershot: Ashgate, forthcoming.
———. The Excel Text-Analysis Pages. Hoover, 2009. Web. 9 July 2012.
———. “Hot-Air Textuality: Literature after Jerome McGann.” Text Technology 14.2 (2005): 71–103. Web. 9 July 2012.
———. Language and Style in The Inheritors. Lanham: UP of Amer., 1999. Print.
———. “Multivariate Analysis and the Study of Style Variation.” Literary and Linguistic Computing 18.4 (2003): 341–60. Web. 9 July 2012.
———. The Parallel Wordlist Spreadsheet. Hoover, 2009. Web. 9 July 2012.
———. “Quantitative Analysis and Literary Studies.” A Companion to Digital Literary Studies. Ed. Susan Schreibman and Ray Siemens. Oxford: Blackwell, 2007. N. pag. The Alliance of Digital Humanities Organizations. Web. 9 July 2012.
———. “Searching for Style in Modern American Poetry.” Directions in Empirical Literary Studies: Essays in Honor of Willie van Peer. Ed. Sonia Zyngier, Marisa Bortolussi, Anna Chesnokova, and Jan Auracher. Amsterdam: Benjamins, 2008. 211–27. Print.
Hoover, David L., and Shervin Hess. “An Exercise in Non-ideal Authorship Attribution: The Mysterious Maria Ward.” Literary and Linguistic Computing 24.4 (2009): 467–89. Web. 9 July 2012.
The Intelligent Archive. Centre for Literary and Linguistic Computing, U of Newcastle, Australia, 21 Sept. 2012. Web. 21 Sept. 2012.
Jewell, Andrew, ed. The Willa Cather Archive. Center for Digital Research in the Humanities, U of Nebraska, Lincoln, n.d. Web. 9 July 2012.
Juola, Patrick. “Authorship Attribution.” Foundations and Trends in Information Retrieval 1.3 (2008): 233–334. Web. 9 July 2012.
———. JGAPP. N.p., 15 Oct. 2012. Web. 12 Nov. 2012. <http://evllabs.com/jgaap/w/index.php/Main_Page>.
Juxta. Applied Research in Patacriticism, U of Virginia, n.d. Web. 9 July 2012.
Koppel, Moshe, Shlomo Argamon, and Anat Rachel Shimoni. “Automatically Categorizing Written Texts by Author Gender.” Literary and Linguistic Computing 17.4 (2002): 401–12. Web. 9 July 2012.
Landow, George, ed. The Victorian Web. N.p., n.d. Web. 9 July 2012.
Lee, David. Bookmarks for Corpus-Based Linguistics. Lee, 15 Aug. 2010. Web. 9 July 2012.
Lewis, Paul. The Wilkie Collins Pages. Lewis, 2009–12. Web. 9 July 2012.
Mark Ockerbloom, John. The Online Books Page. Mark Ockerbloom, 1998– . Web. 9 July 2012.
Mark Ockerbloom, Mary. A Celebration of Women Writers. Mary Mark, 1994–2012. Web. 9 July 2012.
Martindale, Colin. The Clockwork Muse: The Predictability of Artistic Change. New York: Basic, 1990. Print.
McKenna, C. W. F., and A. Antonia. “‘A Few Simple Words’ of Interior Monologue in Ulysses: Reconfiguring the Evidence.” Literary and Linguistic Computing 11.2 (1996): 55–66. Web. 9 July 2012.
MONK: Metadata Offer New Knowledge. N.p., n.d. Web. 9 July 2012.
Moretti, Franco. Graphs, Maps, Trees: Abstract Models for a Literary History. London: Verso, 2005. Print.
Morgan, Eric Lease. Alex Catalogue of Electronic Texts. Infomotions, n.d. Web. 9 July 2012.
PCA Online. Centre for Literary and Linguistic Computing, U of Newcastle, 2012. Web. 9 July 2012.
Pennebaker, James W., and Lori D. Stone. “Words of Wisdom: Language Use over the Lifespan.” Journal of Personality and Social Psychology 85.2 (2003): 291–301. Web. 9 July 2012.
Ramsay, Stephen. Reading Machines: Toward an Algorithmic Criticism. Urbana: U of Illinois P, 2011. Print.
Rybicki, Jan. “Burrowing into Translation: Character Idiolects in Henryk Sienkiewicz’s Trilogy and Its Two English Translations.” Literary and Linguistic Computing 21.1 (2006): 91–103. Web. 9 July 2012.
Scott, Mike. WordSmith Tools. Lexical Analysis Software; Oxford UP, 1996– . Web. 9 July 2012.
Sigelman, Lee, and William Jacoby. “The Not-So-Simple Art of Imitation: Pastiche, Literary Style, and Raymond Chandler.” Computers and the Humanities 30.1 (1996): 11–28. Web. 9 July 2012.
Stamou, Constantina. “Stylochronometry: Stylistic Development, Sequence of Composition, and Relative Dating.” Literary and Linguistic Computing 23.2 (2008): 181–99. Web. 9 July 2012.
Stewart, Larry L. “Charles Brockden Brown: Quantitative Analysis and Literary Interpretation.” Literary and Linguistic Computing 18.2 (2003): 129–38. Web. 9 July 2012.
Stoicheff, Peter, et al., eds. The Sound and the Fury. By William Faulkner. N.p., Feb. 2011. Web. 9 July 2012.
TAPoR. TAPoR team, U of Alberta, 2012. Web. 9 July 2012.
Tsukamoto, Satoru. KWIC Concordance for Windows. Nihon U, n.d. Web. 9 July 2012.
Watt, R. J. C. Concordance. Watt, 3 Mar. 2012. Web. 9 July 2012.