Hist 696: Digging into Data

Megan and I are leading discussion this week, so we thought we would blog early to help start conversation about this week’s readings. I have decided to take a survey approach for this week, offering a brief description of what I saw as the major points of the articles and offering some questions for discussion. The topic for this week’s adventures in digital humanities is Data Mining and Distant Reading, two ways of engaging with primary source material that represent both a continuation and a change from the methods of “traditional” scholarship.

First to data mining. All three of our articles focus on the question of data mining, its usefulness, applications and the like. One thing to note is the date of these articles. The first by Dan Cohen and the article by Gregory Crane were written in 2006. The second article by Dan Cohen was written in 2010. The technology for data mining and the sources available for mining have expanded somewhat drastically between 2006 and 2010.

In his first article “From Babel to Knowledge: Data Mining Large Digital Collections,” Dan Cohen outlines some early attempt to automate the gathering of historical information. He argues that “… computational methods, which allow us to find patterns, determine relationships, categorize documents, and extract information from massive corpi, will form the basis for new tools for research in the humanities and other disciplines in the coming decade.” The first tool is “Syllabus Finder,” the now discontinued resource for finding syllabi on the web.(1) The logic of the search algorithm is quite interesting. Given that the majority of syllabi contain certain phrases and elements, and that these can be identified through textual analysis, the search algorithm ranks documents it finds by their likeliness of being syllabi, based on the number of standard syllabi features they exhibit. The second is the H-Bot, an experiment in question answering that attempts to answer factual questions about the past. The algorithms run appear to use probability calculus and inductive logic to determine the answer that is most likely to be true, based on the information returned from a wide variety of sources. These sort of searches rely on data mining, on tabulating word use frequencies, and on determining which sorts of word pattern are most likely in which context.

What are your reactions to relying on computer searches, like these, to answer historical questions? Is this computational way of determining answers similar to or different from our own?

Gregory Crane’s article, “What do you do with a million books?” pushes further, asking the reader (and librarians) to move beyond “book” and to start asking what can be accomplished by creating a digital, machine readable, library of written material. He notes that “the ability to decompose information into smaller, reusable chunks, to learn autonomously from a changing environment, and to accept explicit structured feedback from many human users in real time” distinguish digital from print and argues that the information contained in digital libraries can be made more useful and more accessible than its print counterparts. Despite the promises of digital collections, Crane notes some of the challenges in creating such large resources. The set-up costs are high, copyright is a persistent issue, and OCR both creates substantial “noise” and is not always able to handle the variety of page layouts or the distinction between footnotes and text.(2) He concludes with three things necessary for the creation of a successful digital library and its ability to transform humanities scholarship: Enhanced OCR, machine translation, and information extraction.”

This sort of study breaks apart the individual books, making the primary unit of study something different, and something dependent upon the user. What might be the benefits and costs of this way of approaching texts? How comfortable are you with the mode of approaching texts? What happens to authorial authority in this context?

The second article by Dan Cohen, “Initial Thoughts on the GoogleBooks Ngram viewer and datasets” provides more current reflexions on where the potential of data mining currently stands. Google Ngrams provides a simple visualization of the frequency of word occurances based upon analysis of the GoogleBooks corpus. As Cohen jokes, it is a digital humanities “gateway drug.” While he acknowledges some of the challenges others have noted about the Google dataset, such as the selection of the corpus, the quality of Google’s OCR and meta-data, and the like, he and others propose that, these downsides being taken into consideration, interesting information can still be gleaned by mining the corpus of Google Books. One important thing that arises from Cohen’s own qualifications of the GoogleBooks data is the need for contextual information, both in terms of how words are studied and mined – not just frequency but word pairings and associations – as well as the need to move from instance to context in order to uncover the significance of the data.

What are your thoughts about what can be learned through Google’s data from digitizing books? Are the objections raised by Natalie Binder regarding the quality of Google’s OCR and the metadata sufficiently addressed in Dan’s response?

Data-mining can be a form of close reading: a study of a particular word or the study of a particular authors use of language. It can also be a form of distance reading – a way of looking at the larger patterns across texts and across time. Distance reading is covered in detail this week through the work of Franco Moretti.

Graphs, Maps, and Trees: Abstract Models for Literary History is a short and dense book about the these three modes of studying literature, and by extension, other work in the Humanities. The work is an apology for three forms of what he calls “distant reading”, where “distance is … not an obstacle, but a specific form of knowledge: fewer elements, hence a sharper sense of their overall interconnection.”(3) These are graphs (quantitative analysis), maps (spatial analysis), and trees (morphological analysis – form and structure).

In “Graphs,” Moretti takes as his example the history of the novel, building off from the work of others (because working with data is collaborative) to give a quantitative analysis of the history of the novel in Britain, Japan, Italy, Spain, and Nigeria.(4) These studies not only reveal the novel rising in different global regions at different times, but also reveal multiple “rises” of the novel. He then moves to genres in an argument for studying the temporary structures that make up the larger historical patterns. “Epistolary”, “Gothic” and “Historical” novels are dissected further into forty-four sub-genres that appear and disappear in clusters. The question then arises: what brings about this shifts in genre at relatively predictable intervals? Generational differences and political interventions are his answers. One observation he makes that was particularly interesting to me was the recurrence of gender shifts. No one of these was “the rise” of female authors, but instead seems to reflect which genre of literature was preferred at a given time.

“Maps” introduces a second method of examining texts: spatial analysis through the creation of maps and diagrams of how elements relate to one another. He uses as his example “village stories,” arguing that their “circular” arrangement “reverses the direction of history, making … urban readers… look at the world according to the older, ‘centred’ viewpoint of an enclosed village.”(5) As Moretti describes, literary maps “prepare a text for analysis. You chose a unit – walks, lawsuits, luxury goods, whatever – find its occurrences, place them in space… you reduce the text to a few elements, and abstract them from the narrative flow, and construct a new, artificial object… And with a little luck, these maps will be more than the sum of their parts: they will possess ‘emerging’ qualities, which were not visible at the lower level.”(6) The creation of maps or diagrams does not guarantee the emergence of an interesting pattern, but when those patterns do emerge, they can change our understanding of the piece being examined.

Finally, “Trees” turns to the question form, to the tracking of changes or differences in forms brought about by divergence and convergence. In Moretti’s example, he studies those elements of detective novels that made Sir Arthur Conan Doyle successful and his contemporary competition less so – by tracing where other deviate, or where Doyle’s own less successful works deviated, one can arrive at the characteristics of a successful detective story. Moretti describes literary trees as exploring the “microlevel of stylistic mutations,” tracing the divergences and convergences of literary elements.(7)

What are your thoughts as to the connections between these form of analysis (quantitative, spatial, morphological) and the larger questions historians and Moretti himself want to address? Do the detailed and distance views help or enable us to make the claims about patterns or causes that historians wish to make? Of these approaches, “trees” seems to me one of the more challenging for application in history. What sort of questions or areas of inquiry might the creation of trees help historians study?

I am excited to hear your thoughts and opinions on this method of engaging in historical scholarship and textual study.



(1) Discontinued because google retired their search api.

(2) This is beginning to improve, though still an issue.

(3) Moretti, 1.

(4) Moretti, 5.

(5) Moretti, 39.

(6) Moretti, 53.

(7) Moretti, 91.