Updating the Dissertation Description

Dissertations can be fickle things. When I started my A.B.D. journey in 2014, I had a very ambitious project outline, and very little understanding of the technical skills I needed to see it through. Since then, through a lot of hard work and a number of false starts, I have greatly expanded my technical skills, learned a good deal about data management (thanks, Wendy!), and discovered how true it is that even (especially?) with computers, less is more when it comes to scholarly projects. While I hope in future iterations of the project to bring network and geospatial analysis to bear on my examination of the evolving relationship between beliefs, health practices, gender dynamics, and temporal imaginaries in the development of the Seventy-day Adventist Church, the dissertation itself will focus on textual analysis of the periodical literature produced by this prolific religious movement.

What follows is my revised dissertation description, submitted to the history faculty at George Mason in pursuit of the department’s completion grant. In the coming months, I will be posting drafts of my technical essays and code notebooks, documenting the computational work that undergirds my historical analysis. This description provides the context for those technical essays.


A Gospel of Health and Salvation: A Digital Study of Seventh-day Adventism, Health, and American Culture, 1843 - 1920

Graham cakes, corn flakes, and therapeutic baths. These are not the first things that come to mind when thinking about religious practices in American history. Yet in 1866, the newly-formed Seventh-day Adventist church opened the Western Health Reform Institute dedicated to providing water cure treatments and health education within a Sabbath-keeping environment.1 Embracing elements of homeopathy, water cure, diet reform, and dress reform, this health-minded Protestant sect incorporated vegetarianism, health reform, and medicine into their evolving understanding of salvation as they grappled with an ever-anticipated but perpetually-delayed millennium.

Seventh-day Adventism began as a community united by their shared belief in the coming return of Christ according to a particular interpretation of biblical literature, rather than as a group tied together by a particular liturgy or by geographic area. As a result, the community organized itself primarily through the publication and distribution of books and periodicals. This reliance on the printed word to orient, build, and maintain a community of faith provides a distinct opportunity for studying the development of this religious movement through their publications. Working with approximately 13,000 periodical issues (nearly 200,000 pages) produced in four geographic regions in the United States by the denomination, I am using text mining to examine the shifting and mutually informing relationship between the theological commitments of the denomination and their gendered health practices in relation to their shifting conception of time.

Seventh-day Adventism is an apocalyptic and millennialist belief system, meaning that they anticipate the imminent return of Jesus Christ, the start of his thousand-year reign, and the end of the world. The movement’s early followers embraced William Miller’s teaching that Christ would return in the period between 1843 and 1844. In the wake of the Great Disappointment, the small group that would become the Seventh-day Adventist church organized around the belief that October 22, 1844, marked the day Christ entered into the inner sanctuary of the Temple in heaven and there began the work of judging all of humanity.2 They also adopted seventh-day Sabbatarianism, holding that Saturday, rather than Sunday, was the proper day for Christian observance, as decreed in the ten commandments. These beliefs contributed to Seventh-day Adventism’s distinctive and shifting temporal imaginary, their shared understanding of time as ordered by God to rotate around the Jewish Sabbath and as rapidly approaching the end of human history.3 I argue that we can see the contours and effects of this shifting temporal imaginary in their health practices and the roles women assumed within the denomination, and that attention to the temporal imaginary provides an additional framework for understanding the flourishing of alternative religious movements during the nineteenth century.4

A Gospel of Health and Salvation is a web-based project composed of five modules, one framing, two methodological, and two interpretative, all of which combine narrative elements with interactive visualizations and computational analysis. Module 1, “A Gospel of Health and Salvation,” provides a non-technical overview of the project and situates my discussion of early Seventh-day Adventism and their beliefs and practices around health within the historical context of the mid-nineteenth to early-twentieth centuries. Module 2, “Of Making Many Books,” explores the history of Seventh-day Adventist publishing and outlines my methods for selecting, collecting, and preparing the corpus of materials for analysis. Module 3, “Modeling Time,” is a study of different ways to algorithmically identify and highlight changes in discourse over time, particularly when considering time as fluid, rather than constant or static. Module 4, “Work Out Your Salvation,” examines the ways the changing temporal imaginary of Seventh-day Adventism shaped their health practice, particularly their attitudes toward faith healing and the relationship between the healthy body and salvation. Module 5, “Teachers, Mothers, Healers All,” focuses on the ways the changing temporal imaginary of Seventh-day Adventism shaped gender relations and hierarchies within the denomination, as seen in their health practices and their denominational structures. Together, these modules engage critically with the computational methods of the digital humanities within the context of religious history, and reveal the ways the community’s conception of time shaped the possibilities of their theology and their health practices, providing the framework within which they defined and sought to achieve flourishing human lives.

  1. Numbers, Ronald L. Prophetess of Health: A Study of Ellen G. White. Wm. B. Eerdmans Publishing Co., 2008. pp. 156-7. 

  2. “The Great Disappointment” refers to the period after October 22, 1844, the date the Millerites believed Jesus would return. Malcolm Bull and Keith Lockhart place the number of early Seventh-day Adventist followers in 1849 at approximately 100. Bull, Malcolm, and Keith Lockhart. Seeking a Sanctuary: Seventh-Day Adventism and the American Dream. Indiana University Press, 2006. p. 7. 

  3. I am using “imaginary” in the sense of Charles Taylor’s “modern social imaginary,” namely, that which “enables, through making sense of …” or “that common understanding that makes possible common practices …” Taylor, Charles. Modern Social Imaginaries. Durham; London: Duke University Press, 2004. pp. 2, 23. 

  4. One common framework for interpreting 19th-century revivalism is Carroll Smith-Rosenberg’s economic interpretation. See Smith-Rosenberg, Carroll. “The Cross and the Pedestal: Women, Anti-Ritualism, and the Emergence of the American Bourgeoisie,” In Disorderly Conduct: Visions of Gender in Victorian America, Oxford University Press, 1986. My approach is similar to that of Robert Abzug in Cosmos Crumbling, but is focused on the temporal aspects of the religious cosmology. Abzug, Robert H. Cosmos Crumbling: American Reform and the Religious Imagination. Oxford University Press, 1994. 

I can haz charts!

[Update: Thanks to an excellent suggestion from Chris Sexton, the visualizations are initially image files, and you can click through to the interactive html versions. Browser overload no more!]

[Disclaimer: this page will probably take a while to load because there is a lot of JavaScript further down the page.]

I have so much explaining to do about how I got to this point, but …

I finally made something interesting and I want to share! So the story of how I got here will be told later, via flashbacks.

Who, What Now?

First, a little context. I am working with a series of periodicals which were published in the mid-19th to early-20th centuries and subsequently digitized and OCRed by the Seventh-day Adventist denomination. The periodicals are available to the public at http://documents.adventistarchives.org.1

I will walk through my process for selecting periodicals at a later point, but I am currently working with 30 different titles, from which I have 13,231 individual issues that have been split into 196,761 pages. This is not “BIG” big data … but it is big enough. I did some math and if I were to look at each page for one minute a page, it would take me 409 days at 8 hours a day to look at the whole corpus.

In order to investigate the development of Seventh-day Adventist health beliefs and practices in relationship to their understanding of salvation, I am applying clustering algorithms to the corpus to identify those documents that contain discourses connected to my topics of interest. But, in order to have confidence in the results of those algorithms, I need an automated way to know more about the quality of the data that I am working with, especially since

  1. I was not responsible for the creation of the data and
  2. I cannot manually examine the corpus and still complete my dissertation on time (or maintain my sanity).

There are different strategies when it comes to dealing with messy OCR and determining how much it can be ignored, some of which depends on the questions being asked and the types of algorithms in use. However, there has been little discussion about ways to ascertain the quality of the documents one is working with and the processes for improving it, particularly when you do not have a “ground truth” against which to compare. The approach I am pursuing is to take each token in the document and compare that token to a large wordlist of “verified” tokens.2 More on the creation of that wordlist to come. (Spoilers: I read the README for the NLTK words list … and wept.)

This approach is not in any means fool-proof. It is really bad at dealing with

  • errors that happened to create new “verified” tokens,
  • obscure words or figures referenced by the SDA authors (despite my best efforts to capture them),
  • intentional mis-spellings (there are a few editors who … really could not spell),
  • names of SDA community members,

to name a few. This is to say, there will undoubtedly be a number of correct OCR transcriptions that are labeled as errors and a number of incorrect OCR transcriptions that will not be labeled as errors. If I was attempting to compute absolute error rates and discarding error words, my approach would not be sufficient.

However, that is not the project at hand. My goal is to get a bird’s eye view of the quality of the OCR I am working with and evaluate the success of different strategies for correcting common OCR mistakes. I am also anticipating that different strategies will be needed for different titles, and possibly over different time periods, as the layout and print methods greatly influence the success of the OCR engines. By comparing the documents at different stages of cleaning to the same wordlist, I can report on the relative improvements, even if not all errors are captured.

All that to say … I did that and I have some pretty-pretty graphs from the first round of results! Here I am showing off two charts that I am using to get an overall picture of what is going on in the individual titles. I am also generating reports that focus in on potential problem areas (documents with high errors rates, docs with low token counts, the most frequent errors, very long errors (often linked to decorative elements in the periodical or to words that were smushed together during OCR), etc.).

Overall, the initial results are really promising! The only change I have made to the documents I am reporting on was to convert all of them to UTF-8 encoding (because Python. And Sanity.)3 The results indicate that the vast majority of documents have a “bad token” rate distributed around 10%. As a point of comparison, the Mapping Texts project considers a final “good token” rate of over 80% to be excellent for their corpus. Rather than overwhelm you with all of them, I want to highlight a few of the most interesting and point out the types of questions that they raise for me.

The results

Notes on the visualizations:

  1. Bokeh, the charting library I am using, insists on formatting the values under “0.1” in scientific notation, which you will see in the tooltips on the scatterplot graphs. I do not know why, and I tried to change it, but that led down a whole rabbit hole of error messages and incomplete documentation from which I have emerged only through struggle and determination with the values still appearing in scientific notation.
  2. I hid the x-axes on the scatterplots because they were illegible, but the x-axis is by document id, and sorted chronologically, with the earliest publication dates at the left of the chart.
  3. The visualization have handy interactive capabilities, thanks to Bokeh. They also make the memory load a bit large. So if your browser hasn’t crashed, you can zoom in to see the details on the charts a bit more clearly.
  4. I forced all of the distribution graphs to an x-axis of 0-1 in order to make it easier to visually compare between them. This resulted in some of the graphs being very tightly squashed, but you can zoom in to get a better picture of the different bins at play. Similarly for the scatterplot graphs, which are forced to 0-1 on the y-axis.

Our first victim: The Columbia Union Visitor, known hereafter as CUV.

Click through to the interactive visualization.

Overall, the distribution of error rates is beautiful. It is centered on .1, showing that the majority of the documents have an error (or “bad token”) rate of around 10%. But this graph does not give many clues about where the problem areas might lie.

When we look at the rates over time, we get a more nuanced picture. (Scroll down in the iframe to see the full graph.)

Click through to the interactive visualization.

While most of the results are clustered down between 0 and .2, as expected, there is a curious spike in error rates around 1906.

To explain this spike I can look at the types of errors reported for those years and examine the original PDF documents to see if there is a shift in formatting or document quality that might be at play.

Next: The Health Reformer, known hereafter as HR.

Click through to the interactive visualization.

Here we have an interesting situation where the bundle of documents with an error rate of “0” is abnormally high. Either the OCR for this publication is off-the-charts good, or there is something else going on. Since there is also a spike at “1”, I’m guessing the latter.

Click through to the interactive visualization.

Looking at the data over time corroborates that guess and reveals another interesting pattern. The visualization shows a visible shift at around 1886 in the occurrence and density of high-error documents, which, interestingly, coincides with the rise of documents with “0” errors. Something is happening here!

One last one (because otherwise your browser will never load). This pattern repeats, though at a smaller scale, in the Pacific Health Journal, or PHJ.

Click through to the interactive visualization.

Click through to the interactive visualization.

Notice the uptick in high-error documents at the 1901 mark, with intermittent spikes prior to that.

Not all of the titles have such drastic shifts in their “bad token” rates, and many have a consistent smattering of high error documents. But these are particularly illustrative of the types of information visible when examining the distribution of the data and the data over time.

All of this to say:

  1. the default OCR for these publications is pretty good. but …
  2. there are some interesting and potentially significant problem areas that would skew the data for particular time periods.

This data will serve as the baseline against which any computational corrections to the OCR will be compared. It will enable me to evaluate the effectiveness of the different corrections and to see when the changes have little affect on the overall error rates in the corpus.

And I will report back on what I find in the original scans and text files!

The code

I have uploaded the code I wrote to generate the corpus statistics as a gist at https://gist.github.com/jerielizabeth/97eeac0bf83365af7fd00bc6a0151554.

The code I wrote to create the visualizations is available as a gist at https://gist.github.com/jerielizabeth/c1ccd516681bf311630533be2bdb23d8.

Comments and suggestions are always welcome!

Acknowledgements

And thanks to Fred Gibbs for all the feedback as I have been working through how to evaluate and improve the OCR!


  1. Funny story. As I was writing this, the SDA released a story about how they’re filling in some of the missing years of the digitized periodicals. So, what you see online is probably a bit different from the collection I am currently work from. 

  2. As a point of reference, the Mapping Texts project worked with a corpus of approximately 230,000 pages. My approach to identifying errors is similar to the one pursued by the Mapping Texts team. 

  3. The process by which is also to be documented at a future point. So much documentation. 

Religion and Data: A Presentation for the American Academy of Religion 2016

On November 19th I had the opportunity to be part of the Digital Futures of Religious Studies panel at the American Academy of Religion in lovely San Antonio, Texas. It was wonderful to see the range of digital work being done in the study of religion represented and to participate in making the case for that work to be supported by the academy.

Panel members represented a wide range of digital work, from large funded projects to digital pedagogy. I focused my five minute slot on the question of data and the need to recognize the work of data creation and management as part of the scholarly production of digital projects.

tl;dr: I argued in my presentation that as we create scholarly works that rely on data in their analysis and presentation, we also need to include the production, management, and interpretation of data in our systems for the assessment and evaluation of scholarship.


Two questions to consider: what does data-rich scholarship entail? and how do we evaluate data-rich scholarship?

What is involved when working with data? And how do we evaluate scholarship that includes data-rich analyses? These are the two questions that I want to bring up in my 5 minutes.

My dissertation sits in the intersection of American religious history and the digital humanities.

Screenshot of dissertation homepage, entitled 'A Gospel of Health and Salvation: A Digital Study of Health and Religion in Seventh-day Adventism, 1843-1920'

I am studying the development of Seventh-day Adventism and the interconnections between their theology and their embrace of health reform. To explore these themes, I am using two main groups of sources: published periodicals of the denomination and the church yearbooks and statistical records, supplemented by archival materials.

Screenshot of the downloaded Yearbooks and Statistical Reports.

In recent years the Seventh-day Adventist church has been undertaking an impressive effort to make their past easily available.

Screenshots of the Adventist Archives and the Adventist Digital Library sites.

Many of the official publications and records are available at the denominational site (http://documents.adventistarchives.org/) and a growing collection of primary materials are being release at the new Adventist Digital Library (http://adventistdigitallibrary.org/).

This abundance of digital resources makes it possible to bring together the computational methods of the digital humanities with historical modes of inquiry to ask broad questions about the development of the denomination and the ways American discourses about health and religious salvation are inter-related.

But … one of the problems of working with data is that while it is easy to present data as clean and authoritative, data are incredibly messy and it is very easy to “do data” poorly.

Chart from tylervigen.com of an apparent correlation between per capita cheese consumption and the number of people who die from becoming tangled in bedsheets.

Just google “how to lie with statistics” (or charts, or visualizations). It happens all the time and it’s really easy to do, intentionally or not. And that is when data sets are ready for analysis.

Most data are not; even “born digital” data are messy and must be “cleaned” before they can be put to use. And when we are dealing with information compiled before computers, things get even more complicated.

Let me give you an example from my dissertation work.

First page of the Adventist yearbooks from 1883 and 1920.

This first page was published in 1883, the first year the denomination dedicated a publication to “such portions of the proceedings of the General Conference, and such other matters, as the Committee may think best to insert therein.” The second is from 1920, the last year in my study.

You can begin to see how this gets complicated quickly. In 1883, the only data recorded for the General Conference was office holders (President, VP, etc.) and we go straight into one of the denominational organizations. In 1920, we have a whole column of contextual information on the scope of the denomination, a whole swath of new vice-presidents for the different regions, and so forth.

What happened over those 40 years? The denomination grew, rapidly, and the organization of the denomination matured; new regions were formed, split, and re-formed; the complexity of the bureaucracy increased; and the types of positions available and methods of record keeping changed.

Adding to that complexity of development over time, the information was entered inconsistently, both within a year and across multiple years. For example, George Butler, who was a major figure in the early SDA church, appears in the yearsbooks as: G.I. Butler; George I. Butler, Geo. I. Butler; Elder Geo. I. Butler …. it keeps going.

Screenshot of the yearbook data in a Google spreadsheet.

Now what I want is data that looks like this, because I am interested in tracking who is recorded as having leadership positions within the denomination and how that changes over time. But this requires work disambiguating names, addressing changing job titles over time, and deciding which pieces of information, from the abundance of the yearbooks, are useful to my study.

All of that results in having to make numerous interpretive decisions when working to translate this information into something I can work with computationally.

There are many ways to do this, some of which are better than others. My approach has been to use the data from 1920 to set my categories and project backwards. For example, in 1883 there were no regional conferences, so I made the decision to restricted the data I translated to the states that will eventually make up my four regions of study. That is an interpretive choice, and one that places constraints around my use of the data and the posibilities for reusing the data in other research contexts. These are choices that must be documented and explained.

Screenshot of the Mason Libraries resources on Data Management.

Fortunately, we are not the first to discover that data are messy and complicated to work with. There are many existing practices around data creation, management, and organization that we can draw on. Libraries are increasingly investing in data services and data management training, and I have greatly benefited from taking part in those conversations.

Quote from Laurie Maffly-Kipp, 'I am left to wonder more generally, though, about the tradeoff necessary as we integrate images and spatial representations into our verbal narratives of religion in American history. In the end, we are still limited by data that is partial, ambiguous, and clearly slanted toward things that can be counted and people that traditionally have been seen as significant.'

Working with data is a challenging, interpretive, and scholarly activity. While we may at times tell lies with our data and our visualizations, all scholarly interpretation is an abstraction, a selective focusing on elements to reveal something about the whole. What is important when working with data is to be transparent about the interpretive choices we are making along the way.

My hope is that as the use of data increases in the study of religion, we also create structures to recognize and evaluate the difficult intellectual work involved in the creation, management, and interpretation of that data.