I can haz charts!

[Update: Thanks to an excellent suggestion from Chris Sexton, the visualizations are initially image files, and you can click through to the interactive html versions. Browser overload no more!]

[Disclaimer: this page will probably take a while to load because there is a lot of JavaScript further down the page.]

I have so much explaining to do about how I got to this point, but …

I finally made something interesting and I want to share! So the story of how I got here will be told later, via flashbacks.

Who, What Now?

First, a little context. I am working with a series of periodicals which were published in the mid-19th to early-20th centuries and subsequently digitized and OCRed by the Seventh-day Adventist denomination. The periodicals are available to the public at http://documents.adventistarchives.org.1

I will walk through my process for selecting periodicals at a later point, but I am currently working with 30 different titles, from which I have 13,231 individual issues that have been split into 196,761 pages. This is not “BIG” big data … but it is big enough. I did some math and if I were to look at each page for one minute a page, it would take me 409 days at 8 hours a day to look at the whole corpus.

In order to investigate the development of Seventh-day Adventist health beliefs and practices in relationship to their understanding of salvation, I am applying clustering algorithms to the corpus to identify those documents that contain discourses connected to my topics of interest. But, in order to have confidence in the results of those algorithms, I need an automated way to know more about the quality of the data that I am working with, especially since

  1. I was not responsible for the creation of the data and
  2. I cannot manually examine the corpus and still complete my dissertation on time (or maintain my sanity).

There are different strategies when it comes to dealing with messy OCR and determining how much it can be ignored, some of which depends on the questions being asked and the types of algorithms in use. However, there has been little discussion about ways to ascertain the quality of the documents one is working with and the processes for improving it, particularly when you do not have a “ground truth” against which to compare. The approach I am pursuing is to take each token in the document and compare that token to a large wordlist of “verified” tokens.2 More on the creation of that wordlist to come. (Spoilers: I read the README for the NLTK words list … and wept.)

This approach is not in any means fool-proof. It is really bad at dealing with

  • errors that happened to create new “verified” tokens,
  • obscure words or figures referenced by the SDA authors (despite my best efforts to capture them),
  • intentional mis-spellings (there are a few editors who … really could not spell),
  • names of SDA community members,

to name a few. This is to say, there will undoubtedly be a number of correct OCR transcriptions that are labeled as errors and a number of incorrect OCR transcriptions that will not be labeled as errors. If I was attempting to compute absolute error rates and discarding error words, my approach would not be sufficient.

However, that is not the project at hand. My goal is to get a bird’s eye view of the quality of the OCR I am working with and evaluate the success of different strategies for correcting common OCR mistakes. I am also anticipating that different strategies will be needed for different titles, and possibly over different time periods, as the layout and print methods greatly influence the success of the OCR engines. By comparing the documents at different stages of cleaning to the same wordlist, I can report on the relative improvements, even if not all errors are captured.

All that to say … I did that and I have some pretty-pretty graphs from the first round of results! Here I am showing off two charts that I am using to get an overall picture of what is going on in the individual titles. I am also generating reports that focus in on potential problem areas (documents with high errors rates, docs with low token counts, the most frequent errors, very long errors (often linked to decorative elements in the periodical or to words that were smushed together during OCR), etc.).

Overall, the initial results are really promising! The only change I have made to the documents I am reporting on was to convert all of them to UTF-8 encoding (because Python. And Sanity.)3 The results indicate that the vast majority of documents have a “bad token” rate distributed around 10%. As a point of comparison, the Mapping Texts project considers a final “good token” rate of over 80% to be excellent for their corpus. Rather than overwhelm you with all of them, I want to highlight a few of the most interesting and point out the types of questions that they raise for me.

The results

Notes on the visualizations:

  1. Bokeh, the charting library I am using, insists on formatting the values under “0.1” in scientific notation, which you will see in the tooltips on the scatterplot graphs. I do not know why, and I tried to change it, but that led down a whole rabbit hole of error messages and incomplete documentation from which I have emerged only through struggle and determination with the values still appearing in scientific notation.
  2. I hid the x-axes on the scatterplots because they were illegible, but the x-axis is by document id, and sorted chronologically, with the earliest publication dates at the left of the chart.
  3. The visualization have handy interactive capabilities, thanks to Bokeh. They also make the memory load a bit large. So if your browser hasn’t crashed, you can zoom in to see the details on the charts a bit more clearly.
  4. I forced all of the distribution graphs to an x-axis of 0-1 in order to make it easier to visually compare between them. This resulted in some of the graphs being very tightly squashed, but you can zoom in to get a better picture of the different bins at play. Similarly for the scatterplot graphs, which are forced to 0-1 on the y-axis.

Our first victim: The Columbia Union Visitor, known hereafter as CUV.

Click through to the interactive visualization.

Overall, the distribution of error rates is beautiful. It is centered on .1, showing that the majority of the documents have an error (or “bad token”) rate of around 10%. But this graph does not give many clues about where the problem areas might lie.

When we look at the rates over time, we get a more nuanced picture. (Scroll down in the iframe to see the full graph.)

Click through to the interactive visualization.

While most of the results are clustered down between 0 and .2, as expected, there is a curious spike in error rates around 1906.

To explain this spike I can look at the types of errors reported for those years and examine the original PDF documents to see if there is a shift in formatting or document quality that might be at play.

Next: The Health Reformer, known hereafter as HR.

Click through to the interactive visualization.

Here we have an interesting situation where the bundle of documents with an error rate of “0” is abnormally high. Either the OCR for this publication is off-the-charts good, or there is something else going on. Since there is also a spike at “1”, I’m guessing the latter.

Click through to the interactive visualization.

Looking at the data over time corroborates that guess and reveals another interesting pattern. The visualization shows a visible shift at around 1886 in the occurrence and density of high-error documents, which, interestingly, coincides with the rise of documents with “0” errors. Something is happening here!

One last one (because otherwise your browser will never load). This pattern repeats, though at a smaller scale, in the Pacific Health Journal, or PHJ.

Click through to the interactive visualization.

Click through to the interactive visualization.

Notice the uptick in high-error documents at the 1901 mark, with intermittent spikes prior to that.

Not all of the titles have such drastic shifts in their “bad token” rates, and many have a consistent smattering of high error documents. But these are particularly illustrative of the types of information visible when examining the distribution of the data and the data over time.

All of this to say:

  1. the default OCR for these publications is pretty good. but …
  2. there are some interesting and potentially significant problem areas that would skew the data for particular time periods.

This data will serve as the baseline against which any computational corrections to the OCR will be compared. It will enable me to evaluate the effectiveness of the different corrections and to see when the changes have little affect on the overall error rates in the corpus.

And I will report back on what I find in the original scans and text files!

The code

I have uploaded the code I wrote to generate the corpus statistics as a gist at https://gist.github.com/jerielizabeth/97eeac0bf83365af7fd00bc6a0151554.

The code I wrote to create the visualizations is available as a gist at https://gist.github.com/jerielizabeth/c1ccd516681bf311630533be2bdb23d8.

Comments and suggestions are always welcome!


And thanks to Fred Gibbs for all the feedback as I have been working through how to evaluate and improve the OCR!

  1. Funny story. As I was writing this, the SDA released a story about how they’re filling in some of the missing years of the digitized periodicals. So, what you see online is probably a bit different from the collection I am currently work from.

  2. As a point of reference, the Mapping Texts project worked with a corpus of approximately 230,000 pages. My approach to identifying errors is similar to the one pursued by the Mapping Texts team.

  3. The process by which is also to be documented at a future point. So much documentation.

Religion and Data: A Presentation for the American Academy of Religion 2016

On November 19th I had the opportunity to be part of the Digital Futures of Religious Studies panel at the American Academy of Religion in lovely San Antonio, Texas. It was wonderful to see the range of digital work being done in the study of religion represented and to participate in making the case for that work to be supported by the academy.

Panel members represented a wide range of digital work, from large funded projects to digital pedagogy. I focused my five minute slot on the question of data and the need to recognize the work of data creation and management as part of the scholarly production of digital projects.

tl;dr: I argued in my presentation that as we create scholarly works that rely on data in their analysis and presentation, we also need to include the production, management, and interpretation of data in our systems for the assessment and evaluation of scholarship.

Two questions to consider: what does data-rich scholarship entail? and how do we evaluate data-rich scholarship?

What is involved when working with data? And how do we evaluate scholarship that includes data-rich analyses? These are the two questions that I want to bring up in my 5 minutes.

My dissertation sits in the intersection of American religious history and the digital humanities.

Screenshot of dissertation homepage, entitled 'A Gospel of Health and Salvation: A Digital Study of Health and Religion in Seventh-day Adventism, 1843-1920'

I am studying the development of Seventh-day Adventism and the interconnections between their theology and their embrace of health reform. To explore these themes, I am using two main groups of sources: published periodicals of the denomination and the church yearbooks and statistical records, supplemented by archival materials.

Screenshot of the downloaded Yearbooks and Statistical Reports.

In recent years the Seventh-day Adventist church has been undertaking an impressive effort to make their past easily available.

Screenshots of the Adventist Archives and the Adventist Digital Library sites.

Many of the official publications and records are available at the denominational site (http://documents.adventistarchives.org/) and a growing collection of primary materials are being release at the new Adventist Digital Library (http://adventistdigitallibrary.org/).

This abundance of digital resources makes it possible to bring together the computational methods of the digital humanities with historical modes of inquiry to ask broad questions about the development of the denomination and the ways American discourses about health and religious salvation are inter-related.

But … one of the problems of working with data is that while it is easy to present data as clean and authoritative, data are incredibly messy and it is very easy to “do data” poorly.

Chart from tylervigen.com of an apparent correlation between per capita cheese consumption and the number of people who die from becoming tangled in bedsheets.

Just google “how to lie with statistics” (or charts, or visualizations). It happens all the time and it’s really easy to do, intentionally or not. And that is when data sets are ready for analysis.

Most data are not; even “born digital” data are messy and must be “cleaned” before they can be put to use. And when we are dealing with information compiled before computers, things get even more complicated.

Let me give you an example from my dissertation work.

First page of the Adventist yearbooks from 1883 and 1920.

This first page was published in 1883, the first year the denomination dedicated a publication to “such portions of the proceedings of the General Conference, and such other matters, as the Committee may think best to insert therein.” The second is from 1920, the last year in my study.

You can begin to see how this gets complicated quickly. In 1883, the only data recorded for the General Conference was office holders (President, VP, etc.) and we go straight into one of the denominational organizations. In 1920, we have a whole column of contextual information on the scope of the denomination, a whole swath of new vice-presidents for the different regions, and so forth.

What happened over those 40 years? The denomination grew, rapidly, and the organization of the denomination matured; new regions were formed, split, and re-formed; the complexity of the bureaucracy increased; and the types of positions available and methods of record keeping changed.

Adding to that complexity of development over time, the information was entered inconsistently, both within a year and across multiple years. For example, George Butler, who was a major figure in the early SDA church, appears in the yearsbooks as: G.I. Butler; George I. Butler, Geo. I. Butler; Elder Geo. I. Butler …. it keeps going.

Screenshot of the yearbook data in a Google spreadsheet.

Now what I want is data that looks like this, because I am interested in tracking who is recorded as having leadership positions within the denomination and how that changes over time. But this requires work disambiguating names, addressing changing job titles over time, and deciding which pieces of information, from the abundance of the yearbooks, are useful to my study.

All of that results in having to make numerous interpretive decisions when working to translate this information into something I can work with computationally.

There are many ways to do this, some of which are better than others. My approach has been to use the data from 1920 to set my categories and project backwards. For example, in 1883 there were no regional conferences, so I made the decision to restricted the data I translated to the states that will eventually make up my four regions of study. That is an interpretive choice, and one that places constraints around my use of the data and the posibilities for reusing the data in other research contexts. These are choices that must be documented and explained.

Screenshot of the Mason Libraries resources on Data Management.

Fortunately, we are not the first to discover that data are messy and complicated to work with. There are many existing practices around data creation, management, and organization that we can draw on. Libraries are increasingly investing in data services and data management training, and I have greatly benefited from taking part in those conversations.

Quote from Laurie Maffly-Kipp, 'I am left to wonder more generally, though, about the tradeoff necessary as we integrate images and spatial representations into our verbal narratives of religion in American history. In the end, we are still limited by data that is partial, ambiguous, and clearly slanted toward things that can be counted and people that traditionally have been seen as significant.'

Working with data is a challenging, interpretive, and scholarly activity. While we may at times tell lies with our data and our visualizations, all scholarly interpretation is an abstraction, a selective focusing on elements to reveal something about the whole. What is important when working with data is to be transparent about the interpretive choices we are making along the way.

My hope is that as the use of data increases in the study of religion, we also create structures to recognize and evaluate the difficult intellectual work involved in the creation, management, and interpretation of that data.

Selecting a Digital Workflow

Disclaimer: workflows are rather infinitely customizable to fit project goals, intellectual patterns, and individual quirks. I am writing mine up because it has been useful for me to see how others are solving such problems. I invite you to take what is useful, discard what is not, and share what you have found.

Writing a dissertation is, among many things, a exercise in data management. Primary materials and their accompanying notes must be organized. Secondary materials and their notes must be organized. Ones own writing must be organized. And, when doing computational analysis, the code, processes, and results must be organized. This, I can attest, can be a daunting mountain of things to organize (and I like to organize).

There are many available solutions for wrangling this data, some of which cost money, some of which are free, and all of which cater to different workflows. For myself, I have found that my reliance on computational methods and my goal of HTML as the final output make many of the solutions my colleagues use with great success (such as Scrivener and Evernote) just not work well for me. In addition, my own restlessness with interfaces has made interoperability a high priority - I need to be able to use many tools on the same document without having to worry about exporting to different formats in-between. This has lead to my first constraint: my work needs to be done in a plain text format that, eventually, will easily convert to HTML.

The second problem is version control. On a number of projects in the past, written in Word or Pages, I have created many different named versions along the way. When getting feedback, new versions were created and results had to be reincorporated back into the authoritative version. This is all, well, very inefficient. And, while many of these sins can be overcome when writing text for humans, the universe is less forgiving when working with code. And so, constraint number two: my work needs to be under version control and duplication should be limited.

The third problem is one of reproducibility and computational methods. The experiments that I run on my data need to be clearly documented and reproducible. While this is generally good practice, it is especially important for computational research. Reproducible and clearly documented work is important for the consistency of my own work over time, and to enable my scholarship to be interrogated and used by others. And so, the third constraint: my work needs to be done in such a way that my methods are transparent, consistent, and reproducible.

My solution to the first constraint is Markdown. All of my writing is being done in markdown — including this blogpost! Markdown enables me to designate the information hierarchies and easily convert to HTML, but also to work on files in multiple interfaces. Currently, I am writing in Ulysses, but I also have used Notebooks, IA Writer, and Sublime text editor. What I love about Ulysses is that I can have one writing interface, but draw from multiple folders, most of which are tied to git repositories and/or dropbox. Ulysses has a handy file browser window, so it is easy to move between files and see them all as part of the larger project. But regardless of the application being used, it is the same file that I am working with across multiple platforms, applications, and devices.

My solution to the second constraint is Github and the Github Student Developer Pack. Git and Github enable version control, and multiple branches, and all sorts of powerful ways to keep my computational work organized. In addition, Github’s student pack comes with 5 private repos (private while you are a student), along with a number of other cool perks. While I am a fan of working in the open and plan to open up my work over time, I have also come to value being able to make mistakes in the quiet of my own digital world. Dissertations are hard, and intimidating, and there is something nice about not having to get it right from the very beginning while still being able to share with advisors and select colleagues.

But, Github hasn’t proved the best for the writing side of things. The default wiki is limited in its functionality, especially in terms of enabling commenting, and repos are not really made for reading. My current version control solution for the writing side of the project is Penflip, billed as “Github for Writing”. Writing with Penflip can be done through the web interface, or by cloning down the files (which are in markdown) and working locally. As such, the platform conforms to the markdown and the “one version” requirements. The platform is free if your writing is public, and $10/month for private repos, which is what I am trying out. I am using PenFlip a bit like a wiki, with general pages for notes on different primary documents and overarching pages that describe lines of inquiry and link the associated note and code files together.1 This is the central core of my workflow - the place that ties together notes, code, and analysis and lays the ground work for the final product. So far it is working well for the writing, though using it for distribution commenting has been a little bumpy as the interface needs a little more refining.

And the code. The code was actually the first of these problems I solved. Of the programming languages to learn, I have found Python to be the most comfortable to work in. It has great libraries for text analysis, and comes with the benefit of Jupyter Notebooks (previously IPython Notebooks). Working in IPython Notebooks allows me to integrate descriptive text in markdown with code and the resulting visualizations. The resulting document is plain text, can be versioned, and displays in HTML for sharing. The platform conforms to my first two requirements — markdown and version controlled — while also making my methods transparent and reproducible. I run the Jupyter server locally, and am sharing pieces that have been successfully executed via nbviewer - for example, I recently released the code I used to create my pilot sample of one of my primary corpuses. I am also abstracting code that is run more than once into a local library of functions. Having these functions enables me to reliably follow the same processes over time, thus creating results that are reproducible and comparable.

And that is my current tooling for organizing and writing both my digital experiments and my surrounding explanations and narrative.

  1. I somehow missed the “research wiki” bus a while back, but thanks to a recent comment made by Abby Mullen, I am a happy convert to the whole concept.