[Update: Thanks to an excellent suggestion from Chris Sexton, the visualizations are initially image files, and you can click through to the interactive html versions. Browser overload no more!]

[Disclaimer: this page will probably take a while to load because there is a lot of JavaScript further down the page.]

I have so much explaining to do about how I got to this point, but …

I finally made something interesting and I want to share! So the story of how I got here will be told later, via flashbacks.

Who, What Now?

First, a little context. I am working with a series of periodicals which were published in the mid-19th to early-20th centuries and subsequently digitized and OCRed by the Seventh-day Adventist denomination. The periodicals are available to the public at http://documents.adventistarchives.org.¹

I will walk through my process for selecting periodicals at a later point, but I am currently working with 30 different titles, from which I have 13,231 individual issues that have been split into 196,761 pages. This is not “BIG” big data … but it is big enough. I did some math and if I were to look at each page for one minute a page, it would take me 409 days at 8 hours a day to look at the whole corpus.

In order to investigate the development of Seventh-day Adventist health beliefs and practices in relationship to their understanding of salvation, I am applying clustering algorithms to the corpus to identify those documents that contain discourses connected to my topics of interest. But, in order to have confidence in the results of those algorithms, I need an automated way to know more about the quality of the data that I am working with, especially since

I was not responsible for the creation of the data and
I cannot manually examine the corpus and still complete my dissertation on time (or maintain my sanity).

There are different strategies when it comes to dealing with messy OCR and determining how much it can be ignored, some of which depends on the questions being asked and the types of algorithms in use. However, there has been little discussion about ways to ascertain the quality of the documents one is working with and the processes for improving it, particularly when you do not have a “ground truth” against which to compare. The approach I am pursuing is to take each token in the document and compare that token to a large wordlist of “verified” tokens.² More on the creation of that wordlist to come. (Spoilers: I read the README for the NLTK words list … and wept.)

This approach is not in any means fool-proof. It is really bad at dealing with

errors that happened to create new “verified” tokens,
obscure words or figures referenced by the SDA authors (despite my best efforts to capture them),
intentional mis-spellings (there are a few editors who … really could not spell),
names of SDA community members,

to name a few. This is to say, there will undoubtedly be a number of correct OCR transcriptions that are labeled as errors and a number of incorrect OCR transcriptions that will not be labeled as errors. If I was attempting to compute absolute error rates and discarding error words, my approach would not be sufficient.

However, that is not the project at hand. My goal is to get a bird’s eye view of the quality of the OCR I am working with and evaluate the success of different strategies for correcting common OCR mistakes. I am also anticipating that different strategies will be needed for different titles, and possibly over different time periods, as the layout and print methods greatly influence the success of the OCR engines. By comparing the documents at different stages of cleaning to the same wordlist, I can report on the relative improvements, even if not all errors are captured.

All that to say … I did that and I have some pretty-pretty graphs from the first round of results! Here I am showing off two charts that I am using to get an overall picture of what is going on in the individual titles. I am also generating reports that focus in on potential problem areas (documents with high errors rates, docs with low token counts, the most frequent errors, very long errors (often linked to decorative elements in the periodical or to words that were smushed together during OCR), etc.).

Overall, the initial results are really promising! The only change I have made to the documents I am reporting on was to convert all of them to UTF-8 encoding (because Python. And Sanity.)³ The results indicate that the vast majority of documents have a “bad token” rate distributed around 10%. As a point of comparison, the Mapping Texts project considers a final “good token” rate of over 80% to be excellent for their corpus. Rather than overwhelm you with all of them, I want to highlight a few of the most interesting and point out the types of questions that they raise for me.

The results

Notes on the visualizations:

Bokeh, the charting library I am using, insists on formatting the values under “0.1” in scientific notation, which you will see in the tooltips on the scatterplot graphs. I do not know why, and I tried to change it, but that led down a whole rabbit hole of error messages and incomplete documentation from which I have emerged only through struggle and determination with the values still appearing in scientific notation.
I hid the x-axes on the scatterplots because they were illegible, but the x-axis is by document id, and sorted chronologically, with the earliest publication dates at the left of the chart.
The visualization have handy interactive capabilities, thanks to Bokeh. They also make the memory load a bit large. So if your browser hasn’t crashed, you can zoom in to see the details on the charts a bit more clearly.
I forced all of the distribution graphs to an x-axis of 0-1 in order to make it easier to visually compare between them. This resulted in some of the graphs being very tightly squashed, but you can zoom in to get a better picture of the different bins at play. Similarly for the scatterplot graphs, which are forced to 0-1 on the y-axis.

Our first victim: The Columbia Union Visitor, known hereafter as CUV.