Ways to Compute Topics over Time, Part 1

This is part of a series of technical essays documenting the computational analysis that undergirds my dissertation, A Gospel of Health and Salvation. For an overview of the dissertation project, you can read the current project description at jeriwieringa.com. You can access the Jupyter notebooks on Github.


This the first in a series of posts which constitute a “lit review” of sorts to document the range of methods scholars are using to compute the distribution of topics over time.

Graphs of topic prevalence over time are some of the most ubiquitous in digital humanities discussions of topic modeling. They are used as a mechanism for identifying spikes in discourse and for depicting the relationship between the various discourses in a corpus.

Topic prevalence over time is not, however, a measure that is returned with the standard modeling tools such as MALLET or Gensim. Instead, it is computed after the fact by combining the model data with external metadata and aggregating the model results. And, as it turns out, there are a number of ways that the data can be aggregated and displayed.

In this series of notebooks, I am looking at 4 different strategies for computing topic significance over time. These strategies are:

  • Average of topic weights per year
  • Smoothing or regression analysis
  • Proportion of total weights per year
  • Prevalence of the top topic per year

To identifying the different strategies that others have used to compute topic distribution, I have gone digging into the code released with various blog posts and topic model browsers. What I have recreated here is my current understanding of the strategies being deployed. I looked at the following sources for examples of ways others have computed topics over time:

I would love to hear if I have misunderstood the calculations taking place or about additional strategies or implementation ideas.

To explore a range of strategies for computing and visualizing topics over time from a standard LDA model, I am using a model I created from my dissertation materials. You can download the files needed to follow along from https://www.dropbox.com/s/9uf6kzkm1t12v6x/2017-06-21.zip?dl=0.

I created this model from periodicals produced by the Seventh-day Adventist denomination between 1859 and 1875 using Gensim. I split each periodical into individual pages, so that I can better capture the range of topics in each periodical issue and identify individual pages of interest. In the process of corpus creation, I filtered the tokens on each page, excluding both low frequency (words occurring in fewer than 20 documents) and high frequency (words occurring in more than 30% of all documents) words. In other words, don’t read anything into the fact that “God” is missing from the model. I have modeled the most distinctive words for each page.

For historical context, during this period early Seventh-day Adventists formally organized as a denomination in 1863. They also launched their health reform efforts with the establishment of the Western Health Reform Institute (later renamed the Battle Creek Sanitarium) and started a health focused publication, The Health Reformer, both in 1866.

# Load the necessary libraries
import gensim # Note: I am running 1.0.1
from ggplot import *
import json
import logging
import matplotlib as mpl
import matplotlib.pyplot as plt
import os
import pandas as pd # Note: I am running 0.19.2
import pyLDAvis.gensim
import seaborn as sns
import warnings
# Enable in-notebook visualizations
%matplotlib inline
pyLDAvis.enable_notebook()
# Temporary fix for persistent warnings of an api change between pandas and seaborn.
warnings.filterwarnings('ignore')
pd.options.display.max_rows = 10
base_dir = ""
period = '1859-to-1875'
directory = "historical_periods"
lda_model = gensim.models.LdaModel.load(os.path.join(base_dir, 'models', directory, '{}.model'.format(period)))
corpus = gensim.corpora.MmCorpus(os.path.join(base_dir, 'corpora', directory, "{}.mm".format(period)))
dictionary = gensim.corpora.Dictionary.load(os.path.join(base_dir, 'corpora', directory, "{}.dict".format(period)))

Visualize the model

To better understand the model that we are analyzing, we can first visualize the whole model using pyLDAvis. This provides us a picture of the overall topic weights and word distributions across all of the documents.

Note: the topic numbers here indicate in descending order the prevalence of the topic in the whole corpus. They are not the same as the topic ids used in the rest of this notebook.

pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)

Create the dataframes

While pyLDAvis gives us a good overall picture of the model data, we need to further manipulate the data in order to track change over time across the different publication years.

I preprocessed the model and exported information about the weights per document and topic labels into CSV files for ease of compiling. I will release the code I used to export that information in a later notebook.

metadata_filename = os.path.join(base_dir,'2017-05-Composite-OCR-statistics.csv')
index_filename = os.path.join(base_dir, 'corpora', directory, '{}.txt'.format(period))
labels_filename = os.path.join(base_dir, 'dataframes', directory, '{}_topicLabels.csv'.format(period))
doc_topic_filename = os.path.join(base_dir, 'dataframes', directory, '{}_dtm.csv'.format(period))
def doc_list(index_filename):
    """
    Read in from a json document with index position and filename. 
    File was created during the creation of the corpus (.mm) file to document
    the filename for each file as it was processed.
    
    Returns the index information as a dataframe.
    """
    with open(index_filename) as data_file:    
        data = json.load(data_file)
    docs = pd.DataFrame.from_dict(data, orient='index').reset_index()
    docs.columns = ['index_pos', 'doc_id']
    docs['index_pos'] = docs['index_pos'].astype(int)
  
    return docs


def compile_dataframe( index, dtm, labels, metadata):
    """
    Combines a series of dataframes to create a large composit dataframe.
    """
    doc2metadata = index.merge(metadata, on='doc_id', how="left")
    topics_expanded = dtm.merge(labels, on='topic_id')
    
    df = topics_expanded.merge(doc2metadata, on="index_pos", how="left")
    
    return df
metadata = pd.read_csv(metadata_filename, usecols=['doc_id', 'year','title'])
docs_index = doc_list(index_filename)
dt = pd.read_csv(doc_topic_filename)
labels = pd.read_csv(labels_filename)

The first step, following Andrew Goldstone’s process in his topic model browser, is to normalize the weights for each document, so that they total to “1”.

As a note, Goldstone first smooths the weights by adding the alpha hyperparameter to each of the weights, which I am not doing here.

# Reorient from long to wide
dtm = dt.pivot(index='index_pos', columns='topic_id', values='topic_weight').fillna(0)

# Divide each value in a row by the sum of the row to normalize the values
dtm = (dtm.T/dtm.sum(axis=1)).T

# Shift back to a long dataframe
dt_norm = dtm.stack().reset_index()
dt_norm.columns = ['index_pos', 'topic_id', 'norm_topic_weight']

With the document/topic dataframe normalized, we can compile our full dataframe.

df = compile_dataframe(docs_index, dt_norm, labels, metadata)
df
index_pos topic_id norm_topic_weight topic_words doc_id year title
0 0 0 0.045525 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page1.txt 1863 GCB
1 1 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page2.txt 1863 GCB
2 2 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page3.txt 1863 GCB
3 3 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page4.txt 1863 GCB
4 4 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page5.txt 1863 GCB
... ... ... ... ... ... ... ...
288595 11539 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page4.txt 1872 YI
288596 11540 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page5.txt 1872 YI
288597 11541 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page6.txt 1872 YI
288598 11542 24 0.012192 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page7.txt 1872 YI
288599 11543 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page8.txt 1872 YI

288600 rows × 7 columns

Data dictionary:

  • index_pos : Gensim uses the order in which the docs were streamed to link back the data and the source file. index_pos refers to the index id for the individual doc, which I used to link the resulting model information with the document name.
  • topic_id : The numerical id for each topic. For this model, I used 20 topics to classify the periodical pages.
  • norm_topic_weight : The proportion of the tokens in the document that are part of the topic, normalized per doc.
  • topic_words : The top 6 words in the topic.
  • doc_id : The file name of the document. The filename contains metadata information about the document, such as the periodical title, date of publication, volume, issue, and page number.
  • year : Year the document was published (according to the filename)
  • title : Periodical that the page was published in.

Summary Statistics

We can get an overall view of our data by computing some basic statistics.

# Dropping the zero values so that we see the larger patterns in the data
df_sum = df[df['norm_topic_weight'] != 0]
print("Max: {}".format(df_sum['norm_topic_weight'].max()))
print("Min: {}".format(df_sum['norm_topic_weight'].min()))
print("Average: {}".format(df_sum['norm_topic_weight'].mean()))
print("Median: {}".format(df_sum['norm_topic_weight'].median()))
print("Most frequent value: {}".format(df_sum['norm_topic_weight'].round(3).value_counts().idxmax()))
Max: 1.0
Min: 0.010027314043872615
Average: 0.09222067775487706
Median: 0.05201414311615885
Most frequent value: 0.011

The topics range from 100% of the tokens in a document to 1% (excluding the zero values), with an average at 9% and a median value of 5%. The most frequent value is near 1%, which indicates that the data predominantly describes topics that have a minor presence in the documents.

One way to visualize the overall structure of the data is to look at all of the topic weights for each topic id using a scatterplot.

p = sns.factorplot(x="year", y='norm_topic_weight', hue="topic_id", col='topic_words', col_wrap=2, 
                   kind='strip', size=2.5, aspect=2, jitter=1, data=df_sum)
p.fig.subplots_adjust(top=0.95)
p.fig.suptitle("Scatterplot of Normalized Topic Weights, Split by Topic; All Weights.", fontsize=16)
<matplotlib.text.Text at 0x1047bc400>

While fun, this is too many plots to be useful, so I will move forward with a subset of the topics. However, if you decide to experiment with the notebook, you can switch back to the full dataframe.

dfs = df[(df['topic_id'] >= 15) & (df['topic_id'] <=20)]

Average of topic weights per year

The first strategy I found for aggregating topic weights is to average the topic weights per year.

The average topic weight is computed by adding all of the weights for a given topic in a time period and dividing by the total number of documents in that time period. This gives us the average weight of the topic over all documents in the corpus.

One complication I encountered while work through this is while MALLET reports a weight (be it ever so small) for each topic for all the documents, Gensim by default drops topic information for a document if the weight is under 0.01. This presents a wrinkle for computing averages from Gensim-produced data, as the data needs be expanded before we can compute the average weight of a topic over the corpus (rather than the average over its significant appearances). To do that, I added a 0 value for the missing topic values during the normalizing steps above.

We can use the aggregation functions in Seaborn, a Python graphing library, to get a quick overview of the average topic weight by year. By default, the pointplot and bar plot functions use the mean to compute the “central tendency and confidence interval” of categorical data. For more control of the calculation, we can also compute the averages externally.

order = ['conference, committee, report, president, secretary, resolved',
         'association, publishing, chart, dollar, xxii, sign',
         'disease, physician, tobacco, patient, poison, medicine',
         'quarterly, district, society, send, sept, business',
         'wicked, immortality, righteous, adam, flesh, hell',
         'mother, wife, told, went, young, school' 
        ]
def create_pointplot(df, y_value, hue=None, order=order, col=None, wrap=None, size=5, aspect=1.5, title=""):
    p = sns.factorplot(x="year", y=y_value, kind='point', hue_order=order, hue=hue, 
                       col=col, col_wrap=wrap, col_order=order, size=size, aspect=aspect, data=df)
    p.fig.subplots_adjust(top=0.9)
    p.fig.suptitle(title, fontsize=16)
    return p
def create_bar(df, y_value, hue=None, order=order, col=None, wrap=None, size=5, aspect=1.5, title=""):
    p = sns.factorplot(x="year", y=y_value, kind='bar', hue_order=order, hue=hue, 
                       col=col, col_wrap=wrap, col_order=order, size=size, aspect=aspect, data=df)
    p.fig.subplots_adjust(top=0.9)
    p.fig.suptitle(title, fontsize=16)
    return p
create_pointplot(dfs, 'norm_topic_weight', hue='topic_words', order=order, size=5, aspect=1.5,
                 title="Central Range of Normalized Topic Weights. Computed with Seaborn.")
<seaborn.axisgrid.FacetGrid at 0x11a0da588>

One of the common visualizations is to show each plot separately through a facet grid. While this helps us to isolate the contours of each individual topic, it is harder to track the relationship between the topics.

create_pointplot(dfs, 'norm_topic_weight', col='topic_words', wrap=3, 
                 title="Central Range of Normalized Topic Weights Split by Topic. Computed with Seaborn.")
<seaborn.axisgrid.FacetGrid at 0x121d5e128>

We can also compute the average weight manually.

# Get number of docs per year
total_docs = df.groupby('year')['doc_id'].apply(lambda x: len(x.unique())).reset_index()
total_docs.columns = ['year', 'total_docs']

# Group by year and topic id
df_avg = df.groupby(['year', 'topic_id']).agg({'norm_topic_weight': 'sum'}).reset_index()

# Merge dataframes
df_avg = df_avg.merge(total_docs, on="year", how="left")

# Compute the mean per topic
df_avg['average_weight'] = df_avg['norm_topic_weight'] / df_avg['total_docs']

# Merge the dataframes
df_avg = df_avg.merge(labels, on='topic_id')

# # Limit to sample
dfs_avg = df_avg[(df_avg['topic_id'] >=15 ) & (df_avg['topic_id']<= 20)]
create_pointplot(dfs_avg, 'average_weight', hue="topic_words",
                title="Yearly Average of Normalized Weight per Topic")
<seaborn.axisgrid.FacetGrid at 0x121d5e470>

create_pointplot(dfs_avg, 'average_weight', col="topic_words", wrap=3,
                title="Yearly Average of Normalized Weight per Topic")
<seaborn.axisgrid.FacetGrid at 0x11e391400>

We can also display the averages as a bar chart, which encourages us to think about each year as discrete.

create_bar(dfs_avg, 'average_weight', order=order, hue="topic_words",
          title="Yearly Average of Normalized Weight per Topic")
<seaborn.axisgrid.FacetGrid at 0x121bbe518>

create_bar(dfs_avg, 'average_weight', order=order, hue="topic_words", col='topic_words',wrap=3,
          title="Yearly Average of Normalized Weight per Topic")
<seaborn.axisgrid.FacetGrid at 0x122085898>

As a sanity check, we can show the importance of the 0 values by excluding those lines from the dataframe and using Seaborn to compute the average topic weight per year.

df_filtered = dfs[dfs['norm_topic_weight'] > 0]
create_pointplot(df_filtered, 'norm_topic_weight', hue='topic_words',
                title="Center Range of Topic Weights with '0' values removed. Computed with Seaborn")
<seaborn.axisgrid.FacetGrid at 0x1224bdbe0>

If we were to rely on this graph, we would likely postulate that the discourse of the denomination became overwhelmingly inflected with discussions of health starting in 1868. And we would likely fail to appreciate the ways that business and domestic concerns followed similar contours and were nearly as prevalent in the overall literature of the time.

Next I will look at using smoothing functions for charting topic changes over time.

You can download and run the code locally using the Jupyter Notebook version of this post.

Know Your Sources (Part 2)

This is part of a series of technical essays documenting the computational analysis that undergirds my dissertation, A Gospel of Health and Salvation. For an overview of the dissertation project, you can read the current project description at jeriwieringa.com. You can access the Jupyter notebooks on Github.

My goals in sharing the notebooks and technical essays are three-fold. First, I hope that they might prove useful to others interested in taking on similar projects. In these notebooks I describe and model how to approach a large corpus of sources and to use that corpus in the production of historical scholarship.

Second, I am sharing them in hopes that “given enough eyeballs, all bugs are shallow.” If you encounter any bugs, if you see an alternative way to solve a problem, or if the code does not achieve the goals I have set out for it, please let me know!

Third, these notebooks make an argument for methodological transparency and for the discussion of methods as part of the scholarly output of digital history. Often the technical work in digital history is done behind the scenes, with publications favoring the final research products, usually in article form with interesting visualizations. While there is a growing culture in digital history of releasing source code, there is little discussion of how that code was developed – the impulse is to move on once the code “works” and to focus in the presentation on what that code produced, without attending to why a solution was chosen and what that solution both enables and prevents. These notebooks seek to serve that middle space between code and the final analysis - documenting the computational problem solving that I’ve done on the way to the analysis. As these essays attest, each step in the processing of the corpus requires the researcher to make a myriad of distinctions about the world they seek to model, decisions that shape the outcomes of the computational analysis and are part of the historical argument of the work.


So far in this series I have discussed gathering a collection of texts, extracting the text, and identifying the coverage of the corpus. My next step is to assess the quality of the source base.

Because I am interested in problems of categorization and discourse, it is important that I have a mechanism to verify that the text layer provides a sufficiently accurate representation of the original print. While processes such as keyword search or computing document similarity can be performed so as to be resilient to OCR errors, an abundance of errors in the text used in topic modeling would make the topic models unreliable for identifying documents and for surfacing patterns of discourse.1 Without some external information against which to compare the text, it is impossible to know what the overall “health” of the corpus is, to identify what steps need to be taken to improve the quality of the text, and to determine whether the steps I take are effective.

Unlike the newspapers digitized by the Chronicling America project, which are encouraged to include “confidence level data” in the OCR data by the Library of Congress guidelines,2 the interface provided by the SDA provides very little information about the processes by which the periodicals were digitized and the expected quality of the OCR. As a result, the quality of the resulting text layers is obscured. A brief glance through the resulting text files suggests that OCR is of average quality – good enough to be legible but with enough errors to raise concerns. At the same time, the age of the periodicals and the idiosyncrasies of 19th century typography and layout suggest that these documents contain sections that were likely a challenge for the OCR engine.

There are two general methods for evaluating the accuracy of transcribed textual data: comparison of the generated text to some “ground truth” (text that is known to be accurate) or comparison of the generated text to a bank of known words.

The first method is used by Simon Tanner, Trevor Muñoz, and Pich Hemy Ros in their evaluation of the OCR quality of the British Library’s 19th Century Online Newspaper Archive. Working with a sample of 1% of the 2 million pages digitized by the British Library, the team calculated the highest rates of OCR accuracy achieved by comparing the generated XML text to a “double re-keyed” ground-truth version.3 Having a verified ground-truth document enabled the team to provide accurate results about the quality of generated texts (which were generally disappointing, particularly for proper nouns and “significant” or content words). Their approach, however, is labor and time intensive – more than can be taken on by most individual scholars with limited financial resources.

The second approach, which the Mapping Texts team used for their analysis of the OCR accuracy of the Texas newspapers in Chronicling America, is to compare the generated text to an authoritative wordlist and compute the number of words outside the approved set.4 This approach is easier to implement, as it takes much less time to compile a list of relevant words than to re-key even a 1% sample of the text. However, the results are less accurate. The method is blind to places where the OCR engine produced a word that, while in the wordlist, does not match the text on the page or where spelling variations that occur on the page are flagged as OCR errors because they are not included in the word list. Because of the possibility of both false positives and false negatives, the results are an approximate representation of the accuracy of the OCR.

Weighing the disadvantages of both approaches, I chose to use the second method, comparing the generated OCR to an authoritative wordlist. This method was the most practical for me to implement within the constraints of the dissertation. In addition, a bank of subject appropriate words has utility for automated spelling correction, as is suggested in the research of Thomas Lasko and Susan Hauser on methods for OCR correction of the text from the National Library of Medicine.5 These factors shaped my decision to evaluate the quality of the extracted text by comparing it to an external word list.

NLTK ‘words’ Corpus

My first choice of wordlists was the NLTK words.words list, as this is a standard word bank used in many projects. The first stop for documentation on published datasets, including wordlists, is the included README file. The README file at the base of the NLTK words corpus contains the following sparse lines:

Wordlists

en: English, http://en.wikipedia.org/wiki/Words_(Unix) en-basic: 850 English words: C.K. Ogden in The ABC of Basic English (1932)

The linked Wikipedia article contains only a little more information:

words is a standard file on all Unix and Unix-like operating systems, and is simply a newline-delimited list of dictionary words. It is used, for instance, by spell-checking programs.[1]

The words file is usually stored in /usr/share/dict/words or /usr/dict/words.

On Debian and Ubuntu, the words file is provided by the wordlist package, or its provider packages wbritish, wamerican, etc. On Fedora and Arch, the words file is provided by the words package.

Fortunately, my personal computer runs on a “Unix-like” operating system, so I was able to gather the following additional information from the README in the suggested /usr/dict/words directory.

Welcome to web2 (Webster’s Second International) all 234,936 words worth. The 1934 copyright has lapsed, according to the supplier. The supplemental ‘web2a’ list contains hyphenated terms as well as assorted noun and adverbial phrases.

Here at last we have information about the source of the collection of words so frequently used for American English spell-checking programs. However, it is not clear whether or not this is the correct file, as the original link also referenced the wordlist packages in Debian and Ubuntu systems. To confirm that NLTK is using the list of words derived from Webster’s Second International Dictionary, I used set comparison to identify the overlap between the two lists.

Comparing NTLK words to web2

''' You may receive an error message noting that you need to download the words corpus from NTLK. 
Follow the instructions in the error message and then rerun the cell.
'''
from nltk.corpus import words
# Function for pulling words from a txt file
def load_from_txt(file_name):
    with open(file_name, "r") as txt:
        words = txt.read().splitlines()
    return(words)
nltk_list = [w for w in words.words()]

We can get an overview of the list by examining the number of words, and printing out the first 10.

len(set(nltk_list))
235892
nltk_list[:10]
['A',
 'a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'Aani',
 'aardvark',
 'aardwolf',
 'Aaron']
web2_list = load_from_txt('/usr/share/dict/web2')
len(set(web2_list))
235886
web2_list[:10]
['A',
 'a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'Aani',
 'aardvark',
 'aardwolf',
 'Aaron']

At this point, it seems likely that the two lists are similar, though not quite identical. To identify the differences, I used set comparison to quickly reveal where the list do not overlap.

'''Use set().symmetric_difference(set()) to return the list of items that are in either list,
but not in both lists. This provides a quick snapshot of the discontinuities between the two lists.
See https://docs.python.org/3.5/library/stdtypes.html#set-types-set-frozenset for more details. 
'''
set(nltk_list).symmetric_difference(set(web2_list))
{'behaviour', 'box', 'colour', 'harbour', 'humour', 'near'}

These results indicate that, with the exceptions of a few British spellings and the curious case of “box” and “near”, the two lists are functionally identical.

Evaluate Corpus Docs with NLTK ‘words’ (web2)

Armed with the knowledge that the NLKT wordlist is, in fact, the words from Webster’s Second International Dictionary, I moved next to evaluate how well the wordlist covers the vocabulary of our corpus. To do that, I read in one of the documents, did some minimal cleaning, and then found the disjoin between the corpus words and the NLTK wordlist.

from nltk import word_tokenize
from nltk.tokenize import WhitespaceTokenizer
from os import path
import re
def readfile( input_dir, filename ):
    """Reads in file from directory and file name.
    Returns the content of the file.

    Usage::

        >>> text = readfile(input_dir, filename)

    Args:
        input_dir (str): Directory with the input file.
        filename (str): Name of file to be read.

    Returns:
        str: Returns the content of the file as a string.

    """
    with open(path.join(input_dir, filename)) as f:
        return f.read()


def strip_punct( text ):
    """Remove punctuation and numbers.
    Remove select punctuation marks and numbers from a text and replaces them with a space.
    
    Non-permanent changes for evaluation purposes only.
    
    Uses the :mod:`re` library.

    Args:
        text (str): Content to be evaluated.

    Returns:
        str: Returns the content string without the following characters: 0-9,.!?$:;&".
    """
    text_cleaned = re.sub(r"[0-9,.!?$:;&\"]", " ", text)
    
    return text_cleaned


def tokenize_text( text, tokenizer='whitespace' ):
    """Converts file content to a list of tokens. 

    Uses :meth:`nltk.tokenize.regexp.WhitespaceTokenizer`.

    Args:
        text (str): Content to be tokenized.
        tokenizer(str): option of tokenizer. Current options are 'whitespace' and
            'word'.

    Returns:
        list: Returns a list of the tokens in the text, separated by white space.
    """
    if tokenizer == 'whitespace' or tokenizer == 'word':
        if tokenizer == 'whitespace':
            return WhitespaceTokenizer().tokenize(text)
        elif tokenizer == 'word':
            return word_tokenize(text)
    else:
        raise ValueError('Tokenizer value {} is invalid. Must be "whitespace" or "word"'.format(tokenizer))


def to_lower( tokens ):
    """Convert all tokens to lower case.

    Args:
        tokens (list): List of tokens generated using a tokenizer.

    Returns:
        list: List of all tokens converted to lowercase.
    """
    return [w.lower() for w in tokens]


def identify_errors( tokens, dictionary ):
    """Compare words in documents to words in dictionary. 

    Args:
        tokens (list): List of all tokens in the document.
        dictionary (set): The set of approved words.
        
    Returns:
        set : Returns the set of tokens that are not 
            also dictionary words.
    """
    return set(tokens).difference(dictionary)


def create_spelling_dictionary( directory, wordlists ):
    """Compile a spelling dictionary.
    Compiles a spelling dictionary from one or multiple
    wordlist files. Returns results as a set.

    Args:
        directory (str): Location of the wordlist files.
        wordlists (list): List of filenames for the wordlist files.

    Returns:
        set: List of unique words in all the compiled lists.
    """
    spelling_dictionary = []
    for wordlist in wordlists:
        words = readfile(directory, wordlist).splitlines()
        word_list = [w.lower() for w in words]
        for each in word_list:
            spelling_dictionary.append(each)

    return set(spelling_dictionary)


def get_doc_errors( input_dir, filename, dictionary ):
    """Identify words in text that are not in a dictionary set.
    
    Args:
        input_dir (str): Directory containing text files
        filename (str): Name of file to evaluate
        dictionary (set): Set of verified words
    Returns:
        set: Returns set of words in doc that are not in the dictionary.
    """
    text = strip_punct(readfile(input_dir, filename))
    tokens = to_lower(tokenize_text(text, tokenizer='whitespace'))
    return identify_errors(tokens, dictionary)
input_dir = "data/"
filename = "HR18910601-V26-06-page34.txt"

First, because previewing early and often is key to evaluating the effectiveness of the code, I start by printing out the file content as well as the cleaned and tokenized version of the text. Overall, the OCR quality is good. There are a number of line-ending problems and some rogue characters, but the content and general message of the page is very legible.

readfile(input_dir, filename)
' THE LEGUMINOUS SEEDS.\nAbstract of a lesson in cookery given by Mrs. E. E. Kellogg, to the Health and Temperance Training Class.]\nTHE leguminous seeds, or legumes, are those foods better known as peas, beans, and lentils. They are usually served as vegetables, but are very different in their composition, as they contain in their mature state a large excess of the nitrogenous element, while the ordinary vegetables are quite lacking in this important element. In their immature state the legumes are more similar to vegetables. Dried peas are found in the market in two different forms, the green, or Scotch peas, and those which have been divided and the skins removed, called split peas. There are also several varieties of lentils to be found in market. Lentils are somewhat superior to peas or beans in nutritive value, but people usually dislike their taste until they have become used to them.\nThe different varieties of beans vary but little as to nutritive value. The Lima bean is more delicate in flavor, and is generally looked upon with most favor. The Chinese and Japanese make large use of the leguminous foods. They have a kind of bean known as soja, which is almost entirely composed of nitroge- nous material. Peas and beans contain caseine, which is similar in its characteristics to the caseine of milk. The Chinese take advantage of this fact, and manu- facture a kind of cheese from them.\nThe legumes are all extensively used in India, China, Japan, and other Eastern countries. Com- bined with rice, which is used as a staple article of diet in these countries, and which contains an excess of the carbonaceous element, they form an excellent food. In England, peas are largely used by persons who are in training as athletes, because they are con- sidered to be superior strength producers.,\nAlthough these seeds are in themselves such nutri- tious food, they are in this country commonly prepared in some manner which renders them indigestible. Beans, especially, are used in connec- tion with a large amount of fat, and are not, in\ngeneral, properly cooked. Peas and beans are covered with a tough skin, and require a prolonged cooking, in order to soften them. If, in cooking, this skin is left intact, and not broken afterward by careful mastication, the legume may pass through the di- gestive tract without being acted upon by the digestive fluids. Even prolonged cooking will not render the skin digestible. It is best that we prepare these seeds in such a manner that the skins may be rejected. When used with the skins a large per- centage of the nutritive material is wasted, since it is impossible for the digestive processes to free it from the skins.\nIn preparing them for cooking, it is generally best to soak them over night. The soaking aids the process of cooking, and to some extent does away with the strong flavor which to some people is very disagreeable. Soft water is best for cooking all dried seeds. If dry peas or beans are put into hard, boiling water, they will not soften, because the min- eral element of the water acts upon the caseine to harden it. As to the quantity needed, much depends upon the degree of heat used, but in general two quarts of soft water will be quite sufficient for one pint of seeds.\nMASHED PEAS.Ñ Cook one quart of dried Scotch peas until tender, simmering slowly at the last so that the water will be nearly or quite evaporated. Then rub them through a colander, or vegetable press. Add to the sifted peas one half cup of good sweet cream, and salt if desired. Pour into a granite or earthen dish, and put into the oven to brown. Some prefer it browned until mealy, some prefer it rather moist. This same dish can be made with split peas by simply mashing them.\nBrans may likewise be put through the same proc- ess. This is one of the most digestible forms in which they can be prepared.\n( I9o)\n'
to_lower(tokenize_text(strip_punct(readfile(input_dir, filename))))
['the',
 'leguminous',
 'seeds',
 'abstract',
 'of',
 'a',
 'lesson',
 'in',
 'cookery',
 'given',
 'by',
 'mrs',
 'e',
 'e',
 'kellogg',
 'to',
 'the',
 'health',
 'and',
 'temperance',
 'training',
 'class',
 ']',
 'the',
 'leguminous',
 'seeds',
 'or',
 'legumes',
 'are',
 'those',
 'foods',
 'better',
 'known',
 'as',
 'peas',
 'beans',
 'and',
 'lentils',
 'they',
 'are',
 'usually',
 'served',
 'as',
 'vegetables',
 'but',
 'are',
 'very',
 'different',
 'in',
 'their',
 'composition',
 'as',
 'they',
 'contain',
 'in',
 'their',
 'mature',
 'state',
 'a',
 'large',
 'excess',
 'of',
 'the',
 'nitrogenous',
 'element',
 'while',
 'the',
 'ordinary',
 'vegetables',
 'are',
 'quite',
 'lacking',
 'in',
 'this',
 'important',
 'element',
 'in',
 'their',
 'immature',
 'state',
 'the',
 'legumes',
 'are',
 'more',
 'similar',
 'to',
 'vegetables',
 'dried',
 'peas',
 'are',
 'found',
 'in',
 'the',
 'market',
 'in',
 'two',
 'different',
 'forms',
 'the',
 'green',
 'or',
 'scotch',
 'peas',
 'and',
 'those',
 'which',
 'have',
 'been',
 'divided',
 'and',
 'the',
 'skins',
 'removed',
 'called',
 'split',
 'peas',
 'there',
 'are',
 'also',
 'several',
 'varieties',
 'of',
 'lentils',
 'to',
 'be',
 'found',
 'in',
 'market',
 'lentils',
 'are',
 'somewhat',
 'superior',
 'to',
 'peas',
 'or',
 'beans',
 'in',
 'nutritive',
 'value',
 'but',
 'people',
 'usually',
 'dislike',
 'their',
 'taste',
 'until',
 'they',
 'have',
 'become',
 'used',
 'to',
 'them',
 'the',
 'different',
 'varieties',
 'of',
 'beans',
 'vary',
 'but',
 'little',
 'as',
 'to',
 'nutritive',
 'value',
 'the',
 'lima',
 'bean',
 'is',
 'more',
 'delicate',
 'in',
 'flavor',
 'and',
 'is',
 'generally',
 'looked',
 'upon',
 'with',
 'most',
 'favor',
 'the',
 'chinese',
 'and',
 'japanese',
 'make',
 'large',
 'use',
 'of',
 'the',
 'leguminous',
 'foods',
 'they',
 'have',
 'a',
 'kind',
 'of',
 'bean',
 'known',
 'as',
 'soja',
 'which',
 'is',
 'almost',
 'entirely',
 'composed',
 'of',
 'nitroge-',
 'nous',
 'material',
 'peas',
 'and',
 'beans',
 'contain',
 'caseine',
 'which',
 'is',
 'similar',
 'in',
 'its',
 'characteristics',
 'to',
 'the',
 'caseine',
 'of',
 'milk',
 'the',
 'chinese',
 'take',
 'advantage',
 'of',
 'this',
 'fact',
 'and',
 'manu-',
 'facture',
 'a',
 'kind',
 'of',
 'cheese',
 'from',
 'them',
 'the',
 'legumes',
 'are',
 'all',
 'extensively',
 'used',
 'in',
 'india',
 'china',
 'japan',
 'and',
 'other',
 'eastern',
 'countries',
 'com-',
 'bined',
 'with',
 'rice',
 'which',
 'is',
 'used',
 'as',
 'a',
 'staple',
 'article',
 'of',
 'diet',
 'in',
 'these',
 'countries',
 'and',
 'which',
 'contains',
 'an',
 'excess',
 'of',
 'the',
 'carbonaceous',
 'element',
 'they',
 'form',
 'an',
 'excellent',
 'food',
 'in',
 'england',
 'peas',
 'are',
 'largely',
 'used',
 'by',
 'persons',
 'who',
 'are',
 'in',
 'training',
 'as',
 'athletes',
 'because',
 'they',
 'are',
 'con-',
 'sidered',
 'to',
 'be',
 'superior',
 'strength',
 'producers',
 'although',
 'these',
 'seeds',
 'are',
 'in',
 'themselves',
 'such',
 'nutri-',
 'tious',
 'food',
 'they',
 'are',
 'in',
 'this',
 'country',
 'commonly',
 'prepared',
 'in',
 'some',
 'manner',
 'which',
 'renders',
 'them',
 'indigestible',
 'beans',
 'especially',
 'are',
 'used',
 'in',
 'connec-',
 'tion',
 'with',
 'a',
 'large',
 'amount',
 'of',
 'fat',
 'and',
 'are',
 'not',
 'in',
 'general',
 'properly',
 'cooked',
 'peas',
 'and',
 'beans',
 'are',
 'covered',
 'with',
 'a',
 'tough',
 'skin',
 'and',
 'require',
 'a',
 'prolonged',
 'cooking',
 'in',
 'order',
 'to',
 'soften',
 'them',
 'if',
 'in',
 'cooking',
 'this',
 'skin',
 'is',
 'left',
 'intact',
 'and',
 'not',
 'broken',
 'afterward',
 'by',
 'careful',
 'mastication',
 'the',
 'legume',
 'may',
 'pass',
 'through',
 'the',
 'di-',
 'gestive',
 'tract',
 'without',
 'being',
 'acted',
 'upon',
 'by',
 'the',
 'digestive',
 'fluids',
 'even',
 'prolonged',
 'cooking',
 'will',
 'not',
 'render',
 'the',
 'skin',
 'digestible',
 'it',
 'is',
 'best',
 'that',
 'we',
 'prepare',
 'these',
 'seeds',
 'in',
 'such',
 'a',
 'manner',
 'that',
 'the',
 'skins',
 'may',
 'be',
 'rejected',
 'when',
 'used',
 'with',
 'the',
 'skins',
 'a',
 'large',
 'per-',
 'centage',
 'of',
 'the',
 'nutritive',
 'material',
 'is',
 'wasted',
 'since',
 'it',
 'is',
 'impossible',
 'for',
 'the',
 'digestive',
 'processes',
 'to',
 'free',
 'it',
 'from',
 'the',
 'skins',
 'in',
 'preparing',
 'them',
 'for',
 'cooking',
 'it',
 'is',
 'generally',
 'best',
 'to',
 'soak',
 'them',
 'over',
 'night',
 'the',
 'soaking',
 'aids',
 'the',
 'process',
 'of',
 'cooking',
 'and',
 'to',
 'some',
 'extent',
 'does',
 'away',
 'with',
 'the',
 'strong',
 'flavor',
 'which',
 'to',
 'some',
 'people',
 'is',
 'very',
 'disagreeable',
 'soft',
 'water',
 'is',
 'best',
 'for',
 'cooking',
 'all',
 'dried',
 'seeds',
 'if',
 'dry',
 'peas',
 'or',
 'beans',
 'are',
 'put',
 'into',
 'hard',
 'boiling',
 'water',
 'they',
 'will',
 'not',
 'soften',
 'because',
 'the',
 'min-',
 'eral',
 'element',
 'of',
 'the',
 'water',
 'acts',
 'upon',
 'the',
 'caseine',
 'to',
 'harden',
 'it',
 'as',
 'to',
 'the',
 'quantity',
 'needed',
 'much',
 'depends',
 'upon',
 'the',
 'degree',
 'of',
 'heat',
 'used',
 'but',
 'in',
 'general',
 'two',
 'quarts',
 'of',
 'soft',
 'water',
 'will',
 'be',
 'quite',
 'sufficient',
 'for',
 'one',
 'pint',
 'of',
 'seeds',
 'mashed',
 'peas',
 'ñ',
 'cook',
 'one',
 'quart',
 'of',
 'dried',
 'scotch',
 'peas',
 'until',
 'tender',
 'simmering',
 'slowly',
 'at',
 'the',
 'last',
 'so',
 'that',
 'the',
 'water',
 'will',
 'be',
 'nearly',
 'or',
 'quite',
 'evaporated',
 'then',
 'rub',
 'them',
 'through',
 'a',
 'colander',
 'or',
 'vegetable',
 'press',
 'add',
 'to',
 'the',
 'sifted',
 'peas',
 'one',
 'half',
 'cup',
 'of',
 'good',
 'sweet',
 'cream',
 'and',
 'salt',
 'if',
 'desired',
 'pour',
 'into',
 'a',
 'granite',
 'or',
 'earthen',
 'dish',
 'and',
 'put',
 'into',
 'the',
 'oven',
 'to',
 'brown',
 'some',
 'prefer',
 'it',
 'browned',
 'until',
 'mealy',
 'some',
 'prefer',
 'it',
 'rather',
 'moist',
 'this',
 'same',
 'dish',
 'can',
 'be',
 'made',
 'with',
 'split',
 'peas',
 'by',
 'simply',
 'mashing',
 'them',
 'brans',
 'may',
 'likewise',
 'be',
 'put',
 'through',
 'the',
 'same',
 'proc-',
 'ess',
 'this',
 'is',
 'one',
 'of',
 'the',
 'most',
 'digestible',
 'forms',
 'in',
 'which',
 'they',
 'can',
 'be',
 'prepared',
 '(',
 'i',
 'o)']

Comparing the text to the NLTK list, however, returns a number of “real” words identified as errors.

get_doc_errors( input_dir, filename, set(to_lower(nltk_list)))
{'(',
 ']',
 'acted',
 'aids',
 'athletes',
 'beans',
 'bined',
 'brans',
 'browned',
 'called',
 'caseine',
 'characteristics',
 'com-',
 'con-',
 'connec-',
 'contains',
 'cooked',
 'countries',
 'depends',
 'di-',
 'england',
 'evaporated',
 'fluids',
 'foods',
 'forms',
 'gestive',
 'kellogg',
 'lacking',
 'legumes',
 'lentils',
 'looked',
 'manu-',
 'mashed',
 'min-',
 'needed',
 'nitroge-',
 'nutri-',
 'o)',
 'peas',
 'per-',
 'persons',
 'preparing',
 'proc-',
 'processes',
 'producers',
 'prolonged',
 'quarts',
 'rejected',
 'renders',
 'seeds',
 'served',
 'sidered',
 'simmering',
 'skins',
 'tion',
 'tious',
 'varieties',
 'vegetables',
 'ñ'}

With words such as “vegetables” and “cooked” missing from the NLTK wordlist, it is clear that this list is insufficient for computing the ratio of words to errors in the documents.

SCOWL (Spell Checker Oriented Word Lists)

With the default Unix wordlist insufficient as a base for evaluating the corpus, I looked next at the wordlist package included with Ubuntu, which is also mentioned in the Wikipedia entry for words. This package links out to the SCOWL (Spell Checker Oriented Word Lists) project.6 Rather than use one of the provided word lists, I used the web application to create a custom a list. I used the following parameters, as documented in the README:

  • diacritic: keep
  • max_size: 70
  • max_variant: 1
  • special: roman-numerals
  • spelling: US GBs

These parameters provide a wordlist that includes both American and British spellings, including common variant spellings. max_size of “70” refers to the “large” collection of words which are contained in most dictionaries, and which is, according to the author, the largest recommended set for spelling, while the larger “80” and “95” sets contain obscure and questionable words.

scowl = load_from_txt('data/SCOWL-wl/words.txt')
len(set(scowl))
171131
scowl[:10]
['A', "A's", 'AA', "AA's", 'AAA', 'AAM', 'AB', "AB's", 'ABA', 'ABC']

When we compare the difference between the two lists, we can see that the majority of words in the SCOWL list are not in the NLTK words list.

len(set(scowl).difference(set(nltk_list)))
104524

And when we can run the file content through our error checking function, we get results that are much closer to what we would expect.

get_doc_errors( input_dir, filename, set(to_lower(scowl)))
{'(',
 ']',
 'bined',
 'brans',
 'caseine',
 'centage',
 'com-',
 'con-',
 'connec-',
 'di-',
 'eral',
 'ess',
 'gestive',
 'manu-',
 'min-',
 'nitroge-',
 'nutri-',
 'o)',
 'per-',
 'proc-',
 'sidered',
 'soja',
 'tion',
 'tious',
 'ñ'}

Filtering the SCOWL List

While it may seem valuable to have an all-encompassing list, one that contains all possible words, a list that is too expansive is actually detrimental to the task of identifying OCR errors. Unusual words, or single letters, are more likely to be the result of a transcription error than an intentional choice by the author or editor, but an overly expansive wordlist would capture these as “words,” thus under-reporting the transcription errors. In constructing the word list, I am looking to strike a delicate balance between false negatives (words identified as errors, but are words) and false positives (words that are identified as words, but are errors).

The SCOWL list contains a few elements that increase the risk of false positive identification. First, the list contains a collection of abbreviations, which are mostly 20th-century in origin and which, once converted to lowercase for comparison, are unlikely to match any of the content produced by the SDA authors.

Second, one of the lists used in the SCOWL application includes the letter section headers as words, so instances of any single letter, such as “e”, are identified as a word. Two of the most frequent problems in the OCR text are words that are “burst” words, or words where a space has been inserted between each of the letters (I N S T R U C T O R), and words that have been split due to line endings (di- gestion). As a result, a number of the errors will appear in the form of individual letters and two letter words. In order to capture these transcription errors, the wordlist needs to be particularly conservative with regard to single-letter and two-letter words.

To improve my ability to capture these errors, I filtered out the acronyms and all words less than 3 characters in length from the SCOWL list. I added back common words, such as “I”, “in”, and “be,” through the subject-specific lists, which I cover below.

# Using regex to remove all abbreviations and their possessive forms
s_filtered = [x for x in scowl if not re.search(r"\b[A-Z]{1,}'s\b|\b[A-Z]{1,}\b", x)]
s_filtered[:10]
['ABCs',
 'ABMs',
 'Aachen',
 "Aachen's",
 'Aalborg',
 'Aalesund',
 'Aaliyah',
 "Aaliyah's",
 'Aalst',
 "Aalst's"]
s_filtered = [x for x in s_filtered if len(x) > 2]

To ensure that I have not removed too many words, I preview the error report once again.

get_doc_errors( input_dir, filename, set(to_lower(s_filtered)))
{'(',
 ']',
 'a',
 'an',
 'as',
 'at',
 'be',
 'bined',
 'brans',
 'by',
 'caseine',
 'centage',
 'com-',
 'con-',
 'connec-',
 'di-',
 'e',
 'eral',
 'ess',
 'gestive',
 'i',
 'if',
 'in',
 'is',
 'it',
 'manu-',
 'min-',
 'nitroge-',
 'nutri-',
 'o)',
 'of',
 'or',
 'per-',
 'proc-',
 'sidered',
 'so',
 'soja',
 'tion',
 'tious',
 'to',
 'we',
 'ñ'}

These results are what I expected. There are a number of additional two-letter “errors” but otherwise the list appears unchanged.

I then saved the filtered word list to the data folder for later use.

Adding subject-specific words

The SCOWL wordlist provides a broad base of English language words against which to compare the words identified in the SDA periodicals. We could proceed with just this list, but that strategy would give us a high number of false negatives (words that identified as errors that are in fact words). Because my corpus is derived from a particular community with a particular set of concerns, it is possible to expand the wordlists to capture some of the specialized language of the community. I chose to store each specialized wordlist separately, and to pull all the lists together into a single list of unique values within the corpus processing steps. This enables me to use the lists modularly, and to isolate a particular subset if a problem arise.

King James Bible

One of the most ubiquitous influences on the language of the SDA writers was the Bible, particularly in the King James translation. From scriptural quotations to references to individuals and use biblical of metaphors, the Bible saturated the consciousness of this nascent religious movement. To capture that language, and to reintroduce the relevant one and two letter words that I removed from the base SCOWL list, I (somewhat sacrilegiously) converted the text of the King James Bible from the Christian Classics Ethereal Library into a list of unique words.

The code I used to create the wordlist from the text file is included in module appendix. Here I am previewing the results of that transformation.

kjv = readfile('data/', '2017-05-24-kjv-wordlist.txt').splitlines()
len(kjv)
14238
kjv[:10]
['spider',
 'rocks',
 'biatas',
 'apparently',
 'strip',
 'anchor',
 'dinaites',
 'entangled',
 'adida',
 'salchah']
word_lists = ['2017-05-05-base-scowl-list.txt','2017-05-24-kjv-wordlist.txt']
compiled_dict = create_spelling_dictionary('data', word_lists)
get_doc_errors(input_dir, filename, compiled_dict)
{'(',
 ']',
 'bined',
 'brans',
 'caseine',
 'centage',
 'com-',
 'con-',
 'connec-',
 'di-',
 'e',
 'eral',
 'ess',
 'gestive',
 'manu-',
 'min-',
 'nitroge-',
 'nutri-',
 'o)',
 'per-',
 'proc-',
 'sidered',
 'soja',
 'tion',
 'tious',
 'ñ'}

Seventh-day Adventist People and Places

A second set of subject specific language comes from within the SDA. Using the Yearbooks produced by the denomination starting in 1883, I have compiled lists of the people serving leadership roles within the denomination, from Sabbath School teachers to the President of the General Conference. Browsing through the reported errors, many are people and place names. Since I had the data available from earlier work, I compiled the SDA last names and place names into wordlists.

The code for creating wordlists from my datafiles, which are in Google Docs, is documented in the module appendix.

Note: There are duplicates in the list of people because I forgot to limit the list to the unique names before saving to a file. While unsightly, the only consequence of this is a little more computation work when I make the combined spelling dictionary.

sda_people = readfile('data/', '2016-12-07-SDA-last-names.txt').splitlines()
sda_people[:10]
['Butler',
 'Haskell',
 'Henry',
 'Kellogg',
 'Kellogg',
 'Oyen',
 'Sisley',
 'Fargo',
 'Hall',
 'Hall']
sda_places = readfile('data/', '2016-12-07-SDA-place-names.txt').splitlines()
sda_places[:10]
['Gray',
 'Oroville',
 'Shreveport',
 'Sāo',
 'Alto',
 'Central',
 'Castlereagh',
 'Warrenton',
 'Alexandria',
 'Cannelton']
word_lists = ['2017-05-05-base-scowl-list.txt',
              '2017-05-24-kjv-wordlist.txt',
              '2016-12-07-SDA-last-names.txt',
              '2016-12-07-SDA-place-names.txt'
             ]
compiled_dict = create_spelling_dictionary('data', word_lists)
get_doc_errors( input_dir, filename, compiled_dict)
{'(',
 ']',
 'bined',
 'brans',
 'caseine',
 'centage',
 'com-',
 'con-',
 'connec-',
 'di-',
 'e',
 'eral',
 'ess',
 'gestive',
 'manu-',
 'min-',
 'nitroge-',
 'nutri-',
 'o)',
 'per-',
 'proc-',
 'sidered',
 'soja',
 'tion',
 'tious',
 'ñ'}

USGS Locations and Roman Numerals

The final two external dataset I used to minimize false identification of errors were a list of U.S. city and town names from the USGS and a list of roman numerals, as a number of these were removed from the base list when I filtered the words under 3 characters long. The code for converting the shape file from the USGS into a list of place names is included in the module appendix.

us_places = readfile('data/', '2017-01-03-US-place-names.txt').splitlines()
us_places[:10]
['attu',
 'point',
 'hope',
 'point',
 'lay',
 'diomede',
 'gambell',
 'tin',
 'city',
 'savoonga']
word_lists = ['2017-05-05-base-scowl-list.txt',
              '2017-05-24-kjv-wordlist.txt',
              '2016-12-07-SDA-last-names.txt',
              '2016-12-07-SDA-place-names.txt',
              '2017-01-03-US-place-names.txt',
              '2017-02-14-roman-numerals.txt'
             ]
compiled_dict = create_spelling_dictionary('data', word_lists)
get_doc_errors(input_dir, filename, compiled_dict)
{'(',
 ']',
 'bined',
 'brans',
 'caseine',
 'centage',
 'com-',
 'con-',
 'connec-',
 'di-',
 'e',
 'eral',
 'ess',
 'gestive',
 'manu-',
 'min-',
 'nitroge-',
 'nutri-',
 'o)',
 'per-',
 'proc-',
 'sidered',
 'soja',
 'tion',
 'tious',
 'ñ'}

Next Steps

With these lists in hand, I went through the different titles in the corpus and identified frequently used words that were escaping the above wordlists. As I will outline in the next notebook, many of the words that I found, which had not been captured by these wordlists, were related to religious persons and theological ideas; were alternative spellings created either intentionally or due to editor error; or can be attributed to John H. Kellogg and the health reform writings of the denomination.

Final wordlists

The final wordlists are included in the data folder for this module. They are:

You can run this code locally using the Jupyter notebook available via Github.

  1. Recent studies on text reuse have effectively used ngram comparison to find reuse despite differences attributable to errors in character recognition or minor editorial rephrasing. (See Mullen (2016), http://americaspublicbible.org/methods.html and Smith, Cordell, Dillon (2013), “Infectious Texts: Modeling Text Reuse in Nineteenth-Century Newspapers,” 2013 IEEE International Conference on Big Data. doi: http://dx.doi.org/10.1109/BigData.2013.6691675) However, research in the field of information retrieval has found that algorithms that factor the relative prevalence of words in a corpus are negatively effected by high OCR rates. (See Taghva, Kazem, Thomas Nartker, and Julie Borsack. “Information Access in the Presence of OCR Errors.” Proceedings of the 1st ACM workshop on Hardcopy document processing the 1st ACM workshop (2004): 1–8. doi: http://dx.doi.org/10.1145/1031442.1031443). 

  2. “The National Digital Newspaper Program (NDNP) Technical Guidelines for Applicants.” Library of Congress (2016): p. 11. http://www.loc.gov/ndnp/guidelines/NDNP_201517TechNotes.pdf 

  3. Tanner, Simon, Trevor Muñoz, and Pich Hemy Ros. “Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned From Assessing the OCR Accuracy of the British Library’s 19th Century Online Newspaper Archive.” D-Lib Magazine 15, no. 7/8 (2009): §5. doi: http://dx.doi.org/10.1045/july2009-munoz 

  4. Torget, Andrew J., Mihalcea, Rada, Christensen, Jon, and McGhee, Geoff. “Mapping Texts: Combining Text-Mining and Geo-Visualization to Unlock the Research Potential of Historical Newspapers.” Mapping Texts (2011): Accessed March 29, 2017. http://mappingtexts.org/whitepaper/MappingTexts_WhitePaper.pdf

  5. Lasko, Thomas A., and Susan E. Hauser. “Approximate String Matching Algorithms for Limited-Vocabulary OCR Output Correction.” Proceedings of SPIE 4307 (2001): 232–40. doi: http://dx.doi.org/10.1117/12.410841 

  6. Atkinson, Kevin. “Spell Checking Oriented Word Lists (Scowl).” (2016): Accessed Nov 18, 2016. http://wordlist.aspell.net/scowl-readme/

Know Your Sources (Part 1)

This is part of a series of technical essays documenting the computational analysis that undergirds my dissertation, A Gospel of Health and Salvation. For an overview of the dissertation project, you can read the current project description at jeriwieringa.com. You can access the Jupyter notebooks on Github.

My goals in sharing the notebooks and technical essays are three-fold. First, I hope that they might prove useful to others interested in taking on similar projects. In these notebooks I describe and model how to approach a large corpus of sources in the production of historical scholarship.

Second, I am sharing them in hopes that “given enough eyeballs, all bugs are shallow.” If you encounter any bugs, if you see an alternative way to solve a problem, or if the code does not achieve the goals I have set out for it, please let me know!

Third, these notebooks make an argument for methodological transparency and for the discussion of methods as part of the scholarly argument of digital history. Often the technical work in digital history is done behind the scenes, with publications favoring the final research products, usually in article form with interesting visualizations. While there is a growing culture in digital history of releasing source code, there is little discussion of how that code was developed, why solutions were chosen, and what those solutions enable and prevent. In these notebooks I seek to engage that middle space between code and the final analysis - documenting the computational problem solving that I’ve done as part of the analysis. As these essays attest, each step in the processing of the corpus requires the researcher to make a myriad of distinctions about the worlds they seek to model, distinctions that shape the outcomes of the computational analysis and are part of the historical argument of the work.


Now that we have a collection of texts selected and downloaded, and have extracted the text, we need to spend some time identifying what the corpus contains, both in terms of coverage and quality. As I describe in the project overview, I will be using these texts to make arguments about the development of the community’s discourses around health and salvation. While the corpus makes that analysis possible, it also sets the limits of what we can claim from text analysis alone. Without an understanding of what those limits are, we run the risk of claiming more than the sources can sustain, and in doing so, minimizing the very complexities that historical research seeks to reveal.

"""My usual practice for gathering the filenames is to read 
them in from a directory. So that this code can be run locally 
without the full corpus downloaded, I exported the list 
of filenames to an index file for use in this notebook.
"""
with open("data/2017-05-05-corpus-index.txt", "r") as f:
    corpus = f.read().splitlines()
len(corpus)
197943

To create an overview of the corpus, I will use the document filenames along with some descriptive metadata that I created.

Filenames are an often underestimated feature of digital files, but one that can be used to great effect. For my corpus, the team that digitized the periodicals did an excellent job of providing the files with descriptive names. Overall, the files conform to the following pattern:

PrefixYYYYMMDD-V00-00.pdf

I discovered a few files that deviated from the pattern, but renamed those so that the pattern held throughout the corpus. When splitting the PDF documents into pages, I preserved the structure, adding -page0.txt to the end.

The advantage of this format is that the filenames contain the metadata I need to place each file within its context. By isolating the different sections of the filename, I can quickly place any file with reference to the periodical title and the publication date.

import pandas as pd
import re
def extract_pub_info(doc_list):
    """Use regex to extract metadata from filename.
    
    Note:
        Assumes that the filename is formatted as::
            
            `PrefixYYYYMMDD-V00-00.pdf`
    
    Args:
        doc_list (list): List of the filenames in the corpus.
    Returns:
        dict: Dictionary with the year and title abbreviation for each filename.
    """
    
    corpus_info = {}
    
    for doc_id in doc_list:
        
        # Split the ID into three parts on the '-'
        split_doc_id = doc_id.split('-')
        
        # Get the prefix by matching the first set of letters 
        # in the first part of the filename.
        title = re.match("[A-Za-z]+", split_doc_id[0])
        # Get the dates by grabbing all of the number elements 
        # in the first part of the filename.
        dates = re.search(r'[0-9]+', split_doc_id[0])
        # The first four numbers is the publication year.
        year = dates.group()[:4]
        
        # Update the dictionary with the title and year 
        # for the filename.
        corpus_info[doc_id] = {'title': title.group(), 'year': year}
    
    return corpus_info
corpus_info = extract_pub_info(corpus)

One of the most useful libraries in Python for working with data is Pandas. With Pandas, Python users gain much of the functionality that our colleagues who work with R have long celebrated as the benefits of that domain-specific language.

By transforming our corpus_info dictionary into a dataframe, we can quickly filter and tabulate a number of different statistics on our corpus.

df = pd.DataFrame.from_dict(corpus_info, orient='index')
df.index.name = 'docs'
df = df.reset_index()

You can preview the initial dataframe by uncommenting the cell below.

# df
df = df.groupby(["title", "year"], as_index=False).docs.count()
df
title year docs
0 ADV 1898 26
1 ADV 1899 674
2 ADV 1900 463
3 ADV 1901 389
4 ADV 1902 440
5 ADV 1903 428
6 ADV 1904 202
7 ADV 1905 20
8 ARAI 1909 64
9 ARAI 1919 32
10 AmSn 1886 96
11 AmSn 1887 96
12 AmSn 1888 105
13 AmSn 1889 386
14 AmSn 1890 403
15 AmSn 1891 398
16 AmSn 1892 401
17 AmSn 1893 402
18 AmSn 1894 402
19 AmSn 1895 400
20 AmSn 1896 408
21 AmSn 1897 800
22 AmSn 1898 804
23 AmSn 1899 801
24 AmSn 1900 800
25 CE 1909 104
26 CE 1910 312
27 CE 1911 312
28 CE 1912 306
29 CE 1913 420
... ... ... ...
465 YI 1885 192
466 YI 1886 220
467 YI 1887 244
468 YI 1888 104
469 YI 1889 208
470 YI 1890 208
471 YI 1895 416
472 YI 1898 28
473 YI 1899 408
474 YI 1900 408
475 YI 1901 408
476 YI 1902 408
477 YI 1903 408
478 YI 1904 288
479 YI 1905 408
480 YI 1906 408
481 YI 1907 432
482 YI 1908 832
483 YI 1909 844
484 YI 1910 850
485 YI 1911 852
486 YI 1912 868
487 YI 1913 848
488 YI 1914 852
489 YI 1915 852
490 YI 1916 852
491 YI 1917 840
492 YI 1918 850
493 YI 1919 856
494 YI 1920 45

495 rows × 3 columns

Nearly 500 rows of data is too large to have a good sense of the coverage of the corpus from reading the data table, so it is necessary to create some visualizations of the records. For a quick prototyping tool, I am using the Bokeh library.

from bokeh.charts import Bar, show
from bokeh.charts import defaults
from bokeh.io import output_notebook
from bokeh.palettes import viridis
output_notebook()
defaults.width = 900
defaults.height = 950

In this first graph, I am showing the total number of pages per title, per year in the corpus.

p = Bar(df, 
        'year', 
        values='docs',
        agg='sum', 
        stack='title',
        palette= viridis(30), 
        title="Pages per Title per Year")
show(p)

This graph of the corpus reflects the historical development of the publication efforts denomination. Starting with a single publication in 1849, the publishing efforts of the denomination expand in the 1860s as they launch their health reform efforts, expand again in the 1880s as they start a publishing house in California and address concerns about Sunday observance laws, and again at the turn of the century as the denomination reorganizes and regional publications expand. The chart also reveals some holes in the corpus. The Youth’s Instructor (shown here in yellow) in one of the oldest continuous denominational publications, but the pages available for the years from 1850 - 1899 are inconsistent.

In interpreting the results of mining these texts, it will be important to factor in the relative difference in size and diversity of publication venues between the early years of the denomination and the later years of this study.

by_title = df.groupby(["title"], as_index=False).docs.sum()
p = Bar(df, 
        'title', 
        values='docs', 
        color='title', 
        palette=viridis(30), 
        title="Total Pages by Title"
       )
show(p)

Another way to view the coverage of the corpus is by total pages per periodical title. The Advent Review and Sabbath Herald dominates the corpus in number of pages, with The Health Reformer, Signs of the Times, and the Youth’s Instructor, making up the next major percentage of the corpus. In terms of scale, these publications will have (and had) a prominent role in shaping the discourse of the SDA community. At the same time, it will be informative to look to the smaller publications to see if we can surface alternative and dissonant ideas.

topic_metadata = pd.read_csv('data/2017-05-05-periodical-topics.csv')
topic_metadata
periodicalTitle title startYear endYear initialPubLocation topic
0 Training School Advocate ADV 1898 1905 Battle Creek, MI Education
1 American Sentinel AmSn 1886 1900 Oakland, CA Religious Liberty
2 Advent Review and Sabbath Herald ARAI 1909 1919 Washington, D.C. Denominational
3 Christian Education CE 1909 1920 Washington, D.C. Education
4 Welcome Visitor (Columbia Union Visitor) CUV 1901 1920 Academia, OH Regional
5 Christian Educator EDU 1897 1899 Battle Creek, MI Education
6 General Conference Bulletin GCB 1863 1918 Battle Creek, MI Denominational
7 Gospel Herald GH 1898 1920 Yazoo City, MS Regional
8 Gospel of Health GOH 1897 1899 Battle Creek, MI Health
9 Gospel Sickle GS 1886 1888 Battle Creek, MI Missions
10 Home Missionary HM 1889 1897 Battle Creek, MI Missions
11 Health Reformer HR 1866 1907 Battle Creek, MI Health
12 Indiana Reporter IR 1901 1910 Indianapolis, IN Regional
13 Life Boat LB 1898 1920 Chicago, IL Missions
14 Life and Health LH 1904 1920 Washington, D.C. Health
15 Liberty LibM 1906 1920 Washington, D.C. Religious Liberty
16 Lake Union Herald LUH 1908 1920 Berrien Springs, MI Regional
17 North Michigan News Sheet NMN 1907 1910 Petoskey, MI Regional
18 Pacific Health Journal and Temperance Advocate PHJ 1885 1904 Oakland, CA Health
19 Present Truth (Advent Review) PTAR 1849 1850 Middletown, CT Denominational
20 Pacific Union Recorder PUR 1901 1920 Oakland, CA Regional
21 Review and Herald RH 1850 1920 Paris, ME Denominational
22 Sligonian Sligo 1916 1920 Washington, D.C. Regional
23 Sentinel of Liberty SOL 1900 1904 Chicago, IL Religious Liberty
24 Signs of the Times ST 1874 1920 Oakland, CA Denominational
25 Report of Progress, Southern Union Conference SUW 1907 1920 Nashville, TN Regional
26 Church Officer's Gazette TCOG 1914 1920 Washington, D.C. Denominational
27 The Missionary Magazine TMM 1898 1902 Philadelphia, PA Missions
28 West Michigan Herald WMH 1903 1908 Grand Rapids, MI Regional
29 Youth's Instructor YI 1852 1920 Rochester, NY Denominational

We can generate another view by adding some external metadata for the titles. The “topics” listed here are ones I assigned when skimming the different titles. “Denominational” refers to centrally produced publications, covering a wide array of topics. “Education” refers to periodicals focused on education. “Health” to publications focused on health. “Missions” titles are focused on outreach and evangelism focused publications and “Religious Liberty” on governmental concerns over Sabbath laws. Finally, “Regional” refers to periodicals produced by local union conferences, which like the denominational titles cover a wide range of topics.

by_topic = pd.merge(topic_metadata, df, on='title')
p = Bar(by_topic, 
        'year', 
        values='docs',
        agg='sum', 
        stack='topic',
        palette= viridis(6), 
        title="Pages per Topic per Year")
show(p)

Here we can see the diversification of periodical subjects over time, especially around the turn of the century.

p = Bar(by_topic, 
        'topic', 
        values='docs',
        agg='sum', 
        stack='title',
        palette= viridis(30), 
        title="Pages per Topic per Year")
p.left[0].formatter.use_scientific = False
p.legend.location = "top_right"
show(p)

Grouping by category allows us to see that our corpus is dominated by the denominational, health, and regionally focused publications. These topics match with our research concerns, increasing our confidence that we will have enough information to determine meaningful patterns about those topics. But, due to the focus of the corpus, we should proceed cautiously before making any claims about the relative importance of those topics within the community.

Now that we have a sense of the temporal and topical coverage of our corpus, we will next turn our attention to evaluating the quality of the data that we have gathered from the scanned PDF files.

You can run this code locally using the Jupyter notebook available via Github.