Ways to Compute Topics over Time, Part 2

This is part of a series of technical essays documenting the computational analysis that undergirds my dissertation, A Gospel of Health and Salvation. For an overview of the dissertation project, you can read the current project description at jeriwieringa.com. You can access the Jupyter notebooks on Github.


This is the second in a series of posts which constitute a “lit review” of sorts, documenting the range of methods scholars are using to compute the distribution of topics over time.

Graphs of topic prevalence over time are some of the most ubiquitous in digital humanities discussions of topic modeling. They are used as a mechanism for identifying spikes in discourse and for depicting the relationship between the various discourses in a corpus.

Topic prevalence over time is not, however, a measure that is returned with the standard modeling tools such as MALLET or Gensim. Instead, it is computed after the fact by combining the model data with external metadata and aggregating the model results. And, as it turns out, there are a number of ways that the data can be aggregated and displayed. In this series of notebooks, I am looking at 4 different strategies for computing topic significance over time. These strategies are:

To explore a range of strategies for computing and visualizing topics over time from a standard LDA model, I am using a model I created from my dissertation materials. You can download the files needed to follow along from https://www.dropbox.com/s/9uf6kzkm1t12v6x/2017-06-21.zip?dl=0.

# Load the necessary libraries
from ggplot import *
import json
import logging
import matplotlib as mpl
import matplotlib.pyplot as plt
import os
import pandas as pd # Note: I am running 0.19.2
import pyLDAvis.gensim
import seaborn as sns
import warnings
# Enable in-notebook visualizations
%matplotlib inline
pyLDAvis.enable_notebook()
# Temporary fix for persistent warnings of an api change between pandas and seaborn.
warnings.filterwarnings('ignore')
pd.options.display.max_rows = 10
base_dir = "data"
period = "1859-to-1875"
directory = "historical_periods"

Create the dataframes

metadata_filename = os.path.join(base_dir,'2017-05-corpus-stats/2017-05-Composite-OCR-statistics.csv')
index_filename = os.path.join(base_dir, 'corpora', directory, '{}.txt'.format(period))
labels_filename = os.path.join(base_dir, 'dataframes', directory, '{}_topicLabels.csv'.format(period))
doc_topic_filename = os.path.join(base_dir, 'dataframes', directory, '{}_dtm.csv'.format(period))
def doc_list(index_filename):
    """
    Read in from a json document with index position and filename. 
    File was created during the creation of the corpus (.mm) file to document
    the filename for each file as it was processed.
    
    Returns the index information as a dataframe.
    """
    with open(index_filename) as data_file:    
        data = json.load(data_file)
    docs = pd.DataFrame.from_dict(data, orient='index').reset_index()
    docs.columns = ['index_pos', 'doc_id']
    docs['index_pos'] = docs['index_pos'].astype(int)
  
    return docs


def compile_dataframe( index, dtm, labels, metadata):
    """
    Combines a series of dataframes to create a large composit dataframe.
    """
    doc2metadata = index.merge(metadata, on='doc_id', how="left")
    topics_expanded = dtm.merge(labels, on='topic_id')
    
    df = topics_expanded.merge(doc2metadata, on="index_pos", how="left")
    
    return df
metadata = pd.read_csv(metadata_filename, usecols=['doc_id', 'year','title'])
docs_index = doc_list(index_filename)
dt = pd.read_csv(doc_topic_filename)
labels = pd.read_csv(labels_filename)

The first step, following the pattern of Andrew Goldstone for his topic model browser, is to normalize the weights for each document, so that they total to “1”.

As a note, Goldstone first smooths the weights by adding the alpha hyperparameter to each of the weights, which I am not doing here.

# Reorient from long to wide
dtm = dt.pivot(index='index_pos', columns='topic_id', values='topic_weight').fillna(0)

# Divide each value in a row by the sum of the row to normalize the values
dtm = (dtm.T/dtm.sum(axis=1)).T

# Shift back to a long dataframe
dt_norm = dtm.stack().reset_index()
dt_norm.columns = ['index_pos', 'topic_id', 'norm_topic_weight']
df = compile_dataframe(docs_index, dt_norm, labels, metadata)
df
index_pos topic_id norm_topic_weight topic_words doc_id year title
0 0 0 0.045525 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page1.txt 1863 GCB
1 1 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page2.txt 1863 GCB
2 2 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page3.txt 1863 GCB
3 3 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page4.txt 1863 GCB
4 4 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page5.txt 1863 GCB
... ... ... ... ... ... ... ...
288595 11539 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page4.txt 1872 YI
288596 11540 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page5.txt 1872 YI
288597 11541 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page6.txt 1872 YI
288598 11542 24 0.012192 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page7.txt 1872 YI
288599 11543 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page8.txt 1872 YI

288600 rows × 7 columns

Data dictionary:

  • index_pos : Gensim uses the order in which the docs were streamed to link back the data and the source file. index_pos refers to the index id for the individual doc, which I used to link the resulting model information with the document name.
  • topic_id : The numerical id for each topic. For this model, I used 20 topics to classify the periodical pages.
  • norm_topic_weight : The proportion of the tokens in the document that are part of the topic, normalized per doc.
  • topic_words : The top 6 words in the topic.
  • doc_id : The file name of the document. The filename contains metadata information about the document, such as the periodical title, date of publication, volume, issue, and page number.
  • year : Year the document was published (according to the filename)
  • title : Periodical that the page was published in.
order = ['conference, committee, report, president, secretary, resolved',
         'quarterly, district, society, send, sept, business',
         'association, publishing, chart, dollar, xxii, sign',
         'mother, wife, told, went, young, school',
         'disease, physician, tobacco, patient, poison, medicine',
         'wicked, immortality, righteous, adam, flesh, hell',
        ]
def create_plotpoint(df, y_value, hue=None, order=order, col=None, wrap=None, size=6, aspect=2, title=""):
    p = sns.factorplot(x="year", y=y_value, kind='point', 
                        hue_order=order, hue=hue, 
                       col=col, col_wrap=wrap, col_order=order, 
                       size=size, aspect=aspect, data=df)
    p.fig.subplots_adjust(top=0.9)
    p.fig.suptitle(title, fontsize=16)
    return p

Smoothing or Regression Analysis

The second popular topic visualization strategy I found is using a smoothing function to highlight trends in the data. I found this strategy was most commonly executed through the graphing library (rather than computing prior to graphing).

The prime example is using the geom_smooth option in ggplot. This is a little harder to do with Seaborn, but fortunately there is a Python port of ggplot so I can demonstrate on my model. For the sake of simplicity, and because faceting with the Python version is buggy, I will work with only one topic at a time for this experiment.

t17 = df[df['topic_id']== 17]

The loess smoother is particularly designed for extrapolating trends in timeseries and noisy data, and is the default smoother in ggplot for data samples with fewer than 1000 observations. The gam smoother is the default for larger datasets.1 While I have more than 1000 observations, I will use both to illustrate the differences.

ggplot(t17, aes(x='year', y='norm_topic_weight')) + geom_point() + stat_smooth(method='loess')

<ggplot: (287572516)>
ggplot(t17, aes(x='year', y='norm_topic_weight')) + geom_point() + stat_smooth(method='gam')

<ggplot: (-9223372036567940983)>

The collection of “0” values appears to have caused the function to flatline at “0” so let’s filter those out (cringing as we do, because now the total number of observations each year will vary). If you are using data produced by MALLET you might not have this problem. But someone else can experiment with that.

# Drop all rows where the topic weight is 0
t17f = t17[t17['norm_topic_weight'] > 0]
ggplot(t17f, aes(x='year', y='norm_topic_weight')) + geom_point() + stat_smooth(method='loess')

<ggplot: (287168092)>
ggplot(t17f, aes(x='year', y='norm_topic_weight')) + geom_point() + stat_smooth(method='gam')

<ggplot: (287167152)>

While not quite as pretty as the graphs from R, these graphs help illustrate the calculations.

The first observation is that the lines are functionally equivalent, so both smoothing functions found a similar line.

Looking at the line, we see a gradual increase over the first 7 years of the dataset, a more rapid increase between 1866 and 1869, and then leveling off at around 25% for the rest of the period. That matches what we saw when charting averages in the last post, though it hides the variations from year to year.

Topic 17, or “disease, physician, tobacco, patient, poison, medicine” was a topic with a large shift in our corpus, so for comparison, let us look at a more typical topic.

Topic 20: “mother, wife, told, went, young, school”

t20 = df[df['topic_id'] == 20]
ggplot(t20, aes(x='year', y='norm_topic_weight')) + geom_point() + stat_smooth(method='loess')

<ggplot: (-9223372036567088787)>

My best guess as to why the unfiltered topic 20 does not flatline at “0” where topic 17 did is that while health topics are either present or not on a page (and so have a high proportion of “0” value documents), “mother, wife, told, went, young, school” is a recurring feature of most pages in a given year.

This could indicate that we need more topics in our model. But it also suggests something interesting about the language of the denomination that this topic is consistently part of the conversation.

t20f = t20[t20['norm_topic_weight'] > 0]
ggplot(t20f, aes(x='year', y='norm_topic_weight')) + geom_point() + stat_smooth(method='loess')

<ggplot: (288187824)>
ggplot(t20f, aes(x='year', y='norm_topic_weight')) + geom_point() + stat_smooth(method='gam')

<ggplot: (-9223372036566588149)>

While removing the “0” values did result in moving the line up ever so slightly, the overall pattern is the same.

Rolling Averages

While smoothing uses linear regression to identify trend lines, another strategy for computing the overall trajectory of a topic is to visualize the average on a rolling time window (such as computing the 5 year average.) This approach emphasizes longer term patterns in topic weight. An example of this can be seen in the work of John Laudun and Jonathan Goodwin in “Computing Folklore Studies: Mapping over a Century of Scholarly Production through Topics.”2

Computing average over time is one calculation I struggled to work with. Back in Part 1, I mentioned that one of the big differences between the output from MALLET and Gensim is while Mallet returns a weight for every topic in every document, Gensim filters out the weights under 1%. Depending on how you handle the data, that can drastically effect the way averages are computed. If you have a weight for all topics in documents, then the average is over the whole corpus. If you only have weights for the significant topic assignments, things get wonky really fast.

Since we “inflated” our dataframe to include a “0” topic weight for all missing values in the Gensim output, the denominator for all of our averages is the same. That being the case, I believe we can compute the rolling mean by taking the average each year and then using the rolling_mean function in Pandas to compute the average across them.

# Group the data by year and compute the mean of the topic weights
t17grouped = t17.groupby('year')['norm_topic_weight'].mean()

# Compute the rolling mean per year and rename the columns for graphing.
t17rolling = pd.rolling_mean(t17grouped, 3).reset_index()
t17rolling.columns = ['year', 'rolling_mean']
create_plotpoint(t17rolling, 'rolling_mean')
<seaborn.axisgrid.FacetGrid at 0x112577c18>

One thing to keep in mind is that both the linear smoothing and the rolling mean are strategies developed particularly for timeseries data, or data that is produced on regular intervals by some recording instrument. While an individual journal might be released on a semi-regular schedule, once we are working across multiple publications, this gets much more complicated. Because I don’t have “real” timeseries data, I am hesitant about this approach.

Both the smoothing function in ggplot and the use of a rolling mean are methods for capturing the overall trend of a topic over time. In some ways, this feels like the holy grail metric — we are computing the trajectory of a particular discourse. But it is also an even more abstracted depiction of the topic weights than we drew when computing the averages. In describing his choice of bars to represent topic prevalence in a given year, Goldstone notes “I have chosen bars to emphasize that the model does not assume a smooth evolution in topics, and neither should we; this does make it harder to see a time trend, however.”3

Both the smoothing function and the rolling mean are strategies for minimizing the dips and spikes of a particular year. If your question relates to the long-term trend of a topic (assuming your document base can support it), that is its advantage. But if your question is about discourse at a particular point or smaller period of time, then that is the weakness.

Next up: topic prevalence, in an epidemiology sort of way.

You can download and run the code locally using the Jupyter Notebook version of this post


  1. http://ggplot2.tidyverse.org/reference/geom_smooth.html 

  2. John Laudun and Jonathan Goodwin. “Computing Folklore Studies: Mapping over a Century of Scholarly Production through Topics.” Journal of American Folklore 126, no. 502 (2013): 455-475. https://muse.jhu.edu/ (accessed June 23, 2017). 

  3. https://agoldst.github.io/dfr-browser/ 

Ways to Compute Topics over Time, Part 1

This is part of a series of technical essays documenting the computational analysis that undergirds my dissertation, A Gospel of Health and Salvation. For an overview of the dissertation project, you can read the current project description at jeriwieringa.com. You can access the Jupyter notebooks on Github.


This the first in a series of posts which constitute a “lit review” of sorts to document the range of methods scholars are using to compute the distribution of topics over time.

Graphs of topic prevalence over time are some of the most ubiquitous in digital humanities discussions of topic modeling. They are used as a mechanism for identifying spikes in discourse and for depicting the relationship between the various discourses in a corpus.

Topic prevalence over time is not, however, a measure that is returned with the standard modeling tools such as MALLET or Gensim. Instead, it is computed after the fact by combining the model data with external metadata and aggregating the model results. And, as it turns out, there are a number of ways that the data can be aggregated and displayed.

In this series of notebooks, I am looking at 4 different strategies for computing topic significance over time. These strategies are:

  • Average of topic weights per year
  • Smoothing or regression analysis
  • Proportion of total weights per year
  • Prevalence of the top topic per year

To identifying the different strategies that others have used to compute topic distribution, I have gone digging into the code released with various blog posts and topic model browsers. What I have recreated here is my current understanding of the strategies being deployed. I looked at the following sources for examples of ways others have computed topics over time:

I would love to hear if I have misunderstood the calculations taking place or about additional strategies or implementation ideas.

To explore a range of strategies for computing and visualizing topics over time from a standard LDA model, I am using a model I created from my dissertation materials. You can download the files needed to follow along from https://www.dropbox.com/s/9uf6kzkm1t12v6x/2017-06-21.zip?dl=0.

I created this model from periodicals produced by the Seventh-day Adventist denomination between 1859 and 1875 using Gensim. I split each periodical into individual pages, so that I can better capture the range of topics in each periodical issue and identify individual pages of interest. In the process of corpus creation, I filtered the tokens on each page, excluding both low frequency (words occurring in fewer than 20 documents) and high frequency (words occurring in more than 30% of all documents) words. In other words, don’t read anything into the fact that “God” is missing from the model. I have modeled the most distinctive words for each page.

For historical context, during this period early Seventh-day Adventists formally organized as a denomination in 1863. They also launched their health reform efforts with the establishment of the Western Health Reform Institute (later renamed the Battle Creek Sanitarium) and started a health focused publication, The Health Reformer, both in 1866.

# Load the necessary libraries
import gensim # Note: I am running 1.0.1
from ggplot import *
import json
import logging
import matplotlib as mpl
import matplotlib.pyplot as plt
import os
import pandas as pd # Note: I am running 0.19.2
import pyLDAvis.gensim
import seaborn as sns
import warnings
# Enable in-notebook visualizations
%matplotlib inline
pyLDAvis.enable_notebook()
# Temporary fix for persistent warnings of an api change between pandas and seaborn.
warnings.filterwarnings('ignore')
pd.options.display.max_rows = 10
base_dir = ""
period = '1859-to-1875'
directory = "historical_periods"
lda_model = gensim.models.LdaModel.load(os.path.join(base_dir, 'models', directory, '{}.model'.format(period)))
corpus = gensim.corpora.MmCorpus(os.path.join(base_dir, 'corpora', directory, "{}.mm".format(period)))
dictionary = gensim.corpora.Dictionary.load(os.path.join(base_dir, 'corpora', directory, "{}.dict".format(period)))

Visualize the model

To better understand the model that we are analyzing, we can first visualize the whole model using pyLDAvis. This provides us a picture of the overall topic weights and word distributions across all of the documents.

Note: the topic numbers here indicate in descending order the prevalence of the topic in the whole corpus. They are not the same as the topic ids used in the rest of this notebook.

pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)

Create the dataframes

While pyLDAvis gives us a good overall picture of the model data, we need to further manipulate the data in order to track change over time across the different publication years.

I preprocessed the model and exported information about the weights per document and topic labels into CSV files for ease of compiling. I will release the code I used to export that information in a later notebook.

metadata_filename = os.path.join(base_dir,'2017-05-Composite-OCR-statistics.csv')
index_filename = os.path.join(base_dir, 'corpora', directory, '{}.txt'.format(period))
labels_filename = os.path.join(base_dir, 'dataframes', directory, '{}_topicLabels.csv'.format(period))
doc_topic_filename = os.path.join(base_dir, 'dataframes', directory, '{}_dtm.csv'.format(period))
def doc_list(index_filename):
    """
    Read in from a json document with index position and filename. 
    File was created during the creation of the corpus (.mm) file to document
    the filename for each file as it was processed.
    
    Returns the index information as a dataframe.
    """
    with open(index_filename) as data_file:    
        data = json.load(data_file)
    docs = pd.DataFrame.from_dict(data, orient='index').reset_index()
    docs.columns = ['index_pos', 'doc_id']
    docs['index_pos'] = docs['index_pos'].astype(int)
  
    return docs


def compile_dataframe( index, dtm, labels, metadata):
    """
    Combines a series of dataframes to create a large composit dataframe.
    """
    doc2metadata = index.merge(metadata, on='doc_id', how="left")
    topics_expanded = dtm.merge(labels, on='topic_id')
    
    df = topics_expanded.merge(doc2metadata, on="index_pos", how="left")
    
    return df
metadata = pd.read_csv(metadata_filename, usecols=['doc_id', 'year','title'])
docs_index = doc_list(index_filename)
dt = pd.read_csv(doc_topic_filename)
labels = pd.read_csv(labels_filename)

The first step, following Andrew Goldstone’s process in his topic model browser, is to normalize the weights for each document, so that they total to “1”.

As a note, Goldstone first smooths the weights by adding the alpha hyperparameter to each of the weights, which I am not doing here.

# Reorient from long to wide
dtm = dt.pivot(index='index_pos', columns='topic_id', values='topic_weight').fillna(0)

# Divide each value in a row by the sum of the row to normalize the values
dtm = (dtm.T/dtm.sum(axis=1)).T

# Shift back to a long dataframe
dt_norm = dtm.stack().reset_index()
dt_norm.columns = ['index_pos', 'topic_id', 'norm_topic_weight']

With the document/topic dataframe normalized, we can compile our full dataframe.

df = compile_dataframe(docs_index, dt_norm, labels, metadata)
df
index_pos topic_id norm_topic_weight topic_words doc_id year title
0 0 0 0.045525 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page1.txt 1863 GCB
1 1 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page2.txt 1863 GCB
2 2 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page3.txt 1863 GCB
3 3 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page4.txt 1863 GCB
4 4 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page5.txt 1863 GCB
... ... ... ... ... ... ... ...
288595 11539 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page4.txt 1872 YI
288596 11540 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page5.txt 1872 YI
288597 11541 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page6.txt 1872 YI
288598 11542 24 0.012192 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page7.txt 1872 YI
288599 11543 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page8.txt 1872 YI

288600 rows × 7 columns

Data dictionary:

  • index_pos : Gensim uses the order in which the docs were streamed to link back the data and the source file. index_pos refers to the index id for the individual doc, which I used to link the resulting model information with the document name.
  • topic_id : The numerical id for each topic. For this model, I used 20 topics to classify the periodical pages.
  • norm_topic_weight : The proportion of the tokens in the document that are part of the topic, normalized per doc.
  • topic_words : The top 6 words in the topic.
  • doc_id : The file name of the document. The filename contains metadata information about the document, such as the periodical title, date of publication, volume, issue, and page number.
  • year : Year the document was published (according to the filename)
  • title : Periodical that the page was published in.

Summary Statistics

We can get an overall view of our data by computing some basic statistics.

# Dropping the zero values so that we see the larger patterns in the data
df_sum = df[df['norm_topic_weight'] != 0]
print("Max: {}".format(df_sum['norm_topic_weight'].max()))
print("Min: {}".format(df_sum['norm_topic_weight'].min()))
print("Average: {}".format(df_sum['norm_topic_weight'].mean()))
print("Median: {}".format(df_sum['norm_topic_weight'].median()))
print("Most frequent value: {}".format(df_sum['norm_topic_weight'].round(3).value_counts().idxmax()))
Max: 1.0
Min: 0.010027314043872615
Average: 0.09222067775487706
Median: 0.05201414311615885
Most frequent value: 0.011

The topics range from 100% of the tokens in a document to 1% (excluding the zero values), with an average at 9% and a median value of 5%. The most frequent value is near 1%, which indicates that the data predominantly describes topics that have a minor presence in the documents.

One way to visualize the overall structure of the data is to look at all of the topic weights for each topic id using a scatterplot.

p = sns.factorplot(x="year", y='norm_topic_weight', hue="topic_id", col='topic_words', col_wrap=2, 
                   kind='strip', size=2.5, aspect=2, jitter=1, data=df_sum)
p.fig.subplots_adjust(top=0.95)
p.fig.suptitle("Scatterplot of Normalized Topic Weights, Split by Topic; All Weights.", fontsize=16)
<matplotlib.text.Text at 0x1047bc400>

While fun, this is too many plots to be useful, so I will move forward with a subset of the topics. However, if you decide to experiment with the notebook, you can switch back to the full dataframe.

dfs = df[(df['topic_id'] >= 15) & (df['topic_id'] <=20)]

Average of topic weights per year

The first strategy I found for aggregating topic weights is to average the topic weights per year.

The average topic weight is computed by adding all of the weights for a given topic in a time period and dividing by the total number of documents in that time period. This gives us the average weight of the topic over all documents in the corpus.

One complication I encountered while work through this is while MALLET reports a weight (be it ever so small) for each topic for all the documents, Gensim by default drops topic information for a document if the weight is under 0.01. This presents a wrinkle for computing averages from Gensim-produced data, as the data needs be expanded before we can compute the average weight of a topic over the corpus (rather than the average over its significant appearances). To do that, I added a 0 value for the missing topic values during the normalizing steps above.

We can use the aggregation functions in Seaborn, a Python graphing library, to get a quick overview of the average topic weight by year. By default, the pointplot and bar plot functions use the mean to compute the “central tendency and confidence interval” of categorical data. For more control of the calculation, we can also compute the averages externally.

order = ['conference, committee, report, president, secretary, resolved',
         'association, publishing, chart, dollar, xxii, sign',
         'disease, physician, tobacco, patient, poison, medicine',
         'quarterly, district, society, send, sept, business',
         'wicked, immortality, righteous, adam, flesh, hell',
         'mother, wife, told, went, young, school' 
        ]
def create_pointplot(df, y_value, hue=None, order=order, col=None, wrap=None, size=5, aspect=1.5, title=""):
    p = sns.factorplot(x="year", y=y_value, kind='point', hue_order=order, hue=hue, 
                       col=col, col_wrap=wrap, col_order=order, size=size, aspect=aspect, data=df)
    p.fig.subplots_adjust(top=0.9)
    p.fig.suptitle(title, fontsize=16)
    return p
def create_bar(df, y_value, hue=None, order=order, col=None, wrap=None, size=5, aspect=1.5, title=""):
    p = sns.factorplot(x="year", y=y_value, kind='bar', hue_order=order, hue=hue, 
                       col=col, col_wrap=wrap, col_order=order, size=size, aspect=aspect, data=df)
    p.fig.subplots_adjust(top=0.9)
    p.fig.suptitle(title, fontsize=16)
    return p
create_pointplot(dfs, 'norm_topic_weight', hue='topic_words', order=order, size=5, aspect=1.5,
                 title="Central Range of Normalized Topic Weights. Computed with Seaborn.")
<seaborn.axisgrid.FacetGrid at 0x11a0da588>

One of the common visualizations is to show each plot separately through a facet grid. While this helps us to isolate the contours of each individual topic, it is harder to track the relationship between the topics.

create_pointplot(dfs, 'norm_topic_weight', col='topic_words', wrap=3, 
                 title="Central Range of Normalized Topic Weights Split by Topic. Computed with Seaborn.")
<seaborn.axisgrid.FacetGrid at 0x121d5e128>

We can also compute the average weight manually.

# Get number of docs per year
total_docs = df.groupby('year')['doc_id'].apply(lambda x: len(x.unique())).reset_index()
total_docs.columns = ['year', 'total_docs']

# Group by year and topic id
df_avg = df.groupby(['year', 'topic_id']).agg({'norm_topic_weight': 'sum'}).reset_index()

# Merge dataframes
df_avg = df_avg.merge(total_docs, on="year", how="left")

# Compute the mean per topic
df_avg['average_weight'] = df_avg['norm_topic_weight'] / df_avg['total_docs']

# Merge the dataframes
df_avg = df_avg.merge(labels, on='topic_id')

# # Limit to sample
dfs_avg = df_avg[(df_avg['topic_id'] >=15 ) & (df_avg['topic_id']<= 20)]
create_pointplot(dfs_avg, 'average_weight', hue="topic_words",
                title="Yearly Average of Normalized Weight per Topic")
<seaborn.axisgrid.FacetGrid at 0x121d5e470>

create_pointplot(dfs_avg, 'average_weight', col="topic_words", wrap=3,
                title="Yearly Average of Normalized Weight per Topic")
<seaborn.axisgrid.FacetGrid at 0x11e391400>

We can also display the averages as a bar chart, which encourages us to think about each year as discrete.

create_bar(dfs_avg, 'average_weight', order=order, hue="topic_words",
          title="Yearly Average of Normalized Weight per Topic")
<seaborn.axisgrid.FacetGrid at 0x121bbe518>

create_bar(dfs_avg, 'average_weight', order=order, hue="topic_words", col='topic_words',wrap=3,
          title="Yearly Average of Normalized Weight per Topic")
<seaborn.axisgrid.FacetGrid at 0x122085898>

As a sanity check, we can show the importance of the 0 values by excluding those lines from the dataframe and using Seaborn to compute the average topic weight per year.

df_filtered = dfs[dfs['norm_topic_weight'] > 0]
create_pointplot(df_filtered, 'norm_topic_weight', hue='topic_words',
                title="Center Range of Topic Weights with '0' values removed. Computed with Seaborn")
<seaborn.axisgrid.FacetGrid at 0x1224bdbe0>

If we were to rely on this graph, we would likely postulate that the discourse of the denomination became overwhelmingly inflected with discussions of health starting in 1868. And we would likely fail to appreciate the ways that business and domestic concerns followed similar contours and were nearly as prevalent in the overall literature of the time.

Next I will look at using smoothing functions for charting topic changes over time.

You can download and run the code locally using the Jupyter Notebook version of this post.

Know Your Sources (Part 2)

This is part of a series of technical essays documenting the computational analysis that undergirds my dissertation, A Gospel of Health and Salvation. For an overview of the dissertation project, you can read the current project description at jeriwieringa.com. You can access the Jupyter notebooks on Github.

My goals in sharing the notebooks and technical essays are three-fold. First, I hope that they might prove useful to others interested in taking on similar projects. In these notebooks I describe and model how to approach a large corpus of sources and to use that corpus in the production of historical scholarship.

Second, I am sharing them in hopes that “given enough eyeballs, all bugs are shallow.” If you encounter any bugs, if you see an alternative way to solve a problem, or if the code does not achieve the goals I have set out for it, please let me know!

Third, these notebooks make an argument for methodological transparency and for the discussion of methods as part of the scholarly output of digital history. Often the technical work in digital history is done behind the scenes, with publications favoring the final research products, usually in article form with interesting visualizations. While there is a growing culture in digital history of releasing source code, there is little discussion of how that code was developed – the impulse is to move on once the code “works” and to focus in the presentation on what that code produced, without attending to why a solution was chosen and what that solution both enables and prevents. These notebooks seek to serve that middle space between code and the final analysis - documenting the computational problem solving that I’ve done on the way to the analysis. As these essays attest, each step in the processing of the corpus requires the researcher to make a myriad of distinctions about the world they seek to model, decisions that shape the outcomes of the computational analysis and are part of the historical argument of the work.


So far in this series I have discussed gathering a collection of texts, extracting the text, and identifying the coverage of the corpus. My next step is to assess the quality of the source base.

Because I am interested in problems of categorization and discourse, it is important that I have a mechanism to verify that the text layer provides a sufficiently accurate representation of the original print. While processes such as keyword search or computing document similarity can be performed so as to be resilient to OCR errors, an abundance of errors in the text used in topic modeling would make the topic models unreliable for identifying documents and for surfacing patterns of discourse.1 Without some external information against which to compare the text, it is impossible to know what the overall “health” of the corpus is, to identify what steps need to be taken to improve the quality of the text, and to determine whether the steps I take are effective.

Unlike the newspapers digitized by the Chronicling America project, which are encouraged to include “confidence level data” in the OCR data by the Library of Congress guidelines,2 the interface provided by the SDA provides very little information about the processes by which the periodicals were digitized and the expected quality of the OCR. As a result, the quality of the resulting text layers is obscured. A brief glance through the resulting text files suggests that OCR is of average quality – good enough to be legible but with enough errors to raise concerns. At the same time, the age of the periodicals and the idiosyncrasies of 19th century typography and layout suggest that these documents contain sections that were likely a challenge for the OCR engine.

There are two general methods for evaluating the accuracy of transcribed textual data: comparison of the generated text to some “ground truth” (text that is known to be accurate) or comparison of the generated text to a bank of known words.

The first method is used by Simon Tanner, Trevor Muñoz, and Pich Hemy Ros in their evaluation of the OCR quality of the British Library’s 19th Century Online Newspaper Archive. Working with a sample of 1% of the 2 million pages digitized by the British Library, the team calculated the highest rates of OCR accuracy achieved by comparing the generated XML text to a “double re-keyed” ground-truth version.3 Having a verified ground-truth document enabled the team to provide accurate results about the quality of generated texts (which were generally disappointing, particularly for proper nouns and “significant” or content words). Their approach, however, is labor and time intensive – more than can be taken on by most individual scholars with limited financial resources.

The second approach, which the Mapping Texts team used for their analysis of the OCR accuracy of the Texas newspapers in Chronicling America, is to compare the generated text to an authoritative wordlist and compute the number of words outside the approved set.4 This approach is easier to implement, as it takes much less time to compile a list of relevant words than to re-key even a 1% sample of the text. However, the results are less accurate. The method is blind to places where the OCR engine produced a word that, while in the wordlist, does not match the text on the page or where spelling variations that occur on the page are flagged as OCR errors because they are not included in the word list. Because of the possibility of both false positives and false negatives, the results are an approximate representation of the accuracy of the OCR.

Weighing the disadvantages of both approaches, I chose to use the second method, comparing the generated OCR to an authoritative wordlist. This method was the most practical for me to implement within the constraints of the dissertation. In addition, a bank of subject appropriate words has utility for automated spelling correction, as is suggested in the research of Thomas Lasko and Susan Hauser on methods for OCR correction of the text from the National Library of Medicine.5 These factors shaped my decision to evaluate the quality of the extracted text by comparing it to an external word list.

NLTK ‘words’ Corpus

My first choice of wordlists was the NLTK words.words list, as this is a standard word bank used in many projects. The first stop for documentation on published datasets, including wordlists, is the included README file. The README file at the base of the NLTK words corpus contains the following sparse lines:

Wordlists

en: English, http://en.wikipedia.org/wiki/Words_(Unix) en-basic: 850 English words: C.K. Ogden in The ABC of Basic English (1932)

The linked Wikipedia article contains only a little more information:

words is a standard file on all Unix and Unix-like operating systems, and is simply a newline-delimited list of dictionary words. It is used, for instance, by spell-checking programs.[1]

The words file is usually stored in /usr/share/dict/words or /usr/dict/words.

On Debian and Ubuntu, the words file is provided by the wordlist package, or its provider packages wbritish, wamerican, etc. On Fedora and Arch, the words file is provided by the words package.

Fortunately, my personal computer runs on a “Unix-like” operating system, so I was able to gather the following additional information from the README in the suggested /usr/dict/words directory.

Welcome to web2 (Webster’s Second International) all 234,936 words worth. The 1934 copyright has lapsed, according to the supplier. The supplemental ‘web2a’ list contains hyphenated terms as well as assorted noun and adverbial phrases.

Here at last we have information about the source of the collection of words so frequently used for American English spell-checking programs. However, it is not clear whether or not this is the correct file, as the original link also referenced the wordlist packages in Debian and Ubuntu systems. To confirm that NLTK is using the list of words derived from Webster’s Second International Dictionary, I used set comparison to identify the overlap between the two lists.

Comparing NTLK words to web2

''' You may receive an error message noting that you need to download the words corpus from NTLK. 
Follow the instructions in the error message and then rerun the cell.
'''
from nltk.corpus import words
# Function for pulling words from a txt file
def load_from_txt(file_name):
    with open(file_name, "r") as txt:
        words = txt.read().splitlines()
    return(words)
nltk_list = [w for w in words.words()]

We can get an overview of the list by examining the number of words, and printing out the first 10.

len(set(nltk_list))
235892
nltk_list[:10]
['A',
 'a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'Aani',
 'aardvark',
 'aardwolf',
 'Aaron']
web2_list = load_from_txt('/usr/share/dict/web2')
len(set(web2_list))
235886
web2_list[:10]
['A',
 'a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'Aani',
 'aardvark',
 'aardwolf',
 'Aaron']

At this point, it seems likely that the two lists are similar, though not quite identical. To identify the differences, I used set comparison to quickly reveal where the list do not overlap.

'''Use set().symmetric_difference(set()) to return the list of items that are in either list,
but not in both lists. This provides a quick snapshot of the discontinuities between the two lists.
See https://docs.python.org/3.5/library/stdtypes.html#set-types-set-frozenset for more details. 
'''
set(nltk_list).symmetric_difference(set(web2_list))
{'behaviour', 'box', 'colour', 'harbour', 'humour', 'near'}

These results indicate that, with the exceptions of a few British spellings and the curious case of “box” and “near”, the two lists are functionally identical.

Evaluate Corpus Docs with NLTK ‘words’ (web2)

Armed with the knowledge that the NLKT wordlist is, in fact, the words from Webster’s Second International Dictionary, I moved next to evaluate how well the wordlist covers the vocabulary of our corpus. To do that, I read in one of the documents, did some minimal cleaning, and then found the disjoin between the corpus words and the NLTK wordlist.

from nltk import word_tokenize
from nltk.tokenize import WhitespaceTokenizer
from os import path
import re
def readfile( input_dir, filename ):
    """Reads in file from directory and file name.
    Returns the content of the file.

    Usage::

        >>> text = readfile(input_dir, filename)

    Args:
        input_dir (str): Directory with the input file.
        filename (str): Name of file to be read.

    Returns:
        str: Returns the content of the file as a string.

    """
    with open(path.join(input_dir, filename)) as f:
        return f.read()


def strip_punct( text ):
    """Remove punctuation and numbers.
    Remove select punctuation marks and numbers from a text and replaces them with a space.
    
    Non-permanent changes for evaluation purposes only.
    
    Uses the :mod:`re` library.

    Args:
        text (str): Content to be evaluated.

    Returns:
        str: Returns the content string without the following characters: 0-9,.!?$:;&".
    """
    text_cleaned = re.sub(r"[0-9,.!?$:;&\"]", " ", text)
    
    return text_cleaned


def tokenize_text( text, tokenizer='whitespace' ):
    """Converts file content to a list of tokens. 

    Uses :meth:`nltk.tokenize.regexp.WhitespaceTokenizer`.

    Args:
        text (str): Content to be tokenized.
        tokenizer(str): option of tokenizer. Current options are 'whitespace' and
            'word'.

    Returns:
        list: Returns a list of the tokens in the text, separated by white space.
    """
    if tokenizer == 'whitespace' or tokenizer == 'word':
        if tokenizer == 'whitespace':
            return WhitespaceTokenizer().tokenize(text)
        elif tokenizer == 'word':
            return word_tokenize(text)
    else:
        raise ValueError('Tokenizer value {} is invalid. Must be "whitespace" or "word"'.format(tokenizer))


def to_lower( tokens ):
    """Convert all tokens to lower case.

    Args:
        tokens (list): List of tokens generated using a tokenizer.

    Returns:
        list: List of all tokens converted to lowercase.
    """
    return [w.lower() for w in tokens]


def identify_errors( tokens, dictionary ):
    """Compare words in documents to words in dictionary. 

    Args:
        tokens (list): List of all tokens in the document.
        dictionary (set): The set of approved words.
        
    Returns:
        set : Returns the set of tokens that are not 
            also dictionary words.
    """
    return set(tokens).difference(dictionary)


def create_spelling_dictionary( directory, wordlists ):
    """Compile a spelling dictionary.
    Compiles a spelling dictionary from one or multiple
    wordlist files. Returns results as a set.

    Args:
        directory (str): Location of the wordlist files.
        wordlists (list): List of filenames for the wordlist files.

    Returns:
        set: List of unique words in all the compiled lists.
    """
    spelling_dictionary = []
    for wordlist in wordlists:
        words = readfile(directory, wordlist).splitlines()
        word_list = [w.lower() for w in words]
        for each in word_list:
            spelling_dictionary.append(each)

    return set(spelling_dictionary)


def get_doc_errors( input_dir, filename, dictionary ):
    """Identify words in text that are not in a dictionary set.
    
    Args:
        input_dir (str): Directory containing text files
        filename (str): Name of file to evaluate
        dictionary (set): Set of verified words
    Returns:
        set: Returns set of words in doc that are not in the dictionary.
    """
    text = strip_punct(readfile(input_dir, filename))
    tokens = to_lower(tokenize_text(text, tokenizer='whitespace'))
    return identify_errors(tokens, dictionary)
input_dir = "data/"
filename = "HR18910601-V26-06-page34.txt"

First, because previewing early and often is key to evaluating the effectiveness of the code, I start by printing out the file content as well as the cleaned and tokenized version of the text. Overall, the OCR quality is good. There are a number of line-ending problems and some rogue characters, but the content and general message of the page is very legible.

readfile(input_dir, filename)
' THE LEGUMINOUS SEEDS.\nAbstract of a lesson in cookery given by Mrs. E. E. Kellogg, to the Health and Temperance Training Class.]\nTHE leguminous seeds, or legumes, are those foods better known as peas, beans, and lentils. They are usually served as vegetables, but are very different in their composition, as they contain in their mature state a large excess of the nitrogenous element, while the ordinary vegetables are quite lacking in this important element. In their immature state the legumes are more similar to vegetables. Dried peas are found in the market in two different forms, the green, or Scotch peas, and those which have been divided and the skins removed, called split peas. There are also several varieties of lentils to be found in market. Lentils are somewhat superior to peas or beans in nutritive value, but people usually dislike their taste until they have become used to them.\nThe different varieties of beans vary but little as to nutritive value. The Lima bean is more delicate in flavor, and is generally looked upon with most favor. The Chinese and Japanese make large use of the leguminous foods. They have a kind of bean known as soja, which is almost entirely composed of nitroge- nous material. Peas and beans contain caseine, which is similar in its characteristics to the caseine of milk. The Chinese take advantage of this fact, and manu- facture a kind of cheese from them.\nThe legumes are all extensively used in India, China, Japan, and other Eastern countries. Com- bined with rice, which is used as a staple article of diet in these countries, and which contains an excess of the carbonaceous element, they form an excellent food. In England, peas are largely used by persons who are in training as athletes, because they are con- sidered to be superior strength producers.,\nAlthough these seeds are in themselves such nutri- tious food, they are in this country commonly prepared in some manner which renders them indigestible. Beans, especially, are used in connec- tion with a large amount of fat, and are not, in\ngeneral, properly cooked. Peas and beans are covered with a tough skin, and require a prolonged cooking, in order to soften them. If, in cooking, this skin is left intact, and not broken afterward by careful mastication, the legume may pass through the di- gestive tract without being acted upon by the digestive fluids. Even prolonged cooking will not render the skin digestible. It is best that we prepare these seeds in such a manner that the skins may be rejected. When used with the skins a large per- centage of the nutritive material is wasted, since it is impossible for the digestive processes to free it from the skins.\nIn preparing them for cooking, it is generally best to soak them over night. The soaking aids the process of cooking, and to some extent does away with the strong flavor which to some people is very disagreeable. Soft water is best for cooking all dried seeds. If dry peas or beans are put into hard, boiling water, they will not soften, because the min- eral element of the water acts upon the caseine to harden it. As to the quantity needed, much depends upon the degree of heat used, but in general two quarts of soft water will be quite sufficient for one pint of seeds.\nMASHED PEAS.Ñ Cook one quart of dried Scotch peas until tender, simmering slowly at the last so that the water will be nearly or quite evaporated. Then rub them through a colander, or vegetable press. Add to the sifted peas one half cup of good sweet cream, and salt if desired. Pour into a granite or earthen dish, and put into the oven to brown. Some prefer it browned until mealy, some prefer it rather moist. This same dish can be made with split peas by simply mashing them.\nBrans may likewise be put through the same proc- ess. This is one of the most digestible forms in which they can be prepared.\n( I9o)\n'
to_lower(tokenize_text(strip_punct(readfile(input_dir, filename))))
['the',
 'leguminous',
 'seeds',
 'abstract',
 'of',
 'a',
 'lesson',
 'in',
 'cookery',
 'given',
 'by',
 'mrs',
 'e',
 'e',
 'kellogg',
 'to',
 'the',
 'health',
 'and',
 'temperance',
 'training',
 'class',
 ']',
 'the',
 'leguminous',
 'seeds',
 'or',
 'legumes',
 'are',
 'those',
 'foods',
 'better',
 'known',
 'as',
 'peas',
 'beans',
 'and',
 'lentils',
 'they',
 'are',
 'usually',
 'served',
 'as',
 'vegetables',
 'but',
 'are',
 'very',
 'different',
 'in',
 'their',
 'composition',
 'as',
 'they',
 'contain',
 'in',
 'their',
 'mature',
 'state',
 'a',
 'large',
 'excess',
 'of',
 'the',
 'nitrogenous',
 'element',
 'while',
 'the',
 'ordinary',
 'vegetables',
 'are',
 'quite',
 'lacking',
 'in',
 'this',
 'important',
 'element',
 'in',
 'their',
 'immature',
 'state',
 'the',
 'legumes',
 'are',
 'more',
 'similar',
 'to',
 'vegetables',
 'dried',
 'peas',
 'are',
 'found',
 'in',
 'the',
 'market',
 'in',
 'two',
 'different',
 'forms',
 'the',
 'green',
 'or',
 'scotch',
 'peas',
 'and',
 'those',
 'which',
 'have',
 'been',
 'divided',
 'and',
 'the',
 'skins',
 'removed',
 'called',
 'split',
 'peas',
 'there',
 'are',
 'also',
 'several',
 'varieties',
 'of',
 'lentils',
 'to',
 'be',
 'found',
 'in',
 'market',
 'lentils',
 'are',
 'somewhat',
 'superior',
 'to',
 'peas',
 'or',
 'beans',
 'in',
 'nutritive',
 'value',
 'but',
 'people',
 'usually',
 'dislike',
 'their',
 'taste',
 'until',
 'they',
 'have',
 'become',
 'used',
 'to',
 'them',
 'the',
 'different',
 'varieties',
 'of',
 'beans',
 'vary',
 'but',
 'little',
 'as',
 'to',
 'nutritive',
 'value',
 'the',
 'lima',
 'bean',
 'is',
 'more',
 'delicate',
 'in',
 'flavor',
 'and',
 'is',
 'generally',
 'looked',
 'upon',
 'with',
 'most',
 'favor',
 'the',
 'chinese',
 'and',
 'japanese',
 'make',
 'large',
 'use',
 'of',
 'the',
 'leguminous',
 'foods',
 'they',
 'have',
 'a',
 'kind',
 'of',
 'bean',
 'known',
 'as',
 'soja',
 'which',
 'is',
 'almost',
 'entirely',
 'composed',
 'of',
 'nitroge-',
 'nous',
 'material',
 'peas',
 'and',
 'beans',
 'contain',
 'caseine',
 'which',
 'is',
 'similar',
 'in',
 'its',
 'characteristics',
 'to',
 'the',
 'caseine',
 'of',
 'milk',
 'the',
 'chinese',
 'take',
 'advantage',
 'of',
 'this',
 'fact',
 'and',
 'manu-',
 'facture',
 'a',
 'kind',
 'of',
 'cheese',
 'from',
 'them',
 'the',
 'legumes',
 'are',
 'all',
 'extensively',
 'used',
 'in',
 'india',
 'china',
 'japan',
 'and',
 'other',
 'eastern',
 'countries',
 'com-',
 'bined',
 'with',
 'rice',
 'which',
 'is',
 'used',
 'as',
 'a',
 'staple',
 'article',
 'of',
 'diet',
 'in',
 'these',
 'countries',
 'and',
 'which',
 'contains',
 'an',
 'excess',
 'of',
 'the',
 'carbonaceous',
 'element',
 'they',
 'form',
 'an',
 'excellent',
 'food',
 'in',
 'england',
 'peas',
 'are',
 'largely',
 'used',
 'by',
 'persons',
 'who',
 'are',
 'in',
 'training',
 'as',
 'athletes',
 'because',
 'they',
 'are',
 'con-',
 'sidered',
 'to',
 'be',
 'superior',
 'strength',
 'producers',
 'although',
 'these',
 'seeds',
 'are',
 'in',
 'themselves',
 'such',
 'nutri-',
 'tious',
 'food',
 'they',
 'are',
 'in',
 'this',
 'country',
 'commonly',
 'prepared',
 'in',
 'some',
 'manner',
 'which',
 'renders',
 'them',
 'indigestible',
 'beans',
 'especially',
 'are',
 'used',
 'in',
 'connec-',
 'tion',
 'with',
 'a',
 'large',
 'amount',
 'of',
 'fat',
 'and',
 'are',
 'not',
 'in',
 'general',
 'properly',
 'cooked',
 'peas',
 'and',
 'beans',
 'are',
 'covered',
 'with',
 'a',
 'tough',
 'skin',
 'and',
 'require',
 'a',
 'prolonged',
 'cooking',
 'in',
 'order',
 'to',
 'soften',
 'them',
 'if',
 'in',
 'cooking',
 'this',
 'skin',
 'is',
 'left',
 'intact',
 'and',
 'not',
 'broken',
 'afterward',
 'by',
 'careful',
 'mastication',
 'the',
 'legume',
 'may',
 'pass',
 'through',
 'the',
 'di-',
 'gestive',
 'tract',
 'without',
 'being',
 'acted',
 'upon',
 'by',
 'the',
 'digestive',
 'fluids',
 'even',
 'prolonged',
 'cooking',
 'will',
 'not',
 'render',
 'the',
 'skin',
 'digestible',
 'it',
 'is',
 'best',
 'that',
 'we',
 'prepare',
 'these',
 'seeds',
 'in',
 'such',
 'a',
 'manner',
 'that',
 'the',
 'skins',
 'may',
 'be',
 'rejected',
 'when',
 'used',
 'with',
 'the',
 'skins',
 'a',
 'large',
 'per-',
 'centage',
 'of',
 'the',
 'nutritive',
 'material',
 'is',
 'wasted',
 'since',
 'it',
 'is',
 'impossible',
 'for',
 'the',
 'digestive',
 'processes',
 'to',
 'free',
 'it',
 'from',
 'the',
 'skins',
 'in',
 'preparing',
 'them',
 'for',
 'cooking',
 'it',
 'is',
 'generally',
 'best',
 'to',
 'soak',
 'them',
 'over',
 'night',
 'the',
 'soaking',
 'aids',
 'the',
 'process',
 'of',
 'cooking',
 'and',
 'to',
 'some',
 'extent',
 'does',
 'away',
 'with',
 'the',
 'strong',
 'flavor',
 'which',
 'to',
 'some',
 'people',
 'is',
 'very',
 'disagreeable',
 'soft',
 'water',
 'is',
 'best',
 'for',
 'cooking',
 'all',
 'dried',
 'seeds',
 'if',
 'dry',
 'peas',
 'or',
 'beans',
 'are',
 'put',
 'into',
 'hard',
 'boiling',
 'water',
 'they',
 'will',
 'not',
 'soften',
 'because',
 'the',
 'min-',
 'eral',
 'element',
 'of',
 'the',
 'water',
 'acts',
 'upon',
 'the',
 'caseine',
 'to',
 'harden',
 'it',
 'as',
 'to',
 'the',
 'quantity',
 'needed',
 'much',
 'depends',
 'upon',
 'the',
 'degree',
 'of',
 'heat',
 'used',
 'but',
 'in',
 'general',
 'two',
 'quarts',
 'of',
 'soft',
 'water',
 'will',
 'be',
 'quite',
 'sufficient',
 'for',
 'one',
 'pint',
 'of',
 'seeds',
 'mashed',
 'peas',
 'ñ',
 'cook',
 'one',
 'quart',
 'of',
 'dried',
 'scotch',
 'peas',
 'until',
 'tender',
 'simmering',
 'slowly',
 'at',
 'the',
 'last',
 'so',
 'that',
 'the',
 'water',
 'will',
 'be',
 'nearly',
 'or',
 'quite',
 'evaporated',
 'then',
 'rub',
 'them',
 'through',
 'a',
 'colander',
 'or',
 'vegetable',
 'press',
 'add',
 'to',
 'the',
 'sifted',
 'peas',
 'one',
 'half',
 'cup',
 'of',
 'good',
 'sweet',
 'cream',
 'and',
 'salt',
 'if',
 'desired',
 'pour',
 'into',
 'a',
 'granite',
 'or',
 'earthen',
 'dish',
 'and',
 'put',
 'into',
 'the',
 'oven',
 'to',
 'brown',
 'some',
 'prefer',
 'it',
 'browned',
 'until',
 'mealy',
 'some',
 'prefer',
 'it',
 'rather',
 'moist',
 'this',
 'same',
 'dish',
 'can',
 'be',
 'made',
 'with',
 'split',
 'peas',
 'by',
 'simply',
 'mashing',
 'them',
 'brans',
 'may',
 'likewise',
 'be',
 'put',
 'through',
 'the',
 'same',
 'proc-',
 'ess',
 'this',
 'is',
 'one',
 'of',
 'the',
 'most',
 'digestible',
 'forms',
 'in',
 'which',
 'they',
 'can',
 'be',
 'prepared',
 '(',
 'i',
 'o)']

Comparing the text to the NLTK list, however, returns a number of “real” words identified as errors.

get_doc_errors( input_dir, filename, set(to_lower(nltk_list)))
{'(',
 ']',
 'acted',
 'aids',
 'athletes',
 'beans',
 'bined',
 'brans',
 'browned',
 'called',
 'caseine',
 'characteristics',
 'com-',
 'con-',
 'connec-',
 'contains',
 'cooked',
 'countries',
 'depends',
 'di-',
 'england',
 'evaporated',
 'fluids',
 'foods',
 'forms',
 'gestive',
 'kellogg',
 'lacking',
 'legumes',
 'lentils',
 'looked',
 'manu-',
 'mashed',
 'min-',
 'needed',
 'nitroge-',
 'nutri-',
 'o)',
 'peas',
 'per-',
 'persons',
 'preparing',
 'proc-',
 'processes',
 'producers',
 'prolonged',
 'quarts',
 'rejected',
 'renders',
 'seeds',
 'served',
 'sidered',
 'simmering',
 'skins',
 'tion',
 'tious',
 'varieties',
 'vegetables',
 'ñ'}

With words such as “vegetables” and “cooked” missing from the NLTK wordlist, it is clear that this list is insufficient for computing the ratio of words to errors in the documents.

SCOWL (Spell Checker Oriented Word Lists)

With the default Unix wordlist insufficient as a base for evaluating the corpus, I looked next at the wordlist package included with Ubuntu, which is also mentioned in the Wikipedia entry for words. This package links out to the SCOWL (Spell Checker Oriented Word Lists) project.6 Rather than use one of the provided word lists, I used the web application to create a custom a list. I used the following parameters, as documented in the README:

  • diacritic: keep
  • max_size: 70
  • max_variant: 1
  • special: roman-numerals
  • spelling: US GBs

These parameters provide a wordlist that includes both American and British spellings, including common variant spellings. max_size of “70” refers to the “large” collection of words which are contained in most dictionaries, and which is, according to the author, the largest recommended set for spelling, while the larger “80” and “95” sets contain obscure and questionable words.

scowl = load_from_txt('data/SCOWL-wl/words.txt')
len(set(scowl))
171131
scowl[:10]
['A', "A's", 'AA', "AA's", 'AAA', 'AAM', 'AB', "AB's", 'ABA', 'ABC']

When we compare the difference between the two lists, we can see that the majority of words in the SCOWL list are not in the NLTK words list.

len(set(scowl).difference(set(nltk_list)))
104524

And when we can run the file content through our error checking function, we get results that are much closer to what we would expect.

get_doc_errors( input_dir, filename, set(to_lower(scowl)))
{'(',
 ']',
 'bined',
 'brans',
 'caseine',
 'centage',
 'com-',
 'con-',
 'connec-',
 'di-',
 'eral',
 'ess',
 'gestive',
 'manu-',
 'min-',
 'nitroge-',
 'nutri-',
 'o)',
 'per-',
 'proc-',
 'sidered',
 'soja',
 'tion',
 'tious',
 'ñ'}

Filtering the SCOWL List

While it may seem valuable to have an all-encompassing list, one that contains all possible words, a list that is too expansive is actually detrimental to the task of identifying OCR errors. Unusual words, or single letters, are more likely to be the result of a transcription error than an intentional choice by the author or editor, but an overly expansive wordlist would capture these as “words,” thus under-reporting the transcription errors. In constructing the word list, I am looking to strike a delicate balance between false negatives (words identified as errors, but are words) and false positives (words that are identified as words, but are errors).

The SCOWL list contains a few elements that increase the risk of false positive identification. First, the list contains a collection of abbreviations, which are mostly 20th-century in origin and which, once converted to lowercase for comparison, are unlikely to match any of the content produced by the SDA authors.

Second, one of the lists used in the SCOWL application includes the letter section headers as words, so instances of any single letter, such as “e”, are identified as a word. Two of the most frequent problems in the OCR text are words that are “burst” words, or words where a space has been inserted between each of the letters (I N S T R U C T O R), and words that have been split due to line endings (di- gestion). As a result, a number of the errors will appear in the form of individual letters and two letter words. In order to capture these transcription errors, the wordlist needs to be particularly conservative with regard to single-letter and two-letter words.

To improve my ability to capture these errors, I filtered out the acronyms and all words less than 3 characters in length from the SCOWL list. I added back common words, such as “I”, “in”, and “be,” through the subject-specific lists, which I cover below.

# Using regex to remove all abbreviations and their possessive forms
s_filtered = [x for x in scowl if not re.search(r"\b[A-Z]{1,}'s\b|\b[A-Z]{1,}\b", x)]
s_filtered[:10]
['ABCs',
 'ABMs',
 'Aachen',
 "Aachen's",
 'Aalborg',
 'Aalesund',
 'Aaliyah',
 "Aaliyah's",
 'Aalst',
 "Aalst's"]
s_filtered = [x for x in s_filtered if len(x) > 2]

To ensure that I have not removed too many words, I preview the error report once again.

get_doc_errors( input_dir, filename, set(to_lower(s_filtered)))
{'(',
 ']',
 'a',
 'an',
 'as',
 'at',
 'be',
 'bined',
 'brans',
 'by',
 'caseine',
 'centage',
 'com-',
 'con-',
 'connec-',
 'di-',
 'e',
 'eral',
 'ess',
 'gestive',
 'i',
 'if',
 'in',
 'is',
 'it',
 'manu-',
 'min-',
 'nitroge-',
 'nutri-',
 'o)',
 'of',
 'or',
 'per-',
 'proc-',
 'sidered',
 'so',
 'soja',
 'tion',
 'tious',
 'to',
 'we',
 'ñ'}

These results are what I expected. There are a number of additional two-letter “errors” but otherwise the list appears unchanged.

I then saved the filtered word list to the data folder for later use.

Adding subject-specific words

The SCOWL wordlist provides a broad base of English language words against which to compare the words identified in the SDA periodicals. We could proceed with just this list, but that strategy would give us a high number of false negatives (words that identified as errors that are in fact words). Because my corpus is derived from a particular community with a particular set of concerns, it is possible to expand the wordlists to capture some of the specialized language of the community. I chose to store each specialized wordlist separately, and to pull all the lists together into a single list of unique values within the corpus processing steps. This enables me to use the lists modularly, and to isolate a particular subset if a problem arise.

King James Bible

One of the most ubiquitous influences on the language of the SDA writers was the Bible, particularly in the King James translation. From scriptural quotations to references to individuals and use biblical of metaphors, the Bible saturated the consciousness of this nascent religious movement. To capture that language, and to reintroduce the relevant one and two letter words that I removed from the base SCOWL list, I (somewhat sacrilegiously) converted the text of the King James Bible from the Christian Classics Ethereal Library into a list of unique words.

The code I used to create the wordlist from the text file is included in module appendix. Here I am previewing the results of that transformation.

kjv = readfile('data/', '2017-05-24-kjv-wordlist.txt').splitlines()
len(kjv)
14238
kjv[:10]
['spider',
 'rocks',
 'biatas',
 'apparently',
 'strip',
 'anchor',
 'dinaites',
 'entangled',
 'adida',
 'salchah']
word_lists = ['2017-05-05-base-scowl-list.txt','2017-05-24-kjv-wordlist.txt']
compiled_dict = create_spelling_dictionary('data', word_lists)
get_doc_errors(input_dir, filename, compiled_dict)
{'(',
 ']',
 'bined',
 'brans',
 'caseine',
 'centage',
 'com-',
 'con-',
 'connec-',
 'di-',
 'e',
 'eral',
 'ess',
 'gestive',
 'manu-',
 'min-',
 'nitroge-',
 'nutri-',
 'o)',
 'per-',
 'proc-',
 'sidered',
 'soja',
 'tion',
 'tious',
 'ñ'}

Seventh-day Adventist People and Places

A second set of subject specific language comes from within the SDA. Using the Yearbooks produced by the denomination starting in 1883, I have compiled lists of the people serving leadership roles within the denomination, from Sabbath School teachers to the President of the General Conference. Browsing through the reported errors, many are people and place names. Since I had the data available from earlier work, I compiled the SDA last names and place names into wordlists.

The code for creating wordlists from my datafiles, which are in Google Docs, is documented in the module appendix.

Note: There are duplicates in the list of people because I forgot to limit the list to the unique names before saving to a file. While unsightly, the only consequence of this is a little more computation work when I make the combined spelling dictionary.

sda_people = readfile('data/', '2016-12-07-SDA-last-names.txt').splitlines()
sda_people[:10]
['Butler',
 'Haskell',
 'Henry',
 'Kellogg',
 'Kellogg',
 'Oyen',
 'Sisley',
 'Fargo',
 'Hall',
 'Hall']
sda_places = readfile('data/', '2016-12-07-SDA-place-names.txt').splitlines()
sda_places[:10]
['Gray',
 'Oroville',
 'Shreveport',
 'Sāo',
 'Alto',
 'Central',
 'Castlereagh',
 'Warrenton',
 'Alexandria',
 'Cannelton']
word_lists = ['2017-05-05-base-scowl-list.txt',
              '2017-05-24-kjv-wordlist.txt',
              '2016-12-07-SDA-last-names.txt',
              '2016-12-07-SDA-place-names.txt'
             ]
compiled_dict = create_spelling_dictionary('data', word_lists)
get_doc_errors( input_dir, filename, compiled_dict)
{'(',
 ']',
 'bined',
 'brans',
 'caseine',
 'centage',
 'com-',
 'con-',
 'connec-',
 'di-',
 'e',
 'eral',
 'ess',
 'gestive',
 'manu-',
 'min-',
 'nitroge-',
 'nutri-',
 'o)',
 'per-',
 'proc-',
 'sidered',
 'soja',
 'tion',
 'tious',
 'ñ'}

USGS Locations and Roman Numerals

The final two external dataset I used to minimize false identification of errors were a list of U.S. city and town names from the USGS and a list of roman numerals, as a number of these were removed from the base list when I filtered the words under 3 characters long. The code for converting the shape file from the USGS into a list of place names is included in the module appendix.

us_places = readfile('data/', '2017-01-03-US-place-names.txt').splitlines()
us_places[:10]
['attu',
 'point',
 'hope',
 'point',
 'lay',
 'diomede',
 'gambell',
 'tin',
 'city',
 'savoonga']
word_lists = ['2017-05-05-base-scowl-list.txt',
              '2017-05-24-kjv-wordlist.txt',
              '2016-12-07-SDA-last-names.txt',
              '2016-12-07-SDA-place-names.txt',
              '2017-01-03-US-place-names.txt',
              '2017-02-14-roman-numerals.txt'
             ]
compiled_dict = create_spelling_dictionary('data', word_lists)
get_doc_errors(input_dir, filename, compiled_dict)
{'(',
 ']',
 'bined',
 'brans',
 'caseine',
 'centage',
 'com-',
 'con-',
 'connec-',
 'di-',
 'e',
 'eral',
 'ess',
 'gestive',
 'manu-',
 'min-',
 'nitroge-',
 'nutri-',
 'o)',
 'per-',
 'proc-',
 'sidered',
 'soja',
 'tion',
 'tious',
 'ñ'}

Next Steps

With these lists in hand, I went through the different titles in the corpus and identified frequently used words that were escaping the above wordlists. As I will outline in the next notebook, many of the words that I found, which had not been captured by these wordlists, were related to religious persons and theological ideas; were alternative spellings created either intentionally or due to editor error; or can be attributed to John H. Kellogg and the health reform writings of the denomination.

Final wordlists

The final wordlists are included in the data folder for this module. They are:

You can run this code locally using the Jupyter notebook available via Github.

  1. Recent studies on text reuse have effectively used ngram comparison to find reuse despite differences attributable to errors in character recognition or minor editorial rephrasing. (See Mullen (2016), http://americaspublicbible.org/methods.html and Smith, Cordell, Dillon (2013), “Infectious Texts: Modeling Text Reuse in Nineteenth-Century Newspapers,” 2013 IEEE International Conference on Big Data. doi: http://dx.doi.org/10.1109/BigData.2013.6691675) However, research in the field of information retrieval has found that algorithms that factor the relative prevalence of words in a corpus are negatively effected by high OCR rates. (See Taghva, Kazem, Thomas Nartker, and Julie Borsack. “Information Access in the Presence of OCR Errors.” Proceedings of the 1st ACM workshop on Hardcopy document processing the 1st ACM workshop (2004): 1–8. doi: http://dx.doi.org/10.1145/1031442.1031443). 

  2. “The National Digital Newspaper Program (NDNP) Technical Guidelines for Applicants.” Library of Congress (2016): p. 11. http://www.loc.gov/ndnp/guidelines/NDNP_201517TechNotes.pdf 

  3. Tanner, Simon, Trevor Muñoz, and Pich Hemy Ros. “Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned From Assessing the OCR Accuracy of the British Library’s 19th Century Online Newspaper Archive.” D-Lib Magazine 15, no. 7/8 (2009): §5. doi: http://dx.doi.org/10.1045/july2009-munoz 

  4. Torget, Andrew J., Mihalcea, Rada, Christensen, Jon, and McGhee, Geoff. “Mapping Texts: Combining Text-Mining and Geo-Visualization to Unlock the Research Potential of Historical Newspapers.” Mapping Texts (2011): Accessed March 29, 2017. http://mappingtexts.org/whitepaper/MappingTexts_WhitePaper.pdf

  5. Lasko, Thomas A., and Susan E. Hauser. “Approximate String Matching Algorithms for Limited-Vocabulary OCR Output Correction.” Proceedings of SPIE 4307 (2001): 232–40. doi: http://dx.doi.org/10.1117/12.410841 

  6. Atkinson, Kevin. “Spell Checking Oriented Word Lists (Scowl).” (2016): Accessed Nov 18, 2016. http://wordlist.aspell.net/scowl-readme/