Know Your Sources (Part 1)

This is part of a series of technical essays documenting the computational analysis that undergirds my dissertation, A Gospel of Health and Salvation. For an overview of the dissertation project, you can read the current project description at You can access the Jupyter notebooks on Github.

My goals in sharing the notebooks and technical essays are three-fold. First, I hope that they might prove useful to others interested in taking on similar projects. In these notebooks I describe and model how to approach a large corpus of sources in the production of historical scholarship.

Second, I am sharing them in hopes that “given enough eyeballs, all bugs are shallow.” If you encounter any bugs, if you see an alternative way to solve a problem, or if the code does not achieve the goals I have set out for it, please let me know!

Third, these notebooks make an argument for methodological transparency and for the discussion of methods as part of the scholarly argument of digital history. Often the technical work in digital history is done behind the scenes, with publications favoring the final research products, usually in article form with interesting visualizations. While there is a growing culture in digital history of releasing source code, there is little discussion of how that code was developed, why solutions were chosen, and what those solutions enable and prevent. In these notebooks I seek to engage that middle space between code and the final analysis - documenting the computational problem solving that I’ve done as part of the analysis. As these essays attest, each step in the processing of the corpus requires the researcher to make a myriad of distinctions about the worlds they seek to model, distinctions that shape the outcomes of the computational analysis and are part of the historical argument of the work.

Now that we have a collection of texts selected and downloaded, and have extracted the text, we need to spend some time identifying what the corpus contains, both in terms of coverage and quality. As I describe in the project overview, I will be using these texts to make arguments about the development of the community’s discourses around health and salvation. While the corpus makes that analysis possible, it also sets the limits of what we can claim from text analysis alone. Without an understanding of what those limits are, we run the risk of claiming more than the sources can sustain, and in doing so, minimizing the very complexities that historical research seeks to reveal.

"""My usual practice for gathering the filenames is to read 
them in from a directory. So that this code can be run locally 
without the full corpus downloaded, I exported the list 
of filenames to an index file for use in this notebook.
with open("data/2017-05-05-corpus-index.txt", "r") as f:
    corpus =

To create an overview of the corpus, I will use the document filenames along with some descriptive metadata that I created.

Filenames are an often underestimated feature of digital files, but one that can be used to great effect. For my corpus, the team that digitized the periodicals did an excellent job of providing the files with descriptive names. Overall, the files conform to the following pattern:


I discovered a few files that deviated from the pattern, but renamed those so that the pattern held throughout the corpus. When splitting the PDF documents into pages, I preserved the structure, adding -page0.txt to the end.

The advantage of this format is that the filenames contain the metadata I need to place each file within its context. By isolating the different sections of the filename, I can quickly place any file with reference to the periodical title and the publication date.

import pandas as pd
import re
def extract_pub_info(doc_list):
    """Use regex to extract metadata from filename.
        Assumes that the filename is formatted as::
        doc_list (list): List of the filenames in the corpus.
        dict: Dictionary with the year and title abbreviation for each filename.
    corpus_info = {}
    for doc_id in doc_list:
        # Split the ID into three parts on the '-'
        split_doc_id = doc_id.split('-')
        # Get the prefix by matching the first set of letters 
        # in the first part of the filename.
        title = re.match("[A-Za-z]+", split_doc_id[0])
        # Get the dates by grabbing all of the number elements 
        # in the first part of the filename.
        dates ='[0-9]+', split_doc_id[0])
        # The first four numbers is the publication year.
        year =[:4]
        # Update the dictionary with the title and year 
        # for the filename.
        corpus_info[doc_id] = {'title':, 'year': year}
    return corpus_info
corpus_info = extract_pub_info(corpus)

One of the most useful libraries in Python for working with data is Pandas. With Pandas, Python users gain much of the functionality that our colleagues who work with R have long celebrated as the benefits of that domain-specific language.

By transforming our corpus_info dictionary into a dataframe, we can quickly filter and tabulate a number of different statistics on our corpus.

df = pd.DataFrame.from_dict(corpus_info, orient='index') = 'docs'
df = df.reset_index()

You can preview the initial dataframe by uncommenting the cell below.

# df
df = df.groupby(["title", "year"], as_index=False).docs.count()
title year docs
0 ADV 1898 26
1 ADV 1899 674
2 ADV 1900 463
3 ADV 1901 389
4 ADV 1902 440
5 ADV 1903 428
6 ADV 1904 202
7 ADV 1905 20
8 ARAI 1909 64
9 ARAI 1919 32
10 AmSn 1886 96
11 AmSn 1887 96
12 AmSn 1888 105
13 AmSn 1889 386
14 AmSn 1890 403
15 AmSn 1891 398
16 AmSn 1892 401
17 AmSn 1893 402
18 AmSn 1894 402
19 AmSn 1895 400
20 AmSn 1896 408
21 AmSn 1897 800
22 AmSn 1898 804
23 AmSn 1899 801
24 AmSn 1900 800
25 CE 1909 104
26 CE 1910 312
27 CE 1911 312
28 CE 1912 306
29 CE 1913 420
... ... ... ...
465 YI 1885 192
466 YI 1886 220
467 YI 1887 244
468 YI 1888 104
469 YI 1889 208
470 YI 1890 208
471 YI 1895 416
472 YI 1898 28
473 YI 1899 408
474 YI 1900 408
475 YI 1901 408
476 YI 1902 408
477 YI 1903 408
478 YI 1904 288
479 YI 1905 408
480 YI 1906 408
481 YI 1907 432
482 YI 1908 832
483 YI 1909 844
484 YI 1910 850
485 YI 1911 852
486 YI 1912 868
487 YI 1913 848
488 YI 1914 852
489 YI 1915 852
490 YI 1916 852
491 YI 1917 840
492 YI 1918 850
493 YI 1919 856
494 YI 1920 45

495 rows × 3 columns

Nearly 500 rows of data is too large to have a good sense of the coverage of the corpus from reading the data table, so it is necessary to create some visualizations of the records. For a quick prototyping tool, I am using the Bokeh library.

from bokeh.charts import Bar, show
from bokeh.charts import defaults
from import output_notebook
from bokeh.palettes import viridis
defaults.width = 900
defaults.height = 950

In this first graph, I am showing the total number of pages per title, per year in the corpus.

p = Bar(df, 
        palette= viridis(30), 
        title="Pages per Title per Year")

This graph of the corpus reflects the historical development of the publication efforts denomination. Starting with a single publication in 1849, the publishing efforts of the denomination expand in the 1860s as they launch their health reform efforts, expand again in the 1880s as they start a publishing house in California and address concerns about Sunday observance laws, and again at the turn of the century as the denomination reorganizes and regional publications expand. The chart also reveals some holes in the corpus. The Youth’s Instructor (shown here in yellow) in one of the oldest continuous denominational publications, but the pages available for the years from 1850 - 1899 are inconsistent.

In interpreting the results of mining these texts, it will be important to factor in the relative difference in size and diversity of publication venues between the early years of the denomination and the later years of this study.

by_title = df.groupby(["title"], as_index=False).docs.sum()
p = Bar(df, 
        title="Total Pages by Title"

Another way to view the coverage of the corpus is by total pages per periodical title. The Advent Review and Sabbath Herald dominates the corpus in number of pages, with The Health Reformer, Signs of the Times, and the Youth’s Instructor, making up the next major percentage of the corpus. In terms of scale, these publications will have (and had) a prominent role in shaping the discourse of the SDA community. At the same time, it will be informative to look to the smaller publications to see if we can surface alternative and dissonant ideas.

topic_metadata = pd.read_csv('data/2017-05-05-periodical-topics.csv')
periodicalTitle title startYear endYear initialPubLocation topic
0 Training School Advocate ADV 1898 1905 Battle Creek, MI Education
1 American Sentinel AmSn 1886 1900 Oakland, CA Religious Liberty
2 Advent Review and Sabbath Herald ARAI 1909 1919 Washington, D.C. Denominational
3 Christian Education CE 1909 1920 Washington, D.C. Education
4 Welcome Visitor (Columbia Union Visitor) CUV 1901 1920 Academia, OH Regional
5 Christian Educator EDU 1897 1899 Battle Creek, MI Education
6 General Conference Bulletin GCB 1863 1918 Battle Creek, MI Denominational
7 Gospel Herald GH 1898 1920 Yazoo City, MS Regional
8 Gospel of Health GOH 1897 1899 Battle Creek, MI Health
9 Gospel Sickle GS 1886 1888 Battle Creek, MI Missions
10 Home Missionary HM 1889 1897 Battle Creek, MI Missions
11 Health Reformer HR 1866 1907 Battle Creek, MI Health
12 Indiana Reporter IR 1901 1910 Indianapolis, IN Regional
13 Life Boat LB 1898 1920 Chicago, IL Missions
14 Life and Health LH 1904 1920 Washington, D.C. Health
15 Liberty LibM 1906 1920 Washington, D.C. Religious Liberty
16 Lake Union Herald LUH 1908 1920 Berrien Springs, MI Regional
17 North Michigan News Sheet NMN 1907 1910 Petoskey, MI Regional
18 Pacific Health Journal and Temperance Advocate PHJ 1885 1904 Oakland, CA Health
19 Present Truth (Advent Review) PTAR 1849 1850 Middletown, CT Denominational
20 Pacific Union Recorder PUR 1901 1920 Oakland, CA Regional
21 Review and Herald RH 1850 1920 Paris, ME Denominational
22 Sligonian Sligo 1916 1920 Washington, D.C. Regional
23 Sentinel of Liberty SOL 1900 1904 Chicago, IL Religious Liberty
24 Signs of the Times ST 1874 1920 Oakland, CA Denominational
25 Report of Progress, Southern Union Conference SUW 1907 1920 Nashville, TN Regional
26 Church Officer's Gazette TCOG 1914 1920 Washington, D.C. Denominational
27 The Missionary Magazine TMM 1898 1902 Philadelphia, PA Missions
28 West Michigan Herald WMH 1903 1908 Grand Rapids, MI Regional
29 Youth's Instructor YI 1852 1920 Rochester, NY Denominational

We can generate another view by adding some external metadata for the titles. The “topics” listed here are ones I assigned when skimming the different titles. “Denominational” refers to centrally produced publications, covering a wide array of topics. “Education” refers to periodicals focused on education. “Health” to publications focused on health. “Missions” titles are focused on outreach and evangelism focused publications and “Religious Liberty” on governmental concerns over Sabbath laws. Finally, “Regional” refers to periodicals produced by local union conferences, which like the denominational titles cover a wide range of topics.

by_topic = pd.merge(topic_metadata, df, on='title')
p = Bar(by_topic, 
        palette= viridis(6), 
        title="Pages per Topic per Year")

Here we can see the diversification of periodical subjects over time, especially around the turn of the century.

p = Bar(by_topic, 
        palette= viridis(30), 
        title="Pages per Topic per Year")
p.left[0].formatter.use_scientific = False
p.legend.location = "top_right"

Grouping by category allows us to see that our corpus is dominated by the denominational, health, and regionally focused publications. These topics match with our research concerns, increasing our confidence that we will have enough information to determine meaningful patterns about those topics. But, due to the focus of the corpus, we should proceed cautiously before making any claims about the relative importance of those topics within the community.

Now that we have a sense of the temporal and topical coverage of our corpus, we will next turn our attention to evaluating the quality of the data that we have gathered from the scanned PDF files.

You can run this code locally using the Jupyter notebook available via Github.

Extracting Text from PDFs

This is part of a series of first drafts of the technical essays documenting the technical work that undergirds my dissertation, A Gospel of Health and Salvation. For an overview of the dissertation project, you can read the current project description at You can access the Jupyter notebooks on Github.

My goals in sharing the notebooks and technical essays are two-fold. First, I hope that they might prove useful to others interested in taking on similar projects. Second, I am sharing them in hopes that “given enough eyeballs, all bugs are shallow.”

With the PDF files downloaded, my next challenge was to extract the text. Here my choice of source base offered some advantages and some additional challenges. It is not uncommon when downloading books scanned to PDFs from providers such as Google to discover that they have only made the page images available. As many people want textual data, and preferably good textual data, for a variety of potentially lucrative computational tasks, it makes sense for companies to withhold the text layer. But for the researcher, this necessitates adding a text recognition step to the gathering process, running the pages through OCR software to generate the needed text layer. One advantage of this is that you then have control over the OCR software, but it significantly increases the time and complexity of the text gathering process.

The PDF files produced by the Office of Archives and Statistics include the produced OCR. But unlike the newspapers scanned as part of the Chronicling America project, there is very little information embedded in these files about the source and estimated quality of that OCR. That lack of information sets up the challenge for the next section of this module, which documents my work to assess and clean the corpus, previewed in an earlier blog post.

In extracting the text, I also had to determined my unit of analysis for text mining – the article, the page, or the issue. I quickly dismissed using the “issue” because it is too large and too irregular a unit. With issues ranging in length from 8 pages to 100 pages, and including a variety of elements from long essays to letters to the editor and field reports, I would only be able to surface summary patterns using the issue as a whole. Since I am interested in identifying shifts in discourse over time, a more fine-grained unit was necessary. For this, the “article” seemed like a very useful unit, enabling each distinct piece to be examined on its own. But the boundaries of “articles” in a newspaper type publication are actually rather hard to define, and the length of the candidate sections range from multiple-page essays to one paragraph letters or poems. In addition, the publications contain a number of article “edge cases”, such as advertisements, notices of letters received, and subscription information, which would either need to be identified and separated into their own articles or identified and excluded.

In the end, I chose the middle-ground solution of using the page as the document unit. While not all pages are created equal (early issues of the Review and Herald made great use of space and small font size to squeeze about 3000 words on a page), on average the pages contain about 1000 words, placing them in line with the units Matthew Jockers has found to be most useful when modeling novels. Splitting on the page is also computationally and analytically simple, which is valuable when working at the scale of this project.

In addition, using the page as the unit of analysis is more reflective of the print reading experience. Rather than interacting with each article in isolation (as is modeled in many database editions of historical newspapers), the newspaper readers would experience an article within the context of the other stories on the page. This juxtaposition of items creates what Marshall McLuhan refers to as the “human interest” element of news print, constructed through the “mosaic” of the page layout.1 Using the page as the unit of analysis enables me to interact with the articles as well as the community that the collection of articles creates.

Having determined the unit of analysis, the technical challenge was how to split the PDF documents and extract the text. It is worth noting that not all of the methods I used to prepare my corpus are ones that I would recommend. For reasons I don’t entirely recall, but related to struggling to conceptualize how to write a function that would separate the PDFs and extract the text, I chose Automator, a default Mac utility, to separate the pages and extract the text from the PDF files. While this reduced the programming load, it was memory and storage intensive — through the process I managed to destroy a hard-drive by failing to realize that 100-page PDFs take up even more space when split into 100 1-page PDFs. Another downside of using Automator was that I did not have control over the encoding of the generated text files. Upon attempting run the files through Python scripts, a plethora of encoding error messages quickly suggested that not all was well with my text corpus. I used the following bash script to check and report the file encodings:

for f in $FILES
  encoding=`file -I $f | cut -f 2 -d";" | cut -f 2 -d=`
  if ! [ $encoding = "utf-8" -o $encoding = "us-ascii" ]; then
    echo "Check $f: encoding reported as $encoding"

The report revealed non-utf-8 encodings from latin-1 to binary on a majority of files. My attempts to use iconv to convert the files to utf-8 raised their own collection of errors. Thanks to a suggestion from a very wise friend, I bypassed these problems by using vim within a bash script to open each file and re-encode it in utf-8.

for f in $FILES
  echo "$f"
  vim -es '+set fileencoding=utf-8' '+wq' $f
  encoding=`file -bi $f | cut -f 2 -d";" | cut -f 2 -d=`
  echo "$encoding"

Although perhaps not an elegant solution, this process worked sufficiently to produce a directory of 197,943 text files that could be read by my Python scripts without trouble.

Below I outline a better way, which I use on later additions to the corpus, to extract the text from a PDF document and save each page to it’s own file using PyPDF2. Using a PDF library has a number of advantages. First, it is much less resource intensive, with no intermediary documents created and, with the use of a generator expression, no need to load the entire list of filenames into memory. Second, by placing the extraction within a more typical Python workflow, I can set the encoding when writing the extracted text to a file. This removes the complication I encountered with the Automator-generated text of relying on a program to assign the “most likely” encoding.

import os
from os.path import isfile, join
import PyPDF2
"""If running locally, set these variables to your local directories.
pdf_dir = "../../corpus/incoming"
txt_dir = "../../corpus/txt_dir"
"""Note: Uses a generator expression.
Rerun the cell if you restart the loop below.
corpus = (f for f in os.listdir(pdf_dir) if not f.startswith('.') and isfile(join(pdf_dir, f)))
"""The documentation for PyPDF2 is minimal. 
For this pattern, I followed the syntax at and
for filename in corpus:
    # Open the PDF and load as PyPDF2 Reader object.
    pdf = open(join(pdf_dir, filename),'rb')
    pdfReader = PyPDF2.PdfFileReader(pdf)
    # Loop through the pages, extract the text, and write each page to individual file.
    for page in range(0, pdfReader.numPages):
        pageObj = pdfReader.getPage(page)
        text = pageObj.extractText()
        # Compile the page name. Add one because Python counts from 0.
        page_name = "{}-page{}.txt".format(filename[:-4], page+1)

        # Write to each page to file
        with open(join(txt_dir, page_name), mode="w", encoding='utf-8') as o:

You can run this code locally using the Jupyter Notebook. Setup instructions are available in the project README.

  1. Marshall McLuhan, Understanding Media: The Extensions of Man. MIT Press Edition. Cambridge, MA: The MIT Press, 1994. p. 204. 

Downloading Corpus Files

This is part of a series of first drafts of the technical essays documenting the technical work that undergirds my dissertation, A Gospel of Health and Salvation. For an overview of the dissertation project, you can read the current project description at You can access the Jupyter notebooks on Github.

My goals in sharing the notebooks and technical essays are two-fold. First, I hope that they might prove useful to others interested in taking on similar projects. Second, I am sharing them in hopes that “given enough eyeballs, all bugs are shallow.”

The source base for A Gospel of Health and Salvation is the collection of scanned periodicals produced by the Office of Archives, Statistics, and Research of the Seventh-day Adventist Church (SDA). That this collection of documents is openly available on the web has been fundamental to the success of this project. One of the greatest challenges for historical scholarship that seeks to leverage large digital collections is access to the relevant materials. While projects such as Chronicling America and resources such as the Digital Public Library of America are indispensable, many specialized resources are available only through proprietary databases and library subscriptions that impose limits on the ways scholars can interact with their resources.1

The publishing of the digital periodicals on the open web by the SDA made it unnecessary for me to navigate through the firewalls (and legal land-mines) of using text from major library databases, a major boon for the digital project.2 And, although the site does not provide an API for accessing the documents, the structure of the pages is regular, making the site a good candidate for web scraping. However, relying on an organization to provide its own historical documents raises its own challenges. Due to the interests of the hosting organization, in this case the Seventh-day Adventist Church, the collection is shaped by and shapes a particular narrative of the denomination’s history and development. For example, issues of Good Health, which was published by John Harvey Kellogg, are (almost entirely) dropped from the SDA’s collection after 1907, which corresponds to the point when Kellogg was disfellowshipped from the denomination, even though Kellogg continued its publication into the 1940s.3 Such interests do not invalidate the usefulness of the collection, as all archives have limitations and goals, but those interests need to be acknowledged and taken into account in the analysis.

To determine the list of titles that applied to my time and regions of study, I browsed through all of the titles in the periodicals section of the site and compiled a list of titles that fit my geographic and temporal constraints. These are:

As this was my first technical task for the dissertation, my initial methods for identifying the URLs for the documents I wanted to download was rather manual. I saved an .html file for each index page that contained documents I wanted to download. I then passed those .html files to a script (similar to that recorded here) that used BeautifulSoup to extract the PDF ids, reconstruct the URLs, and write the URLs to a new text file, scrapeList.txt. After manually deleting the URLs to any documents that were out of range, I then passed the scrapeList.txt file to wget using the following syntax:4

wget -i scrapeList.txt -w 2 --limit-rate=200k

I ran this process for each of the periodical titles included in this study. It took approximately a week to download all 13,000 files to my local machine. The resulting corpus takes up 27.19 GB of space.

This notebook reflects a more automated version of that process, created in 2017 to download missing documents. The example recorded here is for downloading the Sabbath School Quarterly collection, which I missed during my initial collection phase.

In these scripts I use the requests library to retrieve the HTML from the document directory pages and BeautifulSoup4 to locate the filenames. I use wget to download the files.

from bs4 import BeautifulSoup
from os.path import join
import re
import requests
import wget
def check_year(pdfID):
    """Use regex to check the year from the PDF filename.

        pdfID (str): The filename of the PDF object, formatted as 
    split_title = pdfID.split('-')
    title_date = split_title[0]
    date = re.findall(r'[0-9]+', title_date)
    year = date[0][:4]
    if int(year) < 1921:
        return True
        return False

def filename_from_html(content):
    """Use Beautiful Soup to extract the PDF ids from the HTML page. 

    This script is customized to the structure of the archive pages at

        content (str): Content is retrieved from a URL using the `get_html_page` 
    soup = BeautifulSoup(content, "lxml")
    buttons = soup.find_all('td', class_="ms-vb-title")

    pdfIDArray = []

    for each in buttons:
        links = each.find('a')
        pdfID = links.get_text()

    return pdfIDArray

def get_html_page(url):
    """Use the requests library to get HTML content from URL
        url (str): URL of webpage with content to download.
    r = requests.get(url)

    return r.text

The first step is to set the directory where I want to save the downloaded documents, as well as the root URL for the location of the PDF documents.

This example is set up for the Sabbath School Quarterly.

"""If running locally, you will need to create the `corpus` folder or 
update the path to the location of your choice.
download_directory = "../../corpus/"
baseurl = ""

My next step is to generate a list of the IDs for the documents I want to download.

Here I download the HTML from the index page URLs and extract the document IDs. To avoid downloading any files outside of my study, I check the year in the ID before adding the document ID to my list of documents to download.

index_page_urls = ["{44c9b385-7638-47af-ba03-cddf16ec3a94}&SortField=DateTag&SortDir=Asc",
docs_to_download = []

for url in index_page_urls: 
    content = get_html_page(url)
    pdfs = filename_from_html(content)
    for pdf in pdfs:
        if check_year(pdf):
            print("Adding {} to download list".format(pdf))

Finally, I loop through all of the filenames, create the URL to the PDF, and use wget to download a copy of the document into my directory for processing.

for doc_name in docs_to_download:
    url = join(baseurl, "{}.pdf".format(doc_name))
    print(url), download_directory)

You can run this code locally using the Jupyter notebook. Setup instructions are available in the project README.

  1. The library at Northeastern University provides a useful overview of the different elements one must consider when seeking to mine content in subscription databases. Increasingly, library vendors are seeking to build “research environments” that provide “workflow and analytical technology that breathes new life into the study of the social sciences and humanities.” It is my opinion that granting vendors control over both the documentary materials and the methods of analysis, however, would be a concerning development for the long-term health of scholarship in the humanities. 

  2. In recognition of the labor involved in creating the digital assets and to drive traffic back to their site, I am not redistributing any of the PDF files in my dissertation. Copies can be retrieved directly from the denomination’s websites. 

  3. The University of Michigan is the archival home of the Good Health magazine and many have been digitized by Google. The Office of Archives and Research does include a few issues from 1937 and one from 1942, which offers evidence of Kellogg’s ongoing editorial work on the title. 

  4. In this syntax I was guided by Ian Milligan’s wget lesson. Ian Milligan, “Automated Downloading with Wget,” Programming Historian, (2012-06-27),