Extracting Text from PDFs

This is part of a series of first drafts of the technical essays documenting the technical work that undergirds my dissertation, A Gospel of Health and Salvation. For an overview of the dissertation project, you can read the current project description at jeriwieringa.com. You can access the Jupyter notebooks on Github.

My goals in sharing the notebooks and technical essays are two-fold. First, I hope that they might prove useful to others interested in taking on similar projects. Second, I am sharing them in hopes that “given enough eyeballs, all bugs are shallow.”

With the PDF files downloaded, my next challenge was to extract the text. Here my choice of source base offered some advantages and some additional challenges. It is not uncommon when downloading books scanned to PDFs from providers such as Google to discover that they have only made the page images available. As many people want textual data, and preferably good textual data, for a variety of potentially lucrative computational tasks, it makes sense for companies to withhold the text layer. But for the researcher, this necessitates adding a text recognition step to the gathering process, running the pages through OCR software to generate the needed text layer. One advantage of this is that you then have control over the OCR software, but it significantly increases the time and complexity of the text gathering process.

The PDF files produced by the Office of Archives and Statistics include the produced OCR. But unlike the newspapers scanned as part of the Chronicling America project, there is very little information embedded in these files about the source and estimated quality of that OCR. That lack of information sets up the challenge for the next section of this module, which documents my work to assess and clean the corpus, previewed in an earlier blog post.

In extracting the text, I also had to determined my unit of analysis for text mining – the article, the page, or the issue. I quickly dismissed using the “issue” because it is too large and too irregular a unit. With issues ranging in length from 8 pages to 100 pages, and including a variety of elements from long essays to letters to the editor and field reports, I would only be able to surface summary patterns using the issue as a whole. Since I am interested in identifying shifts in discourse over time, a more fine-grained unit was necessary. For this, the “article” seemed like a very useful unit, enabling each distinct piece to be examined on its own. But the boundaries of “articles” in a newspaper type publication are actually rather hard to define, and the length of the candidate sections range from multiple-page essays to one paragraph letters or poems. In addition, the publications contain a number of article “edge cases”, such as advertisements, notices of letters received, and subscription information, which would either need to be identified and separated into their own articles or identified and excluded.

In the end, I chose the middle-ground solution of using the page as the document unit. While not all pages are created equal (early issues of the Review and Herald made great use of space and small font size to squeeze about 3000 words on a page), on average the pages contain about 1000 words, placing them in line with the units Matthew Jockers has found to be most useful when modeling novels. Splitting on the page is also computationally and analytically simple, which is valuable when working at the scale of this project.

In addition, using the page as the unit of analysis is more reflective of the print reading experience. Rather than interacting with each article in isolation (as is modeled in many database editions of historical newspapers), the newspaper readers would experience an article within the context of the other stories on the page. This juxtaposition of items creates what Marshall McLuhan refers to as the “human interest” element of news print, constructed through the “mosaic” of the page layout.1 Using the page as the unit of analysis enables me to interact with the articles as well as the community that the collection of articles creates.

Having determined the unit of analysis, the technical challenge was how to split the PDF documents and extract the text. It is worth noting that not all of the methods I used to prepare my corpus are ones that I would recommend. For reasons I don’t entirely recall, but related to struggling to conceptualize how to write a function that would separate the PDFs and extract the text, I chose Automator, a default Mac utility, to separate the pages and extract the text from the PDF files. While this reduced the programming load, it was memory and storage intensive — through the process I managed to destroy a hard-drive by failing to realize that 100-page PDFs take up even more space when split into 100 1-page PDFs. Another downside of using Automator was that I did not have control over the encoding of the generated text files. Upon attempting run the files through Python scripts, a plethora of encoding error messages quickly suggested that not all was well with my text corpus. I used the following bash script to check and report the file encodings:

for f in $FILES
  encoding=`file -I $f | cut -f 2 -d";" | cut -f 2 -d=`
  if ! [ $encoding = "utf-8" -o $encoding = "us-ascii" ]; then
    echo "Check $f: encoding reported as $encoding"

The report revealed non-utf-8 encodings from latin-1 to binary on a majority of files. My attempts to use iconv to convert the files to utf-8 raised their own collection of errors. Thanks to a suggestion from a very wise friend, I bypassed these problems by using vim within a bash script to open each file and re-encode it in utf-8.

for f in $FILES
  echo "$f"
  vim -es '+set fileencoding=utf-8' '+wq' $f
  encoding=`file -bi $f | cut -f 2 -d";" | cut -f 2 -d=`
  echo "$encoding"

Although perhaps not an elegant solution, this process worked sufficiently to produce a directory of 197,943 text files that could be read by my Python scripts without trouble.

Below I outline a better way, which I use on later additions to the corpus, to extract the text from a PDF document and save each page to it’s own file using PyPDF2. Using a PDF library has a number of advantages. First, it is much less resource intensive, with no intermediary documents created and, with the use of a generator expression, no need to load the entire list of filenames into memory. Second, by placing the extraction within a more typical Python workflow, I can set the encoding when writing the extracted text to a file. This removes the complication I encountered with the Automator-generated text of relying on a program to assign the “most likely” encoding.

import os
from os.path import isfile, join
import PyPDF2
"""If running locally, set these variables to your local directories.
pdf_dir = "../../corpus/incoming"
txt_dir = "../../corpus/txt_dir"
"""Note: Uses a generator expression.
Rerun the cell if you restart the loop below.
corpus = (f for f in os.listdir(pdf_dir) if not f.startswith('.') and isfile(join(pdf_dir, f)))
"""The documentation for PyPDF2 is minimal. 
For this pattern, I followed the syntax at 
https://automatetheboringstuff.com/chapter13/ and
for filename in corpus:
    # Open the PDF and load as PyPDF2 Reader object.
    pdf = open(join(pdf_dir, filename),'rb')
    pdfReader = PyPDF2.PdfFileReader(pdf)
    # Loop through the pages, extract the text, and write each page to individual file.
    for page in range(0, pdfReader.numPages):
        pageObj = pdfReader.getPage(page)
        text = pageObj.extractText()
        # Compile the page name. Add one because Python counts from 0.
        page_name = "{}-page{}.txt".format(filename[:-4], page+1)

        # Write to each page to file
        with open(join(txt_dir, page_name), mode="w", encoding='utf-8') as o:

You can run this code locally using the Jupyter Notebook. Setup instructions are available in the project README.

  1. Marshall McLuhan, Understanding Media: The Extensions of Man. MIT Press Edition. Cambridge, MA: The MIT Press, 1994. p. 204. 

Downloading Corpus Files

This is part of a series of first drafts of the technical essays documenting the technical work that undergirds my dissertation, A Gospel of Health and Salvation. For an overview of the dissertation project, you can read the current project description at jeriwieringa.com. You can access the Jupyter notebooks on Github.

My goals in sharing the notebooks and technical essays are two-fold. First, I hope that they might prove useful to others interested in taking on similar projects. Second, I am sharing them in hopes that “given enough eyeballs, all bugs are shallow.”

The source base for A Gospel of Health and Salvation is the collection of scanned periodicals produced by the Office of Archives, Statistics, and Research of the Seventh-day Adventist Church (SDA). That this collection of documents is openly available on the web has been fundamental to the success of this project. One of the greatest challenges for historical scholarship that seeks to leverage large digital collections is access to the relevant materials. While projects such as Chronicling America and resources such as the Digital Public Library of America are indispensable, many specialized resources are available only through proprietary databases and library subscriptions that impose limits on the ways scholars can interact with their resources.1

The publishing of the digital periodicals on the open web by the SDA made it unnecessary for me to navigate through the firewalls (and legal land-mines) of using text from major library databases, a major boon for the digital project.2 And, although the site does not provide an API for accessing the documents, the structure of the pages is regular, making the site a good candidate for web scraping. However, relying on an organization to provide its own historical documents raises its own challenges. Due to the interests of the hosting organization, in this case the Seventh-day Adventist Church, the collection is shaped by and shapes a particular narrative of the denomination’s history and development. For example, issues of Good Health, which was published by John Harvey Kellogg, are (almost entirely) dropped from the SDA’s collection after 1907, which corresponds to the point when Kellogg was disfellowshipped from the denomination, even though Kellogg continued its publication into the 1940s.3 Such interests do not invalidate the usefulness of the collection, as all archives have limitations and goals, but those interests need to be acknowledged and taken into account in the analysis.

To determine the list of titles that applied to my time and regions of study, I browsed through all of the titles in the periodicals section of the site and compiled a list of titles that fit my geographic and temporal constraints. These are:

As this was my first technical task for the dissertation, my initial methods for identifying the URLs for the documents I wanted to download was rather manual. I saved an .html file for each index page that contained documents I wanted to download. I then passed those .html files to a script (similar to that recorded here) that used BeautifulSoup to extract the PDF ids, reconstruct the URLs, and write the URLs to a new text file, scrapeList.txt. After manually deleting the URLs to any documents that were out of range, I then passed the scrapeList.txt file to wget using the following syntax:4

wget -i scrapeList.txt -w 2 --limit-rate=200k

I ran this process for each of the periodical titles included in this study. It took approximately a week to download all 13,000 files to my local machine. The resulting corpus takes up 27.19 GB of space.

This notebook reflects a more automated version of that process, created in 2017 to download missing documents. The example recorded here is for downloading the Sabbath School Quarterly collection, which I missed during my initial collection phase.

In these scripts I use the requests library to retrieve the HTML from the document directory pages and BeautifulSoup4 to locate the filenames. I use wget to download the files.

from bs4 import BeautifulSoup
from os.path import join
import re
import requests
import wget
def check_year(pdfID):
    """Use regex to check the year from the PDF filename.

        pdfID (str): The filename of the PDF object, formatted as 
    split_title = pdfID.split('-')
    title_date = split_title[0]
    date = re.findall(r'[0-9]+', title_date)
    year = date[0][:4]
    if int(year) < 1921:
        return True
        return False

def filename_from_html(content):
    """Use Beautiful Soup to extract the PDF ids from the HTML page. 

    This script is customized to the structure of the archive pages at

        content (str): Content is retrieved from a URL using the `get_html_page` 
    soup = BeautifulSoup(content, "lxml")
    buttons = soup.find_all('td', class_="ms-vb-title")

    pdfIDArray = []

    for each in buttons:
        links = each.find('a')
        pdfID = links.get_text()

    return pdfIDArray

def get_html_page(url):
    """Use the requests library to get HTML content from URL
        url (str): URL of webpage with content to download.
    r = requests.get(url)

    return r.text

The first step is to set the directory where I want to save the downloaded documents, as well as the root URL for the location of the PDF documents.

This example is set up for the Sabbath School Quarterly.

"""If running locally, you will need to create the `corpus` folder or 
update the path to the location of your choice.
download_directory = "../../corpus/"
baseurl = "http://documents.adventistarchives.org/SSQ/"

My next step is to generate a list of the IDs for the documents I want to download.

Here I download the HTML from the index page URLs and extract the document IDs. To avoid downloading any files outside of my study, I check the year in the ID before adding the document ID to my list of documents to download.

index_page_urls = ["http://documents.adventistarchives.org/SSQ/Forms/AllItems.aspx?View={44c9b385-7638-47af-ba03-cddf16ec3a94}&SortField=DateTag&SortDir=Asc",
docs_to_download = []

for url in index_page_urls: 
    content = get_html_page(url)
    pdfs = filename_from_html(content)
    for pdf in pdfs:
        if check_year(pdf):
            print("Adding {} to download list".format(pdf))

Finally, I loop through all of the filenames, create the URL to the PDF, and use wget to download a copy of the document into my directory for processing.

for doc_name in docs_to_download:
    url = join(baseurl, "{}.pdf".format(doc_name))
    wget.download(url, download_directory)

You can run this code locally using the Jupyter notebook. Setup instructions are available in the project README.

  1. The library at Northeastern University provides a useful overview of the different elements one must consider when seeking to mine content in subscription databases. Increasingly, library vendors are seeking to build “research environments” that provide “workflow and analytical technology that breathes new life into the study of the social sciences and humanities.” It is my opinion that granting vendors control over both the documentary materials and the methods of analysis, however, would be a concerning development for the long-term health of scholarship in the humanities. 

  2. In recognition of the labor involved in creating the digital assets and to drive traffic back to their site, I am not redistributing any of the PDF files in my dissertation. Copies can be retrieved directly from the denomination’s websites. 

  3. The University of Michigan is the archival home of the Good Health magazine and many have been digitized by Google. The Office of Archives and Research does include a few issues from 1937 and one from 1942, which offers evidence of Kellogg’s ongoing editorial work on the title. 

  4. In this syntax I was guided by Ian Milligan’s wget lesson. Ian Milligan, “Automated Downloading with Wget,” Programming Historian, (2012-06-27), http://programminghistorian.org/lessons/automated-downloading-with-wget 

Updating the Dissertation Description

Dissertations can be fickle things. When I started my A.B.D. journey in 2014, I had a very ambitious project outline, and very little understanding of the technical skills I needed to see it through. Since then, through a lot of hard work and a number of false starts, I have greatly expanded my technical skills, learned a good deal about data management (thanks, Wendy!), and discovered how true it is that even (especially?) with computers, less is more when it comes to scholarly projects. While I hope in future iterations of the project to bring network and geospatial analysis to bear on my examination of the evolving relationship between beliefs, health practices, gender dynamics, and temporal imaginaries in the development of the Seventy-day Adventist Church, the dissertation itself will focus on textual analysis of the periodical literature produced by this prolific religious movement.

What follows is my revised dissertation description, submitted to the history faculty at George Mason in pursuit of the department’s completion grant. In the coming months, I will be posting drafts of my technical essays and code notebooks, documenting the computational work that undergirds my historical analysis. This description provides the context for those technical essays.

A Gospel of Health and Salvation: A Digital Study of Seventh-day Adventism, Health, and American Culture, 1843 - 1920

Graham cakes, corn flakes, and therapeutic baths. These are not the first things that come to mind when thinking about religious practices in American history. Yet in 1866, the newly-formed Seventh-day Adventist church opened the Western Health Reform Institute dedicated to providing water cure treatments and health education within a Sabbath-keeping environment.1 Embracing elements of homeopathy, water cure, diet reform, and dress reform, this health-minded Protestant sect incorporated vegetarianism, health reform, and medicine into their evolving understanding of salvation as they grappled with an ever-anticipated but perpetually-delayed millennium.

Seventh-day Adventism began as a community united by their shared belief in the coming return of Christ according to a particular interpretation of biblical literature, rather than as a group tied together by a particular liturgy or by geographic area. As a result, the community organized itself primarily through the publication and distribution of books and periodicals. This reliance on the printed word to orient, build, and maintain a community of faith provides a distinct opportunity for studying the development of this religious movement through their publications. Working with approximately 13,000 periodical issues (nearly 200,000 pages) produced in four geographic regions in the United States by the denomination, I am using text mining to examine the shifting and mutually informing relationship between the theological commitments of the denomination and their gendered health practices in relation to their shifting conception of time.

Seventh-day Adventism is an apocalyptic and millennialist belief system, meaning that they anticipate the imminent return of Jesus Christ, the start of his thousand-year reign, and the end of the world. The movement’s early followers embraced William Miller’s teaching that Christ would return in the period between 1843 and 1844. In the wake of the Great Disappointment, the small group that would become the Seventh-day Adventist church organized around the belief that October 22, 1844, marked the day Christ entered into the inner sanctuary of the Temple in heaven and there began the work of judging all of humanity.2 They also adopted seventh-day Sabbatarianism, holding that Saturday, rather than Sunday, was the proper day for Christian observance, as decreed in the ten commandments. These beliefs contributed to Seventh-day Adventism’s distinctive and shifting temporal imaginary, their shared understanding of time as ordered by God to rotate around the Jewish Sabbath and as rapidly approaching the end of human history.3 I argue that we can see the contours and effects of this shifting temporal imaginary in their health practices and the roles women assumed within the denomination, and that attention to the temporal imaginary provides an additional framework for understanding the flourishing of alternative religious movements during the nineteenth century.4

A Gospel of Health and Salvation is a web-based project composed of five modules, one framing, two methodological, and two interpretative, all of which combine narrative elements with interactive visualizations and computational analysis. Module 1, “A Gospel of Health and Salvation,” provides a non-technical overview of the project and situates my discussion of early Seventh-day Adventism and their beliefs and practices around health within the historical context of the mid-nineteenth to early-twentieth centuries. Module 2, “Of Making Many Books,” explores the history of Seventh-day Adventist publishing and outlines my methods for selecting, collecting, and preparing the corpus of materials for analysis. Module 3, “Modeling Time,” is a study of different ways to algorithmically identify and highlight changes in discourse over time, particularly when considering time as fluid, rather than constant or static. Module 4, “Work Out Your Salvation,” examines the ways the changing temporal imaginary of Seventh-day Adventism shaped their health practice, particularly their attitudes toward faith healing and the relationship between the healthy body and salvation. Module 5, “Teachers, Mothers, Healers All,” focuses on the ways the changing temporal imaginary of Seventh-day Adventism shaped gender relations and hierarchies within the denomination, as seen in their health practices and their denominational structures. Together, these modules engage critically with the computational methods of the digital humanities within the context of religious history, and reveal the ways the community’s conception of time shaped the possibilities of their theology and their health practices, providing the framework within which they defined and sought to achieve flourishing human lives.

  1. Numbers, Ronald L. Prophetess of Health: A Study of Ellen G. White. Wm. B. Eerdmans Publishing Co., 2008. pp. 156-7. 

  2. “The Great Disappointment” refers to the period after October 22, 1844, the date the Millerites believed Jesus would return. Malcolm Bull and Keith Lockhart place the number of early Seventh-day Adventist followers in 1849 at approximately 100. Bull, Malcolm, and Keith Lockhart. Seeking a Sanctuary: Seventh-Day Adventism and the American Dream. Indiana University Press, 2006. p. 7. 

  3. I am using “imaginary” in the sense of Charles Taylor’s “modern social imaginary,” namely, that which “enables, through making sense of …” or “that common understanding that makes possible common practices …” Taylor, Charles. Modern Social Imaginaries. Durham; London: Duke University Press, 2004. pp. 2, 23. 

  4. One common framework for interpreting 19th-century revivalism is Carroll Smith-Rosenberg’s economic interpretation. See Smith-Rosenberg, Carroll. “The Cross and the Pedestal: Women, Anti-Ritualism, and the Emergence of the American Bourgeoisie,” In Disorderly Conduct: Visions of Gender in Victorian America, Oxford University Press, 1986. My approach is similar to that of Robert Abzug in Cosmos Crumbling, but is focused on the temporal aspects of the religious cosmology. Abzug, Robert H. Cosmos Crumbling: American Reform and the Religious Imagination. Oxford University Press, 1994.