Know Your Sources (Part 1)

This is part of a series of technical essays documenting the computational analysis that undergirds my dissertation, A Gospel of Health and Salvation. For an overview of the dissertation project, you can read the current project description at jeriwieringa.com. You can access the Jupyter notebooks on Github.

My goals in sharing the notebooks and technical essays are three-fold. First, I hope that they might prove useful to others interested in taking on similar projects. In these notebooks I describe and model how to approach a large corpus of sources in the production of historical scholarship.

Second, I am sharing them in hopes that “given enough eyeballs, all bugs are shallow.” If you encounter any bugs, if you see an alternative way to solve a problem, or if the code does not achieve the goals I have set out for it, please let me know!

Third, these notebooks make an argument for methodological transparency and for the discussion of methods as part of the scholarly argument of digital history. Often the technical work in digital history is done behind the scenes, with publications favoring the final research products, usually in article form with interesting visualizations. While there is a growing culture in digital history of releasing source code, there is little discussion of how that code was developed, why solutions were chosen, and what those solutions enable and prevent. In these notebooks I seek to engage that middle space between code and the final analysis - documenting the computational problem solving that I’ve done as part of the analysis. As these essays attest, each step in the processing of the corpus requires the researcher to make a myriad of distinctions about the worlds they seek to model, distinctions that shape the outcomes of the computational analysis and are part of the historical argument of the work.


Now that we have a collection of texts selected and downloaded, and have extracted the text, we need to spend some time identifying what the corpus contains, both in terms of coverage and quality. As I describe in the project overview, I will be using these texts to make arguments about the development of the community’s discourses around health and salvation. While the corpus makes that analysis possible, it also sets the limits of what we can claim from text analysis alone. Without an understanding of what those limits are, we run the risk of claiming more than the sources can sustain, and in doing so, minimizing the very complexities that historical research seeks to reveal.

"""My usual practice for gathering the filenames is to read 
them in from a directory. So that this code can be run locally 
without the full corpus downloaded, I exported the list 
of filenames to an index file for use in this notebook.
"""
with open("data/2017-05-05-corpus-index.txt", "r") as f:
    corpus = f.read().splitlines()
len(corpus)
197943

To create an overview of the corpus, I will use the document filenames along with some descriptive metadata that I created.

Filenames are an often underestimated feature of digital files, but one that can be used to great effect. For my corpus, the team that digitized the periodicals did an excellent job of providing the files with descriptive names. Overall, the files conform to the following pattern:

PrefixYYYYMMDD-V00-00.pdf

I discovered a few files that deviated from the pattern, but renamed those so that the pattern held throughout the corpus. When splitting the PDF documents into pages, I preserved the structure, adding -page0.txt to the end.

The advantage of this format is that the filenames contain the metadata I need to place each file within its context. By isolating the different sections of the filename, I can quickly place any file with reference to the periodical title and the publication date.

import pandas as pd
import re
def extract_pub_info(doc_list):
    """Use regex to extract metadata from filename.
    
    Note:
        Assumes that the filename is formatted as::
            
            `PrefixYYYYMMDD-V00-00.pdf`
    
    Args:
        doc_list (list): List of the filenames in the corpus.
    Returns:
        dict: Dictionary with the year and title abbreviation for each filename.
    """
    
    corpus_info = {}
    
    for doc_id in doc_list:
        
        # Split the ID into three parts on the '-'
        split_doc_id = doc_id.split('-')
        
        # Get the prefix by matching the first set of letters 
        # in the first part of the filename.
        title = re.match("[A-Za-z]+", split_doc_id[0])
        # Get the dates by grabbing all of the number elements 
        # in the first part of the filename.
        dates = re.search(r'[0-9]+', split_doc_id[0])
        # The first four numbers is the publication year.
        year = dates.group()[:4]
        
        # Update the dictionary with the title and year 
        # for the filename.
        corpus_info[doc_id] = {'title': title.group(), 'year': year}
    
    return corpus_info
corpus_info = extract_pub_info(corpus)

One of the most useful libraries in Python for working with data is Pandas. With Pandas, Python users gain much of the functionality that our colleagues who work with R have long celebrated as the benefits of that domain-specific language.

By transforming our corpus_info dictionary into a dataframe, we can quickly filter and tabulate a number of different statistics on our corpus.

df = pd.DataFrame.from_dict(corpus_info, orient='index')
df.index.name = 'docs'
df = df.reset_index()

You can preview the initial dataframe by uncommenting the cell below.

# df
df = df.groupby(["title", "year"], as_index=False).docs.count()
df
title year docs
0 ADV 1898 26
1 ADV 1899 674
2 ADV 1900 463
3 ADV 1901 389
4 ADV 1902 440
5 ADV 1903 428
6 ADV 1904 202
7 ADV 1905 20
8 ARAI 1909 64
9 ARAI 1919 32
10 AmSn 1886 96
11 AmSn 1887 96
12 AmSn 1888 105
13 AmSn 1889 386
14 AmSn 1890 403
15 AmSn 1891 398
16 AmSn 1892 401
17 AmSn 1893 402
18 AmSn 1894 402
19 AmSn 1895 400
20 AmSn 1896 408
21 AmSn 1897 800
22 AmSn 1898 804
23 AmSn 1899 801
24 AmSn 1900 800
25 CE 1909 104
26 CE 1910 312
27 CE 1911 312
28 CE 1912 306
29 CE 1913 420
... ... ... ...
465 YI 1885 192
466 YI 1886 220
467 YI 1887 244
468 YI 1888 104
469 YI 1889 208
470 YI 1890 208
471 YI 1895 416
472 YI 1898 28
473 YI 1899 408
474 YI 1900 408
475 YI 1901 408
476 YI 1902 408
477 YI 1903 408
478 YI 1904 288
479 YI 1905 408
480 YI 1906 408
481 YI 1907 432
482 YI 1908 832
483 YI 1909 844
484 YI 1910 850
485 YI 1911 852
486 YI 1912 868
487 YI 1913 848
488 YI 1914 852
489 YI 1915 852
490 YI 1916 852
491 YI 1917 840
492 YI 1918 850
493 YI 1919 856
494 YI 1920 45

495 rows × 3 columns

Nearly 500 rows of data is too large to have a good sense of the coverage of the corpus from reading the data table, so it is necessary to create some visualizations of the records. For a quick prototyping tool, I am using the Bokeh library.

from bokeh.charts import Bar, show
from bokeh.charts import defaults
from bokeh.io import output_notebook
from bokeh.palettes import viridis
output_notebook()
defaults.width = 900
defaults.height = 950

In this first graph, I am showing the total number of pages per title, per year in the corpus.

p = Bar(df, 
        'year', 
        values='docs',
        agg='sum', 
        stack='title',
        palette= viridis(30), 
        title="Pages per Title per Year")
show(p)

This graph of the corpus reflects the historical development of the publication efforts denomination. Starting with a single publication in 1849, the publishing efforts of the denomination expand in the 1860s as they launch their health reform efforts, expand again in the 1880s as they start a publishing house in California and address concerns about Sunday observance laws, and again at the turn of the century as the denomination reorganizes and regional publications expand. The chart also reveals some holes in the corpus. The Youth’s Instructor (shown here in yellow) in one of the oldest continuous denominational publications, but the pages available for the years from 1850 - 1899 are inconsistent.

In interpreting the results of mining these texts, it will be important to factor in the relative difference in size and diversity of publication venues between the early years of the denomination and the later years of this study.

by_title = df.groupby(["title"], as_index=False).docs.sum()
p = Bar(df, 
        'title', 
        values='docs', 
        color='title', 
        palette=viridis(30), 
        title="Total Pages by Title"
       )
show(p)

Another way to view the coverage of the corpus is by total pages per periodical title. The Advent Review and Sabbath Herald dominates the corpus in number of pages, with The Health Reformer, Signs of the Times, and the Youth’s Instructor, making up the next major percentage of the corpus. In terms of scale, these publications will have (and had) a prominent role in shaping the discourse of the SDA community. At the same time, it will be informative to look to the smaller publications to see if we can surface alternative and dissonant ideas.

topic_metadata = pd.read_csv('data/2017-05-05-periodical-topics.csv')
topic_metadata
periodicalTitle title startYear endYear initialPubLocation topic
0 Training School Advocate ADV 1898 1905 Battle Creek, MI Education
1 American Sentinel AmSn 1886 1900 Oakland, CA Religious Liberty
2 Advent Review and Sabbath Herald ARAI 1909 1919 Washington, D.C. Denominational
3 Christian Education CE 1909 1920 Washington, D.C. Education
4 Welcome Visitor (Columbia Union Visitor) CUV 1901 1920 Academia, OH Regional
5 Christian Educator EDU 1897 1899 Battle Creek, MI Education
6 General Conference Bulletin GCB 1863 1918 Battle Creek, MI Denominational
7 Gospel Herald GH 1898 1920 Yazoo City, MS Regional
8 Gospel of Health GOH 1897 1899 Battle Creek, MI Health
9 Gospel Sickle GS 1886 1888 Battle Creek, MI Missions
10 Home Missionary HM 1889 1897 Battle Creek, MI Missions
11 Health Reformer HR 1866 1907 Battle Creek, MI Health
12 Indiana Reporter IR 1901 1910 Indianapolis, IN Regional
13 Life Boat LB 1898 1920 Chicago, IL Missions
14 Life and Health LH 1904 1920 Washington, D.C. Health
15 Liberty LibM 1906 1920 Washington, D.C. Religious Liberty
16 Lake Union Herald LUH 1908 1920 Berrien Springs, MI Regional
17 North Michigan News Sheet NMN 1907 1910 Petoskey, MI Regional
18 Pacific Health Journal and Temperance Advocate PHJ 1885 1904 Oakland, CA Health
19 Present Truth (Advent Review) PTAR 1849 1850 Middletown, CT Denominational
20 Pacific Union Recorder PUR 1901 1920 Oakland, CA Regional
21 Review and Herald RH 1850 1920 Paris, ME Denominational
22 Sligonian Sligo 1916 1920 Washington, D.C. Regional
23 Sentinel of Liberty SOL 1900 1904 Chicago, IL Religious Liberty
24 Signs of the Times ST 1874 1920 Oakland, CA Denominational
25 Report of Progress, Southern Union Conference SUW 1907 1920 Nashville, TN Regional
26 Church Officer's Gazette TCOG 1914 1920 Washington, D.C. Denominational
27 The Missionary Magazine TMM 1898 1902 Philadelphia, PA Missions
28 West Michigan Herald WMH 1903 1908 Grand Rapids, MI Regional
29 Youth's Instructor YI 1852 1920 Rochester, NY Denominational

We can generate another view by adding some external metadata for the titles. The “topics” listed here are ones I assigned when skimming the different titles. “Denominational” refers to centrally produced publications, covering a wide array of topics. “Education” refers to periodicals focused on education. “Health” to publications focused on health. “Missions” titles are focused on outreach and evangelism focused publications and “Religious Liberty” on governmental concerns over Sabbath laws. Finally, “Regional” refers to periodicals produced by local union conferences, which like the denominational titles cover a wide range of topics.

by_topic = pd.merge(topic_metadata, df, on='title')
p = Bar(by_topic, 
        'year', 
        values='docs',
        agg='sum', 
        stack='topic',
        palette= viridis(6), 
        title="Pages per Topic per Year")
show(p)

Here we can see the diversification of periodical subjects over time, especially around the turn of the century.

p = Bar(by_topic, 
        'topic', 
        values='docs',
        agg='sum', 
        stack='title',
        palette= viridis(30), 
        title="Pages per Topic per Year")
p.left[0].formatter.use_scientific = False
p.legend.location = "top_right"
show(p)

Grouping by category allows us to see that our corpus is dominated by the denominational, health, and regionally focused publications. These topics match with our research concerns, increasing our confidence that we will have enough information to determine meaningful patterns about those topics. But, due to the focus of the corpus, we should proceed cautiously before making any claims about the relative importance of those topics within the community.

Now that we have a sense of the temporal and topical coverage of our corpus, we will next turn our attention to evaluating the quality of the data that we have gathered from the scanned PDF files.

You can run this code locally using the Jupyter notebook available via Github.