Know Your Sources (Part 1) | Jeri E. Wieringa

This is part of a series of technical essays documenting the computational analysis that undergirds my dissertation, A Gospel of Health and Salvation. For an overview of the dissertation project, you can read the current project description at jeriwieringa.com. You can access the Jupyter notebooks on Github.

My goals in sharing the notebooks and technical essays are three-fold. First, I hope that they might prove useful to others interested in taking on similar projects. In these notebooks I describe and model how to approach a large corpus of sources in the production of historical scholarship.

Second, I am sharing them in hopes that “given enough eyeballs, all bugs are shallow.” If you encounter any bugs, if you see an alternative way to solve a problem, or if the code does not achieve the goals I have set out for it, please let me know!

Third, these notebooks make an argument for methodological transparency and for the discussion of methods as part of the scholarly argument of digital history. Often the technical work in digital history is done behind the scenes, with publications favoring the final research products, usually in article form with interesting visualizations. While there is a growing culture in digital history of releasing source code, there is little discussion of how that code was developed, why solutions were chosen, and what those solutions enable and prevent. In these notebooks I seek to engage that middle space between code and the final analysis - documenting the computational problem solving that I’ve done as part of the analysis. As these essays attest, each step in the processing of the corpus requires the researcher to make a myriad of distinctions about the worlds they seek to model, distinctions that shape the outcomes of the computational analysis and are part of the historical argument of the work.

Now that we have a collection of texts selected and downloaded, and have extracted the text, we need to spend some time identifying what the corpus contains, both in terms of coverage and quality. As I describe in the project overview, I will be using these texts to make arguments about the development of the community’s discourses around health and salvation. While the corpus makes that analysis possible, it also sets the limits of what we can claim from text analysis alone. Without an understanding of what those limits are, we run the risk of claiming more than the sources can sustain, and in doing so, minimizing the very complexities that historical research seeks to reveal.

"""My usual practice for gathering the filenames is to read 
them in from a directory. So that this code can be run locally 
without the full corpus downloaded, I exported the list 
of filenames to an index file for use in this notebook.
"""
with open("data/2017-05-05-corpus-index.txt", "r") as f:
    corpus = f.read().splitlines()

len(corpus)

To create an overview of the corpus, I will use the document filenames along with some descriptive metadata that I created.

Filenames are an often underestimated feature of digital files, but one that can be used to great effect. For my corpus, the team that digitized the periodicals did an excellent job of providing the files with descriptive names. Overall, the files conform to the following pattern:

PrefixYYYYMMDD-V00-00.pdf

I discovered a few files that deviated from the pattern, but renamed those so that the pattern held throughout the corpus. When splitting the PDF documents into pages, I preserved the structure, adding -page0.txt to the end.

The advantage of this format is that the filenames contain the metadata I need to place each file within its context. By isolating the different sections of the filename, I can quickly place any file with reference to the periodical title and the publication date.

import pandas as pd
import re

def extract_pub_info(doc_list):
    """Use regex to extract metadata from filename.
    
    Note:
        Assumes that the filename is formatted as::
            
            `PrefixYYYYMMDD-V00-00.pdf`
    
    Args:
        doc_list (list): List of the filenames in the corpus.
    Returns:
        dict: Dictionary with the year and title abbreviation for each filename.
    """
    
    corpus_info = {}
    
    for doc_id in doc_list:
        
        # Split the ID into three parts on the '-'
        split_doc_id = doc_id.split('-')
        
        # Get the prefix by matching the first set of letters 
        # in the first part of the filename.
        title = re.match("[A-Za-z]+", split_doc_id[0])
        # Get the dates by grabbing all of the number elements 
        # in the first part of the filename.
        dates = re.search(r'[0-9]+', split_doc_id[0])
        # The first four numbers is the publication year.
        year = dates.group()[:4]
        
        # Update the dictionary with the title and year 
        # for the filename.
        corpus_info[doc_id] = {'title': title.group(), 'year': year}
    
    return corpus_info

corpus_info = extract_pub_info(corpus)

One of the most useful libraries in Python for working with data is Pandas. With Pandas, Python users gain much of the functionality that our colleagues who work with R have long celebrated as the benefits of that domain-specific language.

By transforming our corpus_info dictionary into a dataframe, we can quickly filter and tabulate a number of different statistics on our corpus.

df = pd.DataFrame.from_dict(corpus_info, orient='index')

df.index.name = 'docs'
df = df.reset_index()

You can preview the initial dataframe by uncommenting the cell below.

# df

df = df.groupby(["title", "year"], as_index=False).docs.count()

df

	title	year	docs
0	ADV	1898	26
1	ADV	1899	674
2	ADV	1900	463
3	ADV	1901	389
4	ADV	1902	440
5	ADV	1903	428
6	ADV	1904	202
7	ADV	1905	20
8	ARAI	1909	64
9	ARAI	1919	32
10	AmSn	1886	96
11	AmSn	1887	96
12	AmSn	1888	105
13	AmSn	1889	386
14	AmSn	1890	403
15	AmSn	1891	398
16	AmSn	1892	401
17	AmSn	1893	402
18	AmSn	1894	402
19	AmSn	1895	400
20	AmSn	1896	408
21	AmSn	1897	800
22	AmSn	1898	804
23	AmSn	1899	801
24	AmSn	1900	800
25	CE	1909	104
26	CE	1910	312
27	CE	1911	312
28	CE	1912	306
29	CE	1913	420
...	...	...	...
465	YI	1885	192
466	YI	1886	220
467	YI	1887	244
468	YI	1888	104
469	YI	1889	208
470	YI	1890	208
471	YI	1895	416
472	YI	1898	28
473	YI	1899	408
474	YI	1900	408
475	YI	1901	408
476	YI	1902	408
477	YI	1903	408
478	YI	1904	288
479	YI	1905	408
480	YI	1906	408
481	YI	1907	432
482	YI	1908	832
483	YI	1909	844
484	YI	1910	850
485	YI	1911	852
486	YI	1912	868
487	YI	1913	848
488	YI	1914	852
489	YI	1915	852
490	YI	1916	852
491	YI	1917	840
492	YI	1918	850
493	YI	1919	856
494	YI	1920	45

495 rows × 3 columns

Nearly 500 rows of data is too large to have a good sense of the coverage of the corpus from reading the data table, so it is necessary to create some visualizations of the records. For a quick prototyping tool, I am using the Bokeh library.

from bokeh.charts import Bar, show
from bokeh.charts import defaults
from bokeh.io import output_notebook
from bokeh.palettes import viridis

output_notebook()

defaults.width = 900
defaults.height = 950

In this first graph, I am showing the total number of pages per title, per year in the corpus.

p = Bar(df, 
        'year', 
        values='docs',
        agg='sum', 
        stack='title',
        palette= viridis(30), 
        title="Pages per Title per Year")

show(p)

This graph of the corpus reflects the historical development of the publication efforts denomination. Starting with a single publication in 1849, the publishing efforts of the denomination expand in the 1860s as they launch their health reform efforts, expand again in the 1880s as they start a publishing house in California and address concerns about Sunday observance laws, and again at the turn of the century as the denomination reorganizes and regional publications expand. The chart also reveals some holes in the corpus. The Youth’s Instructor (shown here in yellow) in one of the oldest continuous denominational publications, but the pages available for the years from 1850 - 1899 are inconsistent.

In interpreting the results of mining these texts, it will be important to factor in the relative difference in size and diversity of publication venues between the early years of the denomination and the later years of this study.

by_title = df.groupby(["title"], as_index=False).docs.sum()

p = Bar(df, 
        'title', 
        values='docs', 
        color='title', 
        palette=viridis(30), 
        title="Total Pages by Title"
       )

show(p)

Another way to view the coverage of the corpus is by total pages per periodical title. The Advent Review and Sabbath Herald dominates the corpus in number of pages, with The Health Reformer, Signs of the Times, and the Youth’s Instructor, making up the next major percentage of the corpus. In terms of scale, these publications will have (and had) a prominent role in shaping the discourse of the SDA community. At the same time, it will be informative to look to the smaller publications to see if we can surface alternative and dissonant ideas.

topic_metadata = pd.read_csv('data/2017-05-05-periodical-topics.csv')

topic_metadata

	periodicalTitle	title	startYear	endYear	initialPubLocation	topic
0	Training School Advocate	ADV	1898	1905	Battle Creek, MI	Education
1	American Sentinel	AmSn	1886	1900	Oakland, CA	Religious Liberty
2	Advent Review and Sabbath Herald	ARAI	1909	1919	Washington, D.C.	Denominational
3	Christian Education	CE	1909	1920	Washington, D.C.	Education
4	Welcome Visitor (Columbia Union Visitor)	CUV	1901	1920	Academia, OH	Regional
5	Christian Educator	EDU	1897	1899	Battle Creek, MI	Education
6	General Conference Bulletin	GCB	1863	1918	Battle Creek, MI	Denominational
7	Gospel Herald	GH	1898	1920	Yazoo City, MS	Regional
8	Gospel of Health	GOH	1897	1899	Battle Creek, MI	Health
9	Gospel Sickle	GS	1886	1888	Battle Creek, MI	Missions
10	Home Missionary	HM	1889	1897	Battle Creek, MI	Missions
11	Health Reformer	HR	1866	1907	Battle Creek, MI	Health
12	Indiana Reporter	IR	1901	1910	Indianapolis, IN	Regional
13	Life Boat	LB	1898	1920	Chicago, IL	Missions
14	Life and Health	LH	1904	1920	Washington, D.C.	Health
15	Liberty	LibM	1906	1920	Washington, D.C.	Religious Liberty
16	Lake Union Herald	LUH	1908	1920	Berrien Springs, MI	Regional
17	North Michigan News Sheet	NMN	1907	1910	Petoskey, MI	Regional
18	Pacific Health Journal and Temperance Advocate	PHJ	1885	1904	Oakland, CA	Health
19	Present Truth (Advent Review)	PTAR	1849	1850	Middletown, CT	Denominational
20	Pacific Union Recorder	PUR	1901	1920	Oakland, CA	Regional
21	Review and Herald	RH	1850	1920	Paris, ME	Denominational
22	Sligonian	Sligo	1916	1920	Washington, D.C.	Regional
23	Sentinel of Liberty	SOL	1900	1904	Chicago, IL	Religious Liberty
24	Signs of the Times	ST	1874	1920	Oakland, CA	Denominational
25	Report of Progress, Southern Union Conference	SUW	1907	1920	Nashville, TN	Regional
26	Church Officer's Gazette	TCOG	1914	1920	Washington, D.C.	Denominational
27	The Missionary Magazine	TMM	1898	1902	Philadelphia, PA	Missions
28	West Michigan Herald	WMH	1903	1908	Grand Rapids, MI	Regional
29	Youth's Instructor	YI	1852	1920	Rochester, NY	Denominational

We can generate another view by adding some external metadata for the titles. The “topics” listed here are ones I assigned when skimming the different titles. “Denominational” refers to centrally produced publications, covering a wide array of topics. “Education” refers to periodicals focused on education. “Health” to publications focused on health. “Missions” titles are focused on outreach and evangelism focused publications and “Religious Liberty” on governmental concerns over Sabbath laws. Finally, “Regional” refers to periodicals produced by local union conferences, which like the denominational titles cover a wide range of topics.

by_topic = pd.merge(topic_metadata, df, on='title')

p = Bar(by_topic, 
        'year', 
        values='docs',
        agg='sum', 
        stack='topic',
        palette= viridis(6), 
        title="Pages per Topic per Year")

show(p)

Here we can see the diversification of periodical subjects over time, especially around the turn of the century.

p = Bar(by_topic, 
        'topic', 
        values='docs',
        agg='sum', 
        stack='title',
        palette= viridis(30), 
        title="Pages per Topic per Year")

p.left[0].formatter.use_scientific = False
p.legend.location = "top_right"

show(p)

Grouping by category allows us to see that our corpus is dominated by the denominational, health, and regionally focused publications. These topics match with our research concerns, increasing our confidence that we will have enough information to determine meaningful patterns about those topics. But, due to the focus of the corpus, we should proceed cautiously before making any claims about the relative importance of those topics within the community.

Now that we have a sense of the temporal and topical coverage of our corpus, we will next turn our attention to evaluating the quality of the data that we have gathered from the scanned PDF files.

You can run this code locally using the Jupyter notebook available via Github.

Enjoy Reading This Article?