Religion & Data: A Presentation for the American Academy of Religion 2016

On November 19th I had the opportunity to be part of the Digital Futures of Religious Studies panel at the American Academy of Religion in lovely San Antonio, Texas. It was wonderful to see the range of digital work being done in the study of religion represented and to participate in making the case for that work to be supported by the academy.

Panel members represented a wide range of digital work, from large funded projects to digital pedagogy. I focused my five minute slot on the question of data and the need to recognize the work of data creation and management as part of the scholarly production of digital projects.

tl;dr: I argued in my presentation that as we create scholarly works that rely on data in their analysis and presentation, we also need to include the production, management, and interpretation of data in our systems for the assessment and evaluation of scholarship.


Two questions to consider: what does data-rich scholarship entail? and how do we evaluate data-rich scholarship?

What is involved when working with data? And how do we evaluate scholarship that includes data-rich analyses? These are the two questions that I want to bring up in my 5 minutes.

My dissertation sits in the intersection of American religious history and the digital humanities.

Screenshot of dissertation homepage, entitled 'A Gospel of Health and Salvation: A Digital Study of Health and Religion in Seventh-day Adventism, 1843-1920'

I am studying the development of Seventh-day Adventism and the interconnections between their theology and their embrace of health reform. To explore these themes, I am using two main groups of sources: published periodicals of the denomination and the church yearbooks and statistical records, supplemented by archival materials.

Screenshot of the downloaded Yearbooks and Statistical Reports.

In recent years the Seventh-day Adventist church has been undertaking an impressive effort to make their past easily available.

Screenshots of the Adventist Archives and the Adventist Digital Library sites.

Many of the official publications and records are available at the denominational site (http://documents.adventistarchives.org/) and a growing collection of primary materials are being release at the new Adventist Digital Library (http://adventistdigitallibrary.org/).

This abundance of digital resources makes it possible to bring together the computational methods of the digital humanities with historical modes of inquiry to ask broad questions about the development of the denomination and the ways American discourses about health and religious salvation are inter-related.

But … one of the problems of working with data is that while it is easy to present data as clean and authoritative, data are incredibly messy and it is very easy to “do data” poorly.

Chart from tylervigen.com of an apparent correlation between per capita cheese consumption and the number of people who die from becoming tangled in bedsheets.

Just google “how to lie with statistics” (or charts, or visualizations). It happens all the time and it’s really easy to do, intentionally or not. And that is when data sets are ready for analysis.

Most data are not; even “born digital” data are messy and must be “cleaned” before they can be put to use. And when we are dealing with information compiled before computers, things get even more complicated.

Let me give you an example from my dissertation work.

First page of the Adventist yearbooks from 1883 and 1920.

This first page was published in 1883, the first year the denomination dedicated a publication to “such portions of the proceedings of the General Conference, and such other matters, as the Committee may think best to insert therein.” The second is from 1920, the last year in my study.

You can begin to see how this gets complicated quickly. In 1883, the only data recorded for the General Conference was office holders (President, VP, etc.) and we go straight into one of the denominational organizations. In 1920, we have a whole column of contextual information on the scope of the denomination, a whole swath of new vice-presidents for the different regions, and so forth.

What happened over those 40 years? The denomination grew, rapidly, and the organization of the denomination matured; new regions were formed, split, and re-formed; the complexity of the bureaucracy increased; and the types of positions available and methods of record keeping changed.

Adding to that complexity of development over time, the information was entered inconsistently, both within a year and across multiple years. For example, George Butler, who was a major figure in the early SDA church, appears in the yearsbooks as: G.I. Butler; George I. Butler, Geo. I. Butler; Elder Geo. I. Butler …. it keeps going.

Screenshot of the yearbook data in a Google spreadsheet.

Now what I want is data that looks like this, because I am interested in tracking who is recorded as having leadership positions within the denomination and how that changes over time. But this requires work disambiguating names, addressing changing job titles over time, and deciding which pieces of information, from the abundance of the yearbooks, are useful to my study.

All of that results in having to make numerous interpretive decisions when working to translate this information into something I can work with computationally.

There are many ways to do this, some of which are better than others. My approach has been to use the data from 1920 to set my categories and project backwards. For example, in 1883 there were no regional conferences, so I made the decision to restricted the data I translated to the states that will eventually make up my four regions of study. That is an interpretive choice, and one that places constraints around my use of the data and the posibilities for reusing the data in other research contexts. These are choices that must be documented and explained.

Screenshot of the Mason Libraries resources on Data Management.

Fortunately, we are not the first to discover that data are messy and complicated to work with. There are many existing practices around data creation, management, and organization that we can draw on. Libraries are increasingly investing in data services and data management training, and I have greatly benefited from taking part in those conversations.

Quote from Laurie Maffly-Kipp, 'I am left to wonder more generally, though, about the tradeoff necessary as we integrate images and spatial representations into our verbal narratives of religion in American history. In the end, we are still limited by data that is partial, ambiguous, and clearly slanted toward things that can be counted and people that traditionally have been seen as significant.'

Working with data is a challenging, interpretive, and scholarly activity. While we may at times tell lies with our data and our visualizations, all scholarly interpretation is an abstraction, a selective focusing on elements to reveal something about the whole. What is important when working with data is to be transparent about the interpretive choices we are making along the way.

My hope is that as the use of data increases in the study of religion, we also create structures to recognize and evaluate the difficult intellectual work involved in the creation, management, and interpretation of that data.

Selecting a Digital Workflow

Disclaimer: workflows are rather infinitely customizable to fit project goals, intellectual patterns, and individual quirks. I am writing mine up because it has been useful for me to see how others are solving such problems. I invite you to take what is useful, discard what is not, and share what you have found.

Writing a dissertation is, among many things, a exercise in data management. Primary materials and their accompanying notes must be organized. Secondary materials and their notes must be organized. Ones own writing must be organized. And, when doing computational analysis, the code, processes, and results must be organized. This, I can attest, can be a daunting mountain of things to organize (and I like to organize).

There are many available solutions for wrangling this data, some of which cost money, some of which are free, and all of which cater to different workflows. For myself, I have found that my reliance on computational methods and my goal of HTML as the final output make many of the solutions my colleagues use with great success (such as Scrivener and Evernote) just not work well for me. In addition, my own restlessness with interfaces has made interoperability a high priority - I need to be able to use many tools on the same document without having to worry about exporting to different formats in-between. This has lead to my first constraint: my work needs to be done in a plain text format that, eventually, will easily convert to HTML.

The second problem is version control. On a number of projects in the past, written in Word or Pages, I have created many different named versions along the way. When getting feedback, new versions were created and results had to be reincorporated back into the authoritative version. This is all, well, very inefficient. And, while many of these sins can be overcome when writing text for humans, the universe is less forgiving when working with code. And so, constraint number two: my work needs to be under version control and duplication should be limited.

The third problem is one of reproducibility and computational methods. The experiments that I run on my data need to be clearly documented and reproducible. While this is generally good practice, it is especially important for computational research. Reproducible and clearly documented work is important for the consistency of my own work over time, and to enable my scholarship to be interrogated and used by others. And so, the third constraint: my work needs to be done in such a way that my methods are transparent, consistent, and reproducible.

My solution to the first constraint is Markdown. All of my writing is being done in markdown — including this blogpost! Markdown enables me to designate the information hierarchies and easily convert to HTML, but also to work on files in multiple interfaces. Currently, I am writing in Ulysses, but I also have used Notebooks, IA Writer, and Sublime text editor. What I love about Ulysses is that I can have one writing interface, but draw from multiple folders, most of which are tied to git repositories and/or dropbox. Ulysses has a handy file browser window, so it is easy to move between files and see them all as part of the larger project. But regardless of the application being used, it is the same file that I am working with across multiple platforms, applications, and devices.

My solution to the second constraint is Github and the Github Student Developer Pack. Git and Github enable version control, and multiple branches, and all sorts of powerful ways to keep my computational work organized. In addition, Github’s student pack comes with 5 private repos (private while you are a student), along with a number of other cool perks. While I am a fan of working in the open and plan to open up my work over time, I have also come to value being able to make mistakes in the quiet of my own digital world. Dissertations are hard, and intimidating, and there is something nice about not having to get it right from the very beginning while still being able to share with advisors and select colleagues.

But, Github hasn’t proved the best for the writing side of things. The default wiki is limited in its functionality, especially in terms of enabling commenting, and repos are not really made for reading. My current version control solution for the writing side of the project is Penflip, billed as “Github for Writing”. Writing with Penflip can be done through the web interface, or by cloning down the files (which are in markdown) and working locally. As such, the platform conforms to the markdown and the “one version” requirements. The platform is free if your writing is public, and $10/month for private repos, which is what I am trying out. I am using PenFlip a bit like a wiki, with general pages for notes on different primary documents and overarching pages that describe lines of inquiry and link the associated note and code files together.1 This is the central core of my workflow - the place that ties together notes, code, and analysis and lays the ground work for the final product. So far it is working well for the writing, though using it for distribution commenting has been a little bumpy as the interface needs a little more refining.

And the code. The code was actually the first of these problems I solved. Of the programming languages to learn, I have found Python to be the most comfortable to work in. It has great libraries for text analysis, and comes with the benefit of Jupyter Notebooks (previously IPython Notebooks). Working in IPython Notebooks allows me to integrate descriptive text in markdown with code and the resulting visualizations. The resulting document is plain text, can be versioned, and displays in HTML for sharing. The platform conforms to my first two requirements — markdown and version controlled — while also making my methods transparent and reproducible. I run the Jupyter server locally, and am sharing pieces that have been successfully executed via nbviewer - for example, I recently released the code I used to create my pilot sample of one of my primary corpuses. I am also abstracting code that is run more than once into a local library of functions. Having these functions enables me to reliably follow the same processes over time, thus creating results that are reproducible and comparable.

And that is my current tooling for organizing and writing both my digital experiments and my surrounding explanations and narrative.


  1. I somehow missed the “research wiki” bus a while back, but thanks to a recent comment made by Abby Mullen, I am a happy convert to the whole concept.

Bridging the Gap

I am so excited that Celeste Sharpe and I have been awarded an ACH microgrant for “Bridging the Gap: Women, Code, and the Digital Humanities”. This grant will help us create a curriculum for workshops aimed at bridging the gender gap between those who code and those who do not in the Digital Humanities

While our first “Rails Girls” event was quite successful, one of the most repeated pieces of feedback we received was that participants were unsure how to connect what they had learned about building a Rails application to their own scholarly work. This is not surprising - the Rails Girls curriculum is aimed at helping women develop the skills necessary to land technical jobs and so focuses on instrumental understandings of code. As our participants reported, this is not the most obviously applicable type of knowledge in the context of humanities research.

What is necessary instead, and what we have been given the grant to develop, is a curriculum that focuses both on technical skills and on computational thinking, the intellectual skill of breaking complex problems into discrete, algorithmically solvable parts.1 It is computational thinking that is necessary to answer the “but what can I do with it?” question. And it is computational thinking, even more than technical know-how, that is necessary for developing interesting research questions that can be solved with computers.

This emphasis on computational thinking, as opposed to instrumental knowledge, is also a response to a growing concern regarding the emphasis on tools within DH. This concern has already been articulated very well by Matthew Lincoln in his post “Tool Trouble” and by Ted Underwood in his talk “Beyond Tools”. To echo some of Matthew’s points, the focus on tool-use in DH runs the risk of hindering rather than developing computational thinking, hiding the computational work behind a (hopefully) pretty interface and deceptively clear and visually interesting results. The emphasis on “tool-use” reinforces a sense of digital humanities methods as ways to get answers to particular questions along the way to constructing traditional humanities arguments. As a result, digital work undertaken in this manner often fails to grapple with the complex, and often problematic, theoretically-inflected models that digital tools produce.

(As a grand sweeping side note, I think it is this tool-centric approach to digital humanities that is most likely to be adopted by the various disciplines. When computational methods are merely tools, they pose little challenge to established modes of thinking, making it reasonable to say that we will all be digital humanists eventually.)

Tools are useful - they provide a means of addressing particular problems efficiently and consistently. But tools are packaged ways to solve particular problems in particular, computational ways, and are designed according to particular theoretical and philosophical assumptions. And they must be interacted with as such, not as “black boxes” or answer generators that tell us something about our stuff.

Moving beyond tool-use requires computational thinking, and is the intentional combining of computational and humanities modes of thinking that, I think, produces the most innovative work in the field. In learning to think computationally, one learns to conceptualize problems differently, to view the object of study in multiple ways and develop research questions driven both by humanities concerns and computational modes of problem solving. One also becomes able to reflect back on those computational modes and evaluate the ways the epistemological assumptions at work shape the models being created.

Rather than teaching coding as one tool among many for answering humanistic questions, it is increasingly clear to us that it is necessary to teach computational thinking through learning to code - to learn to think through problems in computational ways and to ask new questions as the result. It is this pattern of thinking computationally that we are interested in fostering in our workshops.


  1. This is my reformulation of Cuny, Snyder, and Wing’s definition of computational thinking as “… the thought processes involved in formulating problems and their solutions so that the solutions are represented in a form that can be effectively carried out by an information-processing agent.” http://www.cs.cmu.edu/~CompThink/