All Models are Wrong

This piece is cross-posted on the ACH Blog

“Essentially, all models are wrong but some are useful.” This phrase, attributed to George Box, was the mantra for the Data Mining for Digital Humanists course at the Digital Humanities Summer Institute last week. Throughout the course, we worked to understand the mathematical concepts behind some of the most common algorithms used to solve classification and clustering problems. In data mining, the goal is to create a heuristic that allows one to predict with a high degree of accuracy how to classify new data based on the data one has seen in the past. From Amazon and Netflix recommendations to Pandora and even self-driving cars, the results of data mining work are present all across our digital lives.

And yet all models are wrong. In thinking about how to use data mining in my own work, it was helpful to learn that the goal of data mining is not to perfectly fit the model to the data at hand. Data mining is predictive: based on the results of the past, you create a model that predicts the label or classification to be assigned to future data. If you match your model too closely to the original data, if you overfit, the predictive power of the model decreases. Models are created to generalize. There are always outliers and items that are misclassified, but the payoff is the ability to abstract and to get a reliable prediction when faced with new data.

All models are wrong but some are useful. This should not be surprising to scholars in the humanities. In history, we frequently work with models to help us elucidate processes and explain events. We use economic models, political models, and models of race, class, and gender to explain actions and causes. While useful for drawing attention to the ways particular factors shaped events in the past, these models are necessarily incomplete. We know that there is more than economics at play, more than politics, more than religion, more than individual genius.

One area of concern surrounding computational approaches to humanities scholarship is the question of truth claims. Will algorithmic approaches fool us into claiming that we have uncovered the “truth” of the text, the “true” cause of a historical event, the “truth” of the human experience? Are computational approaches at odds with humanistic thinking that stresses contingency, subjectivity, and irreducibility? These are open questions that we need to continue to explore.

But all models are wrong. Data mining that seeks the “truth” is data mining done poorly. I think computational approaches can be very useful in the context of the humanities when we are looking for useful patterns to shed light on particular questions. Different models of the same data can answer different questions or suggest different conclusions. This makes data mining not an answer generating machine but a way of modeling data that is itself the result of theoretical and interpretive choices.

Remembering that all models are wrong is very useful for my own work. I am working to model and discuss the ways health reform ideology, social constructions that delineate power relationships, and religious and theological commitments were in conversation in the development of Seventh-day Adventism. There are many ways to model language similarity and use, some of which will be more useful than others for exploring the interrelationship between multiple cultural systems. My goal, however, is not to create the perfect model - that is impossible. Instead, my goal is to create useful models that help elucidate particular aspects or patterns, models that will necessarily fall short and make no claims to expressing the “truth” of the historical processes I study.

What I am taking away from the data mining course is a better understanding of how those in computer science, statistics, and applied mathematics view the intellectual space where they work. Rather than black-boxes that give “truth,” the algorithms developed and used for data mining problems constitute a variety of models that work with different ranges of accuracy on different problems. There is no one right model, but there are a variety of algorithms that one can use to isolate aspects of a given dataset. This allows one to create useful models of the data in order to describe, make predictions, and better understand the material at hand. And this is an understanding that I think has many possibilities for productive overlap with research in the humanities.