Ways to Compute Topics over Time, Part 4

This is part of a series of technical essays documenting the computational analysis that undergirds my dissertation, A Gospel of Health and Salvation. For an overview of the dissertation project, you can read the current project description at jeriwieringa.com. You can access the Jupyter notebooks on Github.


This is the last in a series of posts which constitute a “lit review” of sorts, documenting the range of methods scholars are using to compute the distribution of topics over time. The strategies I am considering are:

To explore a range of strategies for computing and visualizing topics over time from a standard LDA model, I am using a model I created from my dissertation materials. You can download the files needed to follow along from https://www.dropbox.com/s/9uf6kzkm1t12v6x/2017-06-21.zip?dl=0.

# Load the necessary libraries
import gensim # Note: I am running 1.0.1
import json
import matplotlib as mpl
import matplotlib.pyplot as plt
import os
import pandas as pd # Note: I am running 0.19.2
import pyLDAvis.gensim
import seaborn as sns
import warnings
# Enable in-notebook visualizations
%matplotlib inline
# Temporary fix for persistent warnings of an api change between pandas and seaborn.
warnings.filterwarnings('ignore')
pd.options.display.max_rows = 10
base_dir = "data"
period = "1859-to-1875"
directory = "historical_periods"

Create the dataframes

I preprocessed the model to export various aspects of the model information into CSV files for ease of compiling.

metadata_filename = os.path.join(base_dir,'2017-05-corpus-stats/2017-05-Composite-OCR-statistics.csv')
index_filename = os.path.join(base_dir, 'corpora', directory, '{}.txt'.format(period))
labels_filename = os.path.join(base_dir, 'dataframes', directory, '{}_topicLabels.csv'.format(period))
doc_topic_filename = os.path.join(base_dir, 'dataframes', directory, '{}_dtm.csv'.format(period))
def doc_list(index_filename):
    """
    Read in from a json document with index position and filename. 
    File was created during the creation of the corpus (.mm) file to document
    the filename for each file as it was processed.
    
    Returns the index information as a dataframe.
    """
    with open(index_filename) as data_file:    
        data = json.load(data_file)
    docs = pd.DataFrame.from_dict(data, orient='index').reset_index()
    docs.columns = ['index_pos', 'doc_id']
    docs['index_pos'] = docs['index_pos'].astype(int)
  
    return docs


def compile_dataframe( index, dtm, labels, metadata):
    """
    Combines a series of dataframes to create a large composit dataframe.
    """
    doc2metadata = index.merge(metadata, on='doc_id', how="left")
    topics_expanded = dtm.merge(labels, on='topic_id')
    
    df = topics_expanded.merge(doc2metadata, on="index_pos", how="left")
    
    return df
order = ['conference, committee, report, president, secretary, resolved',
         'quarterly, district, society, send, sept, business',
         'association, publishing, chart, dollar, xxii, sign',
         'mother, wife, told, went, young, school',
         'disease, physician, tobacco, patient, poison, medicine',
         'wicked, immortality, righteous, adam, flesh, hell',
        ]
def create_pointplot(df, y_value, hue=None, order=order, col=None, wrap=None, size=6, aspect=1.5, title=""):
    p = sns.factorplot(x="year", y=y_value, kind='point', hue_order=order, hue=hue, 
                       col=col, col_wrap=wrap, col_order=order, size=size, aspect=aspect, data=df)
    p.fig.subplots_adjust(top=0.9)
    p.fig.suptitle(title, fontsize=16)
    return p
metadata = pd.read_csv(metadata_filename, usecols=['doc_id', 'year','title'])
docs_index = doc_list(index_filename)
dt = pd.read_csv(doc_topic_filename)
labels = pd.read_csv(labels_filename)

The first step, following the pattern of Andrew Goldstone for his topic model browser, is to normalize the weights for each document, so that they total to “1”.

As a note, Goldstone first smooths the weights by adding the alpha hyperparameter to each of the weights, which I am not doing here.

# Reorient from long to wide
dtm = dt.pivot(index='index_pos', columns='topic_id', values='topic_weight').fillna(0)

# Divide each value in a row by the sum of the row to normalize the values
# https://stackoverflow.com/questions/18594469/normalizing-a-pandas-dataframe-by-row
dtm = dtm.div(dtm.sum(axis=1), axis=0)

# Shift back to a long dataframe
dt_norm = dtm.stack().reset_index()
dt_norm.columns = ['index_pos', 'topic_id', 'norm_topic_weight']
df = compile_dataframe(docs_index, dt_norm, labels, metadata)
df
index_pos topic_id norm_topic_weight topic_words doc_id year title
0 0 0 0.045525 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page1.txt 1863 GCB
1 1 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page2.txt 1863 GCB
2 2 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page3.txt 1863 GCB
3 3 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page4.txt 1863 GCB
4 4 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page5.txt 1863 GCB
... ... ... ... ... ... ... ...
288595 11539 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page4.txt 1872 YI
288596 11540 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page5.txt 1872 YI
288597 11541 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page6.txt 1872 YI
288598 11542 24 0.012192 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page7.txt 1872 YI
288599 11543 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page8.txt 1872 YI

288600 rows × 7 columns

Data dictionary:

  • index_pos : Gensim uses the order in which the docs were streamed to link back the data and the source file. index_pos refers to the index id for the individual doc, which I used to link the resulting model information with the document name.
  • topic_id : The numerical id for each topic. For this model, I used 20 topics to classify the periodical pages.
  • norm_topic_weight : The proportion of the tokens in the document that are part of the topic, normalized per doc.
  • topic_words : The top 6 words in the topic.
  • doc_id : The file name of the document. The filename contains metadata information about the document, such as the periodical title, date of publication, volume, issue, and page number.
  • year : Year the document was published (according to the filename)
  • title : Periodical that the page was published in.

Normalizing Weights to Proportion of Total

The final method for topic weights over time is calculating a normalized or proportional weight. If I read the code correctly, this is the approach that Goldstone uses in his dfr topics library. Rather than computing the average weight, in this approach we sum up all of the values for each topic in a given time frame, and then normalize those values by dividing by the sum of all the weights in that period. In the normalized data, the sum of all the normalized weights for each time period is 1.

This allows us to see the proportion of the total weight that is held by each individual topic in each period. As a topic increases in a corpus, it takes up a larger proportion of the total weight.

y_df = df.groupby(['year', 'topic_id']).agg({'norm_topic_weight': 'sum'})

# This achieves a similar computation as the normalizing function above, but without converting into a matrix first.
# For each group (in this case year), we are dividing the values by the sum of the values. 
ny_df = y_df.groupby(level=0).apply(lambda x: x / x.sum()).reset_index()

ny_df.columns = ['year', 'topic_id', 'normalized_weight']
ny_df = ny_df.merge(labels, on="topic_id")
ny_df
year topic_id normalized_weight topic_words
0 1859 0 0.127749 satan, salvation, sinner, righteousness, peace...
1 1860 0 0.119807 satan, salvation, sinner, righteousness, peace...
2 1861 0 0.132227 satan, salvation, sinner, righteousness, peace...
3 1862 0 0.109252 satan, salvation, sinner, righteousness, peace...
4 1863 0 0.109100 satan, salvation, sinner, righteousness, peace...
... ... ... ... ...
420 1871 24 0.001333 jerusalem, thess, parable, lazarus, thou_hast,...
421 1872 24 0.001614 jerusalem, thess, parable, lazarus, thou_hast,...
422 1873 24 0.001359 jerusalem, thess, parable, lazarus, thou_hast,...
423 1874 24 0.002465 jerusalem, thess, parable, lazarus, thou_hast,...
424 1875 24 0.001354 jerusalem, thess, parable, lazarus, thou_hast,...

425 rows × 4 columns

# Limit the data to our 6 topic sample
ny_df_filtered = ny_df[(ny_df['topic_id'] >= 15) & (ny_df['topic_id'] <= 20)]
create_pointplot(ny_df_filtered, "normalized_weight", hue="topic_words", 
                 title="Proportion of total topic weight by topic per year.")
<seaborn.axisgrid.FacetGrid at 0x116f21588>

This graph is very similar to what we see when we use Seaborn to compute the mean topic weight per year.

create_pointplot(df, 'norm_topic_weight', hue='topic_words',
                 title='Central range of topic weights by topic per year.'
                )
<seaborn.axisgrid.FacetGrid at 0x113bf4160>

One advantage of this method is that it is easy to aggregate by a factor other than time, such as periodical title, to see the overall distribution of topics within different subsets of the corpus.

titles = df.groupby(['title', 'topic_id']).agg({'norm_topic_weight': 'sum'})
titles = titles.groupby(level=0).apply(lambda x: x / x.sum()).reset_index()
titles = titles.merge(labels, on='topic_id')
titles.pivot('title','topic_words', 'norm_topic_weight')\
.plot(kind='bar', stacked=True, colormap='Paired', 
      figsize=(8,7), title='Normalized proportions of topic weights per title.')\
.legend(bbox_to_anchor=(1.75, 1))
<matplotlib.legend.Legend at 0x106eb4358>

Another advantage of this method is that it makes it easier to accommodate the default Gensim data output (where values less than 1% are dropped). We can see this by charting the normalized weights after dropping the zero values and comparing that to a chart of the average weights after dropping the zero values.

# Remove the zero values we added in when normalizing each document
df2 = df[df['norm_topic_weight'] > 0]
p_df2 = df2.groupby(['year', 'topic_id']).agg({'norm_topic_weight': 'sum'})

# Normalize each group by dividing by the sum of the group
np_df2 = p_df2.groupby(level=0).apply(lambda x: x / x.sum()).reset_index()
np_df2.columns = ['year', 'topic_id', 'normalized_weight']
np_df2 = np_df2.merge(labels, on="topic_id")
np_df2_filtered = np_df2[(np_df2['topic_id'] >= 15) & (np_df2['topic_id'] <= 20)]
create_pointplot(np_df2_filtered, "normalized_weight", 
hue="topic_words", 
title="Normalized topic weights by topic per year, zero values excluded.")
<seaborn.axisgrid.FacetGrid at 0x11434df60>

Whereas, if we compute the mean values on the default Gensim report, we get a very different (and skewed) picture of the topic weights per year.

create_pointplot(df2, 'norm_topic_weight', hue='topic_words',
                 title='Central range of topic weights by topic per year, zero values excluded.'
                )
<seaborn.axisgrid.FacetGrid at 0x115797d68>

This series of posts came to be as I went searching for models of how to aggregate the topic weights per year and, due in no small part because I forgot to account for the different ways Gensim and Mallet handle low topic weights, produced very different graphs depending on the aggregation method. Each method I found is useful for highlighting different aspects of the data: computing the mean tell us about the average weight for each topic in a given group; computing smoothing lines and rolling averages points out broader trends for a particular topic; computing prevalence helps us see when a topic is prominent; and the normalized or proportional weights helps us see the topics in relationship to one another.

If I have understood these methods correctly, then I find the proportional weight to be most useful for my particular questions and my data. I am interested in when and where different topics spike relative to other topics as a way to guide further research into the documents, a line of inquiry that the proportional weights are useful for exploring. Computing proportional or the normalized weights also helps me handle the default Gensim data more smoothly (by which I mean, with fewer opportunities for me to make mistakes).

A note on the visualizations: The charts in these notebooks are not my favorite in terms of design, readability, and accessibility. My apologies to anyone for whom the are illegible. I am still working out how best to create these visualizations for a range of users and am happy to provide an alternative if needed.

A special thanks to Amanda Regan for talking me through the code for her visualizations of Eleanor Roosevelt’s My Day.


You can download and run the code locally using the Jupyter Notebook version of this post

Ways to Compute Topics over Time, Part 3

This is part of a series of technical essays documenting the computational analysis that undergirds my dissertation, A Gospel of Health and Salvation. For an overview of the dissertation project, you can read the current project description at jeriwieringa.com. You can access the Jupyter notebooks on Github.


This is the third in a series of posts which constitute a “lit review” of sorts, documenting the range of methods scholars are using to compute the distribution of topics over time.

Graphs of topic prevalence over time are some of the most ubiquitous in digital humanities discussions of topic modeling. They are used as a mechanism for identifying spikes in discourse and for depicting the relationship between the various discourses in a corpus.

Topic prevalence over time is not, however, a measure that is returned with the standard modeling tools such as MALLET or Gensim. Instead, it is computed after the fact by combining the model data with external metadata and aggregating the model results. And, as it turns out, there are a number of ways that the data can be aggregated and displayed. In this series of notebooks, I am looking at 4 different strategies for computing topic significance over time. These strategies are:

To explore a range of strategies for computing and visualizing topics over time from a standard LDA model, I am using a model I created from my dissertation materials. You can download the files needed to follow along from https://www.dropbox.com/s/9uf6kzkm1t12v6x/2017-06-21.zip?dl=0.

If you cloned the notebooks from one of the earlier posts, please pull down the latest version and update your environment (see the README for help). I forgot to add a few libraries that are needed to run these notebooks.

# Load the necessary libraries
import json
import logging
import matplotlib.pyplot as plt
import os
import pandas as pd # Note: I am running 0.19.2
import seaborn as sns
# Enable in-notebook visualizations
%matplotlib inline
pd.options.display.max_rows = 10
base_dir = "data"
period = '1859-to-1875'
directory = "historical_periods"

Create the dataframes

I preprocessed the model to export various aspects of the model information into CSV files for ease of compiling. I will be releasing the code I used to export that information in a later notebook.

metadata_filename = os.path.join(base_dir,'2017-05-corpus-stats/2017-05-Composite-OCR-statistics.csv')
index_filename = os.path.join(base_dir, 'corpora', directory, '{}.txt'.format(period))
labels_filename = os.path.join(base_dir, 'dataframes', directory, '{}_topicLabels.csv'.format(period))
doc_topic_filename = os.path.join(base_dir, 'dataframes', directory, '{}_dtm.csv'.format(period))
def doc_list(index_filename):
    """
    Read in from a json document with index position and filename. 
    File was created during the creation of the corpus (.mm) file to document
    the filename for each file as it was processed.
    
    Returns the index information as a dataframe.
    """
    with open(index_filename) as data_file:    
        data = json.load(data_file)
    docs = pd.DataFrame.from_dict(data, orient='index').reset_index()
    docs.columns = ['index_pos', 'doc_id']
    docs['index_pos'] = docs['index_pos'].astype(int)
  
    return docs


def compile_dataframe( index, dtm, labels, metadata):
    """
    Combines a series of dataframes to create a large composit dataframe.
    """
    doc2metadata = index.merge(metadata, on='doc_id', how="left")
    topics_expanded = dtm.merge(labels, on='topic_id')
    
    df = topics_expanded.merge(doc2metadata, on="index_pos", how="left")
    
    return df
order = ['conference, committee, report, president, secretary, resolved',
         'quarterly, district, society, send, sept, business',
         'association, publishing, chart, dollar, xxii, sign',
         'mother, wife, told, went, young, school',
         'disease, physician, tobacco, patient, poison, medicine',
         'wicked, immortality, righteous, adam, flesh, hell',
        ]
def create_plotpoint(df, y_value, hue=None, order=order, col=None, wrap=None, size=6, aspect=1.5, title=""):
    p = sns.factorplot(x="year", y=y_value, kind='point', hue_order=order, hue=hue, 
                       col=col, col_wrap=wrap, col_order=order, size=size, aspect=aspect, data=df)
    p.fig.subplots_adjust(top=0.9)
    p.fig.suptitle(title, fontsize=16)
    return p
metadata = pd.read_csv(metadata_filename, usecols=['doc_id', 'year','title'])
docs_index = doc_list(index_filename)
dt = pd.read_csv(doc_topic_filename)
labels = pd.read_csv(labels_filename)

The first step, following the pattern of Andrew Goldstone for his topic model browser, is to normalize the weights for each document, so that they total to “1”.

As a note, Goldstone first smooths the weights by adding the alpha hyperparameter to each of the weights, which I am not doing here.

# Reorient from long to wide
dtm = dt.pivot(index='index_pos', columns='topic_id', values='topic_weight').fillna(0)

# Divide each value in a row by the sum of the row to normalize the values
# Since last week I have found a cleaner way to normalize the rows.
# https://stackoverflow.com/questions/18594469/normalizing-a-pandas-dataframe-by-row
dtm = dtm.div(dtm.sum(axis=1), axis=0)

# Shift back to a long dataframe
dt_norm = dtm.stack().reset_index()
dt_norm.columns = ['index_pos', 'topic_id', 'norm_topic_weight']
df = compile_dataframe(docs_index, dt_norm, labels, metadata)
df
index_pos topic_id norm_topic_weight topic_words doc_id year title
0 0 0 0.045525 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page1.txt 1863 GCB
1 1 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page2.txt 1863 GCB
2 2 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page3.txt 1863 GCB
3 3 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page4.txt 1863 GCB
4 4 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page5.txt 1863 GCB
... ... ... ... ... ... ... ...
288595 11539 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page4.txt 1872 YI
288596 11540 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page5.txt 1872 YI
288597 11541 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page6.txt 1872 YI
288598 11542 24 0.012192 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page7.txt 1872 YI
288599 11543 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page8.txt 1872 YI

288600 rows × 7 columns

Data dictionary:

  • index_pos : Gensim uses the order in which the docs were streamed to link back the data and the source file. index_pos refers to the index id for the individual doc, which I used to link the resulting model information with the document name.
  • topic_id : The numerical id for each topic. For this model, I used 20 topics to classify the periodical pages.
  • norm_topic_weight : The proportion of the tokens in the document that are part of the topic, normalized per doc.
  • topic_words : The top 6 words in the topic.
  • doc_id : The file name of the document. The filename contains metadata information about the document, such as the periodical title, date of publication, volume, issue, and page number.
  • year : Year the document was published (according to the filename)
  • title : Periodical that the page was published in.

Computing Topic Prevalence

The third approach I found for calculating topic significance over time is computing the topic prevalence. The primary example I found of this approach is Adrien Guille’s TOM, TOpic Modeling, library for Python. Rather than averaging the weights, his approach is to set a baseline for determining whether a topic is significantly present (in this case, it is the topic with the highest weight for a document) and then computing the percentage of documents in a given year where the topic is significantly present.

Following the pattern in the TOM library, we can compute the prevalence of the topics by identifying the topic with the highest weight per document, grouping the results by year, adding up the number of top occurrences of each topic per year and dividing by the total number of documents per year.

# Group by document and take the row with max topic weight for each document
max_df = df[df.groupby(['index_pos'])['norm_topic_weight'].transform(max) == df['norm_topic_weight']]

# Group by year and topic, counting the number of documents per topic per year.
max_counts = max_df[['doc_id', 'year', 'topic_id']].groupby(['year', 'topic_id']).agg({'doc_id' : 'count'}).reset_index()
max_counts.columns = ['year', 'topic_id', 'max_count']
# Count the number of individual documents per year
total_docs = max_df[['year', 'doc_id']].groupby('year').agg({'doc_id' : 'count'}).reset_index()
total_docs.columns = ['year', 'total_docs']
# Combine the two dataframes
max_counts = max_counts.merge(total_docs, on='year', how='left')

# Create a new column with the count per topic divided by the total docs per year
max_counts['prevalence'] = max_counts['max_count']/max_counts['total_docs']

# Add the topic labels to make human-readable
max_counts = max_counts.merge(labels, on="topic_id")
max_counts
year topic_id max_count total_docs prevalence topic_words
0 1859 0 90 512 0.175781 satan, salvation, sinner, righteousness, peace...
1 1860 0 79 512 0.154297 satan, salvation, sinner, righteousness, peace...
2 1861 0 79 408 0.193627 satan, salvation, sinner, righteousness, peace...
3 1862 0 67 514 0.130350 satan, salvation, sinner, righteousness, peace...
4 1863 0 59 424 0.139151 satan, salvation, sinner, righteousness, peace...
... ... ... ... ... ... ...
352 1867 2 1 951 0.001052 animal, fruit, horse, flesh, sheep, gardner
353 1872 2 7 904 0.007743 animal, fruit, horse, flesh, sheep, gardner
354 1874 2 3 883 0.003398 animal, fruit, horse, flesh, sheep, gardner
355 1869 16 3 682 0.004399 association, publishing, chart, dollar, xxii, ...
356 1872 16 1 904 0.001106 association, publishing, chart, dollar, xxii, ...

357 rows × 6 columns

# Limit to our 5 test topics
mc_s = max_counts[(max_counts['topic_id'] >= 15) & (max_counts['topic_id'] <= 20)]
create_plotpoint(mc_s, 'prevalence', hue='topic_words',
                 title='Percentage of documents where topic is most significant per year'
                )
<seaborn.axisgrid.FacetGrid at 0x114eebf98>

If we look back at our chart of the average topic weights per year, we can see that the two sets of lines are similar, but not the same. If we rely on prevalence, we see a larger spike of interest in health in 1867, in terms of pages dedicated to the topic. We also see more dramatic spikes in our topic on “mother, wife, told, went, young, school.”

create_plotpoint(df, 'norm_topic_weight', hue='topic_words',
                 title='Central range of topic weights by year.'
                )
<seaborn.axisgrid.FacetGrid at 0x1154bc400>

While not apparent from the data discussed here, the spikes in “mother, wife, told, went, young, school” correspond with the years where The Youth’s Instructor is part of the corpus (1859, 1860, 1862, 1870, 1871, 1872). While the topic is clearly capturing language that occurs in multiple publications, the presence or absence of the title has a noticeable effect. We can use smoothing to adjust for the missing data (if we’re interested in the overall trajectory of the topic), or use the information to frame our exploration what the topic is capturing.


You can download and run the code locally using the Jupyter Notebook version of this post

Ways to Compute Topics over Time, Part 2

This is part of a series of technical essays documenting the computational analysis that undergirds my dissertation, A Gospel of Health and Salvation. For an overview of the dissertation project, you can read the current project description at jeriwieringa.com. You can access the Jupyter notebooks on Github.


This is the second in a series of posts which constitute a “lit review” of sorts, documenting the range of methods scholars are using to compute the distribution of topics over time.

Graphs of topic prevalence over time are some of the most ubiquitous in digital humanities discussions of topic modeling. They are used as a mechanism for identifying spikes in discourse and for depicting the relationship between the various discourses in a corpus.

Topic prevalence over time is not, however, a measure that is returned with the standard modeling tools such as MALLET or Gensim. Instead, it is computed after the fact by combining the model data with external metadata and aggregating the model results. And, as it turns out, there are a number of ways that the data can be aggregated and displayed. In this series of notebooks, I am looking at 4 different strategies for computing topic significance over time. These strategies are:

To explore a range of strategies for computing and visualizing topics over time from a standard LDA model, I am using a model I created from my dissertation materials. You can download the files needed to follow along from https://www.dropbox.com/s/9uf6kzkm1t12v6x/2017-06-21.zip?dl=0.

# Load the necessary libraries
from ggplot import *
import json
import logging
import matplotlib as mpl
import matplotlib.pyplot as plt
import os
import pandas as pd # Note: I am running 0.19.2
import pyLDAvis.gensim
import seaborn as sns
import warnings
# Enable in-notebook visualizations
%matplotlib inline
pyLDAvis.enable_notebook()
# Temporary fix for persistent warnings of an api change between pandas and seaborn.
warnings.filterwarnings('ignore')
pd.options.display.max_rows = 10
base_dir = "data"
period = "1859-to-1875"
directory = "historical_periods"

Create the dataframes

metadata_filename = os.path.join(base_dir,'2017-05-corpus-stats/2017-05-Composite-OCR-statistics.csv')
index_filename = os.path.join(base_dir, 'corpora', directory, '{}.txt'.format(period))
labels_filename = os.path.join(base_dir, 'dataframes', directory, '{}_topicLabels.csv'.format(period))
doc_topic_filename = os.path.join(base_dir, 'dataframes', directory, '{}_dtm.csv'.format(period))
def doc_list(index_filename):
    """
    Read in from a json document with index position and filename. 
    File was created during the creation of the corpus (.mm) file to document
    the filename for each file as it was processed.
    
    Returns the index information as a dataframe.
    """
    with open(index_filename) as data_file:    
        data = json.load(data_file)
    docs = pd.DataFrame.from_dict(data, orient='index').reset_index()
    docs.columns = ['index_pos', 'doc_id']
    docs['index_pos'] = docs['index_pos'].astype(int)
  
    return docs


def compile_dataframe( index, dtm, labels, metadata):
    """
    Combines a series of dataframes to create a large composit dataframe.
    """
    doc2metadata = index.merge(metadata, on='doc_id', how="left")
    topics_expanded = dtm.merge(labels, on='topic_id')
    
    df = topics_expanded.merge(doc2metadata, on="index_pos", how="left")
    
    return df
metadata = pd.read_csv(metadata_filename, usecols=['doc_id', 'year','title'])
docs_index = doc_list(index_filename)
dt = pd.read_csv(doc_topic_filename)
labels = pd.read_csv(labels_filename)

The first step, following the pattern of Andrew Goldstone for his topic model browser, is to normalize the weights for each document, so that they total to “1”.

As a note, Goldstone first smooths the weights by adding the alpha hyperparameter to each of the weights, which I am not doing here.

# Reorient from long to wide
dtm = dt.pivot(index='index_pos', columns='topic_id', values='topic_weight').fillna(0)

# Divide each value in a row by the sum of the row to normalize the values
dtm = (dtm.T/dtm.sum(axis=1)).T

# Shift back to a long dataframe
dt_norm = dtm.stack().reset_index()
dt_norm.columns = ['index_pos', 'topic_id', 'norm_topic_weight']
df = compile_dataframe(docs_index, dt_norm, labels, metadata)
df
index_pos topic_id norm_topic_weight topic_words doc_id year title
0 0 0 0.045525 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page1.txt 1863 GCB
1 1 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page2.txt 1863 GCB
2 2 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page3.txt 1863 GCB
3 3 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page4.txt 1863 GCB
4 4 0 0.000000 satan, salvation, sinner, righteousness, peace... GCB186305XX-VXX-XX-page5.txt 1863 GCB
... ... ... ... ... ... ... ...
288595 11539 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page4.txt 1872 YI
288596 11540 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page5.txt 1872 YI
288597 11541 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page6.txt 1872 YI
288598 11542 24 0.012192 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page7.txt 1872 YI
288599 11543 24 0.000000 jerusalem, thess, parable, lazarus, thou_hast,... YI18721201-V20-12-page8.txt 1872 YI

288600 rows × 7 columns

Data dictionary:

  • index_pos : Gensim uses the order in which the docs were streamed to link back the data and the source file. index_pos refers to the index id for the individual doc, which I used to link the resulting model information with the document name.
  • topic_id : The numerical id for each topic. For this model, I used 20 topics to classify the periodical pages.
  • norm_topic_weight : The proportion of the tokens in the document that are part of the topic, normalized per doc.
  • topic_words : The top 6 words in the topic.
  • doc_id : The file name of the document. The filename contains metadata information about the document, such as the periodical title, date of publication, volume, issue, and page number.
  • year : Year the document was published (according to the filename)
  • title : Periodical that the page was published in.
order = ['conference, committee, report, president, secretary, resolved',
         'quarterly, district, society, send, sept, business',
         'association, publishing, chart, dollar, xxii, sign',
         'mother, wife, told, went, young, school',
         'disease, physician, tobacco, patient, poison, medicine',
         'wicked, immortality, righteous, adam, flesh, hell',
        ]
def create_plotpoint(df, y_value, hue=None, order=order, col=None, wrap=None, size=6, aspect=2, title=""):
    p = sns.factorplot(x="year", y=y_value, kind='point', 
                        hue_order=order, hue=hue, 
                       col=col, col_wrap=wrap, col_order=order, 
                       size=size, aspect=aspect, data=df)
    p.fig.subplots_adjust(top=0.9)
    p.fig.suptitle(title, fontsize=16)
    return p

Smoothing or Regression Analysis

The second popular topic visualization strategy I found is using a smoothing function to highlight trends in the data. I found this strategy was most commonly executed through the graphing library (rather than computing prior to graphing).

The prime example is using the geom_smooth option in ggplot. This is a little harder to do with Seaborn, but fortunately there is a Python port of ggplot so I can demonstrate on my model. For the sake of simplicity, and because faceting with the Python version is buggy, I will work with only one topic at a time for this experiment.

t17 = df[df['topic_id']== 17]

The loess smoother is particularly designed for extrapolating trends in timeseries and noisy data, and is the default smoother in ggplot for data samples with fewer than 1000 observations. The gam smoother is the default for larger datasets.1 While I have more than 1000 observations, I will use both to illustrate the differences.

ggplot(t17, aes(x='year', y='norm_topic_weight')) + geom_point() + stat_smooth(method='loess')

<ggplot: (287572516)>
ggplot(t17, aes(x='year', y='norm_topic_weight')) + geom_point() + stat_smooth(method='gam')

<ggplot: (-9223372036567940983)>

The collection of “0” values appears to have caused the function to flatline at “0” so let’s filter those out (cringing as we do, because now the total number of observations each year will vary). If you are using data produced by MALLET you might not have this problem. But someone else can experiment with that.

# Drop all rows where the topic weight is 0
t17f = t17[t17['norm_topic_weight'] > 0]
ggplot(t17f, aes(x='year', y='norm_topic_weight')) + geom_point() + stat_smooth(method='loess')

<ggplot: (287168092)>
ggplot(t17f, aes(x='year', y='norm_topic_weight')) + geom_point() + stat_smooth(method='gam')

<ggplot: (287167152)>

While not quite as pretty as the graphs from R, these graphs help illustrate the calculations.

The first observation is that the lines are functionally equivalent, so both smoothing functions found a similar line.

Looking at the line, we see a gradual increase over the first 7 years of the dataset, a more rapid increase between 1866 and 1869, and then leveling off at around 25% for the rest of the period. That matches what we saw when charting averages in the last post, though it hides the variations from year to year.

Topic 17, or “disease, physician, tobacco, patient, poison, medicine” was a topic with a large shift in our corpus, so for comparison, let us look at a more typical topic.

Topic 20: “mother, wife, told, went, young, school”

t20 = df[df['topic_id'] == 20]
ggplot(t20, aes(x='year', y='norm_topic_weight')) + geom_point() + stat_smooth(method='loess')

<ggplot: (-9223372036567088787)>

My best guess as to why the unfiltered topic 20 does not flatline at “0” where topic 17 did is that while health topics are either present or not on a page (and so have a high proportion of “0” value documents), “mother, wife, told, went, young, school” is a recurring feature of most pages in a given year.

This could indicate that we need more topics in our model. But it also suggests something interesting about the language of the denomination that this topic is consistently part of the conversation.

t20f = t20[t20['norm_topic_weight'] > 0]
ggplot(t20f, aes(x='year', y='norm_topic_weight')) + geom_point() + stat_smooth(method='loess')

<ggplot: (288187824)>
ggplot(t20f, aes(x='year', y='norm_topic_weight')) + geom_point() + stat_smooth(method='gam')

<ggplot: (-9223372036566588149)>

While removing the “0” values did result in moving the line up ever so slightly, the overall pattern is the same.

Rolling Averages

While smoothing uses linear regression to identify trend lines, another strategy for computing the overall trajectory of a topic is to visualize the average on a rolling time window (such as computing the 5 year average.) This approach emphasizes longer term patterns in topic weight. An example of this can be seen in the work of John Laudun and Jonathan Goodwin in “Computing Folklore Studies: Mapping over a Century of Scholarly Production through Topics.”2

Computing average over time is one calculation I struggled to work with. Back in Part 1, I mentioned that one of the big differences between the output from MALLET and Gensim is while Mallet returns a weight for every topic in every document, Gensim filters out the weights under 1%. Depending on how you handle the data, that can drastically effect the way averages are computed. If you have a weight for all topics in documents, then the average is over the whole corpus. If you only have weights for the significant topic assignments, things get wonky really fast.

Since we “inflated” our dataframe to include a “0” topic weight for all missing values in the Gensim output, the denominator for all of our averages is the same. That being the case, I believe we can compute the rolling mean by taking the average each year and then using the rolling_mean function in Pandas to compute the average across them.

# Group the data by year and compute the mean of the topic weights
t17grouped = t17.groupby('year')['norm_topic_weight'].mean()

# Compute the rolling mean per year and rename the columns for graphing.
t17rolling = pd.rolling_mean(t17grouped, 3).reset_index()
t17rolling.columns = ['year', 'rolling_mean']
create_plotpoint(t17rolling, 'rolling_mean')
<seaborn.axisgrid.FacetGrid at 0x112577c18>

One thing to keep in mind is that both the linear smoothing and the rolling mean are strategies developed particularly for timeseries data, or data that is produced on regular intervals by some recording instrument. While an individual journal might be released on a semi-regular schedule, once we are working across multiple publications, this gets much more complicated. Because I don’t have “real” timeseries data, I am hesitant about this approach.

Both the smoothing function in ggplot and the use of a rolling mean are methods for capturing the overall trend of a topic over time. In some ways, this feels like the holy grail metric — we are computing the trajectory of a particular discourse. But it is also an even more abstracted depiction of the topic weights than we drew when computing the averages. In describing his choice of bars to represent topic prevalence in a given year, Goldstone notes “I have chosen bars to emphasize that the model does not assume a smooth evolution in topics, and neither should we; this does make it harder to see a time trend, however.”3

Both the smoothing function and the rolling mean are strategies for minimizing the dips and spikes of a particular year. If your question relates to the long-term trend of a topic (assuming your document base can support it), that is its advantage. But if your question is about discourse at a particular point or smaller period of time, then that is the weakness.

Next up: topic prevalence, in an epidemiology sort of way.

You can download and run the code locally using the Jupyter Notebook version of this post


  1. http://ggplot2.tidyverse.org/reference/geom_smooth.html 

  2. John Laudun and Jonathan Goodwin. “Computing Folklore Studies: Mapping over a Century of Scholarly Production through Topics.” Journal of American Folklore 126, no. 502 (2013): 455-475. https://muse.jhu.edu/ (accessed June 23, 2017). 

  3. https://agoldst.github.io/dfr-browser/