Using pyLDAvis with Mallet

Update: I think I have finally figured out the trick to smoothing for the document-topic matrix - really I was making things more difficult than I needed to. Because I optimized the topic distribution, the alpha value IS the list of values, which can be added to the matrix of document-topic counts without any additional trouble or manipulation. I’ve updated the code and the visualization below to reflect this change. As a result, there is more variation in topic size (which makes sense) and the number match the doc-topics output from MALLET (which I should have been using to verify my results all along.)

One useful library for viewing a topic model is LDAvis, an R package for creating interactive web visualizations of topic models, and its Python port, PyLDAvis. This library is focused on visualizing a topic model, using PCA to chart the relationship between topics and between topics and words in the topic model. It is also agnostic about library you use to create the topic model, so long as you extract the necessary data in the correct formats.

While the python version of the library works very smoothly with Gensim, which I have discussed before, there is little documentation for how to move from a topic model created using MALLET to data that can be processed by the LDAvis library. For reasons that require their own blog post, I have shifted from using Gensim for my topic model to using MALLET (spoilers: better documentation of output formats, more widespread use in the humanities so better documentation and code examples generally). But I still wanted to use this library to visualize the full model as a way of generating an overall view of the relationship between the 250 topics it contains.

The documentation for both LDAvis and PyLDAvis relies primarily on code examples to demonstrate how to use the libraries. My primary sources were a python example and two R examples, one focused on manipulating the model data and one on the full model to visualization process. The “details” documentation for the R library also proved key for trouble-shooting when the outputs did not match my expectations. (Pro tip: word order matters.)

Looking at the examples, the data required for the visualization library are:

One challenge is that the order of the data needs to be managed, so that the terms columns in phi, the topic-term matrix, are in the same order as the vocab vector, which is in the same order as the frequencies vector, and the documents index of theta, the document-topic matrix, is in the same order of the document lengths vector.

I was able to isolate and generate all of the data necessary for creating the visualization from the statefile produced by MALLET. For the document lengths, vocabulary, and frequencies, I computed the data on the tokens remaining after stopword removal, so only those tokens considered in the creation of the model. I used the optimize interval in creating the model to allow some topics to be more prominent than others. As a result, the alpha value is a list of values (weights per topic). Here I have relied on the documentation for the R MALLET package to use the sum of these values as the smoothing value for theta, the document-topic matrix.

The first step was to extract the data from the MALLET statefile and into a pandas dataframe.

import gzip
import os
import pandas as pd

dataDir = "../../data/"

def extract_params(statefile):
    """Extract the alpha and beta values from the statefile.

    Args:
        statefile (str): Path to statefile produced by MALLET.
    Returns:
        tuple: alpha (list), beta    
    """
    with gzip.open(statefile, 'r') as state:
        params = [x.decode('utf8').strip() for x in state.readlines()[1:3]]
    return (list(params[0].split(":")[1].split(" ")), float(params[1].split(":")[1]))


def state_to_df(statefile):
    """Transform state file into pandas dataframe.
    The MALLET statefile is tab-separated, and the first two rows contain the alpha and beta hypterparamters.
    
    Args:
        statefile (str): Path to statefile produced by MALLET.
    Returns:
        datframe: topic assignment for each token in each document of the model
    """
    return pd.read_csv(statefile,
                       compression='gzip',
                       sep=' ',
                       skiprows=[1,2]
                       )
params = extract_params(os.path.join(dataDir, 'target_300_10.18497.state.gz'))
alpha = [float(x) for x in params[0][1:]]
beta = params[1]
print("{}, {}".format(alpha, beta))
[0.029812924118304992, 0.009271861921224648, 0.036602792963554454, 0.03036346204521996, 0.020332718571530326, 0.10351890596388356, 0.012755495950966892, 0.04290588069973869, 0.01901706336256544, 0.010000562544823921, 0.008051737793126377, 0.010478767490879254, 0.05102267459262269, 0.017735208419080477, 0.040884777680179944, 0.02240422753511921, 0.04633390719823591, 0.009648564936291307, 0.01221798611153162, 0.00518484471239066, 0.026017510181194876, 0.012790226620751563, 0.01623316505431329, 0.020333139411403224, 0.00825133910195449, 0.009324396700901893, 0.009969352687469672, 0.014058852097669477, 0.014327024486392018, 0.007030687242074, 0.013643807447897123, 0.014522948963225387, 0.04471494487932859, 0.014164304504351164, 0.0039162429342773745, 0.014532343795810298, 0.025680848757922235, 0.01983317219162416, 0.03957426890109219, 0.032049504012664, 0.023936849755631102, 0.020045882246671455, 0.0163365826757628, 0.003664273571444934, 0.01244363813012928, 0.006379376114561163, 0.03183838413167178, 0.0228654378800433, 0.02589730566107523, 0.002725534892222803, 0.020321045426697708, 0.04147171080143742, 0.013214378648651782, 0.01545986360120152, 0.00472439590963874, 0.009571643835736842, 0.011852134455363792, 0.01849484576655846, 0.010789697738530953, 0.010931933129577746, 0.01277599570543656, 0.02865829649526358, 0.025531595810921456, 0.013319388129360477, 0.013671904052002564, 0.03664274901715288, 0.027764002912708176, 0.03798684207498981, 0.010093692241401983, 0.02086852756014119, 0.027915681744247706, 0.007111062378759468, 0.011931852473693062, 0.060321392110113205, 0.010631585925131115, 0.02277232852533824, 0.022347319541299926, 0.020016904891511495, 0.008478233613906505, 0.028181890315513807, 0.01027271825929225, 0.015904757954617755, 0.015791178498843768, 0.0191566503478927, 0.032057954637453895, 0.01993928018855794, 0.009610904469008625, 0.024155275595540187, 0.009383503065703933, 0.011349218953292111, 0.009017549066160908, 0.015974717687080673, 0.01437943660784478, 0.012044976507002564, 0.03732865170330231, 0.013679921501071392, 0.010056445695427858, 0.013700151106363787, 0.015527525242800691, 0.01838062814387135, 0.01003220363012177, 0.011726330465516797, 0.06279021293851472, 0.011306076304927044, 0.022169791301728952, 0.09052698056050601, 0.013273622774595808, 0.016757843984888256, 0.02834518580348568, 0.014328529132486114, 0.008869888943969785, 0.010620834978804661, 0.02856840703138199, 0.026545113171383437, 0.022202598341928788, 0.021789883322435685, 0.02315912700902329, 0.018211012444135744, 0.012913523595199006, 0.010350788910319165, 0.004451048502122248, 0.030355975269286625, 0.003492783830246674, 0.01192034374170573, 0.027700222580153786, 0.018313502769780502, 0.019356895619519046, 0.021516916814583894, 0.0056654564371284525, 0.018829713295067545, 0.01022461760210033, 0.006279341483060046, 0.014364541807704207, 0.001573700008969004, 0.006719003656189242, 0.012605179443199417, 0.024460251523987342, 0.0182499475845198, 0.006206344325333511, 0.012599453549275916, 0.012249376897371305, 0.025768677091400388, 0.013565663314185872, 0.020135876708538898, 0.012443595650252689, 0.010798991574291462, 0.0026960940570410906, 0.01728421395376024, 0.03200629341384553, 0.041563976680458144, 0.011842193878494717, 0.00743704889720417, 0.021797155342851707, 0.019973408255262923, 0.015055290919369782, 0.009117163319745271, 0.01852047216755867, 0.003047546912966365, 0.021264102592926326, 0.007691689543153175, 0.02903386621021987, 0.020374995078762736, 0.019529189458843053, 0.011777296769065458, 0.04174199906345287, 0.01135337310597628, 0.014894292146825408, 0.006795363369957581, 0.012535637546649925, 0.028644526374704934, 0.013160328274689539, 0.022578250880233568, 0.012537369482130595, 0.007340171510908071, 0.014817972919966718, 0.008735804073555823, 0.010983914354572858, 0.032069764330134996, 0.042922790555120396, 0.019530089993512335, 0.03137052494962982, 0.011943734231251193, 0.015418531760651456, 0.01712478478340123, 0.0156949411259937, 0.02785009209993766, 0.0019321257841679888, 0.02615931178010513, 0.014755658587172392, 0.019110932033433007, 0.005800365350818801, 0.010607553788835274, 0.019259346572074787, 0.005043020126756468, 0.012751723193498746, 0.03148187376811713, 0.021164622984691635, 0.02229774988021518, 0.009324050784836864, 0.024370846468811384, 0.004160642339659546, 0.027753906937651587, 0.02283299944435053, 0.007850441485466118, 0.013456229294301635, 0.016590206403546366, 0.020321136161409868, 0.020930630364992903, 0.02339327930121348, 0.015586366826561743, 0.012089620923028578, 0.011029800508113282, 0.006129150966787204, 0.04337105416984723, 0.01483093991200465, 0.005704777036833407, 0.01726502161173706, 0.010514963086575535, 0.02905089499349202, 0.027591588675477965, 0.01397053997745691, 0.01536830959628985, 0.011105366008182125, 0.004900591851721317, 0.05799364001945829, 0.02540643660947351, 0.014525869409298599, 0.02627828623581746, 0.010761810837253307, 0.039146127532474206, 0.01008898466006184, 0.020787880178054812, 0.008164830108028306, 0.009003942798165172, 0.01720164006487483, 0.027123841707823098, 0.0241269324440368, 0.0482433167498602, 0.008091609956704575, 0.00811884016017425, 0.011260778354946063, 0.01764141856430096, 0.014453246410351622, 0.030266455307113138, 0.011962138258688833, 0.012938776484910453, 0.01614189286227352, 0.017650677445550104, 0.016027167728970906, 0.011036778274787372], 0.013712186605802602
df = state_to_df(os.path.join(dataDir, 'target_300_10.18497.state.gz'))

One small complication I found with my data was there is a token nan in the list of “types.” Because nan is used in Pandas to indicate missing integer values, Pandas assumed it was an integer, rather than a string. To address this misidentification problem, I explicitly set the type for the entire column “type” to string. (Two meanings of type. Also confusing.)

df['type'] = df.type.astype(str)
df[:10]
#doc source pos typeindex type topic
0 0 NaN 0 0 uncommon 165
1 0 NaN 1 1 thing 224
2 0 NaN 2 2 member 224
3 0 NaN 3 3 boarding 136
4 0 NaN 4 4 school 14
5 0 NaN 5 5 form 237
6 0 NaN 6 6 clique 94
7 0 NaN 7 7 pair 11
8 0 NaN 8 8 tend 136
9 0 NaN 9 9 thank 224

Above is a preview of the data that I have from the statefile, and which I am going to use generate the data needed for the LDAvis library. It contains the document id, the word position index, the word index, the word, and its topic assignment.

The first bit of data to gather is the length of the documents. To do this, I grouped the data by the document id and counted the tokens in the doc. This data is sorted by the document id, so is in the correct order for the visualization preprocessing.

# Get document lengths from statefile
docs = df.groupby('#doc')['type'].count().reset_index(name ='doc_length')

docs[:10]
#doc doc_length
0 0 116
1 1 102
2 2 108
3 3 97
4 4 121
5 5 113
6 6 123
7 7 137
8 8 121
9 9 132

The second bit of data is the vocabulary and frequencies. Here I used pandas to generate a new frame with the counts for each word. I then sorted this dataframe so that it is alphabetical by type, a step I will repeat in creating the topic-term matrix.

# Get vocab and term frequencies from statefile
vocab = df['type'].value_counts().reset_index()
vocab.columns = ['type', 'term_freq']
vocab = vocab.sort_values(by='type', ascending=True)

vocab[:10]
type term_freq
31901 aaa 77
54240 aaaa 18
48406 aaaaa 26
22633 aad 180
50081 aai 24
52882 aal 20
31404 aalborg 80
48729 aan 26
44793 aand 32
24704 aar 146

Clearly some OCR problems have managed to make their way through all of my preprocessing efforts. With the low term frequencies, however, they should be only minimally present in the final representation of the topic model.

The third step was to create the matrix files. Here is where things get a bit tricky, as there is the adding of smoothing values and normalizing the data so that each row sums to 1. To do the normalizing, I am used sklearn because these are large matrices that require a more optimized function than dividing by the sum of the row with pandas.

# Topic-term matrix from state file
# https://ldavis.cpsievert.me/reviews/reviews.html

import sklearn.preprocessing

def pivot_and_smooth(df, smooth_value, rows_variable, cols_variable, values_variable):
    """
    Turns the pandas dataframe into a data matrix.
    Args:
        df (dataframe): aggregated dataframe 
        smooth_value (float): value to add to the matrix to account for the priors
        rows_variable (str): name of dataframe column to use as the rows in the matrix
        cols_variable (str): name of dataframe column to use as the columns in the matrix
        values_variable(str): name of the dataframe column to use as the values in the matrix
    Returns:
        dataframe: pandas matrix that has been normalized on the rows.
    """
    matrix = df.pivot(index=rows_variable, columns=cols_variable, values=values_variable).fillna(value=0)
    matrix = matrix.values + smooth_value
    
    normed = sklearn.preprocessing.normalize(matrix, norm='l1', axis=1)
    
    return pd.DataFrame(normed)

First, we need to aggregate the data from the statefile dataframe to get the number of topic assignments for word in the documents. For phi, the topic-term matrix, I aggregated by topic and word, counted the number of times each word was assigned to each topic, and then sorted the resulting dataframe alphabetically by word, so that it matches the order of the vocabulary frame. Here I used the beta hyperparamter as the smoothing value.

phi_df = df.groupby(['topic', 'type'])['type'].count().reset_index(name ='token_count')
phi_df = phi_df.sort_values(by='type', ascending=True)

phi_df[:10]
topic type token_count
902220 200 aaa 2
220727 45 aaa 1
802489 177 aaa 10
484128 104 aaa 1
104943 19 aaa 5
930031 207 aaa 7
633014 138 aaa 3
715361 157 aaa 1
983211 219 aaa 4
35018 7 aaa 31
phi = pivot_and_smooth(phi_df, beta, 'topic', 'type', 'token_count')

# phi[:10]

Then we do this again, but focused on the documents and topics, to generate the theta document-topic matrix. Here I used alpha as the smoothing value.

theta_df = df.groupby(['#doc', 'topic'])['topic'].count().reset_index(name ='topic_count')

theta_df[:10]
#doc topic topic_count
0 0 11 10
1 0 14 7
2 0 94 14
3 0 102 2
4 0 113 23
5 0 136 12
6 0 144 1
7 0 165 1
8 0 171 12
9 0 175 1
theta = pivot_and_smooth(theta_df, alpha , '#doc', 'topic', 'topic_count')

# theta[:10]

These data processing steps, as always, represent about 90% of the work needed to use the LDAvis library with a MALLET topic model. Now that I have all of the data in place, I can queue that data up and pass it to the visualization library.

import pyLDAvis

data = {'topic_term_dists': phi, 
        'doc_topic_dists': theta,
        'doc_lengths': list(docs['doc_length']),
        'vocab': list(vocab['type']),
        'term_frequency': list(vocab['term_freq'])
       }

I used the code examples I mentioned earlier for setting the keys of the data dictionary. Keeping the data in sorted dataframes helps ensure that the order is consistent and preserved as it is moved into a list for analysis.

This data is then passed the visualization library, first for preparation and then for display. In preparing the data, the library computes the distances between topics and then projects those distances into a two-dimensional space using “multidimensional scaling” or principle component analysis. In the resulting visualization, the overlap in words of the topic is represented by shared location in space, and the greater the distance between topics, the larger the dissimilarity between the weight of the words that comprise the topic. The second element that the library computes is the most relevant terms for each topic with a sliding scale that either ranks on the overall frequency of the word (the default) or the distinctiveness of the word (lambda = 0.0 on the resulting slider). This provides a richer view of the topic assignments and is useful in labeling for distinguishing between topics.

vis_data = pyLDAvis.prepare(**data)
pyLDAvis.display(vis_data)