Using pyLDAvis with Mallet | Jeri E. Wieringa

Update: I think I have finally figured out the trick to smoothing for the document-topic matrix - really I was making things more difficult than I needed to. Because I optimized the topic distribution, the alpha value IS the list of values, which can be added to the matrix of document-topic counts without any additional trouble or manipulation. I’ve updated the code and the visualization below to reflect this change. As a result, there is more variation in topic size (which makes sense) and the number match the doc-topics output from MALLET (which I should have been using to verify my results all along.)

One useful library for viewing a topic model is LDAvis, an R package for creating interactive web visualizations of topic models, and its Python port, PyLDAvis. This library is focused on visualizing a topic model, using PCA to chart the relationship between topics and between topics and words in the topic model. It is also agnostic about library you use to create the topic model, so long as you extract the necessary data in the correct formats.

While the python version of the library works very smoothly with Gensim, which I have discussed before, there is little documentation for how to move from a topic model created using MALLET to data that can be processed by the LDAvis library. For reasons that require their own blog post, I have shifted from using Gensim for my topic model to using MALLET (spoilers: better documentation of output formats, more widespread use in the humanities so better documentation and code examples generally). But I still wanted to use this library to visualize the full model as a way of generating an overall view of the relationship between the 250 topics it contains.

The documentation for both LDAvis and PyLDAvis relies primarily on code examples to demonstrate how to use the libraries. My primary sources were a python example and two R examples, one focused on manipulating the model data and one on the full model to visualization process. The “details” documentation for the R library also proved key for trouble-shooting when the outputs did not match my expectations. (Pro tip: word order matters.)

Looking at the examples, the data required for the visualization library are:

topic-term distributions (matrix, phi)
document-topic distributions (matrix, theta)
document lengths (number vector)
vocab (character vector)
term frequencies (number vector)

One challenge is that the order of the data needs to be managed, so that the terms columns in phi, the topic-term matrix, are in the same order as the vocab vector, which is in the same order as the frequencies vector, and the documents index of theta, the document-topic matrix, is in the same order of the document lengths vector.

I was able to isolate and generate all of the data necessary for creating the visualization from the statefile produced by MALLET. For the document lengths, vocabulary, and frequencies, I computed the data on the tokens remaining after stopword removal, so only those tokens considered in the creation of the model. I used the optimize interval in creating the model to allow some topics to be more prominent than others. As a result, the alpha value is a list of values (weights per topic). ~~Here I have relied on the documentation for the R MALLET package to use the sum of these values as the smoothing value for theta, the document-topic matrix.~~

The first step was to extract the data from the MALLET statefile and into a pandas dataframe.

import gzip
import os
import pandas as pd

dataDir = "../../data/"

def extract_params(statefile):
    """Extract the alpha and beta values from the statefile.

    Args:
        statefile (str): Path to statefile produced by MALLET.
    Returns:
        tuple: alpha (list), beta    
    """
    with gzip.open(statefile, 'r') as state:
        params = [x.decode('utf8').strip() for x in state.readlines()[1:3]]
    return (list(params[0].split(":")[1].split(" ")), float(params[1].split(":")[1]))


def state_to_df(statefile):
    """Transform state file into pandas dataframe.
    The MALLET statefile is tab-separated, and the first two rows contain the alpha and beta hypterparamters.
    
    Args:
        statefile (str): Path to statefile produced by MALLET.
    Returns:
        datframe: topic assignment for each token in each document of the model
    """
    return pd.read_csv(statefile,
                       compression='gzip',
                       sep=' ',
                       skiprows=[1,2]
                       )

params = extract_params(os.path.join(dataDir, 'target_300_10.18497.state.gz'))

alpha = [float(x) for x in params[0][1:]]
beta = params[1]
print("{}, {}".format(alpha, beta))

[0.029812924118304992, 0.009271861921224648, 0.036602792963554454, 0.03036346204521996, 0.020332718571530326, 0.10351890596388356, 0.012755495950966892, 0.04290588069973869, 0.01901706336256544, 0.010000562544823921, 0.008051737793126377, 0.010478767490879254, 0.05102267459262269, 0.017735208419080477, 0.040884777680179944, 0.02240422753511921, 0.04633390719823591, 0.009648564936291307, 0.01221798611153162, 0.00518484471239066, 0.026017510181194876, 0.012790226620751563, 0.01623316505431329, 0.020333139411403224, 0.00825133910195449, 0.009324396700901893, 0.009969352687469672, 0.014058852097669477, 0.014327024486392018, 0.007030687242074, 0.013643807447897123, 0.014522948963225387, 0.04471494487932859, 0.014164304504351164, 0.0039162429342773745, 0.014532343795810298, 0.025680848757922235, 0.01983317219162416, 0.03957426890109219, 0.032049504012664, 0.023936849755631102, 0.020045882246671455, 0.0163365826757628, 0.003664273571444934, 0.01244363813012928, 0.006379376114561163, 0.03183838413167178, 0.0228654378800433, 0.02589730566107523, 0.002725534892222803, 0.020321045426697708, 0.04147171080143742, 0.013214378648651782, 0.01545986360120152, 0.00472439590963874, 0.009571643835736842, 0.011852134455363792, 0.01849484576655846, 0.010789697738530953, 0.010931933129577746, 0.01277599570543656, 0.02865829649526358, 0.025531595810921456, 0.013319388129360477, 0.013671904052002564, 0.03664274901715288, 0.027764002912708176, 0.03798684207498981, 0.010093692241401983, 0.02086852756014119, 0.027915681744247706, 0.007111062378759468, 0.011931852473693062, 0.060321392110113205, 0.010631585925131115, 0.02277232852533824, 0.022347319541299926, 0.020016904891511495, 0.008478233613906505, 0.028181890315513807, 0.01027271825929225, 0.015904757954617755, 0.015791178498843768, 0.0191566503478927, 0.032057954637453895, 0.01993928018855794, 0.009610904469008625, 0.024155275595540187, 0.009383503065703933, 0.011349218953292111, 0.009017549066160908, 0.015974717687080673, 0.01437943660784478, 0.012044976507002564, 0.03732865170330231, 0.013679921501071392, 0.010056445695427858, 0.013700151106363787, 0.015527525242800691, 0.01838062814387135, 0.01003220363012177, 0.011726330465516797, 0.06279021293851472, 0.011306076304927044, 0.022169791301728952, 0.09052698056050601, 0.013273622774595808, 0.016757843984888256, 0.02834518580348568, 0.014328529132486114, 0.008869888943969785, 0.010620834978804661, 0.02856840703138199, 0.026545113171383437, 0.022202598341928788, 0.021789883322435685, 0.02315912700902329, 0.018211012444135744, 0.012913523595199006, 0.010350788910319165, 0.004451048502122248, 0.030355975269286625, 0.003492783830246674, 0.01192034374170573, 0.027700222580153786, 0.018313502769780502, 0.019356895619519046, 0.021516916814583894, 0.0056654564371284525, 0.018829713295067545, 0.01022461760210033, 0.006279341483060046, 0.014364541807704207, 0.001573700008969004, 0.006719003656189242, 0.012605179443199417, 0.024460251523987342, 0.0182499475845198, 0.006206344325333511, 0.012599453549275916, 0.012249376897371305, 0.025768677091400388, 0.013565663314185872, 0.020135876708538898, 0.012443595650252689, 0.010798991574291462, 0.0026960940570410906, 0.01728421395376024, 0.03200629341384553, 0.041563976680458144, 0.011842193878494717, 0.00743704889720417, 0.021797155342851707, 0.019973408255262923, 0.015055290919369782, 0.009117163319745271, 0.01852047216755867, 0.003047546912966365, 0.021264102592926326, 0.007691689543153175, 0.02903386621021987, 0.020374995078762736, 0.019529189458843053, 0.011777296769065458, 0.04174199906345287, 0.01135337310597628, 0.014894292146825408, 0.006795363369957581, 0.012535637546649925, 0.028644526374704934, 0.013160328274689539, 0.022578250880233568, 0.012537369482130595, 0.007340171510908071, 0.014817972919966718, 0.008735804073555823, 0.010983914354572858, 0.032069764330134996, 0.042922790555120396, 0.019530089993512335, 0.03137052494962982, 0.011943734231251193, 0.015418531760651456, 0.01712478478340123, 0.0156949411259937, 0.02785009209993766, 0.0019321257841679888, 0.02615931178010513, 0.014755658587172392, 0.019110932033433007, 0.005800365350818801, 0.010607553788835274, 0.019259346572074787, 0.005043020126756468, 0.012751723193498746, 0.03148187376811713, 0.021164622984691635, 0.02229774988021518, 0.009324050784836864, 0.024370846468811384, 0.004160642339659546, 0.027753906937651587, 0.02283299944435053, 0.007850441485466118, 0.013456229294301635, 0.016590206403546366, 0.020321136161409868, 0.020930630364992903, 0.02339327930121348, 0.015586366826561743, 0.012089620923028578, 0.011029800508113282, 0.006129150966787204, 0.04337105416984723, 0.01483093991200465, 0.005704777036833407, 0.01726502161173706, 0.010514963086575535, 0.02905089499349202, 0.027591588675477965, 0.01397053997745691, 0.01536830959628985, 0.011105366008182125, 0.004900591851721317, 0.05799364001945829, 0.02540643660947351, 0.014525869409298599, 0.02627828623581746, 0.010761810837253307, 0.039146127532474206, 0.01008898466006184, 0.020787880178054812, 0.008164830108028306, 0.009003942798165172, 0.01720164006487483, 0.027123841707823098, 0.0241269324440368, 0.0482433167498602, 0.008091609956704575, 0.00811884016017425, 0.011260778354946063, 0.01764141856430096, 0.014453246410351622, 0.030266455307113138, 0.011962138258688833, 0.012938776484910453, 0.01614189286227352, 0.017650677445550104, 0.016027167728970906, 0.011036778274787372], 0.013712186605802602

df = state_to_df(os.path.join(dataDir, 'target_300_10.18497.state.gz'))

One small complication I found with my data was there is a token nan in the list of “types.” Because nan is used in Pandas to indicate missing integer values, Pandas assumed it was an integer, rather than a string. To address this misidentification problem, I explicitly set the type for the entire column “type” to string. (Two meanings of type. Also confusing.)

df['type'] = df.type.astype(str)
df[:10]

	source	pos	typeindex	type	topic
0	NaN	0	0	uncommon	165
1	NaN	1	1	thing	224
2	NaN	2	2	member	224
3	NaN	3	3	boarding	136
4	NaN	4	4	school	14
5	NaN	5	5	form	237
6	NaN	6	6	clique	94
7	NaN	7	7	pair	11
8	NaN	8	8	tend	136
9	NaN	9	9	thank	224

Above is a preview of the data that I have from the statefile, and which I am going to use generate the data needed for the LDAvis library. It contains the document id, the word position index, the word index, the word, and its topic assignment.

The first bit of data to gather is the length of the documents. To do this, I grouped the data by the document id and counted the tokens in the doc. This data is sorted by the document id, so is in the correct order for the visualization preprocessing.

# Get document lengths from statefile
docs = df.groupby('#doc')['type'].count().reset_index(name ='doc_length')

docs[:10]

	#doc	doc_length
0	0	116
1	1	102
2	2	108
3	3	97
4	4	121
5	5	113
6	6	123
7	7	137
8	8	121
9	9	132

The second bit of data is the vocabulary and frequencies. Here I used pandas to generate a new frame with the counts for each word. I then sorted this dataframe so that it is alphabetical by type, a step I will repeat in creating the topic-term matrix.

# Get vocab and term frequencies from statefile
vocab = df['type'].value_counts().reset_index()
vocab.columns = ['type', 'term_freq']
vocab = vocab.sort_values(by='type', ascending=True)

vocab[:10]

	type	term_freq
31901	aaa	77
54240	aaaa	18
48406	aaaaa	26
22633	aad	180
50081	aai	24
52882	aal	20
31404	aalborg	80
48729	aan	26
44793	aand	32
24704	aar	146

Clearly some OCR problems have managed to make their way through all of my preprocessing efforts. With the low term frequencies, however, they should be only minimally present in the final representation of the topic model.

The third step was to create the matrix files. Here is where things get a bit tricky, as there is the adding of smoothing values and normalizing the data so that each row sums to 1. To do the normalizing, I am used sklearn because these are large matrices that require a more optimized function than dividing by the sum of the row with pandas.

# Topic-term matrix from state file
# https://ldavis.cpsievert.me/reviews/reviews.html

import sklearn.preprocessing

def pivot_and_smooth(df, smooth_value, rows_variable, cols_variable, values_variable):
    """
    Turns the pandas dataframe into a data matrix.
    Args:
        df (dataframe): aggregated dataframe 
        smooth_value (float): value to add to the matrix to account for the priors
        rows_variable (str): name of dataframe column to use as the rows in the matrix
        cols_variable (str): name of dataframe column to use as the columns in the matrix
        values_variable(str): name of the dataframe column to use as the values in the matrix
    Returns:
        dataframe: pandas matrix that has been normalized on the rows.
    """
    matrix = df.pivot(index=rows_variable, columns=cols_variable, values=values_variable).fillna(value=0)
    matrix = matrix.values + smooth_value
    
    normed = sklearn.preprocessing.normalize(matrix, norm='l1', axis=1)
    
    return pd.DataFrame(normed)

First, we need to aggregate the data from the statefile dataframe to get the number of topic assignments for word in the documents. For phi, the topic-term matrix, I aggregated by topic and word, counted the number of times each word was assigned to each topic, and then sorted the resulting dataframe alphabetically by word, so that it matches the order of the vocabulary frame. Here I used the beta hyperparamter as the smoothing value.

phi_df = df.groupby(['topic', 'type'])['type'].count().reset_index(name ='token_count')
phi_df = phi_df.sort_values(by='type', ascending=True)

phi_df[:10]

	topic	type	token_count
902220	200	aaa	2
220727	45	aaa	1
802489	177	aaa	10
484128	104	aaa	1
104943	19	aaa	5
930031	207	aaa	7
633014	138	aaa	3
715361	157	aaa	1
983211	219	aaa	4
35018	7	aaa	31

phi = pivot_and_smooth(phi_df, beta, 'topic', 'type', 'token_count')

# phi[:10]

Then we do this again, but focused on the documents and topics, to generate the theta document-topic matrix. Here I used alpha as the smoothing value.

theta_df = df.groupby(['#doc', 'topic'])['topic'].count().reset_index(name ='topic_count')

theta_df[:10]

	topic	topic_count
0	11	10
1	14	7
2	94	14
3	102	2
4	113	23
5	136	12
6	144	1
7	165	1
8	171	12
9	175	1

theta = pivot_and_smooth(theta_df, alpha , '#doc', 'topic', 'topic_count')

# theta[:10]

These data processing steps, as always, represent about 90% of the work needed to use the LDAvis library with a MALLET topic model. Now that I have all of the data in place, I can queue that data up and pass it to the visualization library.

import pyLDAvis

data = {'topic_term_dists': phi, 
        'doc_topic_dists': theta,
        'doc_lengths': list(docs['doc_length']),
        'vocab': list(vocab['type']),
        'term_frequency': list(vocab['term_freq'])
       }

I used the code examples I mentioned earlier for setting the keys of the data dictionary. Keeping the data in sorted dataframes helps ensure that the order is consistent and preserved as it is moved into a list for analysis.

This data is then passed the visualization library, first for preparation and then for display. In preparing the data, the library computes the distances between topics and then projects those distances into a two-dimensional space using “multidimensional scaling” or principle component analysis. In the resulting visualization, the overlap in words of the topic is represented by shared location in space, and the greater the distance between topics, the larger the dissimilarity between the weight of the words that comprise the topic. The second element that the library computes is the most relevant terms for each topic with a sliding scale that either ranks on the overall frequency of the word (the default) or the distinctiveness of the word (lambda = 0.0 on the resulting slider). This provides a richer view of the topic assignments and is useful in labeling for distinguishing between topics.

vis_data = pyLDAvis.prepare(**data)

pyLDAvis.display(vis_data)