Latent Dirichlet Allocation and Topic Modelling

Ruma Sinha
Analytics Vidhya
Published in
9 min readDec 21, 2020

--

The objective of the article is to understand the intuition behind LDA, the use cases and implementation.

Photo by Edgar Castrejon on Unsplash

A topic model is a type of statistical model for discovering the abstract topics that occur in a collection of documents. Topic models provide a simple way to analyze large volumes of unlabeled text. A topic consists of a cluster of words that frequently occur together.

The inputs to the algorithm will be a set of documents and the output will be the specified number of topics, each topic will have the relevant terms and the distribution of the topics over all the documents.

Some of the algorithms

Latent Dirichlet Allocation (LDA)

Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI)

Non Negative Matrix Factorization (NMF)

One of the way we do, is with Latent Dirichlet Allocation (LDA) algorithm, and how it can be used to find meaning in text documents. A document contains a set of topics and each topic is a set of words. LDA uses some statistical analysis to figure out what each document is all about.

Use Cases

Topic discovery

Sentiment analysis

Automated document tagging

Documents seen as distribution of topics. Topics seen as distribution of words. A topic is represented as a weighted list of words.

The above diagram shows how LDA tries to classify documents. Documents are represented as a distribution of topics. Topics, in turn, are represented by a distribution of all tokens in the vocabulary. But we do not know the number of topics that are present in the corpus and the documents that belong to each topic. In other words, we want to treat the assignment of the documents to topics as a random variable itself which is estimated from the data.

LDA training process broadly consists of,

A corpus, the collection of documents converted into a bag of words.

Data preprocessing steps like removal of stop words. Apply stemming to bring all the words to the root word. Tokenize. Choosing the number of topics say k that is we need to specify the number of topics during the training process itself.

Randomly assign topics to each word. Count the words by topic. Count the topics by document. Reassign the words to the topics. Iterate over all the words in the documents N times.

The 3 main parameters of the LDA model is the number of topics, the number of words per topic and the number of topics per document.

A low alpha value places more weight on having each document composed of only a few dominant topics

A low beta value places more weight on having each topic composed of only a few dominant words

Implementation of Gensim LDA model with the Spooky authors dataset from Kaggle.

Dataset link:

In this notebook we will apply topic modelling and create three topics as given dataset consists of three authors and the excerpts from their horror stories. These authors are Edgar Allan Poe, Mary Shelley, and HP Lovecraft. Dataset contains text from works of fiction written by spooky authors of the public domain: Edgar Allan Poe, HP Lovecraft and Mary Shelley. Edgar Allan Poe work is around tales of mystery and the macabre. Mary Shelley work is around science fiction and HP Lovecraft, best known as a writer of weird fiction.

Photo by Jr Korpa on Unsplash

The pipeline will be

Documents ===> PreProcessing ===> Vector Space (Bag of words, Dictionaries) ===> Model (lda/nmf) ===> Evaluation (metrics, visualizations)

Reading the dataset and inspecting the same. The dataset contains 19579 observations and 3 feature columns id, text and author.

There are no null columns in this dataset. Will drop the id column.

#for the topic modelling we will focus only on the text data
authors_data_df = authors_data_df.drop(columns = [‘id’], axis=1)

The distribution of the articles among the three authors as

Data cleaning and preparation steps where we remove unwanted numbers or characters. We remove the punctuations and convert the text to lower case. Also we remove the stopwords , the words occuring commonly in all the documents and not adding any value to distinctly identify the three subsets of the given collection of the documents.

Once text cleaning and stopwords removal is done, we can do the Exploratory Data Analysis visually with the wordcloud.

The wordcloud of the entire corpus

The words like time, thing, one, came, year, upon, eye, day, man, night, old etc we can observe in this wordcloud.

The top 30 words from the corpus as

We will subset the entire corpus into three different corpus based on the distinct authors and display the wordcloud as well as the top 30 occurring words.

hplDatadf = authors_data_df[authors_data_df.author==”HPL”]

mwsDatadf = authors_data_df[authors_data_df.author==”MWS”]

eapDatadf = authors_data_df[authors_data_df.author==”EAP”]

HP Lovecraft, wordcloud displays words like night, death, dream, dead, fear, horror, strange, window, ancient which seem to resonate with themes that the author was famous for.

We also looked into the top 30 occurring words from the works of HP Lovercraft.

Mary Shelley, wordcloud displays words as fear, heart, raymond, mind, soul, power, hope, feeling, death, spirit, friend, death. Positive as well as negative words

We also looked into the top 30 occurring words from the works of Mark Shelley.

Edgar Allan Poe, wordcloud displays words as life, end, friend, night, far, open, eye, great, one, little, time, good, manner, moment etc

We also looked into the top 30 occurring words from the works of Edgar Allan Poe.

Next step is preparing data for training the model.

Step 1 Processed text to words or tokens

Step 2 Building N grams

Step 3 Lemmatize Is the process of converting the words to the root words

Step 4 Building the corpora

Above steps 1 to 4 consists of converting text to bag of words.
Prior to creating the model, we need to tokenize and lemmatize text to a bag of words which is a dictionary of vocabs where the key is the word and value is the number of times that word occurs in the entire corpus.
dictionary = gensim.corpora.Dictionary(processed_text)

Now for each pre-processed document we use the dictionary object just created to convert that document into a bag of words. i.e for each document we create a dictionary reporting how many words and how many times those words appear.
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_text]

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)]]

Training the LDA model

Keeping the number of topics as 3 since we know there are three authors and we are interested in seeing if we can use the topic modelling to group the given text into the authors work.

the statistical analysis results displayed as above but how de we interpret these 3 topics and can we map them to the 3 authors?

A topic is represented as a weighted list of words.

Not all topics are semantically coherent or interpretable. How do we measure the interpretability and goodness of the model? Coherence score, the statistical measure to check the goodness of the topic model.

Topic coherence

The degree of semantic similarity between its high scoring words can be measured and based on this we can quantify the coherence of a topic.

Higher the topic coherence, the topic is more human interpretable.

The coherence measure can be used to compare different topic models based on their human-interpretability.

When trained the model with 2 topics then the coherence score dipped to 0.20 because logically one of the author work will overlap with the specified two topics.

A good model must have non overlapping topics, the distance shows how semantically different the topics are. the further they are apart, better the topics are.

Visualization of the topics with LDAvis. Each bubble represents one topic. When we click on a topic, we get the top most relevant terms for that topic and the distributions. When I hover over one word then I can find which all topics this word occurs and in what frequency.

When we compare the top 30 words of each topic and the top 30 words from each of the author work, based on the maximum overlap of the subset of words between the two sets, we conclude Topic1 is probably EAP, Topic2 is probably HPL and Topic3 is probably MWS. These topic ids correspond to gensim model of 0, 1 and 2 respectively.

Note# pyLDAvis topic IDs doesn’t correspond to gensim topic IDs.

Topic#1

Topic#2

Topic#3

github code: https://github.com/rumsinha/NLP/LDA and Topic Modelling.ipynb

Conclusion

In this article we tried understanding what is topic modelling and how it works. The role of topic modelling in NLP and industry use cases.

The various algorithms available for performing topic modelling and we did one implementation with LDA. In future articles we will try exploring the other algorithms as well.

We saw in detail the various steps in training a LDA model from reading text data to various pre processing steps then tokenization, stemming etc. Next we create the bag of words for input to the model. The output will be the number of topics that we specify.

We check the coherence score and perform visual evaluation as well with LDAvis plots to make sure if the topics are human interpretable.

The mapping of the statistical analysis results to the human interpretable topics may not be easy. Hard part of unsupervised learning is the evaluation part as we don’t have pre defined labels hence at this point we may verify with domain experts to make sure the topics generated as part of the statistical analysis map correctly with the human interpretable topics.

References

http://qpleple.com/topic-coherence-to-evaluate-topic-models/

https://www.coursera.org/projects/introduction-to-topic-modeling-for-business

https://www.kaggle.com/arthurtok/spooky-nlp-and-topic-modelling-tutorial

https://www.slideshare.net/g33ktalk/dataengconf-building-the-next-new-york-times-recommendation-engine

--

--