December 4, 2024
In natural language processing, latent Dirichlet allocation (LDA) is a Bayesian network (and, therefore, a generative statistical model) for modeling automatically extracted topics in textual corpora… In this, observations (e.g., words) are collected into documents, and each word’s presence is attributable to one of the document’s topics. Each document will contain a small number of topics.
From https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
How can we figure out (unsupervised!) the underlying topic of each of a set of documents?1
Start with documents that each contain words:
As a human, assign topics: Science, Politics, Sports
What if you don’t have any idea what the words mean (i.e., what if you are the computer)?
without using the definitions of the words:
Topic 1 , Topic 2, Topic 3
Property 1: articles are as homogeneous as possible
Property 2: words are as homogeneous as possible
Color each word with red , green, blue such that
Organize the words one at a time, trying to make the articles (Goal #1) and words (Goal #2) as consistent as possible.
Do we color ball red , green, or blue?
Topic 1 | Topic 2 | Topic 3 |
---|---|---|
Doc A: how many words in Topic 1 | Doc A: how many words in Topic 2 | Doc A: how many words in Topic 3 |
2 | 0 | 2 |
Topic 1 | Topic 2 | Topic 3 |
---|---|---|
How often is ball in Topic 1 | How often is ball in Topic 2 | How often is ball in Topic 3 |
3 | 1 | 0 |
Topic 1 | Topic 2 | Topic 3 |
---|---|---|
Doc A: how many words in Topic 1 | Doc A: how many words in Topic 2 | Doc A: how many words in Topic 3 |
2 | 0 | 2 |
How often is ball in Topic 1 | How often is ball in Topic 2 | How often is ball in Topic 3 |
3 | 1 | 0 |
Update the first instance of ball and move on to the second instance of ball. (Keep iterating!)
where
For each word in the document, you get a probability for each topic, based on the learned topic-word distribution
The probability of a topic
LDA on TSL
In fall 2015, Benji Lu, Kai Fukutaki, and Ziqi Xiong performed LDA on The Student Life articles for their computational statistics project: https://ziqixiong.shinyapps.io/TopicModeling/
Who does statistics?
connecting, uplifting, and recognizing voices – a database of statisticians and data scientists.