In this homework you will exercise your expertise on the EM algorithm and apply it to Probabilistic Latent Semantic Analysis (pLSA).
Recall that our data model for pLSA will consist of a set of documents \(D\), and each document is modeled as a bag of words over dictionary \(W\), we denote \(x_{w,d}\) as the number of times word \(w \in W\) appears in document \(d \in D\).
Before we introduce the concept of topics, let’s build a simple model based on frequency of word occurences to get used to Maximum Likelihood Estimation for multinomial distributions. Specifically, letting \(n_d\) be the number of words in document \(d\), then we model each document \(d\) as \(n_d\) draws from a Multinomial distribution with parameters \(\theta_{1,d},\ldots,\theta_{W,d}\) with \(\theta_{w,d}\) the probability of drawing word \(w\) in document \(d\). Note that \(\theta_{w,d} \geq 0\) for all \(w \in W\), and \(\sum_w \theta_{w,d} = 1\).
With this model in place, the probability of observing the set of words in document \(d\) is given by
\[ Pr(d|\theta_d) \varpropto \prod_{w=1}^{W} \theta_{w,d}^{x_{w,d}} \]
where \(\theta_d\) collects parameters \(\{\theta_{1,d},\ldots,\theta_{W,d}\}\).
Problem 1: Prove that Maximum Likelihood Estimates (MLE) are given by \(\hat{\theta}_{w,d} = \frac{x_{w,d}}{n_d}\), that is, the number of times word \(w\) appears in document \(d\) divided by the total number of words in document \(d\).
Hints:
Write MLE estimation problem as a constrained maximization problem
Write out the Lagrangian \(L(\theta_d,\alpha)\) (see lecture slides) for this maximization problem (use \(\alpha\) as the Lagrangian parameter)
Use optimality conditions from lecture to solve
Let’s introduce topics now. Instead of modeling each document as \(d \sim \mathrm{Mult}(\{\theta_{1,d},\ldots,\theta_{W,d}\})\) over words, we model each document as a distribution over \(T\) topics as \(d \sim \mathrm{Mult}(\{p_{1,d},\ldots,p_{T,d}\})\). In turn, each topic \(t=1,\ldots,T\) is modeled as a distribution \(t \sim \mathrm{Mult}(\{\theta_{1,t},\ldots,\theta_{W,t}\})\) over words. Note that the topics are shared across documents in dataset.
In pLSA, we learn topic distributions from observations by a soft assignment of each word occurence to topics using the EM algorithm. We will denote these latent word-topic assignments as \(\Delta_{w,d,t}\) to represent the number of times word \(w\) was assigned to topic \(t\) in document \(d\).
Of course, we do not observe any of these latent word-topic assignments. However, it is helpful to think of the fully observed case to develop the EM algorithm.
Assuming we observe word occurences \(x_{w,d}\) and latent word-topic assignments \(\Delta_{w,d,t}\) such that \(\sum_t \Delta_{w,d,t} = x_{w,d}\) the full data probability is given by
\[ \mathrm{Pr}(D|\{p_d\},\{\theta_t\}) = \prod_{d=1}^D \prod_{w=1}^{W} \prod_{t=1}^T p_{t,d}^{\Delta_{w,d,t}}\theta_{w,t}^{\Delta_{w,d,t}} \]
Problem 2: Prove that MLEs are given by
\[ \hat{p}_{t,d} = \frac{\sum_{w=1}^W \Delta_{w,d,t}}{\sum_{t=1}^T \sum_{w=1}^W \Delta_{w,d,t}} \]
and
\[ \hat{\theta}_{w,t} = \frac{\sum_{d=1}^D \Delta_{w,d,t}}{\sum_{w=1}^W \sum_{d=1}^D \Delta_{w,d,t}} \]
Denote the responsibility of topic \(t\) for the occurences of word \(w\) in document \(d\) as \(\gamma_{w,d,t}=E[\Delta_{w,d,t}|\{p_d\},\{\theta_t\}]\)
Problem 3: Write out the M-step for the EM algorithm based on the result of Problem 3
Problem 4: Show that the E-step for the EM algorithm, i.e., the update \(\gamma_{d_j,t}\) given current set of parameters \(\{p_d\}\) and \(\{\theta_t\}\) is given by
\[ \gamma_{w,d,t} = x_{w,d} \times \frac{p_{t,d}\theta_{w,t}}{\sum_{t'=1}^T p_{t',d}\theta_{w,t'}} \]
Problem 5 Write python code to generate a test corpus. See lecture notes on how to do this.
Problem 6 Implement pLSA using the EM updates from problems 3 and 4.
Notes:
\[ Pr(D|\{p_d\},\{\theta_t\}) = \prod_{d=1}^D \prod_{w=1}^W \left( \sum_{t=1}^T p_{t,d} \theta_{w,t} \right)^{x_{w,d}} \]
You will need to determine how to initialize parameters \(\{p_d\}\) and \(\{\theta_t\}\) (see lecture notes on the Dirichlet distribution)
You will need to determine how to test for convergence
You will need to determine how to deal with local minima (e.g., you can use multiple random initial points and choose the model that has largest likelihood).
Highly recommend that you test your function on small simulation dataset, i.e., from the data you generate above
The skeleton of the function should be something like this:
def plsa_em(doc_mat, num_topics=5, max_iter=20, eps=1e-3):
num_docs, num_words = doc_mat.shape
p, theta = init_parameters(num_docs, num_words, num_topics)
gamma = np.zeros((num_words,num_docs,num_topics))
i = 0
while True:
# overwrite gamma since it can get rather large
estep(doc_mat, gamma, p, theta)
new_p, new_theta = mstep(doc_mat, gamma)
if i >= max_iter or check_convergence(doc_mat, new_p, new_theta, p, theta, eps):
break
i += 1
p, theta = new_p, new_theta
return p, theta
Implement Latent Dirichlet Annotation with Gibbs Sampling. See lecture notes for details.
The overall structure of the function should be
def lda_gibbs(doc_mat, num_topics=5, alpha=1, beta=1, total_iter=200, burn_iter=100):
num_docs, num_words = doc_mat.shape
p, theta = init_parameters(num_docs, num_words, num_topics,alpha,beta)
gamma = np.zeros((num_words,num_docs,num_topics))
delta = np.zeros((num_wrods, num_docs, num_topics))
num_estimates = total_iter - burn_iter
delta_estimates = np.zeros((num_words, num_docs, num_topics, num_estimates))
i = 0
while True:
# overwrite gamma since it can get rather large
compute_gamma(gamma, p, theta)
# overwrite delta since it can get rahter large
sample_delta(doc_mat, gamma, delta)
# keep estimate if past burn_in iterations
if i >= burn_iter:
j = i - burn_iter
delta_estimates[:,:,:,j] = delta
p = sample_p(delta, alpha)
theta = sample_theta(delta, beta)
if i >= total_iter:
break
i += 1
final_delta = np.mean(delta_estimates, axis=3)
p = get_final_p(final_delta, alpha)
theta = get_final_theta(final_delta, beta)
return p, theta
Use your pLSA and LDA implementations to learn topics from the Reuters dataset. Instructions on how to prepare data will be posted separately.
Analysis 1: Compare topics learned from pLSA and LDA given \(T=3\)
Analysis 2: Compare LDA topics learned for different values of \(T\)
Use it on a few different values of \(T\) (number of topics to learn). Do your topic distributions reflect sensible topics?