In this homework you will exercise your expertise on the EM algorithm and apply it to Probabilistic Latent Semantic Analysis (pLSA).

Part I: The Model

Recall that our data model for pLSA will consist of a set of documents \(D\), and each document is modeled as a bag of words over dictionary \(W\), we denote \(x_{w,d}\) as the number of times word \(w \in W\) appears in document \(d \in D\).

Warmup: A simple multinomial model

Before we introduce the concept of topics, let’s build a simple model based on frequency of word occurences to get used to Maximum Likelihood Estimation for multinomial distributions. Specifically, letting \(n_d\) be the number of words in document \(d\), then we model each document \(d\) as \(n_d\) draws from a Multinomial distribution with parameters \(\theta_{1,d},\ldots,\theta_{W,d}\) with \(\theta_{w,d}\) the probability of drawing word \(w\) in document \(d\). Note that \(\theta_{w,d} \geq 0\) for all \(w \in W\), and \(\sum_w \theta_{w,d} = 1\).

With this model in place, the probability of observing the set of words in document \(d\) is given by

\[ Pr(d|\theta_d) \varpropto \prod_{w=1}^{W} \theta_{w,d}^{x_{w,d}} \]

where \(\theta_d\) collects parameters \(\{\theta_{1,d},\ldots,\theta_{W,d}\}\).

Problem 1: Prove that Maximum Likelihood Estimates (MLE) are given by \(\hat{\theta}_{w,d} = \frac{x_{w,d}}{n_d}\), that is, the number of times word \(w\) appears in document \(d\) divided by the total number of words in document \(d\).

Hints:

  • Write MLE estimation problem as a constrained maximization problem

  • Write out the Lagrangian \(L(\theta_d,\alpha)\) (see lecture slides) for this maximization problem (use \(\alpha\) as the Lagrangian parameter)

  • Use optimality conditions from lecture to solve

A fully observed topic model

Let’s introduce topics now. Instead of modeling each document as \(d \sim \mathrm{Mult}(\{\theta_{1,d},\ldots,\theta_{W,d}\})\) over words, we model each document as a distribution over \(T\) topics as \(d \sim \mathrm{Mult}(\{p_{1,d},\ldots,p_{T,d}\})\). In turn, each topic \(t=1,\ldots,T\) is modeled as a distribution \(t \sim \mathrm{Mult}(\{\theta_{1,t},\ldots,\theta_{W,t}\})\) over words. Note that the topics are shared across documents in dataset.

In pLSA, we learn topic distributions from observations by a soft assignment of each word occurence to topics using the EM algorithm. We will denote these latent word-topic assignments as \(\Delta_{w,d,t}\) to represent the number of times word \(w\) was assigned to topic \(t\) in document \(d\).

Of course, we do not observe any of these latent word-topic assignments. However, it is helpful to think of the fully observed case to develop the EM algorithm.

Assuming we observe word occurences \(x_{w,d}\) and latent word-topic assignments \(\Delta_{w,d,t}\) such that \(\sum_t \Delta_{w,d,t} = x_{w,d}\) the full data probability is given by

\[ \mathrm{Pr}(D|\{p_d\},\{\theta_t\}) = \prod_{d=1}^D \prod_{w=1}^{W} \prod_{t=1}^T p_{t,d}^{\Delta_{w,d,t}}\theta_{w,t}^{\Delta_{w,d,t}} \]

Problem 2: Prove that MLEs are given by

\[ \hat{p}_{t,d} = \frac{\sum_{w=1}^W \Delta_{w,d,t}}{\sum_{t=1}^T \sum_{w=1}^W \Delta_{w,d,t}} \]

and

\[ \hat{\theta}_{w,t} = \frac{\sum_{d=1}^D \Delta_{w,d,t}}{\sum_{w=1}^W \sum_{d=1}^D \Delta_{w,d,t}} \]

Part II: pLSA with EM Algorithm

Denote the responsibility of topic \(t\) for the occurences of word \(w\) in document \(d\) as \(\gamma_{w,d,t}=E[\Delta_{w,d,t}|\{p_d\},\{\theta_t\}]\)

Problem 3: Write out the M-step for the EM algorithm based on the result of Problem 3

Problem 4: Show that the E-step for the EM algorithm, i.e., the update \(\gamma_{d_j,t}\) given current set of parameters \(\{p_d\}\) and \(\{\theta_t\}\) is given by

\[ \gamma_{w,d,t} = x_{w,d} \times \frac{p_{t,d}\theta_{w,t}}{\sum_{t'=1}^T p_{t',d}\theta_{w,t'}} \]

Part III: pLSA with EM Implementation

Part IIIA: Simulating data

Problem 5 Write python code to generate a test corpus. See lecture notes on how to do this.

Part IIIB: pLSA using EM

Problem 6 Implement pLSA using the EM updates from problems 3 and 4.

Notes:

  • For the pLSA topic model we set out here, the probability of the observed word-document occurences is given by mixture distribution

\[ Pr(D|\{p_d\},\{\theta_t\}) = \prod_{d=1}^D \prod_{w=1}^W \left( \sum_{t=1}^T p_{t,d} \theta_{w,t} \right)^{x_{w,d}} \]

  • You will need to determine how to initialize parameters \(\{p_d\}\) and \(\{\theta_t\}\) (see lecture notes on the Dirichlet distribution)

  • You will need to determine how to test for convergence

  • You will need to determine how to deal with local minima (e.g., you can use multiple random initial points and choose the model that has largest likelihood).

  • Highly recommend that you test your function on small simulation dataset, i.e., from the data you generate above

The skeleton of the function should be something like this:

def plsa_em(doc_mat, num_topics=5, max_iter=20, eps=1e-3):
  num_docs, num_words = doc_mat.shape
  p, theta = init_parameters(num_docs, num_words, num_topics)
  gamma = np.zeros((num_words,num_docs,num_topics))
  
  i = 0
  while True:
    # overwrite gamma since it can get rather large
    estep(doc_mat, gamma, p, theta)
    new_p, new_theta = mstep(doc_mat, gamma)
    if i >= max_iter or check_convergence(doc_mat, new_p, new_theta, p, theta, eps):
      break
    
    i += 1
    p, theta = new_p, new_theta
  return p, theta

Part III: LDA with Gibbs Sampling

Implement Latent Dirichlet Annotation with Gibbs Sampling. See lecture notes for details.

The overall structure of the function should be

def lda_gibbs(doc_mat, num_topics=5, alpha=1, beta=1, total_iter=200, burn_iter=100):
  num_docs, num_words = doc_mat.shape
  p, theta = init_parameters(num_docs, num_words, num_topics,alpha,beta)
  gamma = np.zeros((num_words,num_docs,num_topics))
  delta = np.zeros((num_wrods, num_docs, num_topics))
  
  num_estimates = total_iter - burn_iter
  delta_estimates = np.zeros((num_words, num_docs, num_topics, num_estimates))
  
  i = 0
  while True:
    # overwrite gamma since it can get rather large
    compute_gamma(gamma, p, theta)
    
    # overwrite delta since it can get rahter large
    sample_delta(doc_mat, gamma, delta)

    # keep estimate if past burn_in iterations
    if i >= burn_iter:
      j = i - burn_iter
      delta_estimates[:,:,:,j] = delta
      
    p = sample_p(delta, alpha)
    theta = sample_theta(delta, beta)
    
    if i >= total_iter:
      break
    
    i += 1
    
  final_delta = np.mean(delta_estimates, axis=3)
  p = get_final_p(final_delta, alpha)
  theta = get_final_theta(final_delta, beta)
  return p, theta

Part III: Applying Methods

Use your pLSA and LDA implementations to learn topics from the Reuters dataset. Instructions on how to prepare data will be posted separately.

Analysis 1: Compare topics learned from pLSA and LDA given \(T=3\)

Analysis 2: Compare LDA topics learned for different values of \(T\)

Use it on a few different values of \(T\) (number of topics to learn). Do your topic distributions reflect sensible topics?