We will use the 20 newsgroups dataset for analysis. Ideas on how to process are based on https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925
import numpy as np
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', shuffle=True)
Documents in the dataset come from different newsgroups so they are roughly divided into topics, we will see if your pLSA and LDA implementations can recover these topics and understand if there are topics within each newsgroup or topics across newsgroups.
newsgroups_train.target_names
We will only look at documents from three newsgroups.
targets_to_keep = [4, 9, 13]
keep_article = np.isin(newsgroups_train.target, targets_to_keep)
np.sum(keep_article)
Now preprocess documents using word stemming and remoing stopwords.
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import nltk
nltk.download('wordnet')
stemmer = SnowballStemmer("english")
def lemmatize_stemming(text):
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
result.append(lemmatize_stemming(token))
return result
docs_to_process = np.array(newsgroups_train.data)[keep_article]
processed_docs = []
for doc in docs_to_process:
processed_docs.append(preprocess(doc))
Save documents to disk for later usage
import pickle
with open("processed_docs.pkl", 'wb') as pkl_file:
pickle.dump(processed_docs, pkl_file)
# load processed documents
# with open("processed_docs.pkl", 'rb') as pkl_file:
# processed_docs = pickle.load(pkl_file)
Extract (at most) 10,000 words that occured in the processed documents, removing words that are too frequent or too rare. Once that is done, construct the bag of words representation for each document.
dictionary = gensim.corpora.Dictionary(processed_docs)
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=10000)
dictionary.save("newsgroup_dictionary.pkl")
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
doc = corpus[2]
for i in range(len(doc)):
print ("Word {} (\"{}\") appears {} times.".format(doc[i][0],
dictionary[doc[i][0]],
doc[i][1]))
Finally, create a document-word matrix that we can use for analysis
from gensim.matutils import corpus2csc
doc_mat = corpus2csc(corpus)
doc_mat = doc_mat.T
doc_mat
Save the resulting matrix.
import scipy.sparse
scipy.sparse.save_npz("newsgroup_docmat", doc_mat)
# Use the following to load matrix
# docmat = scipy.sparse.load_npz("newsgroup_docmat.npz")