Getting data for your homework¶

We will use the 20 newsgroups dataset for analysis. Ideas on how to process are based on https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925

import numpy as np
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', shuffle=True)

Documents in the dataset come from different newsgroups so they are roughly divided into topics, we will see if your pLSA and LDA implementations can recover these topics and understand if there are topics within each newsgroup or topics across newsgroups.

newsgroups_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

We will only look at documents from three newsgroups.

targets_to_keep = [4, 9, 13]
keep_article = np.isin(newsgroups_train.target, targets_to_keep)
np.sum(keep_article)

1769

Now preprocess documents using word stemming and remoing stopwords.

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

import nltk
nltk.download('wordnet')

stemmer = SnowballStemmer("english")

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/hcorrada/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

docs_to_process = np.array(newsgroups_train.data)[keep_article]
processed_docs = []
for doc in docs_to_process:
    processed_docs.append(preprocess(doc))

Save documents to disk for later usage

import pickle
with open("processed_docs.pkl", 'wb') as pkl_file:
    pickle.dump(processed_docs, pkl_file)

# load processed documents 
# with open("processed_docs.pkl", 'rb') as pkl_file:
#   processed_docs = pickle.load(pkl_file)

Extract (at most) 10,000 words that occured in the processed documents, removing words that are too frequent or too rare. Once that is done, construct the bag of words representation for each document.

dictionary = gensim.corpora.Dictionary(processed_docs)
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=10000)
dictionary.save("newsgroup_dictionary.pkl")

corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

doc = corpus[2]
for i in range(len(doc)):
    print ("Word {} (\"{}\") appears {} times.".format(doc[i][0],
                                                      dictionary[doc[i][0]],
                                                      doc[i][1]))

Word 27 ("request") appears 1 times.
Word 47 ("email") appears 1 times.
Word 55 ("info") appears 1 times.
Word 87 ("brain") appears 1 times.
Word 88 ("brian") appears 2 times.
Word 89 ("chicago") appears 1 times.
Word 90 ("couldn") appears 1 times.
Word 91 ("delet") appears 1 times.
Word 92 ("direct") appears 1 times.
Word 93 ("file") appears 1 times.
Word 94 ("hmmm") appears 1 times.
Word 95 ("instead") appears 1 times.
Word 96 ("midway") appears 1 times.
Word 97 ("public") appears 1 times.
Word 98 ("respond") appears 1 times.
Word 99 ("sean") appears 1 times.
Word 100 ("treatment") appears 2 times.
Word 101 ("tri") appears 1 times.
Word 102 ("uchicago") appears 2 times.

Finally, create a document-word matrix that we can use for analysis

from gensim.matutils import corpus2csc
doc_mat = corpus2csc(corpus)
doc_mat = doc_mat.T
doc_mat

<1769x1590 sparse matrix of type '<class 'numpy.float64'>'
	with 66964 stored elements in Compressed Sparse Row format>

Save the resulting matrix.

import scipy.sparse
scipy.sparse.save_npz("newsgroup_docmat", doc_mat)

# Use the following to load matrix
# docmat = scipy.sparse.load_npz("newsgroup_docmat.npz")