Project 1: Similar document searching via MinHash and Locality Sensitive Hashing
Due: Multiple due dates, see below
Posted: Sept. 1, 2018
Last Update: Sept. 19, 2018
In this first project we will implement the system described in the lecture notes for similar document searching. This project is inspired by http://mccormickml.com/2015/06/12/minhash-tutorial-with-python-code/ (Note: you can look at code there for inspiration but implement your own).
The Task
We will use documents from this repository http://www.inf.ed.ac.uk/teaching/courses/tts/assessed/assessment3.html. This is a dataset of documents for which we want to find possible plagiarism. It consists of 10,000 documents for which some pairs are tagged as instances of plagiarism. The goal of this exercise is to see how effectively and efficiently a minhash and LSH system can identify these instances.
Note that smaller subsets of data suitable for testing are available here: https://github.com/chrisjmccormick/MinHash/tree/master/data
Part I: Preliminaries
DUE: Monday Sept. 10
Part IA: Dataset parsing
Write a function parse_data
that given the path to a filename, reads in the article data and returns an array of tuples. With
- One tuple per article (there is one article per line)
- For each article tuples will contain
(id, string)
whereid
is the article id andstring
is the article text as described next - Process the article text to
- remove all punctuation
- change all letters to lowercase
- remove all whitespace so that all words are concatenated
The function should have skeleton:
def parse_data(filename):
# read lines from filename
# construct tuple of id and text
# process string as described above
# return tuple with id and processed string
Part IB: Document shingles
Write a function shingle_document
that given a processed article string and a parameter k
shards the document as follows:
- each substring of length $k$ in document is hashed to a 32-bit integer (see
crc32
fucntion in https://docs.python.org/3/library/binascii.html) - returns a list of the unique 32-bit integers obtained in previous step (look at python
sets
for this)
The function should have skeleton
def shingle_document(string, k):
# initialize set data structure
# for each position in string,
# extract substring of length k
# hash substring into 32-bit integer using crc32
# insert into set
# return set
Part IC: Jaccard Similarity
Write a function jaccard
that given two sharded documents, computes their Jaccard distance
Function should have skeleton
def jaccard(a, b):
# compute union size
# compute intersection size
# return ratio of union and intersection
Part ID: Put these together
Write a function that uses the above to do the following:
- Parse a file with data
- Return a list of tuples each tuple contains:
(id1, id2, s)
, whereid1
andid2
are document ids ands
is the computed Jaccard similarity
Part IE: Experiment 0
Use your function to answer the following question:
What is the effect of sharding length `k** on the Jaccard similarity of plagiarism instances versus instances that are not plagiarized?
To answer this question, make a plot with $k$ in the x-axis and average Jaccard similarity in the y-axis. Plot two lines, one line for plagiarism instances, one line for instances that are not plagiarized. Use the 1000 document dataset for this.
Part II: MinHash
DUE: Tuesday Sept. 18
In this section you will implement the MinHash algorithm and perform an experiment to see how well it estimates Jaccard similarity.
Part IIA: Prepare shingles for processing
Implement a function that takes the shingled documents and returns a list of item-document pairs sorted by items that we’ll use to compute the minhash signature of each document. Remember that because of the shingling logic we used above, we represent items as 32-bit integers. The function specs are as follows:
- Input is a list of tuples of form
(docid, [items])
- Output will be a tuple with two elements:
- a list of tuples of form
(item, docid)
. It will contain one entry for each item appearing in each document. - a list of document ids found in the dataset
- a list of tuples of form
- Output should be sorted by
item
This function should have skeleton
def invert_shingles(shingled_documents):
# initialize list for tuples
# initialize list for document ids
# for each document in input
# append document id to list
# for each item in document
# append (item, docid) tuple
# sort tuple list
# return sorted tuple list, and document list
Part IIB: Generate hash functions
Use the generate_random_hash_fn
function below to create function make_hashes
. Given input num_hashes
this function will return a list
of hash functions that mimic the random permutation approach used in Minhash calculation. The function specs are:
- Input is an integer
num_hash
- Output is a list of hash functions created by function
generate_random_hash_fn
Part IIC: Construct the Minhash Signature Matrix
Implement a function that builds the Minhash signature matrix. You can use this code as a starting point. It refers to the functions you implemented above and sketches the construction algorithm.
import numpy
def make_minhash_signature(shingled_data, num_hashes):
inv_index, docids = invert_shingles(shingled_data)
num_docs = len(docids)
# initialize the signature matrix with infinity in every entry
sigmatrix = np.full([num_hash, num_docs], np.inf)
# generate hash functions
hash_funcs = make_hashes(num_hashes)
# iterate over each non-zero entry of the characteristic matrix
for row, docid in inv_index:
# update signature matrix if needed
# THIS IS WHAT YOU NEED TO IMPLEMENT
return sigmatrix, docids
Part IID: MinHash similarity estimate
Write a function that computes the similarity of two documents using the minhash matrix computed above. The function specs are:
- Input:
id1
,id2
: document idsminhash_sigmat
: minhash signature matrixdocids
: list of document ids, used to index the columns of the minhash signature matrix
- Output: Jaccard similarity estimated using minhash
Here is the function skeleton:
def minhash_similarity(id1, id2, minhash_sigmat, docids):
# get column of the similarity matrix for the two documents
# calculate the fraction of rows where two columns match
# return this fraction as the minhash similarity estimate
See hint below on comparing numpy arrays
Part IIE: Put these together
Write a function that given shingled documents computes the Minhash estimated similarities between each pair of documents. This will be similar to your function for Part ID.
Part IIF: Experiment 1
Use your function to carry out the following experiment:
What is the effect of the number of hash functions used to compute the Minhash signature on the accuracy of the Minhash estimate of Jaccard similarity. Carry out this experiment on the 1000 document dataset.
Part III: Locality-Sensitive Hashing
DUE: Tuesday Sept. 25
Implement LSH
Write a function that implements locality sensitive hashing. Function specifications:
- Input:
minash_sigmatrix
: a minhash signature matrixnumhashes
: number of hash functions used to construct minhash signature matrixdocids
: list of document idsthreshold
a minimum Jaccard similarity threshold
- Output:
- a list of hash tables
- a list of hash tables
from collections import defaultdict
def do_lsh(minhash_sigmatrix, numhashes, docids, threshold):
# choose the number of bands, and rows per band to use in LSH
b, _ = choose_nbands(threshold, numhashes)
r = int(numhashes / b)
narticles = len(docids)
# generate a random hash function that takes vectors of lenght r as input
hash_func = _make_vector_hash(r)
# setup the list of hashtables, will be populated with one hashtable per band
buckets = []
# fill hash tables for each band
for band in range(b):
# figure out which rows of minhash signature matrix to hash for this band
start_index = int(band * r)
end_index = min(start_index + r, numhashes)
# initialize hashtable for this band
cur_buckets = defaultdict(list)
for j in range(narticles):
# THIS IS WHAT YOU NEED TO IMPLEMENT
# add this hashtable to the list of hashtables
buckets.append(cur_buckets)
return buckets
Find candidate similar article pairs
Write a function that uses the result of your LSH function and returns list of candidate article pairs. Spec:
- Input: the result of
do_lsh
- Output: list of tuples
(docid1, docid2)
each a candidate similar article pair according to LSH
Experiment 2: LSH sensitivity
Use these functions to compute the sensitivity and specificity of LSH as a function of the threshold
. Use the 10,000 document dataset to perform this experiment.
Helpers
Obtaining data
You can use the following python code to download data for the project
import os
from six.moves import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/chrisjmccormick/MinHash/master/data"
PLAGIARISM_PATH = "datasets/plagiarism"
DATA_SIZES = [100,1000,2500,10000]
def fetch_data(download_root=DOWNLOAD_ROOT,
plagiarism_path=PLAGIARISM_PATH,
data_sizes=DATA_SIZES,
maxsize=1000):
if not os.path.isdir(plagiarism_path):
os.makedirs(plagiarism_path)
for size in data_sizes:
if size <= maxsize:
train_file = "articles_" + str(size) + ".train"
train_path = plagiarism_path + '/' + train_file
if not os.path.exists(train_path):
train_url = download_root + '/' + train_file
urllib.request.urlretrieve(train_url, train_path)
truth_file = "articles_" + str(size) + ".truth"
truth_path = plagiarism_path + '/' + truth_file
if not os.path.exists(truth_path):
truth_url = download_root + "/" + truth_file
urllib.request.urlretrieve(truth_url, truth_path)
Using the function fetch_data
will download data to a subdirectory pointed to by the variable PLAGIARISM_PATH
. You can assign
the path you want to use for your data to taht variable. The maxsize
argument of the fetch_data
function is used to limit the size of data downloaded. For testing and development you should use the 1000 document dataset. You can get that with function call fetch_data(maxsize=1000)
.
Generating random hash functions
This function generates a random hash function suitable to mimic permutations over 32-bit integers. Recall since we are using crc32
to represent items
we need random hash functions that generate 32-bit numbers.
import random
def make_random_hash_fn(p=2**33-355, m=4294967295):
a = random.randint(1,p-1)
b = random.randint(0, p-1)
return lambda x: ((a * x + b) % p) % m
This is an example of how to use this function:
hash_fn = make_random_hash_fn()
hash_fn(12345)
Some notes: this implements a universal hash function for 32-bit integers, which will ensure the result corresponds to a permutation of rows of the
characteristic matrix as required by Minhash (see https://en.wikipedia.org/wiki/Universal_hashing). Here m
is the largest number returned by crc32, and p
is a prime number larger than m
.
Comparing numpy vectors
The following bit of code shows how to use numpy to compare two integer vectors as required to compute the minhash similarity estimate.
# generate two vectors of length 50 with random integers from 0 to 100
a = np.random.randint(100, size=50)
b = np.random.randint(100, size=50)
# compute the fraction of entries in which two vectors are equal
np.mean(a == b)
Choosing the number of bands for LSH
Given a similarity threshold, we need to choose the number of bands to use in LSH. Use this function to do this:
import scipy.optimize as opt
import math
def _choose_nbands(t, n):
def _error_fun(x):
cur_t = (1/x[0])**(x[0]/n)
return (t-cur_t)**2
opt_res = opt.minimize(error_fun, x0=(10), method='Nelder-Mead')
b = int(math.ceil(res['x'][0]))
r = int(n / b)
final_t = (1/b)**(1/r)
return b, final_t
Hashing a vector
In LSH for each band we hash the r hash values for each document. We can use this function to generate a hash function for vectors
def _make_vector_hash(num_hashes, m=4294967295):
hash_fns = make_hashes(num_hashes)
def _f(vec):
acc = 0
for i in range(len(vec)):
h = hash_fns[i]
acc += h(vec[i])
return acc % m
return _f