A quick note about operating on all pairs of rows in two dataframes.
# import libraries
import pandas as pd
import numpy as np
import datetime
np.set_printoptions(precision = 2)
# import data from CSV files
# these are the dataframes generated from the R version of this notebook
df1 = pd.read_csv('data/two_tables_df1.csv',
dtype={'cat': np.str},
parse_dates=['date'])
df2 = pd.read_csv('data/two_tables_df2.csv',
dtype={'cat': np.str},
parse_dates=['date'])
print(df1)
print(df2)
We will define a similarity function between rows of the two dataframes. For dates we will compute the squared difference in days between dates. For numeric values we also calculate the squared difference $d$. In both cases we use transformation $\exp{-d}$ to turn difference $d$ into a similarity. For the categorical attribute, we set similarity equal to 10 if values are equal, and 0 otherwise.
# define attribute similarity functions
# convert a distance to a similarity
def diff_to_similarity(d):
return np.exp(-d)
# numeric similarity
def num_similarity(v1, v2):
d = (v1-v2)**2
return diff_to_similarity(d)
# date similarity: convert time difference to
# numeric in 'days' units, then compute
# numeric similarity
def date_similarity(v1, v2):
d = ((v1 - v2) / datetime.timedelta(days=1))**2
return diff_to_similarity(d)
# categorical similarity
def cat_similarity(v1, v2):
return 10 if v1 == v2 else 0
We create a similarity matrix and iterate over rows of the two tables to fill in values of the matrix.
# allocate similarity matrix
sim_matrix = np.empty((len(df1), len(df2)))
# fill similarity matrix
for i in range(len(df1)):
for j in range(len(df2)):
s = date_similarity(df1.loc[i,'date'], df2.loc[j,'date'])
s += cat_similarity(df1.loc[i,'cat'], df2.loc[j,'cat'])
s += num_similarity(df1.loc[i,'num'], df2.loc[j,'num'])
sim_matrix[i,j] = s
np.round(sim_matrix, 2)
We can then use this similarity matrix as needed. To turn this into a tidy data frame we can use in subsequent analysis we could use something like this:
similarity_df = (pd.DataFrame(data=sim_matrix)
.reset_index()
.rename(columns={'index': 'df1_id'})
.melt(id_vars = 'df1_id', var_name = 'df2_id', value_name = 'similarity'))
similarity_df.head(10)
To find out which row in df2
have highest similarity for each row of df1
, we can use the groupby and construct as we have done previously. In this case, since the way pandas deals with scope and column names is a little unclear, we iterate over groups to extract the maximum similarity and create a new dataframe from that result.
def reduce_group(group):
i = group.similarity.idxmax()
return group.loc[i]
pd.DataFrame([reduce_group(group) for _, group in similarity_df.groupby('df1_id')])