Operating on all pairs of rows in two datasets¶

A quick note about operating on all pairs of rows in two dataframes.

# import libraries
import pandas as pd
import numpy as np
import datetime

np.set_printoptions(precision = 2)

# import data from CSV files
# these are the dataframes generated from the R version of this notebook
df1 = pd.read_csv('data/two_tables_df1.csv',
                       dtype={'cat': np.str},
                      parse_dates=['date'])
df2 = pd.read_csv('data/two_tables_df2.csv',
                      dtype={'cat': np.str},
                      parse_dates=['date'])

print(df1)
print(df2)

   rowid       date cat       num         a
0      1 2018-01-28   a -4.275534 -3.946133
1      2 2018-01-16   e -4.663584 -6.819080
2      3 2018-01-22   d -6.265544 -9.200082
3      4 2018-02-06   b -5.355482 -5.624009
4      5 2018-02-13   b -3.667751  6.211971
   rowid       date cat       num         b
0      1 2018-01-21   c -0.120782 -3.838105
1      2 2018-02-09   d  5.024004  0.170951
2      3 2018-01-20   d -6.507004 -8.967068
3      4 2018-01-03   c  6.967848  1.291397
4      5 2018-02-05   a  7.296677 -7.570396
5      6 2018-01-31   c -9.162854  7.856728
6      7 2018-01-29   d -3.656357 -9.707455
7      8 2018-01-05   b -9.725001  5.662422
8      9 2018-01-02   c -5.219485 -8.200773
9     10 2018-02-09   b  4.129892  0.383800

Similarity functions¶

We will define a similarity function between rows of the two dataframes. For dates we will compute the squared difference in days between dates. For numeric values we also calculate the squared difference $d$. In both cases we use transformation $\exp{-d}$ to turn difference $d$ into a similarity. For the categorical attribute, we set similarity equal to 10 if values are equal, and 0 otherwise.

# define attribute similarity functions

# convert a distance to a similarity
def diff_to_similarity(d):
    return np.exp(-d)

# numeric similarity
def num_similarity(v1, v2):
    d = (v1-v2)**2
    return diff_to_similarity(d)

# date similarity: convert time difference to
# numeric in 'days' units, then compute
# numeric similarity
def date_similarity(v1, v2):
    d = ((v1 - v2) / datetime.timedelta(days=1))**2
    return diff_to_similarity(d)

# categorical similarity
def cat_similarity(v1, v2):
    return 10 if v1 == v2 else 0

Using a similarity matrix¶

We create a similarity matrix and iterate over rows of the two tables to fill in values of the matrix.

# allocate similarity matrix
sim_matrix = np.empty((len(df1), len(df2)))

# fill similarity matrix
for i in range(len(df1)):
    for j in range(len(df2)):
        s = date_similarity(df1.loc[i,'date'], df2.loc[j,'date'])
        s += cat_similarity(df1.loc[i,'cat'], df2.loc[j,'cat'])
        s += num_similarity(df1.loc[i,'num'], df2.loc[j,'num'])
        sim_matrix[i,j] = s
        
np.round(sim_matrix, 2)

array([[0.00e+00, 0.00e+00, 1.00e-02, 0.00e+00, 1.00e+01, 0.00e+00,
        1.05e+00, 0.00e+00, 4.10e-01, 0.00e+00],
       [0.00e+00, 0.00e+00, 3.00e-02, 0.00e+00, 0.00e+00, 0.00e+00,
        3.60e-01, 0.00e+00, 7.30e-01, 0.00e+00],
       [3.70e-01, 1.00e+01, 1.10e+01, 0.00e+00, 0.00e+00, 0.00e+00,
        1.00e+01, 0.00e+00, 3.30e-01, 0.00e+00],
       [0.00e+00, 0.00e+00, 2.70e-01, 0.00e+00, 3.70e-01, 0.00e+00,
        6.00e-02, 1.00e+01, 9.80e-01, 1.00e+01],
       [0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
        1.00e+00, 1.00e+01, 9.00e-02, 1.00e+01]])

We can then use this similarity matrix as needed. To turn this into a tidy data frame we can use in subsequent analysis we could use something like this:

similarity_df = (pd.DataFrame(data=sim_matrix) 
    .reset_index()
    .rename(columns={'index': 'df1_id'})
    .melt(id_vars = 'df1_id', var_name = 'df2_id', value_name = 'similarity'))
similarity_df.head(10)

To find out which row in df2 have highest similarity for each row of df1, we can use the groupby and construct as we have done previously. In this case, since the way pandas deals with scope and column names is a little unclear, we iterate over groups to extract the maximum similarity and create a new dataframe from that result.

def reduce_group(group):
    i = group.similarity.idxmax()
    return group.loc[i]

pd.DataFrame([reduce_group(group) for _, group in similarity_df.groupby('df1_id')])

	df1_id	df2_id	similarity
0	0	0	3.185822e-08
1	1	0	1.103921e-09
2	2	0	3.678794e-01
3	3	0	1.257260e-12
4	4	0	3.436725e-06
5	0	1	2.764395e-38
6	1	1	1.744767e-41
7	2	1	1.000000e+01
8	3	1	1.234098e-04
9	4	1	1.125352e-07

	df1_id	df2_id	similarity
20	0	4	10.000000
41	1	8	0.734162
12	2	2	10.961680
48	3	9	10.000123
49	4	9	10.000000