A quick note about two ways of operating on all pairs of rows in two datasets.

## Data generation

We will simulate two dataframes containing a datetime attribute, a categorical and a numeric attribute. Let’s first define a function to generate dataframes.

library(lubridate)
library(tidyverse)

generate_df <- function(dummy_name, n=10,
min_date=ymd("2018/01/01"),
max_date=ymd("2018/02/15"),
cat_levels=letters[1:5],
min_num=-10,
max_num=10) {
date_range <- seq(min_date, max_date, by=1)
df <- tibble(date = sample(date_range, n, replace=TRUE),
cat = factor(sample(cat_levels, n, replace=TRUE), levels=cat_levels),
num = runif(n, min_num, max_num),
dummy=runif(n, min_num, max_num))
colnames(df)[ncol(df)] <- dummy_name
df
}

Now, generate the dataframes. We use the function rowid_to_column to keep track of row indices as we operate on the dataframes.

set.seed(1234)
df1 <- generate_df("a", n=5) %>% rowid_to_column()
df2 <- generate_df("b", n=10) %>% rowid_to_column()

df1
df2

## Similarity functions

We will define a similarity function between rows of the two dataframes. For dates we will compute the squared difference in days between dates. For numeric values we also calculate the squared difference $$d$$. In both cases we use transformation $$\exp{-d}$$ to turn difference $$d$$ into a similarity. For the categorical attribute, we set similarity equal to 10 if values are equal, and 0 otherwise.

diff_to_similarity <- function(d) {
exp(-d)
}

date_similarity <- function(d1, d2) {
d <- (as.integer(d1 - d2))^2
diff_to_similarity(d)
}

cat_similarity <- function(v1, v2) {
ifelse(v1 == v2, 10, 0)
}

num_similarity <- function(v1, v2) {
d <- (v1 - v2)^2
diff_to_similarity(d)
}

## Version 1: Using matrices

In the first version we create a similarity matrix and iterate over rows of the two tables to fill in values of the matrix.

sim_matrix <- matrix(NA, nrow(df1), nrow(df2))

for (i in seq(1, nrow(df1))) {
for (j in seq(1, nrow(df2))) {
s <- date_similarity(df1$date[i], df2$date[j])
s <- s + cat_similarity(df1$cat[i], df2$cat[j])
s <- s + num_similarity(df1$num[i], df2$num[j])
sim_matrix[i,j] <- s
}
}

round(sim_matrix,2)
##      [,1] [,2]  [,3] [,4]  [,5] [,6]  [,7] [,8] [,9] [,10]
## [1,] 0.00    0  0.01    0 10.00    0  1.05    0 0.41     0
## [2,] 0.00    0  0.03    0  0.00    0  0.36    0 0.73     0
## [3,] 0.37   10 10.96    0  0.00    0 10.00    0 0.33     0
## [4,] 0.00    0  0.27    0  0.37    0  0.06   10 0.98    10
## [5,] 0.00    0  0.00    0  0.00    0  1.00   10 0.09    10

We can then use this similarity matrix as needed. To turn this into a tidy data frame we can use in subsequent analysis we could use something like this:

similarity_df <- sim_matrix %>%
magrittr::set_colnames(seq(1,ncol(.))) %>%
as_tibble() %>%
rowid_to_column("df1_id") %>%
tidyr::gather(df2_id, similarity, -df1_id) %>%
mutate(df2_id = as.integer(df2_id))
similarity_df

To find out which row in df2 have highest similarity for each row of df1, we can use the group_by and summarize construct as we have done previously.

similarity_df %>%
group_by(df1_id) %>%
summarize(max_sim = max(similarity), df2_match_id=df2_id[which.max(similarity)])

## Version 2: Using data frames

We can also use operations that work within a pipeline for this task. However, it is a bit more convoluted, and I include it here for reference (and because it’s fun to figure out how to do it).

First we create a data frame with all pairwise combinations of row indices from each of the two data frames.

index_df <- df1 %>%
select(df1_id="rowid") %>%
mutate(df2_id=NA) %>%
bind_rows(df2 %>%
select(df2_id="rowid") %>%
mutate(df1_id = NA)) %>%
tidyr::expand(df1_id, df2_id) %>%
tidyr::drop_na()
index_df

Next, we populate that data frame with attributes from the two dataframes by using joins on the row indices.

similarity_df <- index_df %>%
inner_join(df1 %>%
select(rowid, date.df1=date,
cat.df1=cat, num.df1=num),
by=c(df1_id = "rowid")) %>%
inner_join(df2 %>%
select(rowid, date.df2=date,
cat.df2=cat,
num.df2=num),
by=c(df2_id= "rowid"),
suffix=c(".ind", ".df2"))

With all relevant attributes in one data frame we can compute similarity as before.

similarity_df <- similarity_df %>%
mutate(date_sim = date_similarity(date.df1, date.df2)) %>%
mutate(cat_sim = cat_similarity(cat.df1, cat.df2)) %>%
mutate(num_sim = num_similarity(num.df1, num.df2)) %>%
mutate(similarity = date_sim + cat_sim + num_sim) %>%
select(df1_id, df2_id, similarity)

As in the first option, we can use the group_by and summarize construct.

similarity_df %>%
group_by(df1_id) %>%
summarize(max_sim = max(similarity), df2_match_id=df2_id[which.max(similarity)])