A quick note about two ways of operating on all pairs of rows in two datasets.
We will simulate two dataframes containing a datetime attribute, a categorical and a numeric attribute. Let’s first define a function to generate dataframes.
library(lubridate)
library(tidyverse)
generate_df <- function(dummy_name, n=10,
min_date=ymd("2018/01/01"),
max_date=ymd("2018/02/15"),
cat_levels=letters[1:5],
min_num=-10,
max_num=10) {
date_range <- seq(min_date, max_date, by=1)
df <- tibble(date = sample(date_range, n, replace=TRUE),
cat = factor(sample(cat_levels, n, replace=TRUE), levels=cat_levels),
num = runif(n, min_num, max_num),
dummy=runif(n, min_num, max_num))
colnames(df)[ncol(df)] <- dummy_name
df
}
Now, generate the dataframes. We use the function rowid_to_column
to keep track of row indices as we operate on the dataframes.
set.seed(1234)
df1 <- generate_df("a", n=5) %>% rowid_to_column()
df2 <- generate_df("b", n=10) %>% rowid_to_column()
df1
df2
We will define a similarity function between rows of the two dataframes. For dates we will compute the squared difference in days between dates. For numeric values we also calculate the squared difference \(d\). In both cases we use transformation \(\exp{-d}\) to turn difference \(d\) into a similarity. For the categorical attribute, we set similarity equal to 10 if values are equal, and 0 otherwise.
diff_to_similarity <- function(d) {
exp(-d)
}
date_similarity <- function(d1, d2) {
d <- (as.integer(d1 - d2))^2
diff_to_similarity(d)
}
cat_similarity <- function(v1, v2) {
ifelse(v1 == v2, 10, 0)
}
num_similarity <- function(v1, v2) {
d <- (v1 - v2)^2
diff_to_similarity(d)
}
In the first version we create a similarity matrix and iterate over rows of the two tables to fill in values of the matrix.
sim_matrix <- matrix(NA, nrow(df1), nrow(df2))
for (i in seq(1, nrow(df1))) {
for (j in seq(1, nrow(df2))) {
s <- date_similarity(df1$date[i], df2$date[j])
s <- s + cat_similarity(df1$cat[i], df2$cat[j])
s <- s + num_similarity(df1$num[i], df2$num[j])
sim_matrix[i,j] <- s
}
}
round(sim_matrix,2)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0.00 0 0.01 0 10.00 0 1.05 0 0.41 0
## [2,] 0.00 0 0.03 0 0.00 0 0.36 0 0.73 0
## [3,] 0.37 10 10.96 0 0.00 0 10.00 0 0.33 0
## [4,] 0.00 0 0.27 0 0.37 0 0.06 10 0.98 10
## [5,] 0.00 0 0.00 0 0.00 0 1.00 10 0.09 10
We can then use this similarity matrix as needed. To turn this into a tidy data frame we can use in subsequent analysis we could use something like this:
similarity_df <- sim_matrix %>%
magrittr::set_colnames(seq(1,ncol(.))) %>%
as_tibble() %>%
rowid_to_column("df1_id") %>%
tidyr::gather(df2_id, similarity, -df1_id) %>%
mutate(df2_id = as.integer(df2_id))
similarity_df
To find out which row in df2
have highest similarity for each row of df1
, we can use the group_by and summarize construct as we have done previously.
similarity_df %>%
group_by(df1_id) %>%
summarize(max_sim = max(similarity), df2_match_id=df2_id[which.max(similarity)])
We can also use operations that work within a pipeline for this task. However, it is a bit more convoluted, and I include it here for reference (and because it’s fun to figure out how to do it).
First we create a data frame with all pairwise combinations of row indices from each of the two data frames.
index_df <- df1 %>%
select(df1_id="rowid") %>%
mutate(df2_id=NA) %>%
bind_rows(df2 %>%
select(df2_id="rowid") %>%
mutate(df1_id = NA)) %>%
tidyr::expand(df1_id, df2_id) %>%
tidyr::drop_na()
index_df
Next, we populate that data frame with attributes from the two dataframes by using joins on the row indices.
similarity_df <- index_df %>%
inner_join(df1 %>%
select(rowid, date.df1=date,
cat.df1=cat, num.df1=num),
by=c(df1_id = "rowid")) %>%
inner_join(df2 %>%
select(rowid, date.df2=date,
cat.df2=cat,
num.df2=num),
by=c(df2_id= "rowid"),
suffix=c(".ind", ".df2"))
With all relevant attributes in one data frame we can compute similarity as before.
similarity_df <- similarity_df %>%
mutate(date_sim = date_similarity(date.df1, date.df2)) %>%
mutate(cat_sim = cat_similarity(cat.df1, cat.df2)) %>%
mutate(num_sim = num_similarity(num.df1, num.df2)) %>%
mutate(similarity = date_sim + cat_sim + num_sim) %>%
select(df1_id, df2_id, similarity)
As in the first option, we can use the group_by and summarize construct.
similarity_df %>%
group_by(df1_id) %>%
summarize(max_sim = max(similarity), df2_match_id=df2_id[which.max(similarity)])