19 Entity Resolution and Record Linkage

We have discussed how we model datasets using entities and the attributes that characterize them. Very often, we will be faced with the problem of data integration where we want to combine two (or more) datasets from different sources, especially when they may contain information about the same entities. The challenge here is that the attributes in the two datasets may not be named the same, and more problematic, values for the same entity may be different in the two datasets.

Here are some examples:

  • Suppose we are combining data from one dataset with a Person table containing attributes FirstName and LastName with another dataset with People table containing attributes FirstName and Surname.

  • Suppose there is a row <John, Katz> in the first dataset and row <Johnathan, Katz> in the second. They may refer to the same person, should we combine or link these rows when we combine these datasets?

  • Even trickier, suppose there is a row <John, Katz> in the first, and row <Johnathan, Kats> in the second?

These are examples of a general problem referred to as Entity Resolution and Record Linkage. We can define the general problem as follows:

19.1 Problem Definition

Given: Entity sets \(E_1\) and \(E_2\),

Find: Linked entities \((e_1,e_2)\) with \(e_1 \in E_1\) and \(e_2 \in E_2\).

One general strategy to solve this problem is to define a similarity function between entities \(e_1\) and \(e_2\) and link entities with high similarity.

19.2 One approach: similarity function

A common way of defining this similarity function \(s(e_1,e_2)\) is to define it as an additive function over some set of shared attributes \(A\):

\[ s(e_1,e_2) = \sum_{j \in A} s_j(e_1[j], e_2[j]) \]

with \(s_j\) a similarity function defined for each attribute \(j\), itself depending on the type of attribute \(j\). Here are some examples:

19.2.1 Example attribute functions

Categorical attribute: Here we can specify \(s_j\) to state that pairs of entities with the same value are more similar to each other than pairs of entities with different values. E.g.,

\[ s_j(e_1[j],e_2[j]) = \begin{cases} 1 & \mathrm{ if } \; e_1[j] == e_2[j] \\ 0 & \mathrm { o.w. } \end{cases} \]

Continuous attribute: Here we can specify \(s_j\) to state that pairs of entities with values that are close to each other are more similar than pairs of entities with values that are farther to each other. Note that to specify close or far we need to introduce some notion of distance. We can use Euclidean distance for example,

\[ d_j(e_1[j],e_2[j]) = (e_1[j] - e_2[j])^2 \\ s_j(e_1[j],e_2[j]) = e^{-d_j(e_1[j],e_2[j])} \] Text attributes: Here we can use a similar idea but based on edit distance between strings rather than Euclidean distance. Note, however, that often we can use domain knowledge to specify similarity. For example, the fact that John and Johnathan are similar requires domain knowledge of common usage of English names.

19.3 Solving the resolution problem

Equipped with a similarity function \(s(e_1,e_2)\), we now need a rule to match entities we think are linked. This depends on assumptions we make about the dataset, similar to assumptions we made when performing joins. In general, we model the entity resolution problem as an optimization problem, where we have an objective function (based on similarity) that we want to maximize over possible sets \(V\) of valid pairs \((e_1,e_2)\), where set \(V\) constraints pairs based on problem-specific assumptions.

Thus, in general, we want to solve the problem

\[ $R$ = \arg \max_{V} \sum_{(e_1,e_2) \in V} s(e_1,e_2) \]

19.3.1 Many-to-one resolutions

Suppose we first constrain sets \(V\) to represent many-to-one resolutions. That is, suppose we assume that we want to link every entity \(e_1 \in E_1\) to some entity \(e_2 \in E_2\), allowing many-to-one linking. Thus, entities in \(e_1\) can only appear once in pairs in \(V\), but entities \(e_2\) may appear more than once. In this case, we can match \((e_1,e_2)\) where

\[ e_2 = \arg \max_{e \in E_2} s(e_1,e) \]

That is, the entity in \(E_2\) with highest similarity in \(E_1\).

19.3.2 One-to-one resolutions

Suppose we constrain sets \(V\) to those that represent one-to-one resolutions. That is, suppose we want to link every entity \(e_1 \in E_1\), but in a one-to-one matching with entities in \(E_2\). Thus if \((e_1,e_2) \in V\) then \(e_1\) and \(e_2\) appear in only one pair in \(V\). In this case, we have a harder computational problem. In fact, this is an instance of the maximum bipartite matching problem, and would look at network flow algorithms to solve.

19.3.3 Other constraints

We can add additional constraints to \(V\) to represent other information we have about the task. A common one would be to only allow pairs \((e_1,e_2) \in V\) to have similarity above some threshold \(t\). I.e., \((e_1, e_2) \in V\) only if \(s(e_1,e_2) \geq t\).

19.4 Discussion

The procedure outlined above is an excellent first attempt to solve the Entity Resolution problem. This is a classical problem in Data Science for which a variety of approaches and methods are in use.