# 19 Entity Resolution and Record Linkage

We have discussed how we model datasets using *entities* and the *attributes* that characterize them. Very often, we will be faced with the problem of *data integration* where we want to combine two (or more) datasets from different sources, especially when they may contain information about the same *entities*. The challenge here is that the *attributes* in the two datasets may not be named the same, and more problematic, values for the *same* entity may be different in the two datasets.

Here are some examples:

Suppose we are combining data from one dataset with a

`Person`

table containing attributes`FirstName`

and`LastName`

with another dataset with`People`

table containing attributes`FirstName`

and`Surname`

.Suppose there is a row

`<John, Katz>`

in the first dataset and row`<Johnathan, Katz>`

in the second. They may refer to the same person, should we combine or*link*these rows when we combine these datasets?Even trickier, suppose there is a row

`<John, Katz>`

in the first, and row`<Johnathan, Kats>`

in the second?

These are examples of a general problem referred to as **Entity Resolution** and **Record Linkage**. We can define the general problem as follows:

## 19.1 Problem Definition

**Given**: Entity sets \(E_1\) and \(E_2\),

**Find**: Linked entities \((e_1,e_2)\) with \(e_1 \in E_1\) and \(e_2 \in E_2\).

One general strategy to solve this problem is to define a *similarity* function between entities \(e_1\) and \(e_2\) and link entities with high similarity.

## 19.2 One approach: similarity function

A common way of defining this *similarity* function \(s(e_1,e_2)\) is to define it as an *additive* function over some set of shared attributes \(A\):

\[ s(e_1,e_2) = \sum_{j \in A} s_j(e_1[j], e_2[j]) \]

with \(s_j\) a similarity function defined for *each* attribute \(j\), itself depending on the *type* of attribute \(j\). Here are some examples:

### 19.2.1 Example attribute functions

**Categorical attribute**: Here we can specify \(s_j\) to state that pairs of entities with the same value are more similar to each other than pairs of entities with different values. E.g.,

\[ s_j(e_1[j],e_2[j]) = \begin{cases} 1 & \mathrm{ if } \; e_1[j] == e_2[j] \\ 0 & \mathrm { o.w. } \end{cases} \]

**Continuous attribute**: Here we can specify \(s_j\) to state that pairs of entities with values that are *close* to each other are more similar than pairs of entities with values that are *farther* to each other. Note that to specify *close* or *far* we need to introduce some notion of *distance*. We can use Euclidean distance for example,

\[
d_j(e_1[j],e_2[j]) = (e_1[j] - e_2[j])^2 \\
s_j(e_1[j],e_2[j]) = e^{-d_j(e_1[j],e_2[j])}
\]
**Text attributes**: Here we can use a similar idea but based on edit distance between strings rather than Euclidean distance. Note, however, that often we can use domain knowledge to specify similarity. For example, the fact that `John`

and `Johnathan`

are similar requires domain knowledge of common usage of English names.

## 19.3 Solving the resolution problem

Equipped with a similarity function \(s(e_1,e_2)\), we now need a rule to match entities we think are linked. This depends on assumptions we make about the dataset, similar to assumptions we made when performing joins. In general, we model the entity resolution problem as an *optimization* problem, where we have an *objective function* (based on similarity) that we want to maximize over possible sets \(V\) of *valid* pairs \((e_1,e_2)\), where set \(V\) constraints pairs based on problem-specific assumptions.

Thus, in general, we want to solve the problem

\[ $R$ = \arg \max_{V} \sum_{(e_1,e_2) \in V} s(e_1,e_2) \]

### 19.3.1 Many-to-one resolutions

Suppose we first constrain sets \(V\) to represent many-to-one resolutions. That is, suppose we assume that we want to link every entity \(e_1 \in E_1\) to some entity \(e_2 \in E_2\), allowing many-to-one linking. Thus, entities in \(e_1\) can only appear once in pairs in \(V\), but entities \(e_2\) may appear more than once. In this case, we can match \((e_1,e_2)\) where

\[ e_2 = \arg \max_{e \in E_2} s(e_1,e) \]

That is, the entity in \(E_2\) with highest similarity in \(E_1\).

### 19.3.2 One-to-one resolutions

Suppose we constrain sets \(V\) to those that represent one-to-one resolutions. That is, suppose we want to link every entity \(e_1 \in E_1\), but in a one-to-one matching with entities in \(E_2\). Thus if \((e_1,e_2) \in V\) then \(e_1\) and \(e_2\) appear in only one pair in \(V\). In this case, we have a harder computational problem. In fact, this is an instance of the *maximum bipartite matching problem*, and would look at network flow algorithms to solve.

### 19.3.3 Other constraints

We can add additional constraints to \(V\) to represent other information we have about the task. A common one would be to only allow pairs \((e_1,e_2) \in V\) to have similarity above some threshold \(t\). I.e., \((e_1, e_2) \in V\) only if \(s(e_1,e_2) \geq t\).

## 19.4 Discussion

The procedure outlined above is an excellent first attempt to solve the Entity Resolution problem. This is a classical problem in Data Science for which a variety of approaches and methods are in use.