Leveraging diverse data sources: An introduction to entity resolution

In a world where data availability is becoming increasingly abundant, how can we leverage multiple data sources to answer questions of scientific interest?

This matter is a perfect spot to use the entity resolution.

What is entity resolution? Why does it matter?

Entity Resolution (ER) is a technique that identifies duplicate entities either within, or across data sources. ER within the same database is commonly called “deduplication”, while ER across multiple databases is called “record linkage”. When a unique identifier (like a social security number) is available, ER is a fairly easy task. When such an identifier is not available, typically due to data-privacy reasons, ER becomes considerably more complex.

ER augments existing databases with data from additional sources. This allows users to perform new analyses, without the added cost of conducting surveys or collecting more data. ER has found applications across multiple domains, including e-commerce, human rights research, and healthcare. A recent application involves correctly counting casualties in the El Salvadoran civil war, by applying ER to retrospective mortality surveys.

Another interesting application is deduplicating inventor names in a patents database maintained by the U.S. Patents and Trademarks Office. ER has also been used in e-commerce, by multi-seller platforms like Amazon and Walmart, to identify duplicate product listings on websites.

Deterministic and Probabilistic ER

Given two databases A and B of sizes n and m, the first step in ER is to consider every pair of records from the two databases, and form a concatenated database of nm record-pairs. The next step is to compare attributes for each of the record-pairs.
Deterministic ER methods rely on exact agreement on all attributes of any record pair. For instance, say we are comparing record a from file A and b from file B.

Further, suppose that the comparison is based on two attributes: gender and zipcode. A deterministic rule declares (a,b) to be a link, if gendera = genderb and zipcodea=zipcodeb. This is workable, if all attributes are categorical. If we have a textual attribute like name, then deterministic linking may produce errors. For example, if namea = Andrew and nameb = Andy, then (a,b) will be declared a non-link, even though the two names have multiple letters in common, and one may just be a short-hand form for the other. The diagram below shows a simple deterministic ER rule.

Equating textual data as above is bound to produce incorrect comparisons. What we then need is something that considers partial levels of agreement. This is where probabilistic ER comes into picture. In probabilistic ER, every pair (a,b) is assigned a probability of being a link, based on (1) how many attributes agree; and (2) how well they agree. For example, if gendera = genderb and zipcodea=zipcodeb, and namea and nameb are fairly close, then (a,b) will be assigned a high probability of being a link. If gendera = genderb and zipcodea=zipcodeb, but namea and nameb are poles apart (e.g. “John” and “Francesca”), then this probability will be much lower. For assessing “closeness” on textual attributes, probabilistic ER relies on string distance metrics. The diagram below shows a simple probabilistic ER rule.

A popular method for probabilistic ER is the Fellegi-Sunter model. This model assumes that in any given database, the links and non-links are generated from a “mixture” distribution. The parameters of this mixture distribution are estimated using a statistical technique called Expectation-Maximization. This model can account for simple attributes (like gender) as well as attributes that require partial levels of agreement (like strings). The Fellegi-Sunter model is also readily implemented in standard statistical software, using the RecordLinkage package in R and a package by the same name in Python.

Challenges in Entity Resolution

The Entity Resolution comes naturally with some challenges to think about.

Scalability

A major issue with ER algorithms in practice is their scalability. An ER task scales quadratically with the sizes of the individual databases. Direct implementations of methods like Fellegi-Sunter struggle, even with moderately large databases. Some pre-processing steps like blocking and indexing records are possible solutions. However, there is a lack of more efficient alternatives, and computational advances in ER are a major ongoing research area.

Data quality

ER relies on contrasting attributes between record pairs. It is essential that these attributes in all the databases are on a comparable scale. However, when dealing with real, messy data, we may face inconsistencies across databases due to abbreviations, cultural name variations, etc. For instance, database A may record names in the form “John Doe '' while database B may just record “JD”. On similar lines, one database may record the city to be “Washington” while the other may just record “DC”.

Lack of informative attributes

To perform ER, we need attributes that can help identify that two records are the same. As an example, if two records share the same gender, we cannot declare that these are the same two people. However, if two records share the same name, we may be more confident that these records refer to the same person. We thus need informative attributes to perform reasonably accurate ER. Often, confidentiality restrictions bar access to more informative attributes like names and addresses, which makes ER difficult.

Summary

Access to publicly available datasets is becoming increasingly prevalent. Linking multiple databases can provide enormous benefits for data analysis. The goal of ER is to create a single, consolidated database, by linking records across databases or eliminating duplicates. This can be done using several techniques, like deterministic matching or probabilistic matching. The choice of a technique is determined by the kind of attributes at hand, and the desired level of linking accuracy. There are several software implementations of probabilistic ER in R and Python, which can enable users to expand their data sources and perform more informed data analysis.