Regression with linked data sets subject to linkage error

Regression with linked data sets subject to linkage error

Data are often collected from multiple heterogeneous sources and are combined subsequently. In combing data, record linkage is an essential task for linking records in data sets that refer to the same entity. Record linkage is generally not error-free; there is a possibility that records belonging to different entities are linked or that records belonging to the same entity are missed. It is not advisable to simply ignore such errors because they can lead to data contamination and introduce bias in sample selection or estimation, which, in return, can lead to misleading statistical results and conclusions. For a long while, this problem was not properly recognized, but in recent years a growing number of researchers have developed methodology for dealing with linkage errors in regression analysis of linked data sets. The main goal of this overview is to give an account of those development, with an emphasis on recent approaches and their connection to the so-called ``Broken Sample" problem. We also provide a short empirical study that illustrates the efficacy of corrective methods in different scenarios.

  • Keyword : Data Integration, Record Linkage, Linkage Error, Regression, Mixture Models, Bayesian Analysis
Avatar
Zhen-bang Wang
PhD Student