Record linkage refers to a variety of algorithmic and statistical methods forfinding entries related to the same entity in different, usually large, data sets. De-duplication methods generalizes the above task to the case when records belonging to the same entity are present in the same data set. In this talk I discuss a method to perform a de-duplication process via a latent entity model, where the observed records - usually containing information in terms of categorical variables - are perturbed versions of a set of key variables drawn from a finite population of N different entities.
Also, the population size N is considered unknown.As a result, a salient feature of the proposed method is the capability to account for the de-duplication uncertainty in the population size estimation. As by-products of the approach, I illustrate the relationships between de-duplication problems and capture-recapture models and obtain a more suitable prior distribution on the linkage structure. On the computational side, a novel algorithm is proposed to sample from the posterior distribution of the matching configuration based on the marginalization of the key variables at a population level.
The performance of the proposed method will be illustrated using two synthetic data sets consisting of German names. Finally a real data application is presented, using records from two lists containing information related to death casualties in the recent Syrian conflict.