---
title: "Example session for Weight-based deduplication"
author: "Andreas Borg, Murat Sariyar"
output: html_document
vignette: >
  %\VignetteIndexEntry{Weight-based deduplication}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{r setup, echo=FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
options(width = 60)
backup_options <- options()
```

This document shows an example session using the package *RecordLinkage*. A single data set is deduplicated using an EM algorithm for weight calculation. Conducting linkage of two data sets differs only in the step of generating record pairs.

## Generating record pairs
```{r load-library, echo=FALSE, results='hide'}
library(RecordLinkage)
```

The data to be deduplicated is expected to reside in a data frame or matrix, each row containing one record. Example data sets of 500 and 10000 records are included in the package as `RLData500` and `RLData10000`.
```{r load-data}
data(RLdata500)
RLdata500[1:5,]
```

For deduplication, `compare.dedup` is to be used. In this example, blocking is set to return only record pairs which agree in at least two components of the subdivided date of birth, resulting in 810 pairs. The argument `identity` preserves the true matching status for later evaluation.
```{r compare-dedup}
pairs <- compare.dedup(RLdata500, identity = identity.RLdata500,
                       blockfld = list(c(5,6), c(6,7), c(5,7)))
summary(pairs)
```

## Weight calculation

Weights are calculated by means of an EM algorithm. This step is computationally intensive and might take a while. The histogram shows the resulting weight distribution.
```{r em-weights}
pairs <- emWeights(pairs)
hist(pairs$Wdata, plot = FALSE)
```

## Classification

For determining thresholds, record pairs within a given range of weights can be printed using `getPairs`^[The output of `getPairs` is shortened in this document.]. In this case, 24 is set as upper and -7 as lower threshold, dividing links, possible links and non-links. The summary shows the resulting contingency table and error measures.
```{r get-pairs-hidden, results='hide'}
getPairs(pairs, 30, 20)
```
```{r get-pairs-shown, echo=FALSE}
getPairs(pairs, 30, 20)[23:36,]
```
```{r em-classify}
pairs <- emClassify(pairs, threshold.upper = 24, threshold.lower = -7)
summary(pairs)
```

Review of the record pairs denoted as possible links is facilitated by `getPairs`, which can be forced to show only possible links via argument `show`. A list with the ids of linked pairs can be extracted from the output of `getPairs` with argument `single.rows` set to `TRUE`.
```{r final-pairs}
possibles <- getPairs(pairs, show = "possible")
possibles[1:6,]
links <- getPairs(pairs, show = "links", single.rows = TRUE)
link_ids <- links[, c("id1", "id2")]
link_ids
```
```{r cleanup, echo=FALSE, results='hide'}
options(backup_options)
```