Probabilistic Record Linkage: A Short Tutorial. Matching-1

Содержание

Слайд 2

Слайд 3

Record linkage: definition Record linkage: determine if pairs of data records

Record linkage: definition

Record linkage: determine if pairs of data records describe

the same entity
I.e., find record pairs that are co-referent
Entities: usually people (or organizations or…)
Data records: names, addresses, job titles, birth dates, …
Main applications:
Joining two heterogeneous relations
Removing duplicates from a single relation
Слайд 4

Record linkage: terminology The term “record linkage” is possibly co-referent with:

Record linkage: terminology

The term “record linkage” is possibly co-referent with:
For DB

people: data matching, merge/purge, duplicate detection, data cleansing, ETL (extraction, transfer, and loading), de-duping
For AI/ML people: reference matching, database hardening
In NLP: co-reference/anaphora resolution
Statistical matching, clustering, language modeling, …
Слайд 5

Record linkage: approaches Probabilistic linkage This tutorial Deterministic linkage Test equality

Record linkage: approaches

Probabilistic linkage
This tutorial
Deterministic linkage
Test equality of normalized version of

record
Normalization loses information
Very fast when it works!
Hand-coded rules for an “acceptable match”
e.g. “same SSNs, or same zipcode, birthdate, and Soundex code for last name”
difficult to tune, can be expensive to test
Слайд 6

Record linkage: goals/directions Toolboxes vs. black boxes: To what extent is

Record linkage: goals/directions

Toolboxes vs. black boxes:
To what extent is record linkage

an interactive, exploratory, data-driven process? To what extent is it done by a hands-off, turn-key, autonomous system?
General-purpose vs. domain-specific:
To what extent is the method specific to a particular domain? (e.g., Australian mailing addresses, scientific bibliography entries, …)
Слайд 7

Record linkage tutorial: outline Introduction: definition and terms, etc Overview of

Record linkage tutorial: outline

Introduction: definition and terms, etc
Overview of the Fellegi-Sunter

model
Classify pairs as link/nonlink
Main issues in Felligi-Sunter model
Some design decisions
from original Felligi-Sunter paper
other possibilities
Слайд 8

Felligini-Sunter: notation Two sets to link: A and B A x

Felligini-Sunter: notation

Two sets to link: A and B
A x B =

{(a,b) : a2A, b2B} = M [ U
M = matched pairs, U= unmatched pairs
Record for a2 A is α(a), for b2 B is β(b)
Comparison vector, written γ(a,b), contains “comparison features” (e.g., “last names are same”, “birthdates are same year”, …)
γ(a,b)=h γ1(α(a),β(b)),…, γK(α(a),β(b))i
Comparison space Γ = range of γ(a,b)
Слайд 9

Felligini-Sunter: notation Three actions on (a,b): A1: treat (a,b) as a

Felligini-Sunter: notation

Three actions on (a,b):
A1: treat (a,b) as a match
A2: treat

(a,b) as uncertain
A3: treat (a,b) as a non-match
A linkage rule is a function
L: Γ ! {A1,A2,A3}
Assume a distribution D over A x B:
m(γ) = PrD( γ(a,b) | (a,b)2 M )
u(γ) = PrD( γ(a,b) | (a,b)2 U )
Слайд 10

Felligini-Sunter: main result Suppose we sort all γ’s by m(γ)/u(γ), and

Felligini-Sunter: main result

Suppose we sort all γ’s by m(γ)/u(γ), and pick

n< n’ so

Then the best* linkage rule with Pr(A1|U)=μ and Pr(A3|M)=λ is:

*Best = minimal Pr(A2)

Слайд 11

Felligini-Sunter: main result Intuition: consider changing the action for some γi

Felligini-Sunter: main result

Intuition: consider changing the action for some γi in

the list, e.g. from A1 to A2.
To keep μ constant, swap some γj from A2 to A1.
…but if u(γj)=u(γi) then m(γj)…so after the swap, P(A2) is increased by m(γi)-m(γj)

A1

A2

m(γ)/u(γ) large

A3

m(γ)/u(γ) small

γ1,…,γi,…,γn,γn+1,…,γj,…,γn’-1,γn’,…,γN

mi/ui

mj/uj

Слайд 12

Felligini-Sunter: main result Allowing ranking rules to be probabilistic means that

Felligini-Sunter: main result

Allowing ranking rules to be probabilistic means that one

can achieve any Pareto-optimal combination of μ,λ with this sort of threshold rule
Essentially the same result is known as the probability ranking principle in information retrieval (Robertson ’77)
PRP is not always the “right thing” to do: e.g., suppose the user just wants a few relevant documents
Similar cases may occur in record linkage: e.g., we just want to find matches that lead to re-identification
Слайд 13

Main issues in F-S model Modeling and training: How do we

Main issues in F-S model

Modeling and training:
How do we estimate m(γ),

u(γ) ?
Making decisions with the model:
How do we set the thresholds μ and λ?
Feature engineering:
What should the comparison space Γ be?
Distance metrics for text fields
Normalizing/parsing text fields
Efficiency issues:
How do we avoid looking at |A| * |B| pairs?
Слайд 14

Issues for F-S: modeling and training How do we estimate m(γ),

Issues for F-S: modeling and training

How do we estimate m(γ), u(γ)

?
Independence assumptions on γ=hγ1,…,γKi
Specifically, assume γi, γj are independent given the class (M or U) - the naïve Bayes assumption
Don’t assume training data (!)
Instead look at chance of agreement on “random pairings”
Слайд 15

Issues for F-S: modeling and training Notation for “Method 1”: pS(j)

Issues for F-S: modeling and training

Notation for “Method 1”:
pS(j) = empirical

probability estimate for name j in set S (where S=A, B, AÅB)
eS = error rate for names in S
Consider drawing (a,b) from A x B and measuring γj= “names in a and b are both name j” and γneq = “names in a and b don’t match”
Слайд 16

Issues for F-S: modeling and training Notation: pS(j) = empirical probability

Issues for F-S: modeling and training

Notation:
pS(j) = empirical probability estimate for

name j in set S (where S=A, B, AÅB)
eS = error rate for names in S
m(γjoe) = Pr( γjoe| M) = pAÅB(joe)(1-eA)(1-eB)
m(γneq)
Слайд 17

Issues for F-S: modeling and training Notation: pS(j) = empirical probability

Issues for F-S: modeling and training

Notation:
pS(j) = empirical probability estimate for

name j in set S (where S=A, B, AÅB)
eS = error rate for names in S
u(γjoe) = Pr( γjoe| U) = pA(joe) pB(joe)(1-eA)(1-eB)
u(γneq)
Слайд 18

Issues for F-S: modeling and training Proposal: assume pA(j)=pB(j)=pAÅ B(j) and

Issues for F-S: modeling and training

Proposal: assume pA(j)=pB(j)=pAÅ B(j) and estimate

from A[B (since we don’t have AÅB)
Note: this gives more weight to agreement on rare names and less weight to common names.
Слайд 19

Issues for F-S: modeling and training Aside: log of this weight

Issues for F-S: modeling and training

Aside: log of this weight is

same as the inverse document frequency measure widely used in IR:

Lots of recent/current work on similar IR weighting schemes that are statistically motivated…

Слайд 20

Issues for F-S: modeling and training Alternative approach (Method 2): Basic

Issues for F-S: modeling and training

Alternative approach (Method 2):
Basic idea is

to use estimates for some γi’s to estimate others
Broadly similar to E/M training (but less experimental evidence that it works)
To estimate m(γh), use counts of
Agreement of all components γi
Agreement of γh
Agreement of all components but γh, i.e. γ1,…,γh-1,γh+1,γK
Слайд 21

Main issues in F-S: modeling Modeling and training: How do we

Main issues in F-S: modeling

Modeling and training: How do we

estimate m(γ), u(γ) ?
F-S: Assume independence, and a simple relationship between pA(j), pB(j) and pAÅ B(j)
Connections to language modeling/IR approach?
Or: use training data (of M and U)
Use active learning to collect labels M and U
Or: use semi- or un-supervised clustering to find M and U clusters (Winkler)
Or: assume a generative model of records a or pairs (a,b) and find a distance metric based on this
Do you model the non-matches U ?
Слайд 22

Main issues in F-S model Modeling and training: How do we

Main issues in F-S model

Modeling and training:
How do we estimate m(γ),

u(γ) ?
Making decisions with the model:
How do we set the thresholds μ and λ?
Feature engineering:
What should the comparison space Γ be?
Distance metrics for text fields
Normalizing/parsing text fields
Efficiency issues:
How do we avoid looking at |A| * |B| pairs?
Слайд 23

Main issues in F-S: efficiency Efficiency issues: how do we avoid

Main issues in F-S: efficiency

Efficiency issues: how do we avoid looking

at |A| * |B| pairs?
Blocking: choose a smaller set of pairs that will contain all or most matches.
Simple blocking: compare all pairs that “hash” to the same value (e.g., same Soundex code for last name, same birth year)
Extensions (to increase recall of set of pairs):
Block on multiple attributes (soundex, zip code) and take union of all pairs found.
Windowing: Pick (numerically or lexically) ordered attributes and sort (e.g., sort on last name). The pick all pairs that appear “near” each other in the sorted order.
Слайд 24

Main issues in F-S : efficiency Efficiency issues: how do we

Main issues in F-S : efficiency

Efficiency issues: how do we avoid

looking at |A| * |B| pairs?
Use a sublinear time distance metric like TF-IDF.
The trick: similarity between sets S and T is

So, to find things like S you only need to look sets T with overlapping terms, which can be found with an index mapping S to {terms t in S}
Further trick: to get most similar sets T, need only look at terms t with large weight wS(t) or wT(t)

Слайд 25

The “canopy” algorithm (NMU, KDD2000) Input: set S, thresholds BIG, SMALL

The “canopy” algorithm (NMU, KDD2000)

Input: set S, thresholds BIG, SMALL
Let PAIRS

be the empty set.
Let CENTERS = S
While (CENTERS is not empty)
Pick some a in CENTERS (at random)
Add to PAIRS all pairs (a,b) such that SIM(a,b)Remove from CENTERS all points b’ such that SIM(a,b)Output: the set PAIRS
Слайд 26

The “canopy” algorithm (NMU, KDD2000)

The “canopy” algorithm (NMU, KDD2000)

Слайд 27

Main issues in F-S model Making decisions with the model -?

Main issues in F-S model

Making decisions with the model -?
Feature engineering:

What should the comparison space Γ be?
F-S: Up to the user (toolbox approach)
Or: Generic distance metrics for text fields
Cohen, IDF based distances
Elkan/Monge, affine string edit distance
Ristad/Yianolos, Bilenko/Mooney, learned edit distances
Слайд 28

Main issues in F-S: comparison space Feature engineering: What should the

Main issues in F-S: comparison space

Feature engineering: What should the comparison

space Γ be?
Or: Generic distance metrics for text fields
Cohen, Elkan/Monge, Ristad/Yianolos, Bilenko/Mooney
HMM methods for normalizing text fields
Example: replacing “St.” with “Street” in addresses, without screwing up “St. James Ave”
Seymour, McCallum, Rosenfield
Christen, Churches, Zhu
Charniak