Содержание
- 3. Record linkage: definition Record linkage: determine if pairs of data records describe the same entity I.e.,
- 4. Record linkage: terminology The term “record linkage” is possibly co-referent with: For DB people: data matching,
- 5. Record linkage: approaches Probabilistic linkage This tutorial Deterministic linkage Test equality of normalized version of record
- 6. Record linkage: goals/directions Toolboxes vs. black boxes: To what extent is record linkage an interactive, exploratory,
- 7. Record linkage tutorial: outline Introduction: definition and terms, etc Overview of the Fellegi-Sunter model Classify pairs
- 8. Felligini-Sunter: notation Two sets to link: A and B A x B = {(a,b) : a2A,
- 9. Felligini-Sunter: notation Three actions on (a,b): A1: treat (a,b) as a match A2: treat (a,b) as
- 10. Felligini-Sunter: main result Suppose we sort all γ’s by m(γ)/u(γ), and pick n Then the best*
- 11. Felligini-Sunter: main result Intuition: consider changing the action for some γi in the list, e.g. from
- 12. Felligini-Sunter: main result Allowing ranking rules to be probabilistic means that one can achieve any Pareto-optimal
- 13. Main issues in F-S model Modeling and training: How do we estimate m(γ), u(γ) ? Making
- 14. Issues for F-S: modeling and training How do we estimate m(γ), u(γ) ? Independence assumptions on
- 15. Issues for F-S: modeling and training Notation for “Method 1”: pS(j) = empirical probability estimate for
- 16. Issues for F-S: modeling and training Notation: pS(j) = empirical probability estimate for name j in
- 17. Issues for F-S: modeling and training Notation: pS(j) = empirical probability estimate for name j in
- 18. Issues for F-S: modeling and training Proposal: assume pA(j)=pB(j)=pAÅ B(j) and estimate from A[B (since we
- 19. Issues for F-S: modeling and training Aside: log of this weight is same as the inverse
- 20. Issues for F-S: modeling and training Alternative approach (Method 2): Basic idea is to use estimates
- 21. Main issues in F-S: modeling Modeling and training: How do we estimate m(γ), u(γ) ? F-S:
- 22. Main issues in F-S model Modeling and training: How do we estimate m(γ), u(γ) ? Making
- 23. Main issues in F-S: efficiency Efficiency issues: how do we avoid looking at |A| * |B|
- 24. Main issues in F-S : efficiency Efficiency issues: how do we avoid looking at |A| *
- 25. The “canopy” algorithm (NMU, KDD2000) Input: set S, thresholds BIG, SMALL Let PAIRS be the empty
- 26. The “canopy” algorithm (NMU, KDD2000)
- 27. Main issues in F-S model Making decisions with the model -? Feature engineering: What should the
- 28. Main issues in F-S: comparison space Feature engineering: What should the comparison space Γ be? Or:
- 30. Скачать презентацию