CENSUS 2011 The use of registers in the german census model

CENSUS 2011 The use of registers in the german census model

Matching registers without direct identifiers and confidentiality issues Stephanie Hirner ESTP Administrative data and censuses Wiesbaden 22 24 May 2018 THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Eurostat Contents Types of matching procedures Matching of address data Matching of personal data sets Confidentiality issues Federal Statistical Office of Germany | Census 02/22/2020 slide 2 Contents Types of matching procedures Matching of address data Matching of personal data sets Confidentiality issues Federal Statistical Office of Germany | Census

02/22/2020 slide 3 Matching via Identifiers Identical Items Similar Items e.g. e.g. e.g. Addresses Personal data Federal Statistical Office of Germany | Census 02/22/2020 slide 4 Matching via Addresses

Personal data Identifiers Identical Items Similar Items e.g. e.g. e.g. Address ID Postal Code Street name Street number Street name: original and standardised Personal ID Name Sex Date of birth Place of birth

Birth name versus familiy name Federal Statistical Office of Germany | Census 02/22/2020 slide 5 Matching process Preprocessing Parsing Standardisation Deterministic process Including all items Omit items step by step Probabilistic process

Similarity of items Fuzzy merge Probability of matching Federal Statistical Office of Germany | Census 02/22/2020 slide 6 Probabilistic methods - examples SPEDIS Determines the likelihood of two words matching, expressed as the asymmetric spelling distance between the two words (see SAS Documentation SPEDIS Function) Jaro-Winkler similarity Measure of similarity between two strings, uses the number of matching characters and the number of transpositions Sources of error False match

Missing match Federal Statistical Office of Germany | Census 02/22/2020 slide 7 SPEDIS Method Comparison of items (e.g. names) Identification of costs to transform one value into the target word Weighting by using the length of the string Transformation in both directions Results

Probability of correct matching Federal Statistical Office of Germany | Census 02/22/2020 slide 8 Jaro-Winkler Method Comparison of items (e.g. names) Weighting of identical digits in the compared words Higher weigth for consistency at the beginning of the word Results Probability of correct matching Federal Statistical Office of Germany | Census 02/22/2020

slide 9 Matching of Data source 1ID Item 1 Item 2 Item 3 Item 4 Item A 111 A xx 14 mLx C34 222 B

yy 12 pQn F76 333 C xx 00 sFc A94 Data source 2ID Item A 111 C34 222 F76 333

A94 Federal Statistical Office of Germany | Census Addition of items 02/22/2020 slide 10 Matching of Data source 1ID Item 1 Data source 2ID Item 1 Item 2 Item 2 111 A xx 999

X yy 222 B yy 888 K dd 333 C xx Outer join ID Item 1 Item 2 111

A xx 222 B yy 333 C xx 999 X yy 888 K dd Federal Statistical Office of Germany | Census 02/22/2020

slide 11 Matching of Reference date 1ID Item 1 Reference date 2 Item 2 ID Item 1 Item 2 111 A xx 111 A xx 222 B

yy 222 B yy 333 C xx 333 C yy Identical registers over time ID Item 1 Item 2 111

A xx 222 B yy 333 C yy Federal Statistical Office of Germany | Census 02/22/2020 slide 12 Contents Types of matching procedures Matching of address data Matching of personal data sets Confidentiality issues Federal Statistical Office of Germany | Census 02/22/2020

slide 13 Register of addresses Matching registers Setup of the register Quality aspects Support of the register Validation Quality aspect: up-to-dateness Quality aspect: completeness Federal Statistical Office of Germany | Census 02/22/2020 slide 14 Register of addresses in the German

Census Covered all addresses with housing space and occupied living quarters 2 administrative data sources -> outer join Federal Mapping Agency Population registers Checking of addresses if only included in one data source Classification of addresses as "addresses with housing space Federal Statistical Office of Germany | Census 02/22/2020 slide 15 Data acquisition: using registers in place Geo-referenced address data records: 21 million

including geo-coordinates Data of residents registration offices records: 86 million contains demographic and geographical information Federal Statistical Office of Germany | Census 02/22/2020 slide 16 Problems No identification characteristis Address characteristic as ID Local register data Low standardisation of register entries Low harmonisation between registers Redundant/false/obsolete data entries

Complex data processing Federal Statistical Office of Germany | Census 02/22/2020 slide 17 Setup of the register of addresses Data checks Preprocessing Decomposing the address data into address components Standardisation of the address information Aggregation of individual data sets Harmonisation Referencing the street names at street level

Adjustment of changing address identifiers Merging/record linkage Federal Statistical Office of Germany | Census 02/22/2020 slide 18 Challenges in using the address as a key variable Decentralised administrative data, different registers -> No harmonised address format street name J.-F.-K.-Strae John-F.-Ken.-Strae -> Address unstable, changes not notified simultaneously in all registers Federal Statistical Office of Germany | Census 02/22/2020 slide 19 Standardisation of key variables Necessary condition for completion and updating: standardisation

Standardisation of street names Automated standardisation capital letters uniform abbreviations (street -> str, place -> pl) eliminating blanks Manual checks by the statistical offices of the Lnder Thesaurus of streetnames Aggregation on street level Federal Statistical Office of Germany | Census 02/22/2020 slide 20 Thesaurus of streetnames: harmonisation of spellings

external source postal code street name standardised street name 38471 J.-F.-K.-Strae JOHNFKENNEDYSTR postal code street name standardised street name 38471 J.-F.-K.-Strae JOHNFKENNEDYSTR 38471 John-F.-Ken.-Strae JOHNFKENNEDYSTR thesaurus of streetnames

Federal Statistical Office of Germany | Census 02/22/2020 slide 21 Preparation and integration of register data GA MR preprocessin g corrected data deterministic 1:1 matchingmethod matching data register non-matching data Correction (regional authorities) Federal Statistical Office of Germany | Census

02/22/2020 slide 22 Two-stage correction model Municipal Code I. Street-Level II. Address-Level Existence, Correctness Street B Street A No. 1 Check criterion No. 2 Federal Statistical Office of Germany | Census No. 1 No. 2

Existence, Correctness, housing space 02/22/2020 slide 23 Validation of addresses quality aspectmass: addresses of two data sources Validated Check for housing space: adress in only one data source GA MR Federal Statistical Office of Germany | Census 02/22/2020 slide 24 Results: addresses to be checked for housing space (2011 Census) Federal State Schleswig-Holstein Hamburg Niedersachsen Bremen

Nordrhein-Westfalen Hessen Rheinland-Pfalz Baden-Wrttemberg Bayern Saarland Berlin Brandenburg Mecklenburg-Vorpommern Sachsen Sachsen-Anhalt Thringen Germany Federal Statistical Office of Germany | Census Total number of addresses (thousand) 917 282 2.469 174 4.283 1.586 1.309 2.998 3.323 337 338 837 476

981 689 618 21.615 Number of addresses to be checked (thousand) 117 28 252 33 411 186 168 451 347 31 37 174 84 145 104 87 2.657 Addresses to be checked (percent of total) 13 10

10 19 10 12 13 15 10 9 11 21 18 15 15 14 12 02/22/2020 slide 25 Quality aspect: up-to-dateness Coordination function -> keeping the register up to date Address up-to-dateness = How instabil are the addresses? How often will be updated?

Changes to address variables at municipal level -> address is unstable, when and how often it changes is not predictable Federal Statistical Office of Germany | Census 02/22/2020 slide 26 Instability of the address (20102011): change of at least one variable in percent 0 20 40 60 80 100 Baden-Wrttemberg Bayern Berlin Brandenburg Bremen Hamburg Hessen Niedersachsen

Mecklenburg-Vorpommern Nordrhein-Westfalen Rheinland-Pfalz Saarland Sachsen Sachsen-Anhalt Schleswig-Holstein Thringen Germany Federal Statistical Office of Germany | Census 02/22/2020 slide 27 Keeping the register up to date Integration of 5 different registers (e.g. population register) -> identical registers over time Mismatches: the statistical offices of the Lnder checked -> existence -> correctness street name ->old renamings Kochstrae Federal Statistical Office of Germany | Census

new street name John-F.-Ken.-Strae 02/22/2020 slide 28 Quality aspect: completeness Register of addresses = reference for population New buildings, demolition of residential buildings, incorrect data in registers Completion by: Registers -> outer join Other survey components, information from other sources Federal Statistical Office of Germany | Census 02/22/2020 slide 29 New addresses added to the register by data origin over time (2011 Census) total

250.000 200.000 administrative registers other findings 150.000 100.000 50.000 0 Aug 10 Sep 10 Oct 10 Nov 10 Dec 10 J an 11

Mar 11 Apr 11 J un 11 Aug 11 Nov 11 -> most of the new addresses based on register integration Federal Statistical Office of Germany | Census 02/22/2020 slide 30 Conclusion Decentralised administrative data, differing quality of register data and missing ID = core problem To update and complete an instable key variable is the major focus in the context of the register of addresses -> precondition: harmonisation/ standardisation Updating and completion of the register can mainly be achieved through register integration Federal Statistical Office of Germany | Census

02/22/2020 slide 31 Contents Types of matching procedures Matching of address data Matching of personal data sets Confidentiality issues Federal Statistical Office of Germany | Census 02/22/2020 slide 32 Data acquisition and integration in Germany Data acquisition Decentralised via the statistical offices of the Lnder Two supplies around the census reference date Integration Linking of the information on addresses

Adding personal data records via the address-ID Build-up of a temporary centralised population register for Germany Federal Statistical Office of Germany | Census 02/22/2020 slide 33 Matching of different deliveries over time Merging information Address Family name at birth and first name(s), Sex, Date of birth, Place of birth Results

Confirm data sets Update data sets Add data sets Federal Statistical Office of Germany | Census 02/22/2020 slide 34 Reference data stock Merging datasets from different sources without existing personel identification numbers (registers, surveys) Merging information: family name at birth and first name(s), sex, date of birth, municipal code, post code, street name, house number Federal Statistical Office of Germany | Census

02/22/2020 slide 35 Matching procedures Deterministic process Including all items Omit items step by step Probabilistic process Similarity of items Probability of matching C ha ? s k ? lle s

i R ng es Federal Statistical Office of Germany | Census s n o i t ta i Ch Lim an ? ? c es 02/22/2020 slide 36 Challenges Matching process step by step Create subsets Avoid false matches Quality checks

Federal Statistical Office of Germany | Census 02/22/2020 slide 37 Contents Types of matching procedures Matching of address data Matching of personal data sets Confidentiality issues Federal Statistical Office of Germany | Census 02/22/2020 slide 38 Data protection and confidentiality Collection of personal data Names, date of birth, Additional data only for matching process Create internal IDs Limitations for quality checks Prohibition to transmit the data back to the administration

Federal Statistical Office of Germany | Census 02/22/2020 slide 39 Thank you for your attention! Stephanie Hirner [email protected] Federal Statistical Office of Germany | Census

Recently Viewed Presentations

  • Writing Lab

    Writing Lab

    Double comparisons occur when the degree of the modifier is changed incorrectly by adding both -er and more or -est and most. Michael is more friendlier than his sister. More. is a comparison word; adding - er. to the end...
  • Rural Hospital Stabilization - Georgia State Senate

    Rural Hospital Stabilization - Georgia State Senate

    RHS and APRNs. Phase 1 (pilot) Planned utilization of APRN in Occupational Health Program, did not occur - replaced with PA. Use of APRN in skilled nursing facility (SNF) to facilitate improvement in overall health outcomes for residents & reduction...
  • Marketing International Chapitre 6

    Marketing International Chapitre 6

    Même si la loi du prix unique n'est jamais entièrement vérifiée en pratique, les prix convergent à travers le monde, particulièrement au niveau régional (ex. Europe de l'Ouest) Prix au niveau mondial: McDo Nécessité d'adapter les prix suivant les pays...
  • Run Lola Run

    Run Lola Run

    The difference between life and death can be decided ina split second The German Yogi Berra… "Nach dem Spiel ist vor dem Spiel." "Der Ball ist rund", "Das Spiel dauert neunzig Minuten." ... he's going to rob the supermarket across...
  • Isaiah 1-39 Victor Buksbazen

    Isaiah 1-39 Victor Buksbazen

    Isaiah the son of Amoz lived halfway between Moses the Lawgiver and Jesus the Messiah. He was a contemporary of the prophets Amos, Hosea and Micah. The ancient Jewish tradition that Isaiah's father Amoz was a brother of King Amaziah...
  • Elements of Literature: Character

    Elements of Literature: Character

    Character Feature Menu Creating Characters Character Development Speech Appearance Private Thoughts How Other Characters Feel Actions Direct and Indirect Characterization
  • Building Better Energy & Environmental Lawyers Nancy B.

    Building Better Energy & Environmental Lawyers Nancy B.

    Building Better Energy & Environmental Lawyers Nancy B. Rapoport (with a lot of help from the UHLC energy and environmental faculty members) Dean and Professor of Law
  • Dia 1

    Dia 1

    Thank you for your attention For questions, suggestions and remarks, please contact us: [email protected] Renewable energy resources in the SEEA Are renewable energy resources assets in the SNA and SEEA or not? Maarten van Rossum, Mark de Haan, and Sjoerd...