Back to Talks

Matching addresses, it's surprisingly difficult.

Evan Richards

Audience level: Intermediate
Topic area: Misc

Description

In this talk, you'll learn the weirdest edge cases in the United States addressing system; the hierarchy between city and state, the sublime beauty behind the zipcode, and the constitute parts of an address.

We'll cover how to compare addresses in a way that gives you an F-score you'll be proud of.

Abstract:

When comparing addresses in a database, you run into issues verifying if the two addresses are the same.

Direct string comparisons fail on '123 Main St' and '123 S. Main Street', even when they refer to the same address.

Applied across an entire address, edit distances will give false positives that are all the way across town.

Using Regular Expressions to break an address into it's parts often fails on diacritics punctuation(Kalākaua St), or hyphaned city names(Wilkes-Barre).

Geocoding addresses requires making HTTP requests, and doesn't scale if you have millions of addresses.

In this talk, we'll cover how to compare addresses using libpostal, which parts of the address can be fuzzy matches, which can't, and how to calculate quality in our matches.