I’ve got some interest in BigQuery recently. I’ve been using BQ for a few years, but did not enjoy it much: did not like the BigQuery UI and did not like the fact that data needs to be duplicated in a regular database (e.g. MySQL, for production usage) and in BigQuery (for analysis purposes).
However a few weeks ago I needed to compare two lists of geographical locations in BigQuery and find matches. The first list contained about 100k geographical locations, and the second one - about 30k locations. Both lists looked like this:
id | name | state_code | country_code |
---|---|---|---|
1 | Moscow | null | Russia |
2 | Budapest | null | Hungary |
3 | Calgary | AB | Canada |
Step 1: getting matches using SELECT
So I wrote a simple query to get locations that are presented in both lists:
However how to deal with the cases when locations are named slightly differently? E.g. in locations1
I have “St Petersburg” and in locations2
I have “Saint Petersburg” or “St. Petersburg”. Very quickly I found major cases that causing differences:
- Thai islands: “Koh Samet” vs “Ko Samet”
- Missing spaces: “Chiang Mai” vs “Chiangmai”,
- Missing dashes: “Ulan-Ude” vs “Ulan Ude”
- Strokes: “Noril’sk” vs “Norilsk”
Step 2: removing special characters
I created a temporary function in BigQuery to normalise the string: convert it to lowercase, remove special characters, replace “saint” with “st” and “koh” with “ko”.
The function is:
As you might have noticed in this query I do not handle cases like “Calangute Beach” and “Calangute”, because sometimes it can be two different locations (e.g. “Miami” and “Miami Beach” are two different cities).
Step 3: replacing accented characters with latin characters
In oppose to MySQL (I am using 5.6.34), in Big Query “á” != “a”. So this query returns true
in Sequel Pro, but it returns false
in BQ:
As a result BigQuery treats “Düsseldorf” and “Dusseldorf” as two different strings. So I created one more temp function to replace accented characters with latin:
(Original code was copied from here, however I enhanced it by adding Turkish, Czech and some other accented characters).
Step 4: calculating string similarity using Levenshtein algorithm
After I “normalised” location names by removing special characters, replacing dashes and parts of the string and replacing accented characters, I was able to match two lists quite accurately. However it still was not perfect as I was not able to match similar names such as “Naberezhnye Chelny” and “Naberezhnie Chelny” for example.
To measure string similarity I decided to use Levenshtein distance algorithm. So I created a temporary BigQuery function that calculates similarity of two given strings:
Similarity is a number between 0 and 1. Per my observations, if two location names have similarity 0.8 and above, they are most likely the same. However it is not always the case. For example, similarity of “Bangalore” and “Bengaluru” is 0.66 only.
What’s next
Using BigQuery temporary functions, I was able to match two lists of geo locations quite fast and accurately. If there was no BiqQuery, I could write some code in Ruby perhaps, however I might had to spend more time.
I am currently thinking on how to reduce false-positive matches of Levenshtein algo. For example, “South Palmetto Point” and “North Palmetto Point” have similarity score 0.89, however if Levenshtein algorithm knew English, it would know that “South” and “North” are two opposite things :) Or another example, “Pano Akourdalia” and “Kato Akourdalia” have 0.86 similarity score, but “Pano” and “Kato” in Greek language mean “Over” and “Below” accordingly.
I’m wondering if I could use maching learning for that?..