Fixing String Conflicts with difflib
In this article, we will show you step by step how we come across a merging problem while working with messy real world data sets that come from different databases. You can download the raw code on Jupyter Notebook and the data from my Github Repositoty.
Let Get Started!
We have 2 tables from 2 movie databases: IMDB and Box Office Mojo. The purpose is that we want to find genres of the top 100 movies having the highest total revenues. To achieve this, we have to merge these tables together:
gross
contains titles, domestic and foreign gross. we will calculate total gross from this table;basics
contains primary title and genres columns which we want to join with thegross
table.
import pandas as pd
import numpy as npgross = pd.read_csv('data/bom.movie_gross.csv.gz', compression='gzip', error_bad_lines=False)basics = pd.read_csv('data/imdb.title.basics.csv.gz', compression='gzip', error_bad_lines=False)
Here is what the tables look like:
display(gross.head(3))
display(basics.head(3))
Getting Top 100 Highest Total Revenue Movies:
gross['foreign_gross'] = pd.to_numeric(gross['foreign_gross'],errors='coerce')
gross['total_gross'] = gross['domestic_gross'] + gross['foreign_gross']top_100 = gross.sort_values('total_gross', ascending=False)[:100][['title', 'total_gross']]
Here is what the table look like:
top_100.head(10)
Now Let Get Genres!
We merge top_100
with basics
in order to get the genres column. We notice that we get multiple duplicated rows we need to get rid of. We will do it within this step.
top_100_w_genres = top_100.join(basics.set_index('primary_title'), how='left', on='title' )[['title', 'total_gross', 'genres']]
top_100_w_genres = top_100_w_genres[top_100_w_genres['title'].duplicated() == False]top_100_w_genres.head(10)
Here is what the table looks like:
The Problem!
As you can see, we get the genres column but some values are missing. What happens here? It appears that some of titles in IMDB database are listed in different names. For example, “Marvel’s The Avengers” in Box Office Mojo is listed as “The Avengers” in IMDB. When I merge 2 tables together, the program could not find the exact match between them. As a result, it shows NaN
as a value in the column.
So, how do we fix this problem?
This is when difflib is coming to play!
What is difflib?
‘difflib’ is a python standard library that contains simple classes and functions that allow us to compare sets of data, and sequences such as lists or strings. In this article, we will mainly use get_close_matches
.
The get_close_matches
helps us find the closest words from a list that is similar to a string we assign in the argument. Here is an example:
import difflibword_list = ['Star Wasr', 'Straw Wall','Star Trek','Stupid Wars']
str_we_want = 'Star Wars'
matches = difflib.get_close_matches(str_we_want, word_list, n=3, cutoff=0.6)
In the get_close_matches()
argument, str_we_want
is a string that we want to find a similarity; word_list
is a list of words we want to look; n
is a number of top similar words based on difflib’s ratio, which is a score of similarity; and cutoff
is the minimum difflib’s ratio for words to be considered as similar. From the code above, the result we get is:
matches = ['Star Wasr', 'Stupid Wars', 'Star Trek']
We can see that even though the word ‘Star Wars’ is misspelled in the word_list
, the program still manages to find ‘Star Wasr’ as the best match.
How to Apply This to Our Work?
Let’s test it on our work. We want to see if difflib can find the movie that matches “Marvel’s The Avengers” in basics
table.
difflib.get_close_matches("Marvel's The Avengers", basics['primary_title'], n=1)
We get ['The Avengers']
as the result. Awesome!!
Get Back to Work!
From our top_100_w_genres
table, we distract all rows that can’t find a value in genres column and assign it to a new variable called name_prob
.
name_prob = top_100_w_genres[top_100_w_genres['genres'].isna()]
name_prob.shape
We find that there is 13 rows that have NaN
as genres. Now we want to clean the title names before we marge. We can do this by using indexes from name_prob
to find the specific title names in top_100
table, then we change its format to match one in basics
.
for i in name_prob.index:
try:
best_match = difflib.get_close_matches(top_100.loc[i, 'title'], basics['primary_title'], n=1)
top_100.loc[i, 'title'] = best_match[0]
except:
top_100.loc[i, 'title'] = top_100.loc[i, 'title']top_100.head(10)
Here is what the table looks like:
Now let merge it again! We also need to clean up the duplicated rows again as well.
finished = top_100.join(basics.set_index('primary_title'), how='left', on='title' )[['title', 'total_gross', 'genres']]finished = finished[finished['title'].duplicated() == False]display(finished.head(10))
Yay! Here is what the final table looks like:
IT’s NOT DONE YET!
We still need to verify if our table has all the genres.
finished['genres'].isna().sum()
finished[finished['genres'].isna() == True]
We find out that there are still 2 rows that could not find the match.
We choose to assign missing to these two because we do not want to directly edit the raw data.
finished['genres'].fillna('Missing', inplace=True)
It is not PERFECT!
In real world, it is difficult to find an algorithm that fits every scenario. It turns out that for some reasons the program cannot find the match for these two. One is Disney’s animation called ‘Coco,’ which by further research we can obtain its genres from the internet. The other one is the movie ‘Men in Black 3,’ which is written in abbreviation ‘MIB 3.’
There is much more complex stuff in this module!
difflib has so many interesting and complicating functions. You can check out and learn more at difflib’s official documentation.