Of course! fuzzywuzzy is a fantastic and popular Python library for string matching. It uses the Levenshtein distance to calculate the difference between two sequences. In simple terms, it measures how many "edits" (insertions, deletions, or substitutions) are needed to change one string into another.

This is incredibly useful for tasks where you have messy, inconsistent text data.
Installation
First, you need to install the library. It's highly recommended to also install python-Levenshtein as it significantly speeds up the fuzzywuzzy processing.
pip install fuzzywuzzy pip install python-Levenshtein
Basic Usage and Core Functions
Let's start with the most common functions.
fuzz.ratio()
This is the simplest function. It returns a similarity score between 0 and 100, where 100 is an exact match.

from fuzzywuzzy import fuzz
# Perfect match
print(fuzz.ratio("New York Yankees", "New York Yankees")) # Output: 100
# Slight difference
print(fuzz.ratio("New York Yankees", "New York Yankee")) # Output: 96
# More significant difference
print(fuzz.ratio("New York Yankees", "Yankees New York")) # Output: 86
# Completely different
print(fuzz.ratio("New York Yankees", "Boston Red Sox")) # Output: 39
fuzz.partial_ratio()
This is more useful when you're looking for a substring. It finds the best matching substring and calculates the ratio based on that.
from fuzzywuzzy import fuzz
# The substring "New York" is a perfect match
print(fuzz.partial_ratio("New York Yankees", "York")) # Output: 100
# Even if the order is different, a substring can match well
print(fuzz.partial_ratio("New York Yankees", "Yankees New York")) # Output: 100
Process: Matching Against a List
The real power of fuzzywuzzy comes into play when you want to find the best match for a string from a list of possible choices. The process module is designed for this.
process.extract()
This function takes a query string and a list of choices. It returns a list of tuples, containing the choice, its similarity score, and the index in the original list.
from fuzzywuzzy import process
choices = ["New York Mets", "New York Yankees", "Boston Red Sox", "Atlanta Braves"]
# Find the best match for "Yankees"
query = "Yankees"
result = process.extract(query, choices)
print(result)
# Output: [('New York Yankees', 90, 1), ('New York Mets', 78, 0), ('Boston Red Sox', 27, 2), ('Atlanta Braves', 27, 3)]
# Get the top 2 matches
top_2 = process.extract(query, choices, limit=2)
print(top_2)
# Output: [('New York Yankees', 90, 1), ('New York Mets', 78, 0)]
process.extractOne()
If you only need the single best match, this function is more efficient. It returns a tuple of (best_match, score).

from fuzzywuzzy import process
choices = ["New York Mets", "New York Yankees", "Boston Red Sox", "Atlanta Braves"]
query = "Yankees"
best_match = process.extractOne(query, choices)
print(best_match)
# Output: ('New York Yankees', 90)
Advanced Features: Token Sorting
A major limitation of ratio and partial_ratio is that they are sensitive to word order. For example, "New York Yankees" and "Yankees of New York" have a low score.
To solve this, fuzzywuzzy offers token-based sorting.
fuzz.token_sort_ratio()
This function preprocesses the strings by splitting them into words (tokens), sorting the tokens alphabetically, and then re-joining them. It then calculates the ratio on these "normalized" strings.
from fuzzywuzzy import fuzz
# Without token sort, the score is low
print(fuzz.ratio("New York Yankees", "Yankees of New York")) # Output: 61
# With token sort, the score is much higher
print(fuzz.token_sort_ratio("New York Yankees", "Yankees of New York")) # Output: 90
fuzz.token_set_ratio()
This is even more powerful. It works by finding the common "tokens" (words) between the two strings and then calculating a ratio based on those common tokens. It's excellent for strings that have a lot of shared words but also different words.
Let's break it down:
- String 1: "new york mets vs new york yankees"
- String 2: "new york yankees vs mets"
The common tokens are new, york, vs, mets, yankees. The unique tokens are yankees in string 1 and mets in string 2. token_set_ratio focuses on the common core.
from fuzzywuzzy import fuzz # A common use case: comparing full sentences with extra words s1 = "The New York Yankees baseball team" s2 = "New York Yankees" # Standard ratio is okay print(fuzz.ratio(s1, s2)) # Output: 86 # Token sort is good print(fuzz.token_sort_ratio(s1, s2)) # Output: 89 # Token set is excellent because it ignores the extra words print(fuzz.token_set_ratio(s1, s2)) # Output: 100
A Practical Example: Cleaning a Messy List
Imagine you have a list of customer feedback with inconsistent company names, and you want to standardize them.
from fuzzywuzzy import process, fuzz
# A messy list of company names from feedback
messy_feedback = [
"Google is great",
"I love google",
"GOOGLE products are awesome",
"I have a problem with Google",
"Apple makes good phones",
"I like Apple",
"APPLE service is bad",
"What about Amazon?",
"amazon.com is huge",
"AMAZON delivery was slow"
]
# Our list of "clean" standard company names
standard_companies = ["Google", "Apple", "Amazon"]
# Function to clean a single string
def clean_company_name(name, choices, score_cutoff=80):
"""
Finds the best match for a name from a list of choices.
Returns the match if the score is above the cutoff, otherwise returns the original.
"""
best_match = process.extractOne(name, choices, scorer=fuzz.token_set_ratio)
if best_match and best_match[1] >= score_cutoff:
return best_match[0]
return name
# Clean the entire list
cleaned_feedback = [clean_company_name(item, standard_companies) for item in messy_feedback]
print("--- Original List ---")
for item in messy_feedback:
print(item)
print("\n--- Cleaned List ---")
for item in cleaned_feedback:
print(item)
Output of the example:
--- Original List ---
Google is great
I love google
GOOGLE products are awesome
I have a problem with Google
Apple makes good phones
I like Apple
APPLE service is bad
What about Amazon?
amazon.com is huge
AMAZON delivery was slow
--- Cleaned List ---
Google
Google
Google
Google
Apple
Apple
Apple
Amazon
Amazon
Amazon
Important Considerations & Best Practices
-
Performance:
fuzzywuzzycan be slow on very large lists (tens of thousands of items or more). If performance is critical, consider alternatives likeRapidFuzz(a drop-in replacement that is much faster) or dedicated search engines like Elasticsearch with their "fuzzy" query type. -
Scorer Choice: Don't just use
fuzz.ratio. For most real-world data,fuzz.token_set_ratioorfuzz.token_sort_ratioare far more robust because they handle word order and extra words better. -
Score Cutoff: Always use a
score_cutoff(e.g., 80 or 90) when matching against lists. This prevents the library from returning a poor "best match" when there is no good option. The right cutoff depends on your data and tolerance for errors. -
The Future of
fuzzywuzzy: The original author offuzzywuzzyis no longer actively maintaining it. The community has createdRapidFuzz, which is a high-performance drop-in replacement. For new projects, it's highly recommended to useRapidFuzzinstead.pip install rapidfuzz
The API is almost identical, but it's significantly faster.
