Python fuzzywuzzy如何实现模糊匹配？-杰瑞科技汇

Of course! fuzzywuzzy is a fantastic and popular Python library for string matching. It uses the Levenshtein distance to calculate the difference between two sequences. In simple terms, it measures how many "edits" (insertions, deletions, or substitutions) are needed to change one string into another.

（图片来源网络，侵删）

This is incredibly useful for tasks where you have messy, inconsistent text data.

Installation

First, you need to install the library. It's highly recommended to also install python-Levenshtein as it significantly speeds up the fuzzywuzzy processing.

pip install fuzzywuzzy
pip install python-Levenshtein

Basic Usage and Core Functions

Let's start with the most common functions.

`fuzz.ratio()`

This is the simplest function. It returns a similarity score between 0 and 100, where 100 is an exact match.

（图片来源网络，侵删）

from fuzzywuzzy import fuzz
# Perfect match
print(fuzz.ratio("New York Yankees", "New York Yankees"))  # Output: 100
# Slight difference
print(fuzz.ratio("New York Yankees", "New York Yankee"))   # Output: 96
# More significant difference
print(fuzz.ratio("New York Yankees", "Yankees New York"))   # Output: 86
# Completely different
print(fuzz.ratio("New York Yankees", "Boston Red Sox"))   # Output: 39

`fuzz.partial_ratio()`

This is more useful when you're looking for a substring. It finds the best matching substring and calculates the ratio based on that.

from fuzzywuzzy import fuzz
# The substring "New York" is a perfect match
print(fuzz.partial_ratio("New York Yankees", "York"))        # Output: 100
# Even if the order is different, a substring can match well
print(fuzz.partial_ratio("New York Yankees", "Yankees New York")) # Output: 100

Process: Matching Against a List

The real power of fuzzywuzzy comes into play when you want to find the best match for a string from a list of possible choices. The process module is designed for this.

`process.extract()`

This function takes a query string and a list of choices. It returns a list of tuples, containing the choice, its similarity score, and the index in the original list.

from fuzzywuzzy import process
choices = ["New York Mets", "New York Yankees", "Boston Red Sox", "Atlanta Braves"]
# Find the best match for "Yankees"
query = "Yankees"
result = process.extract(query, choices)
print(result)
# Output: [('New York Yankees', 90, 1), ('New York Mets', 78, 0), ('Boston Red Sox', 27, 2), ('Atlanta Braves', 27, 3)]
# Get the top 2 matches
top_2 = process.extract(query, choices, limit=2)
print(top_2)
# Output: [('New York Yankees', 90, 1), ('New York Mets', 78, 0)]

`process.extractOne()`

If you only need the single best match, this function is more efficient. It returns a tuple of (best_match, score).

（图片来源网络，侵删）

from fuzzywuzzy import process
choices = ["New York Mets", "New York Yankees", "Boston Red Sox", "Atlanta Braves"]
query = "Yankees"
best_match = process.extractOne(query, choices)
print(best_match)
# Output: ('New York Yankees', 90)

Advanced Features: Token Sorting

A major limitation of ratio and partial_ratio is that they are sensitive to word order. For example, "New York Yankees" and "Yankees of New York" have a low score.

To solve this, fuzzywuzzy offers token-based sorting.

`fuzz.token_sort_ratio()`

This function preprocesses the strings by splitting them into words (tokens), sorting the tokens alphabetically, and then re-joining them. It then calculates the ratio on these "normalized" strings.

from fuzzywuzzy import fuzz
# Without token sort, the score is low
print(fuzz.ratio("New York Yankees", "Yankees of New York")) # Output: 61
# With token sort, the score is much higher
print(fuzz.token_sort_ratio("New York Yankees", "Yankees of New York")) # Output: 90

`fuzz.token_set_ratio()`

This is even more powerful. It works by finding the common "tokens" (words) between the two strings and then calculating a ratio based on those common tokens. It's excellent for strings that have a lot of shared words but also different words.

Let's break it down:

String 1: "new york mets vs new york yankees"
String 2: "new york yankees vs mets"

The common tokens are new, york, vs, mets, yankees. The unique tokens are yankees in string 1 and mets in string 2. token_set_ratio focuses on the common core.

from fuzzywuzzy import fuzz
# A common use case: comparing full sentences with extra words
s1 = "The New York Yankees baseball team"
s2 = "New York Yankees"
# Standard ratio is okay
print(fuzz.ratio(s1, s2)) # Output: 86
# Token sort is good
print(fuzz.token_sort_ratio(s1, s2)) # Output: 89
# Token set is excellent because it ignores the extra words
print(fuzz.token_set_ratio(s1, s2)) # Output: 100

A Practical Example: Cleaning a Messy List

Imagine you have a list of customer feedback with inconsistent company names, and you want to standardize them.

from fuzzywuzzy import process, fuzz
# A messy list of company names from feedback
messy_feedback = [
    "Google is great",
    "I love google",
    "GOOGLE products are awesome",
    "I have a problem with Google",
    "Apple makes good phones",
    "I like Apple",
    "APPLE service is bad",
    "What about Amazon?",
    "amazon.com is huge",
    "AMAZON delivery was slow"
]
# Our list of "clean" standard company names
standard_companies = ["Google", "Apple", "Amazon"]
# Function to clean a single string
def clean_company_name(name, choices, score_cutoff=80):
    """
    Finds the best match for a name from a list of choices.
    Returns the match if the score is above the cutoff, otherwise returns the original.
    """
    best_match = process.extractOne(name, choices, scorer=fuzz.token_set_ratio)
    if best_match and best_match[1] >= score_cutoff:
        return best_match[0]
    return name
# Clean the entire list
cleaned_feedback = [clean_company_name(item, standard_companies) for item in messy_feedback]
print("--- Original List ---")
for item in messy_feedback:
    print(item)
print("\n--- Cleaned List ---")
for item in cleaned_feedback:
    print(item)

Output of the example:

--- Original List ---
Google is great
I love google
GOOGLE products are awesome
I have a problem with Google
Apple makes good phones
I like Apple
APPLE service is bad
What about Amazon?
amazon.com is huge
AMAZON delivery was slow
--- Cleaned List ---
Google
Google
Google
Google
Apple
Apple
Apple
Amazon
Amazon
Amazon

Important Considerations & Best Practices

Performance: fuzzywuzzy can be slow on very large lists (tens of thousands of items or more). If performance is critical, consider alternatives like RapidFuzz (a drop-in replacement that is much faster) or dedicated search engines like Elasticsearch with their "fuzzy" query type.
Scorer Choice: Don't just use fuzz.ratio. For most real-world data, fuzz.token_set_ratio or fuzz.token_sort_ratio are far more robust because they handle word order and extra words better.
Score Cutoff: Always use a score_cutoff (e.g., 80 or 90) when matching against lists. This prevents the library from returning a poor "best match" when there is no good option. The right cutoff depends on your data and tolerance for errors.
The Future of fuzzywuzzy: The original author of fuzzywuzzy is no longer actively maintaining it. The community has created RapidFuzz, which is a high-performance drop-in replacement. For new projects, it's highly recommended to use RapidFuzz instead.
```
pip install rapidfuzz
```
The API is almost identical, but it's significantly faster.

Python fuzzywuzzy如何实现模糊匹配？

Installation

Basic Usage and Core Functions

`fuzz.ratio()`

`fuzz.partial_ratio()`

Process: Matching Against a List

`process.extract()`

`process.extractOne()`

Advanced Features: Token Sorting

`fuzz.token_sort_ratio()`

`fuzz.token_set_ratio()`

A Practical Example: Cleaning a Messy List

Important Considerations & Best Practices

99ANYc3cd6

Python joinstring如何高效拼接字符串？

VC 6.0中文版教程该怎么学？

S7edge如何刷成港行？

三星Note3教程图解，如何快速上手操作？

ArrayList在Java中如何使用？

Java Socket编程实例具体怎么实现？

VirtualBox教程，新手如何快速上手虚拟机？

Python easytrader怎么用？新手入门指南

JavaJava Eclipse环境如何正确配置？

Java Socket如何高效传输文件？

Python如何集成KindEditor富文本编辑器？

Java核心技术卷1，基础知识该怎么学？

Java NIO Socket 如何高效实现非阻塞通信？

Java Socket如何高效传输文件？

Premiere安装教程，详细步骤是什么？

如何在Eclipse中创建Java图形界面？

Python fuzzywuzzy如何实现模糊匹配？

Installation

Basic Usage and Core Functions

fuzz.ratio()

fuzz.partial_ratio()

Process: Matching Against a List

process.extract()

process.extractOne()

Advanced Features: Token Sorting

fuzz.token_sort_ratio()

fuzz.token_set_ratio()

A Practical Example: Cleaning a Messy List

Important Considerations & Best Practices

相关推荐

Java Socket编程实例具体怎么实现？

`fuzz.ratio()`

`fuzz.partial_ratio()`

`process.extract()`

`process.extractOne()`

`fuzz.token_sort_ratio()`

`fuzz.token_set_ratio()`