根据拼写清理列？ Pandas

Question

我的数据框中有两个非常重要的用户输入信息列。除了一个问题外，它们大部分都已清理干净：拼写和名称的书写方式不同。例如，一个名字有五个条目："red rocks canyon"、"redrcks"、"redrock canyon"、"red rocks canyons"。这个数据集太大了，我无法手动清理它（200 万个条目）。有什么策略可以用代码清理这些功能吗？

Answer 1

我会考虑在这里做 phonetic string matching。这种方法背后的基本思想是为每个输入的字符串获取一个语音编码，然后根据编码对拼写变体进行分组。然后，您可以选择每组中最频繁的变体作为 "correct" 拼写。

语音编码有几种不同的变体，Python 中用于尝试其中一些的一个很棒的包是 jellyfish. Here is an example of how to use it with the Soundex 编码：

import jellyfish
import pandas as pd

data = pd.DataFrame({
    "name": [
        "red rocks canyon",
        "redrcks",
        "redrock canyon",
        "red rocks canyons",
        "bosque",
        "bosque escoces",
        "bosque escocs",
        "borland",
        "borlange"
    ]
})
data["soundex"] = data.name.apply(lambda x: jellyfish.soundex(x))
print(data.groupby("soundex").agg({"name": lambda x: ", ".join(x)}))

这会打印：

                                                      name
soundex                                                   
B200                                                bosque
B222                         bosque escoces, bosque escocs
B645                                     borland, borlange
R362     red rocks canyon, redrcks, redrock canyon, red...

这肯定不是完美的，您必须小心，因为它可能过于激进地分组，但我希望它能给您一些尝试！

根据拼写清理列？ Pandas

Cleaning up a column based on spelling? Pandas

python-3.x

pandas

sklearn-pandas

pandas-groupby