如何更正 Pandas DataFrame 中的拼写

How to correct spelling in a Pandas DataFrame

使用 TextBlob 库可以改进字符串的拼写,方法是首先将它们定义为 TextBlob 对象,然后使用 correct 方法。

示例:

from textblob import TextBlob
data = TextBlob('Two raods diverrged in a yullow waod and surry I culd not travl bouth')
print (data.correct())
Two roads diverged in a yellow wood and sorry I could not travel both

是否可以对 Pandas DataFrame 系列中的字符串执行此操作,例如:

data = [{'one': '3', 'two': 'two raods'}, 
         {'one': '7', 'two': 'diverrged in a yullow'}, 
        {'one': '8', 'two': 'waod and surry I'}, 
        {'one': '9', 'two': 'culd not travl bouth'}]
df = pd.DataFrame(data)
df

    one   two
0   3     Two raods
1   7     diverrged in a yullow
2   8     waod and surry I
3   9     culd not travl bouth

给return这个:

    one   two
0   3     Two roads
1   7     diverged in a yellow
2   8     wood and sorry I
3   9     could not travel both

使用 TextBlob 或其他方法。

你可以这样做:

df.two.apply(lambda txt: ''.join(textblob.TextBlob(txt).correct()))

使用 pandas.Series.apply.

我仍在寻找更快的方法。但是,我认为 python 中有一个名为 autocorrect 的库可以帮助拼写校正。我在演示数据上对两个库(autocorrecttestblob)进行了计时,这些是我得到的结果。

%%timeit
spell_correct_tb(['haave', 'naame'])
The slowest run took 4.36 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 505 µs per loop

%%timeit
spell_correct_autocorrect(['haave', 'naame'])
The slowest run took 4.80 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 303 µs per loop

这表明 autocorrect 工作得更快(或者我的假设是错误的?)。不过我对这两个库的准确度衡量不是很确定

注意:您可以使用 pip 运行 命令 pip install autocorrect

安装自动更正