如何更正 Pandas DataFrame 中的拼写
How to correct spelling in a Pandas DataFrame
使用 TextBlob 库可以改进字符串的拼写,方法是首先将它们定义为 TextBlob 对象,然后使用 correct
方法。
示例:
from textblob import TextBlob
data = TextBlob('Two raods diverrged in a yullow waod and surry I culd not travl bouth')
print (data.correct())
Two roads diverged in a yellow wood and sorry I could not travel both
是否可以对 Pandas DataFrame 系列中的字符串执行此操作,例如:
data = [{'one': '3', 'two': 'two raods'},
{'one': '7', 'two': 'diverrged in a yullow'},
{'one': '8', 'two': 'waod and surry I'},
{'one': '9', 'two': 'culd not travl bouth'}]
df = pd.DataFrame(data)
df
one two
0 3 Two raods
1 7 diverrged in a yullow
2 8 waod and surry I
3 9 culd not travl bouth
给return这个:
one two
0 3 Two roads
1 7 diverged in a yellow
2 8 wood and sorry I
3 9 could not travel both
使用 TextBlob 或其他方法。
你可以这样做:
df.two.apply(lambda txt: ''.join(textblob.TextBlob(txt).correct()))
我仍在寻找更快的方法。但是,我认为 python 中有一个名为 autocorrect
的库可以帮助拼写校正。我在演示数据上对两个库(autocorrect
和 testblob
)进行了计时,这些是我得到的结果。
%%timeit
spell_correct_tb(['haave', 'naame'])
The slowest run took 4.36 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 505 µs per loop
%%timeit
spell_correct_autocorrect(['haave', 'naame'])
The slowest run took 4.80 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 303 µs per loop
这表明 autocorrect
工作得更快(或者我的假设是错误的?)。不过我对这两个库的准确度衡量不是很确定
注意:您可以使用 pip
运行 命令 pip install autocorrect
安装自动更正
使用 TextBlob 库可以改进字符串的拼写,方法是首先将它们定义为 TextBlob 对象,然后使用 correct
方法。
示例:
from textblob import TextBlob
data = TextBlob('Two raods diverrged in a yullow waod and surry I culd not travl bouth')
print (data.correct())
Two roads diverged in a yellow wood and sorry I could not travel both
是否可以对 Pandas DataFrame 系列中的字符串执行此操作,例如:
data = [{'one': '3', 'two': 'two raods'},
{'one': '7', 'two': 'diverrged in a yullow'},
{'one': '8', 'two': 'waod and surry I'},
{'one': '9', 'two': 'culd not travl bouth'}]
df = pd.DataFrame(data)
df
one two
0 3 Two raods
1 7 diverrged in a yullow
2 8 waod and surry I
3 9 culd not travl bouth
给return这个:
one two
0 3 Two roads
1 7 diverged in a yellow
2 8 wood and sorry I
3 9 could not travel both
使用 TextBlob 或其他方法。
你可以这样做:
df.two.apply(lambda txt: ''.join(textblob.TextBlob(txt).correct()))
我仍在寻找更快的方法。但是,我认为 python 中有一个名为 autocorrect
的库可以帮助拼写校正。我在演示数据上对两个库(autocorrect
和 testblob
)进行了计时,这些是我得到的结果。
%%timeit
spell_correct_tb(['haave', 'naame'])
The slowest run took 4.36 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 505 µs per loop
%%timeit
spell_correct_autocorrect(['haave', 'naame'])
The slowest run took 4.80 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 303 µs per loop
这表明 autocorrect
工作得更快(或者我的假设是错误的?)。不过我对这两个库的准确度衡量不是很确定
注意:您可以使用 pip
运行 命令 pip install autocorrect