如何使用 pandas str.extractall 只查找匹配的单词,不查找子字符串?
How to find matched word only, no substring by using pandas str.extractall?
我正在处理数据框中的一列字符串,并尝试提取与给定单词列表中的任何单词匹配的所有单词。它提取了所有匹配的单词和子字符串,我怎样才能只得到单词?非常感谢!
我的代码:
import pandas as pd
cl =['dust', 'yes inr', 'inner']
data = [[1, 'dust industr yes inr'], [2, 'state inner'],[3, 'dustry']]
df = pd.DataFrame(data, columns = ['ID', 'Text'])
df['findWord'] = df['Text'].str.extractall(f"({'|'.join(cl)})").groupby(level=0).agg(', '.join)
print(df)
当前输出:如何只能提取单词 dust,而不提取 'industry'
的子字符串
ID Text findWord
0 1 dust industr yes inr dust, dust, yes inr
1 2 state inner inner
2 3 dustry dust
预期输出:
ID Text findWord
0 1 dust industr yes inr dust, yes inr
1 2 state inner inner
2 3 dustry Nan
也许是这样的:
import pandas as pd
import numpy as np
cl =['dust', 'inner']
data = [[1, 'dust industry inner'], [2, 'state inner'],[3, 'dustry']]
df = pd.DataFrame(data, columns = ['ID', 'Text'])
df['findWord'] = [', '.join(set(d.split(' ')).intersection(set(cl))) for d in df['Text'].to_numpy()]
df = df.replace('', np.NaN)
ID Text findWord
0 1 dust industry inner dust, inner
1 2 state inner inner
2 3 dustry NaN
更新 1:
尝试使用正则表达式模式:
import pandas as pd
cl =['dust', 'yes inr', 'inner']
data = [[1, 'dust industr yes inr'], [2, 'state inner'],[3, 'dustry']]
df = pd.DataFrame(data, columns = ['ID', 'Text'])
regex = '({})'.format('|'.join('\b{}\b'.format(c) for c in cl))
df['findWord'] = df['Text'].str.extractall(regex).groupby(level=0).agg(', '.join)
ID Text findWord
0 1 dust industr yes inr dust, yes inr
1 2 state inner inner
2 3 dustry NaN
通过添加单词边界 \b
修复您的正则表达式模式,使其只匹配完整的单词,然后使用 str.findall
查找所有出现此模式的地方
df['findWord'] = df['Text'].str.findall(r'\b(%s)\b' % '|'.join(cl)).str.join(', ')
ID Text findWord
0 1 dust industr yes inr dust, yes inr
1 2 state inner inner
2 3 dustry
我正在处理数据框中的一列字符串,并尝试提取与给定单词列表中的任何单词匹配的所有单词。它提取了所有匹配的单词和子字符串,我怎样才能只得到单词?非常感谢!
我的代码:
import pandas as pd
cl =['dust', 'yes inr', 'inner']
data = [[1, 'dust industr yes inr'], [2, 'state inner'],[3, 'dustry']]
df = pd.DataFrame(data, columns = ['ID', 'Text'])
df['findWord'] = df['Text'].str.extractall(f"({'|'.join(cl)})").groupby(level=0).agg(', '.join)
print(df)
当前输出:如何只能提取单词 dust,而不提取 'industry'
的子字符串 ID Text findWord
0 1 dust industr yes inr dust, dust, yes inr
1 2 state inner inner
2 3 dustry dust
预期输出:
ID Text findWord
0 1 dust industr yes inr dust, yes inr
1 2 state inner inner
2 3 dustry Nan
也许是这样的:
import pandas as pd
import numpy as np
cl =['dust', 'inner']
data = [[1, 'dust industry inner'], [2, 'state inner'],[3, 'dustry']]
df = pd.DataFrame(data, columns = ['ID', 'Text'])
df['findWord'] = [', '.join(set(d.split(' ')).intersection(set(cl))) for d in df['Text'].to_numpy()]
df = df.replace('', np.NaN)
ID Text findWord
0 1 dust industry inner dust, inner
1 2 state inner inner
2 3 dustry NaN
更新 1: 尝试使用正则表达式模式:
import pandas as pd
cl =['dust', 'yes inr', 'inner']
data = [[1, 'dust industr yes inr'], [2, 'state inner'],[3, 'dustry']]
df = pd.DataFrame(data, columns = ['ID', 'Text'])
regex = '({})'.format('|'.join('\b{}\b'.format(c) for c in cl))
df['findWord'] = df['Text'].str.extractall(regex).groupby(level=0).agg(', '.join)
ID Text findWord
0 1 dust industr yes inr dust, yes inr
1 2 state inner inner
2 3 dustry NaN
通过添加单词边界 \b
修复您的正则表达式模式,使其只匹配完整的单词,然后使用 str.findall
查找所有出现此模式的地方
df['findWord'] = df['Text'].str.findall(r'\b(%s)\b' % '|'.join(cl)).str.join(', ')
ID Text findWord
0 1 dust industr yes inr dust, yes inr
1 2 state inner inner
2 3 dustry