如何使用 pandas 中的输入 corpus/list 从列中提取所有字符串匹配项?
How to extract all string matches from a column using a input corpus/list in pandas?
例如,我将以下字符串列表作为输入语料库(实际上它是一个包含 100 个值的大列表)。
行动=['jump','fly','run','swim']
数据包含名为 action_description 的列。如何使用动作列表作为输入语料库提取 action_description 中的所有字符串匹配项?
注意:我已经完成了词形缩减description_action,所以如果列中有像 jumping 或 jumped 这样的词,它已经转换为 jump。
示例输入和输出
"I love to run and while my friend prefer to swim" --> "run swim"
"Allan excels at high jump but he is not a good at running" --> "jump run"
注意:我找到了下面的 pandas 函数,但它没有很好的文档记录,所以无法弄清楚如何使用它。
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extractall.html
请推荐一个最佳解决方案,因为输入数据框有 20 万行。
编辑
像 jumper & 运行way 这样的词应该被算法忽略,即不应该被归类为 jump & 运行.
action=['jump','fly','run','swim']
str1="I love to run and while my friend prefer to swim" ##--> "run swim"
str2="Allan excels at high jump but he is not a good at running" ##--> "jump run"
actionDtl=""
for word in str1.split():
if word in action:
if actionDtl<>"":
actionDtl=actionDtl+" " +word
else:
actionDtl=actionDtl+word
else:
for act in action:
if word.find(act)>=0:
if actionDtl<>"":
actionDtl=actionDtl+" " +act
else:
actionDtl=actionDtl+act
break
print actionDtl
步骤:
- 我们通过提供
pos='v'
仅对动词执行词形还原,并通过遍历 str.split
操作获得的列表中的每个单词让名词保持原样。
- 然后,使用
set
获取查找列表和词形化列表中出现的所有单词匹配项。
- 最后,将它们连接到 return 字符串作为输出。
from nltk.stem.wordnet import WordNetLemmatizer
action = ['jump','fly','run','swim'] # lookup list
lem = WordNetLemmatizer()
fcn = lambda x: " ".join(set([lem.lemmatize(w, 'v') for w in x]).intersection(set(action)))
df['action_description'] = df['action_description'].str.split().apply(fcn)
df
开始 DF
使用:
df = pd.DataFrame(dict(action_description=["I love to run and while my friend prefer to swim",
"Allan excels at high jump but he is not a good at running"]))
要生成二进制标志 (0/1),我们可以使用 str.get_dummies
方法,通过在空格上拆分字符串并计算它的指示变量,如下所示:
bin_flag = df['action_description'].str.get_dummies(sep=' ').add_suffix('_flag')
pd.concat([df['action_description'], bin_flag], axis=1)
这确实是一个正则表达式问题,使用 re.findall
匹配字符串并使用 operator.add
组合匹配
import pandas as pd
import re
import operator as op
action=['jump','fly','run','swim']
str1="I love to run and while my friend prefer to swim" ##--> "run swim"
str2="Allan excels at high jump but he is not a good at running" ##--> "jump run
df=pd.DataFrame({'A':[1,2,3,4],
'B':['I love to run and while my friend prefer to swim',
'Allan excels at high jump but he is not a good at running',
'Ostrich can run very fast but cannot fly',
'The runway was wet hence the Jumper flew over it'] })
df['ApproxMatch']=df['B'].apply(lambda x: [reduce(op.add, re.findall(act,x)) for act in action if re.findall(act,x) <> []] )
#using r'\b'+jump+r'\b' to match jump exactly, where \b stands for word boundaries
df['ExactMatch']=df['B'].apply(lambda x: [reduce(op.add, re.findall(r"\b"+act+r"\b",x)) for act in action if re.findall(r"\b"+act+r"\b",x) <> []] )
输出:
df
# A B ApproxMatch \
#0 1 I love to run and while my friend prefer to... [run, swim]
#1 2 Allan excels at high jump but he is not a good... [jump, run]
#2 3 Ostrich can run very fast but cannot fly [fly, run]
#3 4 The runway was wet hence the Jumper flew over it [run]
#
# ExactMatch
#0 [run, swim]
#1 [jump]
#2 [fly, run]
#3 []
请注意,对于第 2 行的精确匹配,"running" 与 "run"
不匹配
例如,我将以下字符串列表作为输入语料库(实际上它是一个包含 100 个值的大列表)。 行动=['jump','fly','run','swim']
数据包含名为 action_description 的列。如何使用动作列表作为输入语料库提取 action_description 中的所有字符串匹配项?
注意:我已经完成了词形缩减description_action,所以如果列中有像 jumping 或 jumped 这样的词,它已经转换为 jump。
示例输入和输出
"I love to run and while my friend prefer to swim" --> "run swim"
"Allan excels at high jump but he is not a good at running" --> "jump run"
注意:我找到了下面的 pandas 函数,但它没有很好的文档记录,所以无法弄清楚如何使用它。
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extractall.html
请推荐一个最佳解决方案,因为输入数据框有 20 万行。
编辑 像 jumper & 运行way 这样的词应该被算法忽略,即不应该被归类为 jump & 运行.
action=['jump','fly','run','swim']
str1="I love to run and while my friend prefer to swim" ##--> "run swim"
str2="Allan excels at high jump but he is not a good at running" ##--> "jump run"
actionDtl=""
for word in str1.split():
if word in action:
if actionDtl<>"":
actionDtl=actionDtl+" " +word
else:
actionDtl=actionDtl+word
else:
for act in action:
if word.find(act)>=0:
if actionDtl<>"":
actionDtl=actionDtl+" " +act
else:
actionDtl=actionDtl+act
break
print actionDtl
步骤:
- 我们通过提供
pos='v'
仅对动词执行词形还原,并通过遍历str.split
操作获得的列表中的每个单词让名词保持原样。 - 然后,使用
set
获取查找列表和词形化列表中出现的所有单词匹配项。 - 最后,将它们连接到 return 字符串作为输出。
from nltk.stem.wordnet import WordNetLemmatizer
action = ['jump','fly','run','swim'] # lookup list
lem = WordNetLemmatizer()
fcn = lambda x: " ".join(set([lem.lemmatize(w, 'v') for w in x]).intersection(set(action)))
df['action_description'] = df['action_description'].str.split().apply(fcn)
df
开始 DF
使用:
df = pd.DataFrame(dict(action_description=["I love to run and while my friend prefer to swim",
"Allan excels at high jump but he is not a good at running"]))
要生成二进制标志 (0/1),我们可以使用 str.get_dummies
方法,通过在空格上拆分字符串并计算它的指示变量,如下所示:
bin_flag = df['action_description'].str.get_dummies(sep=' ').add_suffix('_flag')
pd.concat([df['action_description'], bin_flag], axis=1)
这确实是一个正则表达式问题,使用 re.findall
匹配字符串并使用 operator.add
组合匹配
import pandas as pd
import re
import operator as op
action=['jump','fly','run','swim']
str1="I love to run and while my friend prefer to swim" ##--> "run swim"
str2="Allan excels at high jump but he is not a good at running" ##--> "jump run
df=pd.DataFrame({'A':[1,2,3,4],
'B':['I love to run and while my friend prefer to swim',
'Allan excels at high jump but he is not a good at running',
'Ostrich can run very fast but cannot fly',
'The runway was wet hence the Jumper flew over it'] })
df['ApproxMatch']=df['B'].apply(lambda x: [reduce(op.add, re.findall(act,x)) for act in action if re.findall(act,x) <> []] )
#using r'\b'+jump+r'\b' to match jump exactly, where \b stands for word boundaries
df['ExactMatch']=df['B'].apply(lambda x: [reduce(op.add, re.findall(r"\b"+act+r"\b",x)) for act in action if re.findall(r"\b"+act+r"\b",x) <> []] )
输出:
df
# A B ApproxMatch \
#0 1 I love to run and while my friend prefer to... [run, swim]
#1 2 Allan excels at high jump but he is not a good... [jump, run]
#2 3 Ostrich can run very fast but cannot fly [fly, run]
#3 4 The runway was wet hence the Jumper flew over it [run]
#
# ExactMatch
#0 [run, swim]
#1 [jump]
#2 [fly, run]
#3 []
请注意,对于第 2 行的精确匹配,"running" 与 "run"
不匹配