如何使用 pandas 中的输入 corpus/list 从列中提取所有字符串匹配项？

Question

例如，我将以下字符串列表作为输入语料库（实际上它是一个包含 100 个值的大列表）。行动=['jump','fly','run','swim']

数据包含名为 action_description 的列。如何使用动作列表作为输入语料库提取 action_description 中的所有字符串匹配项？

注意：我已经完成了词形缩减description_action，所以如果列中有像 jumping 或 jumped 这样的词，它已经转换为 jump。

示例输入和输出

"I love to run and while my friend prefer to swim" --> "run swim"
"Allan excels at high jump but he is not a good at running" --> "jump run"

注意：我找到了下面的 pandas 函数，但它没有很好的文档记录，所以无法弄清楚如何使用它。

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extractall.html

请推荐一个最佳解决方案，因为输入数据框有 20 万行。

编辑像 jumper & 运行way 这样的词应该被算法忽略，即不应该被归类为 jump & 运行.

Answer 1

action=['jump','fly','run','swim']


str1="I    love to run and while my friend prefer to swim" ##--> "run swim"
str2="Allan excels at high jump but he is not a good at running" ##--> "jump run"

actionDtl=""
for word in str1.split():
    if word in action:
        if actionDtl<>"":
            actionDtl=actionDtl+" " +word
        else:
            actionDtl=actionDtl+word
    else:
        for act in action:
            if word.find(act)>=0:
                if actionDtl<>"":
                    actionDtl=actionDtl+" " +act
                else:
                    actionDtl=actionDtl+act
                break      
print actionDtl

Answer 2

步骤：

我们通过提供 pos='v' 仅对动词执行词形还原，并通过遍历 str.split 操作获得的列表中的每个单词让名词保持原样。
然后，使用 set 获取查找列表和词形化列表中出现的所有单词匹配项。
最后，将它们连接到 return 字符串作为输出。

from nltk.stem.wordnet import WordNetLemmatizer

action = ['jump','fly','run','swim']     # lookup list
lem = WordNetLemmatizer() 
fcn = lambda x: " ".join(set([lem.lemmatize(w, 'v') for w in x]).intersection(set(action)))
df['action_description'] = df['action_description'].str.split().apply(fcn)
df

开始 DF 使用：

df = pd.DataFrame(dict(action_description=["I love to run and while my friend prefer to swim", 
                                           "Allan excels at high jump but he is not a good at running"]))

要生成二进制标志 (0/1)，我们可以使用 str.get_dummies 方法，通过在空格上拆分字符串并计算它的指示变量，如下所示：

bin_flag = df['action_description'].str.get_dummies(sep=' ').add_suffix('_flag')
pd.concat([df['action_description'], bin_flag], axis=1)

Answer 3

这确实是一个正则表达式问题，使用 re.findall 匹配字符串并使用 operator.add 组合匹配

import pandas as pd
import re
import operator as op


action=['jump','fly','run','swim']

str1="I    love to run and while my friend prefer to swim" ##--> "run swim"
str2="Allan excels at high jump but he is not a good at running" ##--> "jump run



df=pd.DataFrame({'A':[1,2,3,4],
                  'B':['I    love to run and while my friend prefer to swim',
                  'Allan excels at high jump but he is not a good at running',
                  'Ostrich can run very fast but cannot fly',
                  'The runway was wet hence the Jumper flew over it'] })


df['ApproxMatch']=df['B'].apply(lambda x: [reduce(op.add, re.findall(act,x)) for act in action if re.findall(act,x) <> []] )

#using r'\b'+jump+r'\b' to match jump exactly, where \b stands for word boundaries

df['ExactMatch']=df['B'].apply(lambda x: [reduce(op.add, re.findall(r"\b"+act+r"\b",x)) for act in action if re.findall(r"\b"+act+r"\b",x) <> []] )

输出：

df

#   A                                                  B  ApproxMatch  \
#0  1  I    love to run and while my friend prefer to...  [run, swim]   
#1  2  Allan excels at high jump but he is not a good...  [jump, run]   
#2  3           Ostrich can run very fast but cannot fly   [fly, run]   
#3  4   The runway was wet hence the Jumper flew over it        [run]   
#
#    ExactMatch  
#0  [run, swim]  
#1       [jump]  
#2   [fly, run]  
#3           []

请注意，对于第 2 行的精确匹配，"running" 与 "run"

不匹配

如何使用 pandas 中的输入 corpus/list 从列中提取所有字符串匹配项？

How to extract all string matches from a column using a input corpus/list in pandas?

python

regex

text-mining

nltk

pandas