关键字匹配在 pandas 列中给出重复的词？

Question

我有一个 pandas 数据框包含两列：-

ID           text_data                               

1         companies are mainly working on two 
          technologies that is ai and health care.
          Company need to improve on health care.

2         Current trend are mainly depends on block chain
          and IOT where IOT is
          highly used.

3         ............
.         ...........
.         ...........
.         so on.

现在我有另一个列表 Techlist=["block chain","health care","ai","IOT"]

我需要将列表 Techlist 与 pandas 数据框的 text_data 列进行匹配，所以我使用了以下代码：-

df['tech_match']=df['text_data'].apply(lambda x: [reduce(op.add, re.findall(act,x)) for act in Techlist if re.findall(act,x) <> []] )

所以我得到的是不同的东西：-

ID         text_data                                           tech_match
1     companies are mainly working on two          [ai,healthcarehealthcare]             
      technologies that is ai and health care.
      Company need to improve on health care.

2     current trend are mainly                     [block chain,IOTIOT]
      depends on block chain and 
      IOT where IOT is highly used.

3    .................
.    ................             
.    ...............
.    so on.

列表和文本数据正确匹配，但匹配的列表单词在 tech_match 列中重复。

我需要的是：-

ID            text_data                             tech_match
1     companies are mainly working on two           [heatlh care,ai]
      technologies that is ai and health care.
      Company need to improve on health care.

2     Current trend are mainly depends on          [block chain,IOT]
      blockchain and IOT where IOT is
      highly used. 

3     ..................
.     ..................
.     .................
.     son on.

如何删除 tech_match 列中的这些重复词？

Answer 1

使用 str.findall with boundary for look-up words. Thank you 以获得更简单的模式：

pat = '|'.join(r"\b{}\b".format(x) for x in Techlist)
print (pat)
\bblockchain\b|\bhealthcare\b|\bai\b|\bIOT\b

创建新列：

df['tech_match'] = df['text_data'].str.findall(pat).apply(lambda x: list(set(x)))

print (df)
                                           text_data         tech_match
0  companies are mainly working on two technologi...   [healthcare, ai]
1  Current trend are mainly depends on blockchain...  [blockchain, IOT]

您可以 return 用 Counter 计算每个单词的数量，再次感谢 Anton vBR 的建议：

from collections import Counter

df['tech_match'] = df['text_data'].str.findall(pat).apply(lambda x: Counter(x))

print(df)

    text_data                                           tech_match
0   companies are mainly working on two technologi...   {'ai': 1, 'healthcare': 2}
1   Current trend are mainly depends on blockchain...   {'blockchain': 1, 'IOT': 2}

另外你可以用原框加入计数系列:

data = (df['text_data'].str.findall(pat).apply(lambda x: Counter(x))).tolist()
df = df.join(pd.DataFrame(data)).fillna(0) # join dfs
df['Total'] =df[Techlist].sum(axis=1) # create Total column

   text_data          IOT   ai  blockchain  healthcare  Total 
0  companies are ...  0.0  2.0         0.0        2.0    4.0
1  Current trend ...  2.0  0.0         1.0        0.0    3.0

时间:

text_data = "companies are mainly working on two technologies that is ai and healthcare. Company need to improve on healthcare. Current trend are mainly depends on blockchain and IOT where IOT is highly used.".split()

np.random.seed(75)
#20000 random rows with all words from text_data
N = 20000
df = pd.DataFrame({'text_data': [np.random.choice(text_data, size=np.random.randint(3,10)) for x in range(N)]})
df['text_data'] = df['text_data'].str.join(' ')


Techlist=["blockchain","healthcare","ai","IOT"]
s = set(["blockchain", "healthcare", "ai", "IOT"])

#cᴏʟᴅsᴘᴇᴇᴅ's solution
In [401]: %timeit df['matches'] = df.text_data.str.split(r'[^\w]+').apply(lambda x: list(s.intersection(x)))
10 loops, best of 3: 165 ms per loop

#jezrael's solution
In [402]: %timeit df['tech_match'] = df['text_data'].str.findall('|'.join([r"\b{word}\b".format(word=word) for word in Techlist])).apply(lambda x: list(set(x)))
10 loops, best of 3: 74.7 ms per loop

#Bharath's solution
In [403]: %timeit df['new'] = df['text_data'].apply(lambda x :  list(set([i for i in nltk.word_tokenize(x) if i in Techlist])))
1 loop, best of 3: 3.73 s per loop

Answer 2

作为正则表达式的替代方案，我们可以使用 nltk.word_tokenize 然后应用集合，即

text_data = ["companies are mainly working on two data itegration technologies that is and healthcare. Company need to improve on healthcare.", "Current trend are mainly depends on blockchain and IOT where IOT is highly used."]

df = pd.DataFrame({'text_data':text_data})

Techlist=["blockchain","healthcare","ai","IOT"]
import nltk

df['new'] = df['text_data'].apply(lambda x :  list(set([i for i in nltk.word_tokenize(x) if i in Techlist])))


                                      text_data                new
0  companies are mainly working on two data itegr...       [healthcare]
1  Current trend are mainly depends on blockchain...  [IOT, blockchain]

为了更快地应用相同的内容，您可以查看

Answer 3

使用str.split然后调用set.intersection:

s = set(["blockchain", "healthcare", "ai", "IOT"])

df['matches'] = df.text_data.str.split(r'[^\w]+')\
                   .apply(lambda x: list(s.intersection(x)))
df

                                           text_data            matches
0  companies are mainly working on two technologi...   [healthcare, ai]
1  Current trend are mainly depends on blockchain...  [IOT, blockchain]

感谢提供设置数据。

关键字匹配在 pandas 列中给出重复的词？

keyword matching gives repeated words in pandas column?

python

text-mining

pandas