关键字匹配在 pandas 列中给出重复的词?
keyword matching gives repeated words in pandas column?
我有一个 pandas 数据框包含两列:-
ID text_data
1 companies are mainly working on two
technologies that is ai and health care.
Company need to improve on health care.
2 Current trend are mainly depends on block chain
and IOT where IOT is
highly used.
3 ............
. ...........
. ...........
. so on.
现在我有另一个列表 Techlist=["block chain","health care","ai","IOT"]
我需要将列表 Techlist
与 pandas 数据框的 text_data
列进行匹配,所以我使用了以下代码:-
df['tech_match']=df['text_data'].apply(lambda x: [reduce(op.add, re.findall(act,x)) for act in Techlist if re.findall(act,x) <> []] )
所以我得到的是不同的东西:-
ID text_data tech_match
1 companies are mainly working on two [ai,healthcarehealthcare]
technologies that is ai and health care.
Company need to improve on health care.
2 current trend are mainly [block chain,IOTIOT]
depends on block chain and
IOT where IOT is highly used.
3 .................
. ................
. ...............
. so on.
列表和文本数据正确匹配,但匹配的列表单词在 tech_match
列中重复。
我需要的是:-
ID text_data tech_match
1 companies are mainly working on two [heatlh care,ai]
technologies that is ai and health care.
Company need to improve on health care.
2 Current trend are mainly depends on [block chain,IOT]
blockchain and IOT where IOT is
highly used.
3 ..................
. ..................
. .................
. son on.
如何删除 tech_match
列中的这些重复词?
使用 str.findall
with boundary
for look-up words. Thank you 以获得更简单的模式:
pat = '|'.join(r"\b{}\b".format(x) for x in Techlist)
print (pat)
\bblockchain\b|\bhealthcare\b|\bai\b|\bIOT\b
创建新列:
df['tech_match'] = df['text_data'].str.findall(pat).apply(lambda x: list(set(x)))
print (df)
text_data tech_match
0 companies are mainly working on two technologi... [healthcare, ai]
1 Current trend are mainly depends on blockchain... [blockchain, IOT]
您可以 return 用 Counter
计算每个单词的数量,再次感谢 Anton vBR
的建议:
from collections import Counter
df['tech_match'] = df['text_data'].str.findall(pat).apply(lambda x: Counter(x))
print(df)
text_data tech_match
0 companies are mainly working on two technologi... {'ai': 1, 'healthcare': 2}
1 Current trend are mainly depends on blockchain... {'blockchain': 1, 'IOT': 2}
另外你可以用原框加入计数系列:
data = (df['text_data'].str.findall(pat).apply(lambda x: Counter(x))).tolist()
df = df.join(pd.DataFrame(data)).fillna(0) # join dfs
df['Total'] =df[Techlist].sum(axis=1) # create Total column
text_data IOT ai blockchain healthcare Total
0 companies are ... 0.0 2.0 0.0 2.0 4.0
1 Current trend ... 2.0 0.0 1.0 0.0 3.0
时间:
text_data = "companies are mainly working on two technologies that is ai and healthcare. Company need to improve on healthcare. Current trend are mainly depends on blockchain and IOT where IOT is highly used.".split()
np.random.seed(75)
#20000 random rows with all words from text_data
N = 20000
df = pd.DataFrame({'text_data': [np.random.choice(text_data, size=np.random.randint(3,10)) for x in range(N)]})
df['text_data'] = df['text_data'].str.join(' ')
Techlist=["blockchain","healthcare","ai","IOT"]
s = set(["blockchain", "healthcare", "ai", "IOT"])
#cᴏʟᴅsᴘᴇᴇᴅ's solution
In [401]: %timeit df['matches'] = df.text_data.str.split(r'[^\w]+').apply(lambda x: list(s.intersection(x)))
10 loops, best of 3: 165 ms per loop
#jezrael's solution
In [402]: %timeit df['tech_match'] = df['text_data'].str.findall('|'.join([r"\b{word}\b".format(word=word) for word in Techlist])).apply(lambda x: list(set(x)))
10 loops, best of 3: 74.7 ms per loop
#Bharath's solution
In [403]: %timeit df['new'] = df['text_data'].apply(lambda x : list(set([i for i in nltk.word_tokenize(x) if i in Techlist])))
1 loop, best of 3: 3.73 s per loop
作为正则表达式的替代方案,我们可以使用 nltk.word_tokenize
然后应用集合,即
text_data = ["companies are mainly working on two data itegration technologies that is and healthcare. Company need to improve on healthcare.", "Current trend are mainly depends on blockchain and IOT where IOT is highly used."]
df = pd.DataFrame({'text_data':text_data})
Techlist=["blockchain","healthcare","ai","IOT"]
import nltk
df['new'] = df['text_data'].apply(lambda x : list(set([i for i in nltk.word_tokenize(x) if i in Techlist])))
text_data new
0 companies are mainly working on two data itegr... [healthcare]
1 Current trend are mainly depends on blockchain... [IOT, blockchain]
为了更快地应用相同的内容,您可以查看
使用str.split
然后调用set.intersection
:
s = set(["blockchain", "healthcare", "ai", "IOT"])
df['matches'] = df.text_data.str.split(r'[^\w]+')\
.apply(lambda x: list(s.intersection(x)))
df
text_data matches
0 companies are mainly working on two technologi... [healthcare, ai]
1 Current trend are mainly depends on blockchain... [IOT, blockchain]
感谢提供设置数据。
我有一个 pandas 数据框包含两列:-
ID text_data
1 companies are mainly working on two
technologies that is ai and health care.
Company need to improve on health care.
2 Current trend are mainly depends on block chain
and IOT where IOT is
highly used.
3 ............
. ...........
. ...........
. so on.
现在我有另一个列表 Techlist=["block chain","health care","ai","IOT"]
我需要将列表 Techlist
与 pandas 数据框的 text_data
列进行匹配,所以我使用了以下代码:-
df['tech_match']=df['text_data'].apply(lambda x: [reduce(op.add, re.findall(act,x)) for act in Techlist if re.findall(act,x) <> []] )
所以我得到的是不同的东西:-
ID text_data tech_match
1 companies are mainly working on two [ai,healthcarehealthcare]
technologies that is ai and health care.
Company need to improve on health care.
2 current trend are mainly [block chain,IOTIOT]
depends on block chain and
IOT where IOT is highly used.
3 .................
. ................
. ...............
. so on.
列表和文本数据正确匹配,但匹配的列表单词在 tech_match
列中重复。
我需要的是:-
ID text_data tech_match
1 companies are mainly working on two [heatlh care,ai]
technologies that is ai and health care.
Company need to improve on health care.
2 Current trend are mainly depends on [block chain,IOT]
blockchain and IOT where IOT is
highly used.
3 ..................
. ..................
. .................
. son on.
如何删除 tech_match
列中的这些重复词?
使用 str.findall
with boundary
for look-up words. Thank you
pat = '|'.join(r"\b{}\b".format(x) for x in Techlist)
print (pat)
\bblockchain\b|\bhealthcare\b|\bai\b|\bIOT\b
创建新列:
df['tech_match'] = df['text_data'].str.findall(pat).apply(lambda x: list(set(x)))
print (df)
text_data tech_match
0 companies are mainly working on two technologi... [healthcare, ai]
1 Current trend are mainly depends on blockchain... [blockchain, IOT]
您可以 return 用 Counter
计算每个单词的数量,再次感谢 Anton vBR
的建议:
from collections import Counter
df['tech_match'] = df['text_data'].str.findall(pat).apply(lambda x: Counter(x))
print(df)
text_data tech_match
0 companies are mainly working on two technologi... {'ai': 1, 'healthcare': 2}
1 Current trend are mainly depends on blockchain... {'blockchain': 1, 'IOT': 2}
另外你可以用原框加入计数系列:
data = (df['text_data'].str.findall(pat).apply(lambda x: Counter(x))).tolist()
df = df.join(pd.DataFrame(data)).fillna(0) # join dfs
df['Total'] =df[Techlist].sum(axis=1) # create Total column
text_data IOT ai blockchain healthcare Total
0 companies are ... 0.0 2.0 0.0 2.0 4.0
1 Current trend ... 2.0 0.0 1.0 0.0 3.0
时间:
text_data = "companies are mainly working on two technologies that is ai and healthcare. Company need to improve on healthcare. Current trend are mainly depends on blockchain and IOT where IOT is highly used.".split()
np.random.seed(75)
#20000 random rows with all words from text_data
N = 20000
df = pd.DataFrame({'text_data': [np.random.choice(text_data, size=np.random.randint(3,10)) for x in range(N)]})
df['text_data'] = df['text_data'].str.join(' ')
Techlist=["blockchain","healthcare","ai","IOT"]
s = set(["blockchain", "healthcare", "ai", "IOT"])
#cᴏʟᴅsᴘᴇᴇᴅ's solution
In [401]: %timeit df['matches'] = df.text_data.str.split(r'[^\w]+').apply(lambda x: list(s.intersection(x)))
10 loops, best of 3: 165 ms per loop
#jezrael's solution
In [402]: %timeit df['tech_match'] = df['text_data'].str.findall('|'.join([r"\b{word}\b".format(word=word) for word in Techlist])).apply(lambda x: list(set(x)))
10 loops, best of 3: 74.7 ms per loop
#Bharath's solution
In [403]: %timeit df['new'] = df['text_data'].apply(lambda x : list(set([i for i in nltk.word_tokenize(x) if i in Techlist])))
1 loop, best of 3: 3.73 s per loop
作为正则表达式的替代方案,我们可以使用 nltk.word_tokenize
然后应用集合,即
text_data = ["companies are mainly working on two data itegration technologies that is and healthcare. Company need to improve on healthcare.", "Current trend are mainly depends on blockchain and IOT where IOT is highly used."]
df = pd.DataFrame({'text_data':text_data})
Techlist=["blockchain","healthcare","ai","IOT"]
import nltk
df['new'] = df['text_data'].apply(lambda x : list(set([i for i in nltk.word_tokenize(x) if i in Techlist])))
text_data new 0 companies are mainly working on two data itegr... [healthcare] 1 Current trend are mainly depends on blockchain... [IOT, blockchain]
为了更快地应用相同的内容,您可以查看
使用str.split
然后调用set.intersection
:
s = set(["blockchain", "healthcare", "ai", "IOT"])
df['matches'] = df.text_data.str.split(r'[^\w]+')\
.apply(lambda x: list(s.intersection(x)))
df
text_data matches
0 companies are mainly working on two technologi... [healthcare, ai]
1 Current trend are mainly depends on blockchain... [IOT, blockchain]
感谢