从数据集中的字符串列表中查找事件

Question

我有一个列表

top = ['GME', 'MVIS', 'TSLA', 'AMC']

我有一个数据集

dt | text
2021-03-19 20:59:49+06 | I only need GME to hit 20 eod to make up
2021-03-19 20:59:51+06 | lads why is my account covered in more red
2021-05-21 15:54:27+06 | Oh my god, we might have 2 green days in a row
2021-05-21 15:56:06+06 | Why are people so hype about a 4% TSLA move

所以我想从数据集中的单词列表中获取所有出现的地方我的输出需要像这样

dt | text
2021-03-19 20:59:49+06 | I only need GME to hit 20 eod to make up
2021-05-21 15:56:06+06 | Why are people so hype about a 4% TSLA move

感谢任何帮助

Answer 1

您可以通过搜索每行文本中的每个标签来对数据框进行切片。自定义函数可以解决这个问题：

df[df['text'].map(lambda txt: any(tag in txt for tag in top))]

Answer 2

我会按照以下方式进行

import re
import pandas as pd
top = ['GME', 'MVIS', 'TSLA', 'AMC']
df = pd.DataFrame({"text":["I only need GME to hit 20 eod to make up","lads why is my account covered in more red","Oh my god, we might have 2 green days in a row","Why are people so hype about a 4% TSLA move"]})
out_df = df[df["text"].str.contains("|".join(re.escape(i) for i in top))]
print(out_df)

输出

                                          text
0     I only need GME to hit 20 eod to make up
3  Why are people so hype about a 4% TSLA move

解释：pandas.Series.str.contains by default treat what it gets as regex, I used | to build regular expression which says GME or MVIS or TSLA or AMC (GME|MVIS|TSLA|AMC). I used re.escape which is not required in this particular case, but is useful to prevent unexpected behavior in case any word in list contain special character (enumerated in Regular Expression Syntax chapter of re docs).

从数据集中的字符串列表中查找事件

Find the occurence from the list of strings in the dataset

python

dataset