从数据集中的字符串列表中查找事件
Find the occurence from the list of strings in the dataset
我有一个列表
top = ['GME', 'MVIS', 'TSLA', 'AMC']
我有一个数据集
dt | text
2021-03-19 20:59:49+06 | I only need GME to hit 20 eod to make up
2021-03-19 20:59:51+06 | lads why is my account covered in more red
2021-05-21 15:54:27+06 | Oh my god, we might have 2 green days in a row
2021-05-21 15:56:06+06 | Why are people so hype about a 4% TSLA move
所以我想从数据集中的单词列表中获取所有出现的地方
我的输出需要像这样
dt | text
2021-03-19 20:59:49+06 | I only need GME to hit 20 eod to make up
2021-05-21 15:56:06+06 | Why are people so hype about a 4% TSLA move
感谢任何帮助
您可以通过搜索每行文本中的每个标签来对数据框进行切片。
自定义函数可以解决这个问题:
df[df['text'].map(lambda txt: any(tag in txt for tag in top))]
我会按照以下方式进行
import re
import pandas as pd
top = ['GME', 'MVIS', 'TSLA', 'AMC']
df = pd.DataFrame({"text":["I only need GME to hit 20 eod to make up","lads why is my account covered in more red","Oh my god, we might have 2 green days in a row","Why are people so hype about a 4% TSLA move"]})
out_df = df[df["text"].str.contains("|".join(re.escape(i) for i in top))]
print(out_df)
输出
text
0 I only need GME to hit 20 eod to make up
3 Why are people so hype about a 4% TSLA move
解释:pandas.Series.str.contains
by default treat what it gets as regex, I used |
to build regular expression which says GME
or MVIS
or TSLA
or AMC
(GME|MVIS|TSLA|AMC
). I used re.escape
which is not required in this particular case, but is useful to prevent unexpected behavior in case any word in list contain special character (enumerated in Regular Expression Syntax chapter of re docs).
我有一个列表
top = ['GME', 'MVIS', 'TSLA', 'AMC']
我有一个数据集
dt | text
2021-03-19 20:59:49+06 | I only need GME to hit 20 eod to make up
2021-03-19 20:59:51+06 | lads why is my account covered in more red
2021-05-21 15:54:27+06 | Oh my god, we might have 2 green days in a row
2021-05-21 15:56:06+06 | Why are people so hype about a 4% TSLA move
所以我想从数据集中的单词列表中获取所有出现的地方 我的输出需要像这样
dt | text
2021-03-19 20:59:49+06 | I only need GME to hit 20 eod to make up
2021-05-21 15:56:06+06 | Why are people so hype about a 4% TSLA move
感谢任何帮助
您可以通过搜索每行文本中的每个标签来对数据框进行切片。 自定义函数可以解决这个问题:
df[df['text'].map(lambda txt: any(tag in txt for tag in top))]
我会按照以下方式进行
import re
import pandas as pd
top = ['GME', 'MVIS', 'TSLA', 'AMC']
df = pd.DataFrame({"text":["I only need GME to hit 20 eod to make up","lads why is my account covered in more red","Oh my god, we might have 2 green days in a row","Why are people so hype about a 4% TSLA move"]})
out_df = df[df["text"].str.contains("|".join(re.escape(i) for i in top))]
print(out_df)
输出
text
0 I only need GME to hit 20 eod to make up
3 Why are people so hype about a 4% TSLA move
解释:pandas.Series.str.contains
by default treat what it gets as regex, I used |
to build regular expression which says GME
or MVIS
or TSLA
or AMC
(GME|MVIS|TSLA|AMC
). I used re.escape
which is not required in this particular case, but is useful to prevent unexpected behavior in case any word in list contain special character (enumerated in Regular Expression Syntax chapter of re docs).