您如何计算数据框中的正则表达式子字符串匹配项，并将其作为新功能应用？

Question

我想向现有数据框添加一个功能来计算子字符串的出现次数。例如，如果我想计算字符串 str 中 https 的出现次数，我可以这样做：str.count("https")

但是我如何将其应用于 DataFrame 的每一行？

Label    Text
0        Lorem ipsum dolor sit amet 
1        Quis https://url.com/a nunc https://g.co/b elit 
0        Donec https://url.com/c interdum libero,
0        Consectetur convallis inbox.gmail.com/d auctor.
1        Praesent  semper magna lorem

期望的输出：

Label    Text                                             count_https
0        Lorem ipsum dolor sit amet                        0
1        Quis https://url.com/a nunc https://g.co/b elit   2
0        Donec https://url.com/c interdum libero,            1
0        Consectetur convallis inbox.gmail.com/d auctor.     0
1        Praesent  semper magna lorem                      0

这是我使用 .find("https") 应用新功能的最新尝试：

df.apply(lambda x: len([w for w in str(x).split() if w.find("https") != -1()]))

但这会导致类型错误：

TypeError: 'int' object is not callable

Answer 1

不确定是不是打错了，但是-1()没有意义，因为不可能调用整数。

无论如何，有更好的方法可以实现您的目标。您可以使用矢量化 count。矢量化操作几乎总是比使用 lambda 的 apply 快。

df['count_https'] = df['Text'].str.count('https')

Answer 2

您可以使用 count 但您仍然想使用 lambda 那么您可以使用 re

import re
df.apply(lambda x: len(re.findall('https',str(x))))

更正您的解决方案在这种情况下可以调用 int 变量 -1() 而不是检查仅是 -1 的索引位置

df.apply(lambda x: len([w for w in str(x).split() if w.find("https") != -1]))

您如何计算数据框中的正则表达式子字符串匹配项，并将其作为新功能应用？

How do you count regex substring matches in a dataframe, and apply it as a new feature?

python

lambda

dataframe

feature-selection