pandas:如何限制str.contains的结果?

pandas: How to limit the results of str.contains?

我有一个超过 100 万行的 DataFrame。我想 select 某列包含某个子字符串的所有行:

matching = df['col2'].str.contains('substr', case=True, regex=False)
rows = df[matching].col1.drop_duplicates()

但是这个 selection 很慢,我想加快它的速度。假设我只需要前 n 个结果。有没有办法在得到 n 结果后停止 matching?我试过:

matching = df['col2'].str.contains('substr', case=True, regex=False).head(n)

和:

matching = df['col2'].str.contains('substr', case=True, regex=False).sample(n)

但它们并没有更快。第二个语句是布尔值并且非常快。我怎样才能加快第一个语句?

您可以通过以下方式加快速度:

matching = df['col2'].head(n).str.contains('substr', case=True, regex=False)
rows = df['col1'].head(n)[matching==True]

但是,此解决方案将检索前 n 行中的匹配结果,而不是前 n 匹配结果。

如果您确实想要第一个 n 个匹配结果,您应该使用:

rows =  df['col1'][df['col2'].str.contains("substr")==True].head(n)

但是这个选项当然要慢得多。

受@ScottBoston 的回答启发,您可以使用以下方法获得完整更快的解决方案

rows = df['col1'][pd.Series(['substr' in i for i in df['col2']])==True].head(n)

这比使用此选项显示整个结果要快,但不会那么快。使用此解决方案,您可以获得第一个 n 个匹配结果。

通过下面的测试代码我们可以看到每个解决方案的速度有多快及其结果:

import pandas as pd
import time

n = 10
a = ["Result", "from", "first", "column", "for", "this", "matching", "test", "end"]
b = ["This", "is", "a", "test", "has substr", "also has substr", "end", "of", "test"]

col1 = a*1000000
col2 = b*1000000

df = pd.DataFrame({"col1":col1,"col2":col2})

# Original option
start_time = time.time()
matching = df['col2'].str.contains('substr', case=True, regex=False)
rows = df[matching].col1.drop_duplicates()
print("--- %s seconds ---" % (time.time() - start_time))

# Faster option
start_time = time.time()
matching_fast = df['col2'].head(n).str.contains('substr', case=True, regex=False)
rows_fast = df['col1'].head(n)[matching==True]
print("--- %s seconds for fast solution ---" % (time.time() - start_time))


# Other option
start_time = time.time()
rows_other =  df['col1'][df['col2'].str.contains("substr")==True].head(n)
print("--- %s seconds for other solution ---" % (time.time() - start_time))

# Complete option
start_time = time.time()
rows_complete = df['col1'][pd.Series(['substr' in i for i in df['col2']])==True].head(n)
print("--- %s seconds for complete solution ---" % (time.time() - start_time))

这将输出:

>>> 
--- 2.33899998665 seconds ---
--- 0.302999973297 seconds for fast solution ---
--- 4.56700015068 seconds for other solution ---
--- 1.61599993706 seconds for complete solution ---

结果系列将是:

>>> rows
4     for
5    this
Name: col1, dtype: object
>>> rows_fast
4     for
5    this
Name: col1, dtype: object
>>> rows_other
4      for
5     this
13     for
14    this
22     for
23    this
31     for
32    this
40     for
41    this
Name: col1, dtype: object
>>> rows_complete
4      for
5     this
13     for
14    this
22     for
23    this
31     for
32    this
40     for
41    this
Name: col1, dtype: object

信不信由你,但 .str 访问器很慢。您可以使用性能更好的列表理解。

df = pd.DataFrame({'col2':np.random.choice(['substring','midstring','nostring','substrate'],100000)})

相等性测试

all(df['col2'].str.contains('substr', case=True, regex=False) ==
    pd.Series(['substr' in i for i in df['col2']]))

输出:

True

时间:

%timeit df['col2'].str.contains('substr', case=True, regex=False)
10 loops, best of 3: 37.9 ms per loop

对比

%timeit pd.Series(['substr' in i for i in df['col2']])
100 loops, best of 3: 19.1 ms per loop