在 Dataframe 中加速数百万个正则表达式替换

Question

我有：

大约 40k bigram/trigram 字的位置列表。
['San Francisco CA', 'Oakland CA', 'San Diego CA',...]
具有数百万行的 Pandas DataFrame。

string_column	string_column_location_removed
Burger King Oakland CA	Burger King
Walmart Walnut Creek CA	Walmart

我目前正在遍历位置列表，如果该位置存在于 string_column 中，请创建一个新列 string_column_location_removed 并删除该位置。

这是我的尝试，虽然有效，但速度很慢。关于如何加快速度的任何想法？

我尝试从 and this 中汲取灵感，但不确定如何使用 Pandas Dataframe 进行真正的推断。

from random import choice
from string import ascii_lowercase, digits
import pandas 

#making random list here 
chars = ascii_lowercase + digits
locations_lookup_list  = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')

strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
             "Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this", "Thank you so Much!"] * 250000

df = pd.DataFrame(strings_for_df)

def location_remove(txnString):
    for locationString in locations_lookup_list:   
        if re.search(f'\b{locationString}\b', txnString):  
            return re.sub(f'\b{locationString}\b','', txnString)
        else:
            continue

df['string_column_location_removed'] = df['string_column'].apply(lambda x: location_remove(x))

Answer 1

使用trrex, it builds an equivalent pattern as the same found in this （其实是受那个答案的启发）：

from random import choice
from string import ascii_lowercase, digits

import pandas as pd
import trrex as tx

# making random list here
chars = ascii_lowercase + digits
locations_lookup_list = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')

strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
                  "Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this",
                  "Thank you so Much!"] * 250000

df = pd.DataFrame(strings_for_df, columns=["string_column"])
pattern = tx.make(locations_lookup_list, suffix="", prefix="")

df["string_column_location_removed"] = df["string_column"].str.replace(pattern, "", regex=True)
print(df)

输出

                              string_column      string_column_location_removed
0                    Burger King Oakland CA                        Burger King 
1                   Walmart Walnut Creek CA                            Walmart 
2                   Random Other Thing Here             Random Other Thing Here
3           Another random other thing here     Another random other thing here
4        Really Appreciate the help on this  Really Appreciate the help on this
...                                     ...                                 ...
1499995             Walmart Walnut Creek CA                            Walmart 
1499996             Random Other Thing Here             Random Other Thing Here
1499997     Another random other thing here     Another random other thing here
1499998  Really Appreciate the help on this  Really Appreciate the help on this
1499999                  Thank you so Much!                  Thank you so Much!

[1500000 rows x 2 columns]

时间（在str.replace的运行上）

%timeit df["string_column"].str.replace(pattern, "", regex=True)
8.84 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

时间不包括构建模式所需的时间。

免责声明 我是 trrex

的作者

在 Dataframe 中加速数百万个正则表达式替换

Speed up millions of regex replacements in Dataframe

python

regex

performance

replace

pandas