在 Dataframe 中加速数百万个正则表达式替换
Speed up millions of regex replacements in Dataframe
我有:
大约 40k bigram/trigram 字的位置列表。
['San Francisco CA', 'Oakland CA', 'San Diego CA',...]
具有数百万行的 Pandas DataFrame。
string_column
string_column_location_removed
Burger King Oakland CA
Burger King
Walmart Walnut Creek CA
Walmart
我目前正在遍历位置列表,如果该位置存在于 string_column
中,请创建一个新列 string_column_location_removed
并删除该位置。
这是我的尝试,虽然有效,但速度很慢。关于如何加快速度的任何想法?
我尝试从 and this 中汲取灵感,但不确定如何使用 Pandas Dataframe 进行真正的推断。
from random import choice
from string import ascii_lowercase, digits
import pandas
#making random list here
chars = ascii_lowercase + digits
locations_lookup_list = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')
strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
"Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this", "Thank you so Much!"] * 250000
df = pd.DataFrame(strings_for_df)
def location_remove(txnString):
for locationString in locations_lookup_list:
if re.search(f'\b{locationString}\b', txnString):
return re.sub(f'\b{locationString}\b','', txnString)
else:
continue
df['string_column_location_removed'] = df['string_column'].apply(lambda x: location_remove(x))
使用trrex, it builds an equivalent pattern as the same found in this (其实是受那个答案的启发):
from random import choice
from string import ascii_lowercase, digits
import pandas as pd
import trrex as tx
# making random list here
chars = ascii_lowercase + digits
locations_lookup_list = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')
strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
"Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this",
"Thank you so Much!"] * 250000
df = pd.DataFrame(strings_for_df, columns=["string_column"])
pattern = tx.make(locations_lookup_list, suffix="", prefix="")
df["string_column_location_removed"] = df["string_column"].str.replace(pattern, "", regex=True)
print(df)
输出
string_column string_column_location_removed
0 Burger King Oakland CA Burger King
1 Walmart Walnut Creek CA Walmart
2 Random Other Thing Here Random Other Thing Here
3 Another random other thing here Another random other thing here
4 Really Appreciate the help on this Really Appreciate the help on this
... ... ...
1499995 Walmart Walnut Creek CA Walmart
1499996 Random Other Thing Here Random Other Thing Here
1499997 Another random other thing here Another random other thing here
1499998 Really Appreciate the help on this Really Appreciate the help on this
1499999 Thank you so Much! Thank you so Much!
[1500000 rows x 2 columns]
时间(在str.replace
的运行上)
%timeit df["string_column"].str.replace(pattern, "", regex=True)
8.84 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
时间不包括构建模式所需的时间。
免责声明 我是 trrex
的作者
我有:
大约 40k bigram/trigram 字的位置列表。
['San Francisco CA', 'Oakland CA', 'San Diego CA',...]
具有数百万行的 Pandas DataFrame。
string_column | string_column_location_removed |
---|---|
Burger King Oakland CA | Burger King |
Walmart Walnut Creek CA | Walmart |
我目前正在遍历位置列表,如果该位置存在于 string_column
中,请创建一个新列 string_column_location_removed
并删除该位置。
这是我的尝试,虽然有效,但速度很慢。关于如何加快速度的任何想法?
我尝试从
from random import choice
from string import ascii_lowercase, digits
import pandas
#making random list here
chars = ascii_lowercase + digits
locations_lookup_list = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')
strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
"Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this", "Thank you so Much!"] * 250000
df = pd.DataFrame(strings_for_df)
def location_remove(txnString):
for locationString in locations_lookup_list:
if re.search(f'\b{locationString}\b', txnString):
return re.sub(f'\b{locationString}\b','', txnString)
else:
continue
df['string_column_location_removed'] = df['string_column'].apply(lambda x: location_remove(x))
使用trrex, it builds an equivalent pattern as the same found in this
from random import choice
from string import ascii_lowercase, digits
import pandas as pd
import trrex as tx
# making random list here
chars = ascii_lowercase + digits
locations_lookup_list = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')
strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
"Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this",
"Thank you so Much!"] * 250000
df = pd.DataFrame(strings_for_df, columns=["string_column"])
pattern = tx.make(locations_lookup_list, suffix="", prefix="")
df["string_column_location_removed"] = df["string_column"].str.replace(pattern, "", regex=True)
print(df)
输出
string_column string_column_location_removed
0 Burger King Oakland CA Burger King
1 Walmart Walnut Creek CA Walmart
2 Random Other Thing Here Random Other Thing Here
3 Another random other thing here Another random other thing here
4 Really Appreciate the help on this Really Appreciate the help on this
... ... ...
1499995 Walmart Walnut Creek CA Walmart
1499996 Random Other Thing Here Random Other Thing Here
1499997 Another random other thing here Another random other thing here
1499998 Really Appreciate the help on this Really Appreciate the help on this
1499999 Thank you so Much! Thank you so Much!
[1500000 rows x 2 columns]
时间(在str.replace
的运行上)
%timeit df["string_column"].str.replace(pattern, "", regex=True)
8.84 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
时间不包括构建模式所需的时间。
免责声明 我是 trrex
的作者