测试值的最有效方法是在 pandas 中的列表中
Most efficient way of testing a value is in a list in pandas
我有一个来自 csv 的数据框,我正在测试它的各个方面。这些似乎都符合此列是否像此正则表达式或此列表中的此列。
所以我的数据框有点像这样:
import pandas as pd
df = pd.DataFrame({'full_name': ['Mickey Mouse', 'M Mouse', 'Mickey RudeWord Mouse'], 'nationality': ['Mouseland', 'United States', 'Canada']})
我正在根据该内容生成新的列,如下所示:
def full_name_metrics(full_name):
lst_rude_words = ['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA']
# metric of whether full name has less than two distinct elements
full_name_less_than_2_parts = len(full_name.split(' '))<2
# metric of whether full_name contains an initial
full_name_with_initial = 1 in [len(x) for x in full_name.split(' ')]
# metric of whether name matches an offensive word
full_name_with_offensive_word = any(item in full_name.upper().split(' ') for item in lst_rude_words)
return pd.Series([full_name_less_than_2_parts, full_name_with_initial, full_name_with_offensive_word])
df[['full_name_less_than_2_parts', 'full_name_with_initial', 'full_name_with_offensive_word']] = df.apply(lambda x: full_name_metrics(x['full_name']), axis=1)
full_name
nationality
full_name_less_than_2_parts
full_name_with_initial
full_name_with_offensive_word
0
Mickey Mouse
Mouseland
False
False
False
1
M Mouse
United States
False
True
False
2
Mickey RudeWord Mouse
Canada
False
False
True
它有效,但对于 25k 条记录和更多此类控件,它花费的时间比我想要的要多。
那么有没有更好的方法呢?我是最好将粗鲁的单词列表作为另一个数据框,还是我找错了树?
我要逐个回答...
您所有的操作都依赖于在空白处拆分全名列,所以只做一次:
>>> stuff = df.full_name.str.split()
姓名少于两部分:
>>> df['full_name_less_than_2_parts'] = stuff.agg(len) < 2
>>> df
full_name nationality full_name_less_than_2_parts
0 Mickey Mouse Mouseland False
1 M Mouse United States False
2 Mickey RudeWord Mouse Canada False
名字只有首字母。
分解、拆分、系列;找到长度为 1 的项目;按索引分组以合并分解的系列并使用 any
.
进行聚合
>>> q = (stuff.explode().agg(len) == 1)
>>> df['full_name_with_initial'] = q.groupby(q.index).agg('any')
>>> df
full_name nationality full_name_less_than_2_parts full_name_with_initial
0 Mickey Mouse Mouseland False False
1 M Mouse United States False True
2 Mickey RudeWord Mouse Canada False False
检查不需要的词。
从不需要的单词列表中创建一个正则表达式模式,并将其用作 .str.contains
方法的参数。
>>> rude_words =r'|'.join( ['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA'])
>>> df['rude'] = df.full_name.str.upper().str.contains(rude_words,regex=True)
>>> df
full_name nationality full_name_less_than_2_parts full_name_with_initial rude
0 Mickey Mouse Mouseland False False False
1 M Mouse United States False True False
2 Mickey RudeWord Mouse Canada False False True
把它们放在一起放在returns三个系列的一个函数中(主要是做一个时序测试)。
import pandas as pd
from timer import Timer
df = pd.DataFrame(
{
"full_name": ["Mickey Mouse", "M Mouse", "Mickey RudeWord Mouse"]*8000,
"nationality": ["Mouseland", "United States", "Canada"]*8000,
}
)
rude_words = r'|'.join(['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA'])
def f(df):
rude_words = r'|'.join(['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA'])
stuff = df.full_name.str.split()
s1 = stuff.agg(len) < 2
stuff = (stuff.explode().agg(len) == 1)
s2 = stuff.groupby(stuff.index).agg('any')
s3 = df.full_name.str.upper().str.contains(rude_words,regex=True)
return s1,s2,s3
t = Timer('f(df)','from __main__ import pd,df,f')
print(t.timeit(1)) # <--- 0.12 seconds on my computer
x,y,z = f(df)
df.loc[:,'full_name_less_than_2_parts'] = x
df.loc[:,'full_name_with_initial'] = y
df.loc[:,'rude'] = z
# print(df.head(100))
如果您要加快列表检查速度 - 那么 Series.str.contains
方法可能会有所帮助 -
lst_rude_words_as_str = '|'.join(lst_rude_words)
df['full_name_with_offensive_word'] = df['full_name'].str.upper().str.contains(lst_rude_words_as_str, regex=True)
以下是 %timeit
寻找我的方式:
def func_in_list(full_name):
'''Your function - just removed the other two columns.'''
lst_rude_words = ['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA']
full_name_with_offensive_word = any(item in full_name.upper().split(' ') for item in lst_rude_words)
%timeit df.apply(lambda x: func_in_list(x['full_name']), axis=1) #3.15 ms
%timeit df['full_name'].str.upper().str.contains(lst_rude_words_as_str, regex=True) #505 µs
编辑
我添加了之前遗漏的其他两列 - 这是完整代码
import pandas as pd
df = pd.DataFrame({'full_name': ['Mickey Mouse', 'M Mouse', 'Mickey Rudeword Mouse']})
def df_metrics(input_df):
input_df['full_name_less_than_2_parts'] = input_df['full_name'].str.split().map(len) < 2
input_df['full_name_with_initial'] = input_df['full_name'].str.split(expand=True)[0].map(len) == 1
lst_rude_words = ['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA']
lst_rude_words_as_str = '|'.join(lst_rude_words)
input_df['full_name_with_offensive_word'] = input_df['full_name'].str.upper().str.contains(lst_rude_words_as_str, regex=True)
return input_df
结果
对于 3 行数据集 - 两个函数之间没有区别 -
%timeit df_metrics(df)
#3.5 ms ± 67.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df[['full_name_less_than_2_parts', 'full_name_with_initial', 'full_name_with_offensive_word']] = df.apply(lambda x: full_name_metrics(x['full_name']), axis=1)
#3.7 ms ± 59.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
但是当我增加数据帧的大小时 - 然后有一些加速
df_big = pd.concat([df] * 10000)
%timeit df_metrics(df_big)
#135 ms ± 7.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df_big[['full_name_less_than_2_parts', 'full_name_with_initial', 'full_name_with_offensive_word']] = df_big.apply(lambda x: full_name_metrics(x['full_name']), axis=1)
#11.5 s ± 173 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
我有一个来自 csv 的数据框,我正在测试它的各个方面。这些似乎都符合此列是否像此正则表达式或此列表中的此列。
所以我的数据框有点像这样:
import pandas as pd
df = pd.DataFrame({'full_name': ['Mickey Mouse', 'M Mouse', 'Mickey RudeWord Mouse'], 'nationality': ['Mouseland', 'United States', 'Canada']})
我正在根据该内容生成新的列,如下所示:
def full_name_metrics(full_name):
lst_rude_words = ['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA']
# metric of whether full name has less than two distinct elements
full_name_less_than_2_parts = len(full_name.split(' '))<2
# metric of whether full_name contains an initial
full_name_with_initial = 1 in [len(x) for x in full_name.split(' ')]
# metric of whether name matches an offensive word
full_name_with_offensive_word = any(item in full_name.upper().split(' ') for item in lst_rude_words)
return pd.Series([full_name_less_than_2_parts, full_name_with_initial, full_name_with_offensive_word])
df[['full_name_less_than_2_parts', 'full_name_with_initial', 'full_name_with_offensive_word']] = df.apply(lambda x: full_name_metrics(x['full_name']), axis=1)
full_name | nationality | full_name_less_than_2_parts | full_name_with_initial | full_name_with_offensive_word | |
---|---|---|---|---|---|
0 | Mickey Mouse | Mouseland | False | False | False |
1 | M Mouse | United States | False | True | False |
2 | Mickey RudeWord Mouse | Canada | False | False | True |
它有效,但对于 25k 条记录和更多此类控件,它花费的时间比我想要的要多。
那么有没有更好的方法呢?我是最好将粗鲁的单词列表作为另一个数据框,还是我找错了树?
我要逐个回答...
您所有的操作都依赖于在空白处拆分全名列,所以只做一次:
>>> stuff = df.full_name.str.split()
姓名少于两部分:
>>> df['full_name_less_than_2_parts'] = stuff.agg(len) < 2
>>> df
full_name nationality full_name_less_than_2_parts
0 Mickey Mouse Mouseland False
1 M Mouse United States False
2 Mickey RudeWord Mouse Canada False
名字只有首字母。
分解、拆分、系列;找到长度为 1 的项目;按索引分组以合并分解的系列并使用 any
.
>>> q = (stuff.explode().agg(len) == 1)
>>> df['full_name_with_initial'] = q.groupby(q.index).agg('any')
>>> df
full_name nationality full_name_less_than_2_parts full_name_with_initial
0 Mickey Mouse Mouseland False False
1 M Mouse United States False True
2 Mickey RudeWord Mouse Canada False False
检查不需要的词。
从不需要的单词列表中创建一个正则表达式模式,并将其用作 .str.contains
方法的参数。
>>> rude_words =r'|'.join( ['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA'])
>>> df['rude'] = df.full_name.str.upper().str.contains(rude_words,regex=True)
>>> df
full_name nationality full_name_less_than_2_parts full_name_with_initial rude
0 Mickey Mouse Mouseland False False False
1 M Mouse United States False True False
2 Mickey RudeWord Mouse Canada False False True
把它们放在一起放在returns三个系列的一个函数中(主要是做一个时序测试)。
import pandas as pd
from timer import Timer
df = pd.DataFrame(
{
"full_name": ["Mickey Mouse", "M Mouse", "Mickey RudeWord Mouse"]*8000,
"nationality": ["Mouseland", "United States", "Canada"]*8000,
}
)
rude_words = r'|'.join(['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA'])
def f(df):
rude_words = r'|'.join(['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA'])
stuff = df.full_name.str.split()
s1 = stuff.agg(len) < 2
stuff = (stuff.explode().agg(len) == 1)
s2 = stuff.groupby(stuff.index).agg('any')
s3 = df.full_name.str.upper().str.contains(rude_words,regex=True)
return s1,s2,s3
t = Timer('f(df)','from __main__ import pd,df,f')
print(t.timeit(1)) # <--- 0.12 seconds on my computer
x,y,z = f(df)
df.loc[:,'full_name_less_than_2_parts'] = x
df.loc[:,'full_name_with_initial'] = y
df.loc[:,'rude'] = z
# print(df.head(100))
如果您要加快列表检查速度 - 那么 Series.str.contains
方法可能会有所帮助 -
lst_rude_words_as_str = '|'.join(lst_rude_words)
df['full_name_with_offensive_word'] = df['full_name'].str.upper().str.contains(lst_rude_words_as_str, regex=True)
以下是 %timeit
寻找我的方式:
def func_in_list(full_name):
'''Your function - just removed the other two columns.'''
lst_rude_words = ['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA']
full_name_with_offensive_word = any(item in full_name.upper().split(' ') for item in lst_rude_words)
%timeit df.apply(lambda x: func_in_list(x['full_name']), axis=1) #3.15 ms
%timeit df['full_name'].str.upper().str.contains(lst_rude_words_as_str, regex=True) #505 µs
编辑
我添加了之前遗漏的其他两列 - 这是完整代码
import pandas as pd
df = pd.DataFrame({'full_name': ['Mickey Mouse', 'M Mouse', 'Mickey Rudeword Mouse']})
def df_metrics(input_df):
input_df['full_name_less_than_2_parts'] = input_df['full_name'].str.split().map(len) < 2
input_df['full_name_with_initial'] = input_df['full_name'].str.split(expand=True)[0].map(len) == 1
lst_rude_words = ['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA']
lst_rude_words_as_str = '|'.join(lst_rude_words)
input_df['full_name_with_offensive_word'] = input_df['full_name'].str.upper().str.contains(lst_rude_words_as_str, regex=True)
return input_df
结果
对于 3 行数据集 - 两个函数之间没有区别 -
%timeit df_metrics(df)
#3.5 ms ± 67.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df[['full_name_less_than_2_parts', 'full_name_with_initial', 'full_name_with_offensive_word']] = df.apply(lambda x: full_name_metrics(x['full_name']), axis=1)
#3.7 ms ± 59.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
但是当我增加数据帧的大小时 - 然后有一些加速
df_big = pd.concat([df] * 10000)
%timeit df_metrics(df_big)
#135 ms ± 7.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df_big[['full_name_less_than_2_parts', 'full_name_with_initial', 'full_name_with_offensive_word']] = df_big.apply(lambda x: full_name_metrics(x['full_name']), axis=1)
#11.5 s ± 173 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)