向量化 GroupBy Pandas 数据框的函数
Vectorize a function for a GroupBy Pandas Dataframe
我有一个按日期时间列排序的 Pandas 数据框。多行将具有相同的日期时间,但 "report type" 列值不同。我需要 select 基于首选报告类型列表的其中一行。该列表按优先顺序排列。因此,如果其中一行具有列表中的第一个元素,那么该行就是选择附加到新数据框的行。
我尝试了 GroupBy 和非常慢的 Python for 循环来处理每个组以找到首选的报告类型并将该行附加到新的数据框。我想到了 numpy vectorize(),但我不知道如何将 group by 合并到其中。我真的不太了解数据框,但正在学习。关于如何让它更快的任何想法?我可以合并分组依据吗?
示例数据框
OBSERVATIONTIME REPTYPE CIGFT
2000-01-01 00:00:00 AUTO 73300
2000-01-01 00:00:00 FM-15 25000
2000-01-01 00:00:00 FM-12 3000
2000-01-01 01:00:00 SAO 9000
2000-01-01 01:00:00 FM-16 600
2000-01-01 01:00:00 FM-15 5000
2000-01-01 01:00:00 AUTO 5000
2000-01-01 02:00:00 FM-12 12000
2000-01-01 02:00:00 FM-15 15000
2000-01-01 02:00:00 FM-16 8000
2000-01-01 03:00:00 SAO 700
2000-01-01 04:00:00 SAO 3000
2000-01-01 05:00:00 FM-16 5000
2000-01-01 06:00:00 AUTO 15000
2000-01-01 06:00:00 FM-12 12500
2000-01-01 06:00:00 FM-16 12000
2000-01-01 07:00:00 FM-15 20000
#################################################
# The function to loop through and find the row
################################################
def select_the_one_ob(df):
''' select the preferred observation '''
tophour_df = pd.DataFrame()
preferred_order = ['FM-15', 'AUTO', 'SAO', 'FM-16', 'SAOSP', 'FM-12',
'SY-MT', 'SY-SA']
grouped = df.groupby("OBSERVATIONTIME", as_index=False)
for name, group in grouped:
a_group_df = pd.DataFrame(grouped.get_group(name))
for reptype in preferred_order:
preferred_found = False
for i in a_group_df.index.values:
if a_group_df.loc[i, 'REPTYPE'] == reptype:
tophour_df =
tophour_df.append(a_group_df.loc[i].transpose())
preferred_found = True
break
if preferred_found:
break
del a_group_df
return tophour_df
################################################
### The function which calls the above function
################################################
def process_ceiling(plat, network):
platformcig.data_pull(CONNECT_SRC, PULL_CEILING)
data_df = platformcig.df
data_df = select_the_one_ob(data_df)
对于 300,000 行的完整数据集,该函数需要 4 个多小时。
我需要它更快。我可以将 group by 合并到 numpy vectorize() 中吗?
你可以避免使用groupby
。一种方法是将您的列 'REPTYPE' 分类为 pd.Categorical
and then sort_values
and drop_duplicates
,例如:
def select_the_one_ob(df):
preferred_order = ['FM-15', 'AUTO', 'SAO', 'FM-16', 'SAOSP', 'FM-12', 'SY-MT', 'SY-SA']
df.REPTYPE = pd.Categorical(df.REPTYPE, categories=preferred_order, ordered=True)
return (df.sort_values(by=['OBSERVATIONTIME','REPTYPE'])
.drop_duplicates(subset='OBSERVATIONTIME', keep='first'))
你得到了你的例子:
OBSERVATIONTIME REPTYPE CIGFT
1 2000-01-01 00:00:00 FM-15 25000
5 2000-01-01 01:00:00 FM-15 5000
8 2000-01-01 02:00:00 FM-15 15000
10 2000-01-01 03:00:00 SAO 700
11 2000-01-01 04:00:00 SAO 3000
12 2000-01-01 05:00:00 FM-16 5000
13 2000-01-01 06:00:00 AUTO 15000
16 2000-01-01 07:00:00 FM-15 20000
发现创建一个单独的相同形状的数据帧填充了每小时的观察时间,我可以使用pandas数据帧合并()并在第一次通过后使用pandas数据帧combine_first()。这只花了几分钟而不是几个小时。
def select_the_one_ob(df):
''' select the preferred observation
Parameters:
df (Pandas Object), a Pandas dataframe
Returns Pandas Dataframe
'''
dshelldict = {'DateTime': pd.date_range(BEG_POR, END_POR, freq='H')}
dshell = pd.DataFrame(data = dshelldict)
dshell['YEAR'] = dshell['DateTime'].dt.year
dshell['MONTH'] = dshell['DateTime'].dt.month
dshell['DAY'] = dshell['DateTime'].dt.day
dshell['HOUR'] = dshell['DateTime'].dt.hour
dshell = dshell.set_index(['YEAR','MONTH','DAY','HOUR'])
df = df.set_index(['YEAR','MONTH','DAY','HOUR'])
#tophour_df = pd.DataFrame()
preferred_order = ['FM-15', 'AUTO', 'SAO', 'FM-16', 'SAOSP', 'FM-12', 'SY-MT', 'SY-SA']
reptype_list = list(df.REPTYPE.unique())
# remove the preferred report types from the unique ones
for rep in preferred_order:
if rep in reptype_list:
reptype_list.remove(rep)
# If there are any unique report types left, append them to the preferred list
if len(reptype_list) > 0:
preferred_order = preferred_order + reptype_list
## i is flag to make sure a report type is used to transfer columns to new DataFrame
## (Merge has to happen before combine first)
first_pass = True
for reptype in preferred_order:
if first_pass:
## if there is data in dataframe
if df[(df['MINUTE']==00)&(df['REPTYPE']==reptype)].shape[0]>0:
first_pass = False
# Merge shell with first df with data, the dataframe is sorted by original
# obstime and drop any dup's keeping first aka. first report chronologically
tophour_df = dshell.merge( df[ (df['MINUTE']==00)&(df['REPTYPE']==reptype) ].sort_values(['OBSERVATIONTIME'],ascending=True).drop_duplicates(subset=['ROLLED_OBSERVATIONTIME'],keep='first'),how ='left',left_index = True,right_index=True ).drop('DateTime',axis=1)
else:
# combine_first takes the original dataframe and fills any nan values with data
# of another identical shape dataframe
# ex. if value df.loc[2,col1] is nan df2.loc[2,col1] would fill it if not nan
tophour_df = tophour_df.combine_first(df[(df['MINUTE']==00)&(df['REPTYPE']==reptype)].sort_values(['OBSERVATIONTIME'],ascending=True).drop_duplicates(subset=['ROLLED_OBSERVATIONTIME'],keep='first'))
tophour_df = tophour_df.reset_index()
return tophour_df
我有一个按日期时间列排序的 Pandas 数据框。多行将具有相同的日期时间,但 "report type" 列值不同。我需要 select 基于首选报告类型列表的其中一行。该列表按优先顺序排列。因此,如果其中一行具有列表中的第一个元素,那么该行就是选择附加到新数据框的行。
我尝试了 GroupBy 和非常慢的 Python for 循环来处理每个组以找到首选的报告类型并将该行附加到新的数据框。我想到了 numpy vectorize(),但我不知道如何将 group by 合并到其中。我真的不太了解数据框,但正在学习。关于如何让它更快的任何想法?我可以合并分组依据吗?
示例数据框
OBSERVATIONTIME REPTYPE CIGFT
2000-01-01 00:00:00 AUTO 73300
2000-01-01 00:00:00 FM-15 25000
2000-01-01 00:00:00 FM-12 3000
2000-01-01 01:00:00 SAO 9000
2000-01-01 01:00:00 FM-16 600
2000-01-01 01:00:00 FM-15 5000
2000-01-01 01:00:00 AUTO 5000
2000-01-01 02:00:00 FM-12 12000
2000-01-01 02:00:00 FM-15 15000
2000-01-01 02:00:00 FM-16 8000
2000-01-01 03:00:00 SAO 700
2000-01-01 04:00:00 SAO 3000
2000-01-01 05:00:00 FM-16 5000
2000-01-01 06:00:00 AUTO 15000
2000-01-01 06:00:00 FM-12 12500
2000-01-01 06:00:00 FM-16 12000
2000-01-01 07:00:00 FM-15 20000
#################################################
# The function to loop through and find the row
################################################
def select_the_one_ob(df):
''' select the preferred observation '''
tophour_df = pd.DataFrame()
preferred_order = ['FM-15', 'AUTO', 'SAO', 'FM-16', 'SAOSP', 'FM-12',
'SY-MT', 'SY-SA']
grouped = df.groupby("OBSERVATIONTIME", as_index=False)
for name, group in grouped:
a_group_df = pd.DataFrame(grouped.get_group(name))
for reptype in preferred_order:
preferred_found = False
for i in a_group_df.index.values:
if a_group_df.loc[i, 'REPTYPE'] == reptype:
tophour_df =
tophour_df.append(a_group_df.loc[i].transpose())
preferred_found = True
break
if preferred_found:
break
del a_group_df
return tophour_df
################################################
### The function which calls the above function
################################################
def process_ceiling(plat, network):
platformcig.data_pull(CONNECT_SRC, PULL_CEILING)
data_df = platformcig.df
data_df = select_the_one_ob(data_df)
对于 300,000 行的完整数据集,该函数需要 4 个多小时。 我需要它更快。我可以将 group by 合并到 numpy vectorize() 中吗?
你可以避免使用groupby
。一种方法是将您的列 'REPTYPE' 分类为 pd.Categorical
and then sort_values
and drop_duplicates
,例如:
def select_the_one_ob(df):
preferred_order = ['FM-15', 'AUTO', 'SAO', 'FM-16', 'SAOSP', 'FM-12', 'SY-MT', 'SY-SA']
df.REPTYPE = pd.Categorical(df.REPTYPE, categories=preferred_order, ordered=True)
return (df.sort_values(by=['OBSERVATIONTIME','REPTYPE'])
.drop_duplicates(subset='OBSERVATIONTIME', keep='first'))
你得到了你的例子:
OBSERVATIONTIME REPTYPE CIGFT
1 2000-01-01 00:00:00 FM-15 25000
5 2000-01-01 01:00:00 FM-15 5000
8 2000-01-01 02:00:00 FM-15 15000
10 2000-01-01 03:00:00 SAO 700
11 2000-01-01 04:00:00 SAO 3000
12 2000-01-01 05:00:00 FM-16 5000
13 2000-01-01 06:00:00 AUTO 15000
16 2000-01-01 07:00:00 FM-15 20000
发现创建一个单独的相同形状的数据帧填充了每小时的观察时间,我可以使用pandas数据帧合并()并在第一次通过后使用pandas数据帧combine_first()。这只花了几分钟而不是几个小时。
def select_the_one_ob(df):
''' select the preferred observation
Parameters:
df (Pandas Object), a Pandas dataframe
Returns Pandas Dataframe
'''
dshelldict = {'DateTime': pd.date_range(BEG_POR, END_POR, freq='H')}
dshell = pd.DataFrame(data = dshelldict)
dshell['YEAR'] = dshell['DateTime'].dt.year
dshell['MONTH'] = dshell['DateTime'].dt.month
dshell['DAY'] = dshell['DateTime'].dt.day
dshell['HOUR'] = dshell['DateTime'].dt.hour
dshell = dshell.set_index(['YEAR','MONTH','DAY','HOUR'])
df = df.set_index(['YEAR','MONTH','DAY','HOUR'])
#tophour_df = pd.DataFrame()
preferred_order = ['FM-15', 'AUTO', 'SAO', 'FM-16', 'SAOSP', 'FM-12', 'SY-MT', 'SY-SA']
reptype_list = list(df.REPTYPE.unique())
# remove the preferred report types from the unique ones
for rep in preferred_order:
if rep in reptype_list:
reptype_list.remove(rep)
# If there are any unique report types left, append them to the preferred list
if len(reptype_list) > 0:
preferred_order = preferred_order + reptype_list
## i is flag to make sure a report type is used to transfer columns to new DataFrame
## (Merge has to happen before combine first)
first_pass = True
for reptype in preferred_order:
if first_pass:
## if there is data in dataframe
if df[(df['MINUTE']==00)&(df['REPTYPE']==reptype)].shape[0]>0:
first_pass = False
# Merge shell with first df with data, the dataframe is sorted by original
# obstime and drop any dup's keeping first aka. first report chronologically
tophour_df = dshell.merge( df[ (df['MINUTE']==00)&(df['REPTYPE']==reptype) ].sort_values(['OBSERVATIONTIME'],ascending=True).drop_duplicates(subset=['ROLLED_OBSERVATIONTIME'],keep='first'),how ='left',left_index = True,right_index=True ).drop('DateTime',axis=1)
else:
# combine_first takes the original dataframe and fills any nan values with data
# of another identical shape dataframe
# ex. if value df.loc[2,col1] is nan df2.loc[2,col1] would fill it if not nan
tophour_df = tophour_df.combine_first(df[(df['MINUTE']==00)&(df['REPTYPE']==reptype)].sort_values(['OBSERVATIONTIME'],ascending=True).drop_duplicates(subset=['ROLLED_OBSERVATIONTIME'],keep='first'))
tophour_df = tophour_df.reset_index()
return tophour_df