如何使用条件向量化 for 循环而不是迭代 Pandas DataFrame
How to vectorize a for loop with conditions instead of iterating over a Pandas DataFrame
我有一些代码可以提取两个 .csv 文件:employee.csv 和 schedule.csv。 employee.csv 具有属性 'ID' 和 'Building' ,我将它们一起用作 'key' 根据条件收集具有相同 ID/Building 对的计划文件中的条目。
最后我留下了一个列表列表,我用它来创建输出数据框。
employees.csv
Name,Date,Building,ID,Start Time,Stop Time,Duration,Years,EmployeeType,Status
1,3/1/2021,1,1,22:04:05,0:00:00,1:55:55,21,EmployeeType1,Status
1,3/1/2021,2,2,17:04:05,0:00:00,5:55:55,21,EmployeeType1,Status
schedule.csv
Name,Rev,Building,ID,Op Date,Start Time,Dur,WorkType
1,1,1,1,3/1/2021,23:04:12,1,WorkType1
1,1,1,1,3/1/2021,23:44:00,1,WorkType1
伪代码(数据逻辑可能不完全合理,但它反映了我正在尝试做的事情):
import pandas as pd
import datetime
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
# takes sequence and converts
def convert_sequence(seq):
return ''.join(seq)
def create_output(employee_file, schedule_file, output_file):
output_columns = ['Name', 'Date', 'Building', 'ID', 'Years', 'Type', 'Start Time', 'Stop Time', 'Duration',
'SumDuration', '%Time', 'Gap', 'Sequence', 'MinDuration', 'MaxDuration', 'Status']
employee_df = pd.read_csv(employee_file)
schedule = pd.read_csv(schedule_file)
output_data_list = []
# loop through rows in employees, get pairs as keys to search through schedule
for iw_index, iw_row in employee_df.iterrows():
employee_name = iw_row['Name']
date = iw_row['Date']
building = iw_row['Building']
id = iw_row['ID']
start = iw_row['Start Time']
end = iw_row['Stop Time']
duration = iw_row['Duration']
num_years = iw_row['Years']
employee_type = iw_row['EmployeeType']
status = iw_row['Status']
# if we don't find any rows that match on the id/building, still write out a row for the employee we are on
# with current data we have
if len(list(schedule.loc[(schedule['Building'] == building) & (schedule['ID'] == id)].iterrows())) == 0:
data_retrieved = [employee_name, date, building, id, num_years, employee_type, start, end, duration,
'NA', 'NA', 'NA', 'NA', 'NA', 'NA', status]
output_data_list.append(data_retrieved)
# skip gathering rest of data because we won't find any matches, move to next pair
print('skipping')
continue
# holds list of contact types for this particular building/id pair
work_sequence = schedule.loc[(schedule['Building'] == building) & (schedule['ID'] == id)]['WorkType'].tolist()
work_sequence_converted = convert_sequence(work_sequence)
# get all durations for this pair
durations = schedule.loc[(schedule['Building'] == building) & (schedule['ID'] == id)]['Dur'].values
min_duration = min(durations)
max_duration = max(durations)
sum_duration = sum(durations)
#convert duration in datetime format to seconds
date_time = datetime.datetime.strptime(str(duration), "%H:%M:%S")
a_timedelta = date_time - datetime.datetime(1900, 1, 1)
duration_in_seconds = a_timedelta.total_seconds()
percent_time = 1.0/duration_in_seconds
data_retrieved = [employee_name, date, building, id, num_years, employee_type, start, end, duration,
sum_duration, percent_time, 'NA', work_sequence_converted, min_duration, max_duration, status]
output_data_list.append(data_retrieved)
print('ree')
output_df = pd.DataFrame(output_data_list, columns=output_columns)
# further computations on created df....
# further computations on created df....
# further computations on created df....
output_df.to_csv(output_file, index=False)
def main():
create_output('employees.csv', 'schedule.csv', 'out.csv')
if __name__ == '__main__':
main()
我 运行 这是在一个有 80,000 行的数据集上进行的,花了几个小时。我怎样才能 vectorize/optimize 上面有条件的循环,这样我就不再遍历整个 df 了?
我对 pandas 优化完全陌生,所以任何帮助都会大有帮助。
给定这样的数据框:
>>df
Name Date Building ID ... Duration Years EmployeeType Status
0 1 3/1/2021 1 1 ... 1:55:55 21 EmployeeType1 Status
1 1 3/1/2021 2 2 ... 5:55:55 21 EmployeeType1 Status
>>df2 # Schedule Data frame
Name Rev Building ID Op Date Start Time Dur WorkType
0 1 1 1 1 3/1/2021 23:04:12 1 WorkType1
1 1 1 1 1 3/1/2021 23:44:00 1 WorkType1
我刚刚修改了你的一个函数,使用 pandas' apply
方法来实现它。
def create_output(row):
if len(list(df2.loc[(df2['Building'] == row['Building']) & (df2['ID'] == row['ID'])].iterrows())) == 0:
data_retrieved = [row['Name'], row['Date'], row['Building'], row['ID'], row['Years'], row['EmployeeType'], row['Start Time'], row['Stop Time'], row['Duration'],
'NA', 'NA', 'NA', 'NA', 'NA', 'NA', row['Status']]
return data_retrieved
work_sequence = df2.loc[(df2['Building'] == row['Building']) & (df2['ID'] == row['ID'])]['WorkType'].tolist()
work_sequence_converted = ''.join(work_sequence)
# get all durations for this pair
durations = df2.loc[(df2['Building'] == row['Building']) & (df2['ID'] == row['ID'])]['Dur'].astype(int).values
min_duration = min(durations)
max_duration = max(durations)
sum_duration = sum(durations)
# convert duration in datetime format to seconds
date_time = datetime.datetime.strptime(str(row['Duration']), "%H:%M:%S")
a_timedelta = date_time - datetime.datetime(1900, 1, 1)
duration_in_seconds = a_timedelta.total_seconds()
percent_time = 1.0 / duration_in_seconds
data_retrieved = [row['Name'], row['Date'], row['Building'], row['ID'], row['Years'], row['EmployeeType'], row['Start Time'], row['Stop Time'], row['Duration'],
sum_duration, percent_time, 'NA', work_sequence_converted, min_duration, max_duration, row['Status']]
return data_retrieved
现在您可以为每一行调用此函数而无需手动迭代,并且由于您不是手动迭代它,因此速度会非常快..
df.apply(create_output, axis=1)
0 [1, 3/1/2021, 1, 1, 21, EmployeeType1, 22:04:0...
1 [1, 3/1/2021, 2, 2, 21, EmployeeType1, 17:04:0...
dtype: object
既然是dataframe,可以很方便的转换成list,
df.apply(create_output, axis=1).tolist()
[['1', '3/1/2021', '1', '1', '21', 'EmployeeType1', '22:04:05', '0:00:00', '1:55:55', 2, 0.00014378145219266715, 'NA', 'WorkType1WorkType1', 1, 1, 'Status'], ['1', '3/1/2021', '2', '2', '21', 'EmployeeType1', '17:04:05', '0:00:00', '5:55:55', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'Status']]
您正在做的是手动执行“合并”
key_cols = ['Building', 'ID']
output_df = employee_df.merge(
schedule.drop(columns=['Name', 'Op Date', 'Rev', 'Start Time']),
on=key_cols, how='outer'
)
您可以 .drop()
最终结果中 schedule
不需要的任何列。
how='outer'
将包括没有“匹配”的行。
>>> output_df
Name Date Building ID Start Time Stop Time Duration Years EmployeeType Status Dur WorkType
0 1 3/1/2021 1 1 22:04:05 0:00:00 1:55:55 21 EmployeeType1 Status 1.0 WorkType1
1 1 3/1/2021 1 1 22:04:05 0:00:00 1:55:55 21 EmployeeType1 Status 1.0 WorkType1
2 1 3/1/2021 2 2 17:04:05 0:00:00 5:55:55 21 EmployeeType1 Status NaN NaN
现在您有了一个数据框,您可以在 key_cols
上 groupby
并使用 Aggregation 生成每个组的摘要。
summary = { column: (column, 'first') for column in employee_df.columns }
summary['%Time'] = (
'Duration',
lambda dur:
1 / (pd.Timestamp(dur.iat[0])
.replace(year=1900, day=1, month=1)
- pd.Timestamp(1900, 1, 1)).total_seconds()
)
summary.update({
'SumDuration': ('Dur', 'sum'),
'MinDuration': ('Dur', 'min'),
'MaxDuration': ('Dur', 'max'),
'WorkType': ('WorkType', ','.join)
})
output_df = output_df.fillna('').groupby(key_cols).agg(**summary)
>>> output_df
Name Date Building ID Start Time Stop Time Duration ... EmployeeType Status %Time SumDuration MinDuration MaxDuration WorkType
Building ID ...
1 1 1 3/1/2021 1 1 22:04:05 0:00:00 1:55:55 ... EmployeeType1 Status 0.000144 2.0 1.0 1.0 WorkType1,WorkType1
2 2 1 3/1/2021 2 2 17:04:05 0:00:00 5:55:55 ... EmployeeType1 Status 0.000047
然后您可以通过删除添加的索引来清理它,添加您的 NA
字符串并删除没有 Dur
值的行的 %Time
。
output_df.reset_index(drop=True, inplace=True)
output_df.replace({'': 'NA'}, inplace=True)
output_df.loc[ output_df.SumDuration == 'NA', '%Time' ] = 'NA'
产生:
>>> output_df.to_csv()
Name,Date,Building,ID,Start Time,Stop Time,Duration,Years,EmployeeType,Status,%Time,SumDuration,MinDuration,MaxDuration,WorkType
1,3/1/2021,1,1,22:04:05,0:00:00,1:55:55,21,EmployeeType1,Status,0.00014378145219266715,2.0,1.0,1.0,"WorkType1,WorkType1"
1,3/1/2021,2,2,17:04:05,0:00:00,5:55:55,21,EmployeeType1,Status,NA,NA,NA,NA,NA
编辑
这是您使用 groupby().apply()
而不是 .agg()
编写的 create_output
函数 - 您应该更容易理解。
def create_output(employee_file, schedule_file, output_file):
output_columns = ['Name', 'Date', 'Building', 'ID', 'Years', 'Type', 'Start Time', 'Stop Time', 'Duration',
'SumDuration', '%Time', 'Gap', 'Sequence', 'MinDuration', 'MaxDuration', 'Status']
employee_df = pd.read_csv(employee_file)
schedule = pd.read_csv(schedule_file)
key_cols = ['Building', 'ID']
output_df = employee_df.merge(
schedule.drop(columns=['Name', 'Op Date', 'Rev', 'Start Time']),
on=key_cols, how='outer'
)
def summary(df):
row = df.iloc[0]
min_duration = df['Dur'].min()
max_duration = df['Dur'].max()
sum_duration = df['Dur'].sum()
work_sequence = ','.join(df['WorkType'])
row['Type'] = row['EmployeeType']
row['SumDuration'] = sum_duration
row['%Time'] = ''
if sum_duration: # only add %Time if there is a duration
duration = row['Duration']
date_time = datetime.datetime.strptime(duration, "%H:%M:%S")
a_timedelta = date_time - datetime.datetime(1900, 1, 1)
duration_in_seconds = a_timedelta.total_seconds()
percent_time = 1.0/duration_in_seconds
row['%Time'] = percent_time
row['Gap'] = 'NA'
row['MinDuration'] = min_duration
row['MaxDuration'] = max_duration
row['Sequence'] = work_sequence
return row.loc[output_columns] # reorder the columns
output_df = output_df.fillna('').groupby(key_cols).apply(summary)
output_df.replace({'': 'NA'}).to_csv(output_file, index=False)
我有一些代码可以提取两个 .csv 文件:employee.csv 和 schedule.csv。 employee.csv 具有属性 'ID' 和 'Building' ,我将它们一起用作 'key' 根据条件收集具有相同 ID/Building 对的计划文件中的条目。
最后我留下了一个列表列表,我用它来创建输出数据框。
employees.csv
Name,Date,Building,ID,Start Time,Stop Time,Duration,Years,EmployeeType,Status
1,3/1/2021,1,1,22:04:05,0:00:00,1:55:55,21,EmployeeType1,Status
1,3/1/2021,2,2,17:04:05,0:00:00,5:55:55,21,EmployeeType1,Status
schedule.csv
Name,Rev,Building,ID,Op Date,Start Time,Dur,WorkType
1,1,1,1,3/1/2021,23:04:12,1,WorkType1
1,1,1,1,3/1/2021,23:44:00,1,WorkType1
伪代码(数据逻辑可能不完全合理,但它反映了我正在尝试做的事情):
import pandas as pd
import datetime
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
# takes sequence and converts
def convert_sequence(seq):
return ''.join(seq)
def create_output(employee_file, schedule_file, output_file):
output_columns = ['Name', 'Date', 'Building', 'ID', 'Years', 'Type', 'Start Time', 'Stop Time', 'Duration',
'SumDuration', '%Time', 'Gap', 'Sequence', 'MinDuration', 'MaxDuration', 'Status']
employee_df = pd.read_csv(employee_file)
schedule = pd.read_csv(schedule_file)
output_data_list = []
# loop through rows in employees, get pairs as keys to search through schedule
for iw_index, iw_row in employee_df.iterrows():
employee_name = iw_row['Name']
date = iw_row['Date']
building = iw_row['Building']
id = iw_row['ID']
start = iw_row['Start Time']
end = iw_row['Stop Time']
duration = iw_row['Duration']
num_years = iw_row['Years']
employee_type = iw_row['EmployeeType']
status = iw_row['Status']
# if we don't find any rows that match on the id/building, still write out a row for the employee we are on
# with current data we have
if len(list(schedule.loc[(schedule['Building'] == building) & (schedule['ID'] == id)].iterrows())) == 0:
data_retrieved = [employee_name, date, building, id, num_years, employee_type, start, end, duration,
'NA', 'NA', 'NA', 'NA', 'NA', 'NA', status]
output_data_list.append(data_retrieved)
# skip gathering rest of data because we won't find any matches, move to next pair
print('skipping')
continue
# holds list of contact types for this particular building/id pair
work_sequence = schedule.loc[(schedule['Building'] == building) & (schedule['ID'] == id)]['WorkType'].tolist()
work_sequence_converted = convert_sequence(work_sequence)
# get all durations for this pair
durations = schedule.loc[(schedule['Building'] == building) & (schedule['ID'] == id)]['Dur'].values
min_duration = min(durations)
max_duration = max(durations)
sum_duration = sum(durations)
#convert duration in datetime format to seconds
date_time = datetime.datetime.strptime(str(duration), "%H:%M:%S")
a_timedelta = date_time - datetime.datetime(1900, 1, 1)
duration_in_seconds = a_timedelta.total_seconds()
percent_time = 1.0/duration_in_seconds
data_retrieved = [employee_name, date, building, id, num_years, employee_type, start, end, duration,
sum_duration, percent_time, 'NA', work_sequence_converted, min_duration, max_duration, status]
output_data_list.append(data_retrieved)
print('ree')
output_df = pd.DataFrame(output_data_list, columns=output_columns)
# further computations on created df....
# further computations on created df....
# further computations on created df....
output_df.to_csv(output_file, index=False)
def main():
create_output('employees.csv', 'schedule.csv', 'out.csv')
if __name__ == '__main__':
main()
我 运行 这是在一个有 80,000 行的数据集上进行的,花了几个小时。我怎样才能 vectorize/optimize 上面有条件的循环,这样我就不再遍历整个 df 了?
我对 pandas 优化完全陌生,所以任何帮助都会大有帮助。
给定这样的数据框:
>>df
Name Date Building ID ... Duration Years EmployeeType Status
0 1 3/1/2021 1 1 ... 1:55:55 21 EmployeeType1 Status
1 1 3/1/2021 2 2 ... 5:55:55 21 EmployeeType1 Status
>>df2 # Schedule Data frame
Name Rev Building ID Op Date Start Time Dur WorkType
0 1 1 1 1 3/1/2021 23:04:12 1 WorkType1
1 1 1 1 1 3/1/2021 23:44:00 1 WorkType1
我刚刚修改了你的一个函数,使用 pandas' apply
方法来实现它。
def create_output(row):
if len(list(df2.loc[(df2['Building'] == row['Building']) & (df2['ID'] == row['ID'])].iterrows())) == 0:
data_retrieved = [row['Name'], row['Date'], row['Building'], row['ID'], row['Years'], row['EmployeeType'], row['Start Time'], row['Stop Time'], row['Duration'],
'NA', 'NA', 'NA', 'NA', 'NA', 'NA', row['Status']]
return data_retrieved
work_sequence = df2.loc[(df2['Building'] == row['Building']) & (df2['ID'] == row['ID'])]['WorkType'].tolist()
work_sequence_converted = ''.join(work_sequence)
# get all durations for this pair
durations = df2.loc[(df2['Building'] == row['Building']) & (df2['ID'] == row['ID'])]['Dur'].astype(int).values
min_duration = min(durations)
max_duration = max(durations)
sum_duration = sum(durations)
# convert duration in datetime format to seconds
date_time = datetime.datetime.strptime(str(row['Duration']), "%H:%M:%S")
a_timedelta = date_time - datetime.datetime(1900, 1, 1)
duration_in_seconds = a_timedelta.total_seconds()
percent_time = 1.0 / duration_in_seconds
data_retrieved = [row['Name'], row['Date'], row['Building'], row['ID'], row['Years'], row['EmployeeType'], row['Start Time'], row['Stop Time'], row['Duration'],
sum_duration, percent_time, 'NA', work_sequence_converted, min_duration, max_duration, row['Status']]
return data_retrieved
现在您可以为每一行调用此函数而无需手动迭代,并且由于您不是手动迭代它,因此速度会非常快..
df.apply(create_output, axis=1)
0 [1, 3/1/2021, 1, 1, 21, EmployeeType1, 22:04:0...
1 [1, 3/1/2021, 2, 2, 21, EmployeeType1, 17:04:0...
dtype: object
既然是dataframe,可以很方便的转换成list,
df.apply(create_output, axis=1).tolist()
[['1', '3/1/2021', '1', '1', '21', 'EmployeeType1', '22:04:05', '0:00:00', '1:55:55', 2, 0.00014378145219266715, 'NA', 'WorkType1WorkType1', 1, 1, 'Status'], ['1', '3/1/2021', '2', '2', '21', 'EmployeeType1', '17:04:05', '0:00:00', '5:55:55', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'Status']]
您正在做的是手动执行“合并”
key_cols = ['Building', 'ID']
output_df = employee_df.merge(
schedule.drop(columns=['Name', 'Op Date', 'Rev', 'Start Time']),
on=key_cols, how='outer'
)
您可以 .drop()
最终结果中 schedule
不需要的任何列。
how='outer'
将包括没有“匹配”的行。
>>> output_df
Name Date Building ID Start Time Stop Time Duration Years EmployeeType Status Dur WorkType
0 1 3/1/2021 1 1 22:04:05 0:00:00 1:55:55 21 EmployeeType1 Status 1.0 WorkType1
1 1 3/1/2021 1 1 22:04:05 0:00:00 1:55:55 21 EmployeeType1 Status 1.0 WorkType1
2 1 3/1/2021 2 2 17:04:05 0:00:00 5:55:55 21 EmployeeType1 Status NaN NaN
现在您有了一个数据框,您可以在 key_cols
上 groupby
并使用 Aggregation 生成每个组的摘要。
summary = { column: (column, 'first') for column in employee_df.columns }
summary['%Time'] = (
'Duration',
lambda dur:
1 / (pd.Timestamp(dur.iat[0])
.replace(year=1900, day=1, month=1)
- pd.Timestamp(1900, 1, 1)).total_seconds()
)
summary.update({
'SumDuration': ('Dur', 'sum'),
'MinDuration': ('Dur', 'min'),
'MaxDuration': ('Dur', 'max'),
'WorkType': ('WorkType', ','.join)
})
output_df = output_df.fillna('').groupby(key_cols).agg(**summary)
>>> output_df
Name Date Building ID Start Time Stop Time Duration ... EmployeeType Status %Time SumDuration MinDuration MaxDuration WorkType
Building ID ...
1 1 1 3/1/2021 1 1 22:04:05 0:00:00 1:55:55 ... EmployeeType1 Status 0.000144 2.0 1.0 1.0 WorkType1,WorkType1
2 2 1 3/1/2021 2 2 17:04:05 0:00:00 5:55:55 ... EmployeeType1 Status 0.000047
然后您可以通过删除添加的索引来清理它,添加您的 NA
字符串并删除没有 Dur
值的行的 %Time
。
output_df.reset_index(drop=True, inplace=True)
output_df.replace({'': 'NA'}, inplace=True)
output_df.loc[ output_df.SumDuration == 'NA', '%Time' ] = 'NA'
产生:
>>> output_df.to_csv()
Name,Date,Building,ID,Start Time,Stop Time,Duration,Years,EmployeeType,Status,%Time,SumDuration,MinDuration,MaxDuration,WorkType
1,3/1/2021,1,1,22:04:05,0:00:00,1:55:55,21,EmployeeType1,Status,0.00014378145219266715,2.0,1.0,1.0,"WorkType1,WorkType1"
1,3/1/2021,2,2,17:04:05,0:00:00,5:55:55,21,EmployeeType1,Status,NA,NA,NA,NA,NA
编辑
这是您使用 groupby().apply()
而不是 .agg()
编写的 create_output
函数 - 您应该更容易理解。
def create_output(employee_file, schedule_file, output_file):
output_columns = ['Name', 'Date', 'Building', 'ID', 'Years', 'Type', 'Start Time', 'Stop Time', 'Duration',
'SumDuration', '%Time', 'Gap', 'Sequence', 'MinDuration', 'MaxDuration', 'Status']
employee_df = pd.read_csv(employee_file)
schedule = pd.read_csv(schedule_file)
key_cols = ['Building', 'ID']
output_df = employee_df.merge(
schedule.drop(columns=['Name', 'Op Date', 'Rev', 'Start Time']),
on=key_cols, how='outer'
)
def summary(df):
row = df.iloc[0]
min_duration = df['Dur'].min()
max_duration = df['Dur'].max()
sum_duration = df['Dur'].sum()
work_sequence = ','.join(df['WorkType'])
row['Type'] = row['EmployeeType']
row['SumDuration'] = sum_duration
row['%Time'] = ''
if sum_duration: # only add %Time if there is a duration
duration = row['Duration']
date_time = datetime.datetime.strptime(duration, "%H:%M:%S")
a_timedelta = date_time - datetime.datetime(1900, 1, 1)
duration_in_seconds = a_timedelta.total_seconds()
percent_time = 1.0/duration_in_seconds
row['%Time'] = percent_time
row['Gap'] = 'NA'
row['MinDuration'] = min_duration
row['MaxDuration'] = max_duration
row['Sequence'] = work_sequence
return row.loc[output_columns] # reorder the columns
output_df = output_df.fillna('').groupby(key_cols).apply(summary)
output_df.replace({'': 'NA'}).to_csv(output_file, index=False)