"Dynamic" 列选择
"Dynamic" column selection
问题:
比方说,输入 table 是通话和账单的合并 table,具有以下列:通话时间和所有账单的月份。这个想法是有一个 table,其中包含从通话时间开始该人支付的最后 3 笔账单。这样可以将账单放在通话的上下文中。
示例输入输出:
# INPUT:
# df
# TIME ID 2019-08-01 2019-09-01 2019-10-01 2019-11-01 2019-12-01
# 2019-12-01 1 1 2 3 4 5
# 2019-11-01 2 6 7 8 9 10
# 2019-10-01 3 11 12 13 14 15
# EXPECTED OUTPUT:
# df_context
# TIME ID 0 1 2
# 2019-12-01 1 3 4 5
# 2019-11-01 2 7 8 9
# 2019-10-01 3 11 12 13
示例输入创建:
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],
})
我目前得到的代码:
# HOW DOES ONE GET THE col_to FOR EVERY ROW?
col_to = df.columns.get_loc(df['TIME'].astype(str).values[0])
col_from = col_to - 3
df_context = pd.DataFrame()
df_context = df_context.append(pd.DataFrame(df.iloc[:, col_from : col_to].values))
df_context["TIME"] = df["TIME"]
cols = df_context.columns.tolist()
df_context = df_context[cols[-1:] + cols[:-1]]
df_context.head()
我的代码输出:
# OUTPUTS:
# TIME 0 1 2
# 0 2019-12-01 2 3 4 should be 3 4 5
# 1 2019-11-01 7 8 9 all good
# 2 2019-10-01 12 13 14 should be 11 12 13
对于前两行代码,我的代码似乎缺少一两个 for 循环来完成我想要它做的事情,但我无法相信没有更好的比我现在正在编造的解决方案。
我建议您执行以下步骤,这样您就可以完全避免动态列选择。
- 将宽 table(参考日期作为列)转换为长 table(参考日期作为行)
- 计算通话时间
TIME
和参考日期 之间的月数差
- Select 只有
difference >= 0
和 difference < 3
- 根据您的要求格式化输出 table(添加一个 运行 数字,旋转它)
# Initialize dataframe
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],
})
# Convert the wide table to a long table by melting the date columns
# Name the new date column as REF_TIME, and the bill column as BILL
date_cols = ['2019-08-01', '2019-09-01', '2019-10-01', '2019-11-01', '2019-12-01']
df = df.melt(id_vars=['TIME','ID'], value_vars=date_cols, var_name='REF_TIME', value_name='BILL')
# Convert TIME and REF_TIME to datetime type
df['TIME'] = pd.to_datetime(df['TIME'])
df['REF_TIME'] = pd.to_datetime(df['REF_TIME'])
# Find out difference between TIME and REF_TIME
df['TIME_DIFF'] = (df['TIME'] - df['REF_TIME']).dt.days
df['TIME_DIFF'] = (df['TIME_DIFF'] / 30).round()
# Keep only the preceding 3 months (including the month = TIME)
selection = (
(df['TIME_DIFF'] < 3) &
(df['TIME_DIFF'] >= 0)
)
# Apply selection, sort the columns and keep only columns needed
df_out = (
df[selection]
.sort_values(['TIME','ID','REF_TIME'])
[['TIME','ID','BILL']]
)
# Add a running number, lets call this BILL_NO
df_out = df_out.assign(BILL_NO = df_out.groupby(['TIME','ID']).cumcount() + 1)
# Pivot the output table to the format needed
df_out = df_out.pivot(index=['ID','TIME'], columns='BILL_NO', values='BILL')
输出:
BILL_NO 1 2 3
ID TIME
1 2019-12-01 3 4 5
2 2019-11-01 7 8 9
3 2019-10-01 11 12 13
这是我(新手)的解决方案,只有当列名中的日期按升序排列时它才会起作用:
# Initializing Dataframe
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],})
cols = list(df.columns)
new_df = pd.DataFrame([], columns=["0","1","2"])
# Iterating over rows, selecting desired slices and appending them to a new DataFrame:
for i in range(len(df)):
searched_date = df.iloc[i, 0]
searched_column_index = cols.index(searched_date)
searched_row = df.iloc[[i], searched_column_index-2:searched_column_index+1]
mapping_column_names = {searched_row.columns[0]: "0", searched_row.columns[1]: "1", searched_row.columns[2]: "2"}
searched_df = searched_row.rename(mapping_column_names, axis=1)
new_df = pd.concat([new_df, searched_df], ignore_index=True)
new_df = pd.merge(df.iloc[:,0:2], new_df, left_index=True, right_index=True)
new_df
输出:
TIME ID 0 1 2
0 2019-12-01 1 3 4 5
1 2019-11-01 2 7 8 9
2 2019-10-01 3 11 12 13
无论如何,我认为@Toukenize 解决方案更好,因为它不需要迭代。
问题: 比方说,输入 table 是通话和账单的合并 table,具有以下列:通话时间和所有账单的月份。这个想法是有一个 table,其中包含从通话时间开始该人支付的最后 3 笔账单。这样可以将账单放在通话的上下文中。
示例输入输出:
# INPUT:
# df
# TIME ID 2019-08-01 2019-09-01 2019-10-01 2019-11-01 2019-12-01
# 2019-12-01 1 1 2 3 4 5
# 2019-11-01 2 6 7 8 9 10
# 2019-10-01 3 11 12 13 14 15
# EXPECTED OUTPUT:
# df_context
# TIME ID 0 1 2
# 2019-12-01 1 3 4 5
# 2019-11-01 2 7 8 9
# 2019-10-01 3 11 12 13
示例输入创建:
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],
})
我目前得到的代码:
# HOW DOES ONE GET THE col_to FOR EVERY ROW?
col_to = df.columns.get_loc(df['TIME'].astype(str).values[0])
col_from = col_to - 3
df_context = pd.DataFrame()
df_context = df_context.append(pd.DataFrame(df.iloc[:, col_from : col_to].values))
df_context["TIME"] = df["TIME"]
cols = df_context.columns.tolist()
df_context = df_context[cols[-1:] + cols[:-1]]
df_context.head()
我的代码输出:
# OUTPUTS:
# TIME 0 1 2
# 0 2019-12-01 2 3 4 should be 3 4 5
# 1 2019-11-01 7 8 9 all good
# 2 2019-10-01 12 13 14 should be 11 12 13
对于前两行代码,我的代码似乎缺少一两个 for 循环来完成我想要它做的事情,但我无法相信没有更好的比我现在正在编造的解决方案。
我建议您执行以下步骤,这样您就可以完全避免动态列选择。
- 将宽 table(参考日期作为列)转换为长 table(参考日期作为行)
- 计算通话时间
TIME
和参考日期 之间的月数差
- Select 只有
difference >= 0
和difference < 3
- 根据您的要求格式化输出 table(添加一个 运行 数字,旋转它)
# Initialize dataframe
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],
})
# Convert the wide table to a long table by melting the date columns
# Name the new date column as REF_TIME, and the bill column as BILL
date_cols = ['2019-08-01', '2019-09-01', '2019-10-01', '2019-11-01', '2019-12-01']
df = df.melt(id_vars=['TIME','ID'], value_vars=date_cols, var_name='REF_TIME', value_name='BILL')
# Convert TIME and REF_TIME to datetime type
df['TIME'] = pd.to_datetime(df['TIME'])
df['REF_TIME'] = pd.to_datetime(df['REF_TIME'])
# Find out difference between TIME and REF_TIME
df['TIME_DIFF'] = (df['TIME'] - df['REF_TIME']).dt.days
df['TIME_DIFF'] = (df['TIME_DIFF'] / 30).round()
# Keep only the preceding 3 months (including the month = TIME)
selection = (
(df['TIME_DIFF'] < 3) &
(df['TIME_DIFF'] >= 0)
)
# Apply selection, sort the columns and keep only columns needed
df_out = (
df[selection]
.sort_values(['TIME','ID','REF_TIME'])
[['TIME','ID','BILL']]
)
# Add a running number, lets call this BILL_NO
df_out = df_out.assign(BILL_NO = df_out.groupby(['TIME','ID']).cumcount() + 1)
# Pivot the output table to the format needed
df_out = df_out.pivot(index=['ID','TIME'], columns='BILL_NO', values='BILL')
输出:
BILL_NO 1 2 3
ID TIME
1 2019-12-01 3 4 5
2 2019-11-01 7 8 9
3 2019-10-01 11 12 13
这是我(新手)的解决方案,只有当列名中的日期按升序排列时它才会起作用:
# Initializing Dataframe
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],})
cols = list(df.columns)
new_df = pd.DataFrame([], columns=["0","1","2"])
# Iterating over rows, selecting desired slices and appending them to a new DataFrame:
for i in range(len(df)):
searched_date = df.iloc[i, 0]
searched_column_index = cols.index(searched_date)
searched_row = df.iloc[[i], searched_column_index-2:searched_column_index+1]
mapping_column_names = {searched_row.columns[0]: "0", searched_row.columns[1]: "1", searched_row.columns[2]: "2"}
searched_df = searched_row.rename(mapping_column_names, axis=1)
new_df = pd.concat([new_df, searched_df], ignore_index=True)
new_df = pd.merge(df.iloc[:,0:2], new_df, left_index=True, right_index=True)
new_df
输出:
TIME ID 0 1 2
0 2019-12-01 1 3 4 5
1 2019-11-01 2 7 8 9
2 2019-10-01 3 11 12 13
无论如何,我认为@Toukenize 解决方案更好,因为它不需要迭代。