"Dynamic" 列选择

"Dynamic" column selection

问题: 比方说,输入 table 是通话和账单的合并 table,具有以下列:通话时间和所有账单的月份。这个想法是有一个 table,其中包含从通话时间开始该人支付的最后 3 笔账单。这样可以将账单放在通话的上下文中。

示例输入输出:

# INPUT:
# df
# TIME        ID   2019-08-01   2019-09-01   2019-10-01   2019-11-01   2019-12-01
# 2019-12-01  1    1            2            3            4            5
# 2019-11-01  2    6            7            8            9            10
# 2019-10-01  3    11           12           13           14           15

# EXPECTED OUTPUT:
# df_context
# TIME        ID   0     1     2
# 2019-12-01  1    3     4     5
# 2019-11-01  2    7     8     9
# 2019-10-01  3    11    12    13

示例输入创建:

df = pd.DataFrame({
    'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
    'ID':   [1,2,3],
    '2019-08-01':   [1,6,11],
    '2019-09-01':   [2,7,12],
    '2019-10-01':   [3,8,13],
    '2019-11-01':   [4,9,14],
    '2019-12-01':   [5,10,15],
})

我目前得到的代码:

# HOW DOES ONE GET THE col_to FOR EVERY ROW?
col_to = df.columns.get_loc(df['TIME'].astype(str).values[0])
col_from = col_to - 3

df_context = pd.DataFrame()
df_context = df_context.append(pd.DataFrame(df.iloc[:, col_from : col_to].values))
df_context["TIME"] = df["TIME"]
cols = df_context.columns.tolist()
df_context = df_context[cols[-1:] + cols[:-1]]
df_context.head()

我的代码输出:

# OUTPUTS:
#   TIME        0   1   2
# 0 2019-12-01  2   3   4    should be  3     4     5
# 1 2019-11-01  7   8   9    all good
# 2 2019-10-01  12  13  14   should be  11    12    13

对于前两行代码,我的代码似乎缺少一两个 for 循环来完成我想要它做的事情,但我无法相信没有更好的比我现在正在编造的解决方案。

我建议您执行以下步骤,这样您就可以完全避免动态列选择。

  1. 将宽 table(参考日期作为列)转换为长 table(参考日期作为行)
  2. 计算通话时间 TIME 和参考日期
  3. 之间的月数差
  4. Select 只有 difference >= 0difference < 3
  5. 根据您的要求格式化输出 table(添加一个 运行 数字,旋转它)
# Initialize dataframe
df = pd.DataFrame({
    'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
    'ID':   [1,2,3],
    '2019-08-01':   [1,6,11],
    '2019-09-01':   [2,7,12],
    '2019-10-01':   [3,8,13],
    '2019-11-01':   [4,9,14],
    '2019-12-01':   [5,10,15],
})

# Convert the wide table to a long table by melting the date columns
# Name the new date column as REF_TIME, and the bill column as BILL

date_cols = ['2019-08-01', '2019-09-01', '2019-10-01', '2019-11-01', '2019-12-01']
df = df.melt(id_vars=['TIME','ID'], value_vars=date_cols, var_name='REF_TIME', value_name='BILL')

# Convert TIME and REF_TIME to datetime type
df['TIME'] = pd.to_datetime(df['TIME'])
df['REF_TIME'] = pd.to_datetime(df['REF_TIME'])

# Find out difference between TIME and REF_TIME
df['TIME_DIFF'] = (df['TIME'] - df['REF_TIME']).dt.days
df['TIME_DIFF'] = (df['TIME_DIFF'] / 30).round()

# Keep only the preceding 3 months (including the month = TIME)
selection = (
    (df['TIME_DIFF'] < 3) &
    (df['TIME_DIFF'] >= 0)
)

# Apply selection, sort the columns and keep only columns needed
df_out = (
    df[selection]
    .sort_values(['TIME','ID','REF_TIME'])
    [['TIME','ID','BILL']]
)

# Add a running number, lets call this BILL_NO
df_out = df_out.assign(BILL_NO = df_out.groupby(['TIME','ID']).cumcount() + 1)

# Pivot the output table to the format needed
df_out = df_out.pivot(index=['ID','TIME'], columns='BILL_NO', values='BILL')

输出:

BILL_NO         1   2   3
ID  TIME            
1   2019-12-01  3   4   5
2   2019-11-01  7   8   9
3   2019-10-01  11  12  13

这是我(新手)的解决方案,只有当列名中的日期按升序排列时它才会起作用:

# Initializing Dataframe
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID':   [1,2,3],
'2019-08-01':   [1,6,11],
'2019-09-01':   [2,7,12],
'2019-10-01':   [3,8,13],
'2019-11-01':   [4,9,14],
'2019-12-01':   [5,10,15],})


cols = list(df.columns)
new_df = pd.DataFrame([], columns=["0","1","2"])
# Iterating over rows, selecting desired slices and appending them to a new DataFrame:

 for i in range(len(df)):
    searched_date = df.iloc[i, 0] 
    searched_column_index = cols.index(searched_date)
    searched_row = df.iloc[[i], searched_column_index-2:searched_column_index+1]
    mapping_column_names = {searched_row.columns[0]: "0", searched_row.columns[1]: "1", searched_row.columns[2]: "2"}
    searched_df = searched_row.rename(mapping_column_names, axis=1)
    new_df = pd.concat([new_df, searched_df], ignore_index=True)
new_df = pd.merge(df.iloc[:,0:2], new_df, left_index=True, right_index=True)
new_df

输出:

     TIME      ID   0   1   2
0  2019-12-01   1   3   4   5
1  2019-11-01   2   7   8   9
2  2019-10-01   3  11  12  13

无论如何,我认为@Toukenize 解决方案更好,因为它不需要迭代。