"Dynamic" 列选择

Question

问题：比方说，输入 table 是通话和账单的合并 table，具有以下列：通话时间和所有账单的月份。这个想法是有一个 table，其中包含从通话时间开始该人支付的最后 3 笔账单。这样可以将账单放在通话的上下文中。

示例输入输出：

# INPUT:
# df
# TIME        ID   2019-08-01   2019-09-01   2019-10-01   2019-11-01   2019-12-01
# 2019-12-01  1    1            2            3            4            5
# 2019-11-01  2    6            7            8            9            10
# 2019-10-01  3    11           12           13           14           15

# EXPECTED OUTPUT:
# df_context
# TIME        ID   0     1     2
# 2019-12-01  1    3     4     5
# 2019-11-01  2    7     8     9
# 2019-10-01  3    11    12    13

示例输入创建：

df = pd.DataFrame({
    'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
    'ID':   [1,2,3],
    '2019-08-01':   [1,6,11],
    '2019-09-01':   [2,7,12],
    '2019-10-01':   [3,8,13],
    '2019-11-01':   [4,9,14],
    '2019-12-01':   [5,10,15],
})

我目前得到的代码：

# HOW DOES ONE GET THE col_to FOR EVERY ROW?
col_to = df.columns.get_loc(df['TIME'].astype(str).values[0])
col_from = col_to - 3

df_context = pd.DataFrame()
df_context = df_context.append(pd.DataFrame(df.iloc[:, col_from : col_to].values))
df_context["TIME"] = df["TIME"]
cols = df_context.columns.tolist()
df_context = df_context[cols[-1:] + cols[:-1]]
df_context.head()

我的代码输出：

# OUTPUTS:
#   TIME        0   1   2
# 0 2019-12-01  2   3   4    should be  3     4     5
# 1 2019-11-01  7   8   9    all good
# 2 2019-10-01  12  13  14   should be  11    12    13

对于前两行代码，我的代码似乎缺少一两个 for 循环来完成我想要它做的事情，但我无法相信没有更好的比我现在正在编造的解决方案。

Answer 1

我建议您执行以下步骤，这样您就可以完全避免动态列选择。

将宽 table（参考日期作为列）转换为长 table（参考日期作为行）
计算通话时间 TIME 和参考日期
Select 只有 difference >= 0 和 difference < 3
根据您的要求格式化输出 table（添加一个运行数字，旋转它）

# Initialize dataframe
df = pd.DataFrame({
    'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
    'ID':   [1,2,3],
    '2019-08-01':   [1,6,11],
    '2019-09-01':   [2,7,12],
    '2019-10-01':   [3,8,13],
    '2019-11-01':   [4,9,14],
    '2019-12-01':   [5,10,15],
})

# Convert the wide table to a long table by melting the date columns
# Name the new date column as REF_TIME, and the bill column as BILL

date_cols = ['2019-08-01', '2019-09-01', '2019-10-01', '2019-11-01', '2019-12-01']
df = df.melt(id_vars=['TIME','ID'], value_vars=date_cols, var_name='REF_TIME', value_name='BILL')

# Convert TIME and REF_TIME to datetime type
df['TIME'] = pd.to_datetime(df['TIME'])
df['REF_TIME'] = pd.to_datetime(df['REF_TIME'])

# Find out difference between TIME and REF_TIME
df['TIME_DIFF'] = (df['TIME'] - df['REF_TIME']).dt.days
df['TIME_DIFF'] = (df['TIME_DIFF'] / 30).round()

# Keep only the preceding 3 months (including the month = TIME)
selection = (
    (df['TIME_DIFF'] < 3) &
    (df['TIME_DIFF'] >= 0)
)

# Apply selection, sort the columns and keep only columns needed
df_out = (
    df[selection]
    .sort_values(['TIME','ID','REF_TIME'])
    [['TIME','ID','BILL']]
)

# Add a running number, lets call this BILL_NO
df_out = df_out.assign(BILL_NO = df_out.groupby(['TIME','ID']).cumcount() + 1)

# Pivot the output table to the format needed
df_out = df_out.pivot(index=['ID','TIME'], columns='BILL_NO', values='BILL')

输出：

BILL_NO         1   2   3
ID  TIME            
1   2019-12-01  3   4   5
2   2019-11-01  7   8   9
3   2019-10-01  11  12  13

Answer 2

这是我（新手）的解决方案，只有当列名中的日期按升序排列时它才会起作用：

# Initializing Dataframe
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID':   [1,2,3],
'2019-08-01':   [1,6,11],
'2019-09-01':   [2,7,12],
'2019-10-01':   [3,8,13],
'2019-11-01':   [4,9,14],
'2019-12-01':   [5,10,15],})


cols = list(df.columns)
new_df = pd.DataFrame([], columns=["0","1","2"])
# Iterating over rows, selecting desired slices and appending them to a new DataFrame:

 for i in range(len(df)):
    searched_date = df.iloc[i, 0] 
    searched_column_index = cols.index(searched_date)
    searched_row = df.iloc[[i], searched_column_index-2:searched_column_index+1]
    mapping_column_names = {searched_row.columns[0]: "0", searched_row.columns[1]: "1", searched_row.columns[2]: "2"}
    searched_df = searched_row.rename(mapping_column_names, axis=1)
    new_df = pd.concat([new_df, searched_df], ignore_index=True)
new_df = pd.merge(df.iloc[:,0:2], new_df, left_index=True, right_index=True)
new_df

输出：

     TIME      ID   0   1   2
0  2019-12-01   1   3   4   5
1  2019-11-01   2   7   8   9
2  2019-10-01   3  11  12  13

无论如何，我认为@Toukenize 解决方案更好，因为它不需要迭代。

"Dynamic" 列选择

"Dynamic" column selection

python

pandas

data-science