将数据从一个数据帧汇总到另一个数据帧

Question

我愿意为您提供以下帮助。

在我的工作中，我有两个 DataFrame。第一个称为 df_card_features，具有卡片特征，card_id 列具有每张卡片的唯一 ID。第二个名为 df_cart_historic，具有来自第一个数据帧的卡片数据；在第二个数据帧中，card_id 列没有唯一值，但与第一个数据帧的 card_id 列相同。

作为一个解决方案，我考虑创建一个字典，然后将列包含在数据框中，但在我看来，这个提议在性能方面非常昂贵，因为历史的 csv 文件大约有 5 GB。

# card features:
card_id = ['card_a', 'card_b', 'card_c', 'card_d', 'card_e']
date_activation = ['2019-02-01', '2019-05-02', '2018-01-20', '2015-07-23', '2013-07-23']
feature_1_1 = [0, 1, 1, 1, 0]
feature_1_2 = [1, 0, 0, 0, 1]
df_card_features = pd.DataFrame()
df_card_features['card_id'] = card_id
df_card_features['date_activation'] = date_activation
df_card_features['feature_1_1'] = feature_1_1
df_card_features['feature_1_2'] = feature_1_2;
df_card_features.head()


# card historic
card_id = ['card_a', 'card_b', 'card_c', 'card_d', 'card_e', 'card_a', 'card_b', 'card_c', 'card_d', 'card_e', 'card_a', 'card_b', 'card_c', 'card_d', 'card_e']
denied_purchase = ['N', 'Y', 'N', 'Y', 'N', 'N', 'N', 'N', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y']
purchase_date = ['2019-02-01', '2019-02-01', '2019-02-01', '2019-02-01', '2019-02-01', '2019-02-10', '2019-02-11', '2019-02-21', '2019-03-01', '2019-03-01', '2019-03-01', '2019-03-31', '2018-04-01', '2016-02-01', '2013-12-01']
installments = [0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 8, 4, 0 ]
month_lag = [0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 0, 0, 5]
df_cart_historic = pd.DataFrame()
df_cart_historic['card_id'] = card_id
df_cart_historic['denied_purchase'] = denied_purchase
df_cart_historic['purchase_date'] = purchase_date
df_cart_historic['installments'] = installments
df_cart_historic['month_lag'] = month_lag

我需要在 df_card_features 数据框中创建以下列：

列 'denied_purchase?' 如果 df_cart_historic 数据框的 denied_purchase 列中至少出现一个 Y 值，则其值为 1，如果没有 Y 值出现，则其值为零card_id
'oldest_Date'列，其值为df_cart_historic
'max_installments'，即df_cart_historic
'max_month_lag'，也就是df_cart_historic的month_lag列的最大值。

Answer 1

Yoy 需要在 df_cart_historic 中的 'card_id' 列上使用 groupby，以便仅使用 'card_id' 具有相同值的行构建新列。
通过调用 groupby('card_id').apply(func)，您可以使用自定义函数 func 来完成这项工作。

这是一个工作示例：

import pandas as pd

# card features:
card_id = ['card_a', 'card_b', 'card_c', 'card_d', 'card_e']
date_activation = ['2019-02-01', '2019-05-02', '2018-01-20', '2015-07-23', '2013-07-23']
feature_1_1 = [0, 1, 1, 1, 0]
feature_1_2 = [1, 0, 0, 0, 1]
df_card_features = pd.DataFrame()
df_card_features['card_id'] = card_id
df_card_features['date_activation'] = pd.to_datetime(date_activation) #converting to datetime
df_card_features['feature_1_1'] = feature_1_1
df_card_features['feature_1_2'] = feature_1_2;
df_card_features.head()


# card historic
card_id = ['card_a', 'card_b', 'card_c', 'card_d', 'card_e', 'card_a', 'card_b', 'card_c', 'card_d', 'card_e', 'card_a', 'card_b', 'card_c', 'card_d', 'card_e']
denied_purchase = ['N', 'Y', 'N', 'Y', 'N', 'N', 'N', 'N', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y']
purchase_date = ['2019-02-01', '2019-02-01', '2019-02-01', '2019-02-01', '2019-02-01', '2019-02-10', '2019-02-11', '2019-02-21', '2019-03-01', '2019-03-01', '2019-03-01', '2019-03-31', '2018-04-01', '2016-02-01', '2013-12-01']
installments = [0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 8, 4, 0 ]
month_lag = [0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 0, 0, 5]
df_cart_historic = pd.DataFrame()
df_cart_historic['card_id'] = card_id
df_cart_historic['denied_purchase'] = denied_purchase
df_cart_historic['purchase_date'] = pd.to_datetime(purchase_date) #converting to datetime
df_cart_historic['installments'] = installments
df_cart_historic['month_lag'] = month_lag

df_card_features.set_index('card_id', inplace=True) #using card_id column as index

def getnewcols(x):
    res = pd.DataFrame()
    res['denied_purchase?'] = pd.Series(['Y' if 'Y' in x['denied_purchase'].unique() else 'N'])
    res['oldest_Date'] = x['purchase_date'].min()
    res['max_installments'] = x['installments'].max()
    res['max_month_lag'] = x['month_lag'].max()
    return res

newcols = df_cart_historic.groupby('card_id').apply(getnewcols)
newcols = newcols.reset_index().drop('level_1', axis=1).set_index('card_id')
df_card_features_final = pd.concat([df_card_features, newcols], axis=1)

请注意，包含日期的列使用 pandas.to_datetime 进行解析，以便拥有 datetime 对象而不是简单的字符串（对于处理日期来说非常困难）。
newcols 是包含新列的数据框，df_card_features_final 是包含所有列的最终数据框：

        date_activation  feature_1_1  feature_1_2 denied_purchase? oldest_Date  max_installments  max_month_lag
card_id                                                                                                        
card_a       2019-02-01            0            1                N  2019-02-01                 0              0
card_b       2019-05-02            1            0                Y  2019-02-01                 0              0
card_c       2018-01-20            1            0                N  2018-04-01                 8              0
card_d       2015-07-23            1            0                Y  2016-02-01                 4              0
card_e       2013-07-23            0            1                Y  2013-12-01                 5              5

将数据从一个数据帧汇总到另一个数据帧

Summarize data from one dataframe to another

python

mapreduce

dataframe

pandas