在列上连接 pandas 个数据帧,类似于外部合并
Concatenate pandas DataFrames on columns, similar to outer merge
我有 3 个数据框,每个数据框的第一列都有日期。我想连接这些数据帧,但连接与每个数据帧的行值相关。如果值匹配,则在同一行添加,否则,我希望有一个 NaN。
import numpy as np
import pandas as pd
# Create the pandas DataFrame
df1 = pd.DataFrame(['2018-12-31','2019-09-30','2022-01-31'], columns = ['Date1'])
df2 = pd.DataFrame(['2019-09-30','2022-02-28'], columns = ['Date2'])
df3 = pd.DataFrame(['2019-09-30','2021-06-30','2021-11-30','2022-03-31'], columns = ['Date3'])
display(df1)
display(df2)
display(df3)
data = {'Date1': ['2018-12-31','2019-09-30',np.nan,np.nan,'2022-01-31',np.nan,np.nan],
'Date2': [np.nan,'2019-09-30',np.nan,np.nan,np.nan,'2022-02-28',np.nan],
'Date3': [np.nan,'2019-09-30','2021-06-30','2021-11-30',np.nan,np.nan,'2022-01-31']}
desired_df = pd.DataFrame(data)
desired_df
这就是我想要达到的目标。
Date1
Date2
Date3
0
2018-12-31
NaN
NaN
1
2019-09-30
2019-09-30
2019-09-30
2
NaN
NaN
2021-06-30
3
NaN
NaN
2021-11-30
4
2022-01-31
NaN
NaN
5
NaN
2022-02-28
NaN
6
NaN
NaN
2022-01-31
我最初的想法是使用类似的东西:
pd.concat([df1,df2,df3], axis=1, join="outer")
但是,上面会产生类似的东西:
Date1
Date2
Date3
2018-12-31
2019-09-30
2019-09-30
2019-09-30
2022-02-28
2021-06-30
2022-01-31
NaN
2021-11-30
NaN
NaN
2022-03-31
我们可以 set_index
使用日期(通过将 drop
参数设置为 False,我们不会丢失该列),然后 concat
水平:
out = (pd.concat([df.set_index(f'Date{i+1}', drop=False)
for i, df in enumerate([df1, df2, df3])], axis=1)
.sort_index().reset_index(drop=True))
输出:
Date1 Date2 Date3
0 2018-12-31 NaN NaN
1 2019-09-30 2019-09-30 2019-09-30
2 NaN NaN 2021-06-30
3 NaN NaN 2021-11-30
4 2022-01-31 NaN NaN
5 NaN 2022-02-28 NaN
6 NaN NaN 2022-03-31
我有 3 个数据框,每个数据框的第一列都有日期。我想连接这些数据帧,但连接与每个数据帧的行值相关。如果值匹配,则在同一行添加,否则,我希望有一个 NaN。
import numpy as np
import pandas as pd
# Create the pandas DataFrame
df1 = pd.DataFrame(['2018-12-31','2019-09-30','2022-01-31'], columns = ['Date1'])
df2 = pd.DataFrame(['2019-09-30','2022-02-28'], columns = ['Date2'])
df3 = pd.DataFrame(['2019-09-30','2021-06-30','2021-11-30','2022-03-31'], columns = ['Date3'])
display(df1)
display(df2)
display(df3)
data = {'Date1': ['2018-12-31','2019-09-30',np.nan,np.nan,'2022-01-31',np.nan,np.nan],
'Date2': [np.nan,'2019-09-30',np.nan,np.nan,np.nan,'2022-02-28',np.nan],
'Date3': [np.nan,'2019-09-30','2021-06-30','2021-11-30',np.nan,np.nan,'2022-01-31']}
desired_df = pd.DataFrame(data)
desired_df
这就是我想要达到的目标。
Date1 | Date2 | Date3 | |
---|---|---|---|
0 | 2018-12-31 | NaN | NaN |
1 | 2019-09-30 | 2019-09-30 | 2019-09-30 |
2 | NaN | NaN | 2021-06-30 |
3 | NaN | NaN | 2021-11-30 |
4 | 2022-01-31 | NaN | NaN |
5 | NaN | 2022-02-28 | NaN |
6 | NaN | NaN | 2022-01-31 |
我最初的想法是使用类似的东西:
pd.concat([df1,df2,df3], axis=1, join="outer")
但是,上面会产生类似的东西:
Date1 | Date2 | Date3 |
---|---|---|
2018-12-31 | 2019-09-30 | 2019-09-30 |
2019-09-30 | 2022-02-28 | 2021-06-30 |
2022-01-31 | NaN | 2021-11-30 |
NaN | NaN | 2022-03-31 |
我们可以 set_index
使用日期(通过将 drop
参数设置为 False,我们不会丢失该列),然后 concat
水平:
out = (pd.concat([df.set_index(f'Date{i+1}', drop=False)
for i, df in enumerate([df1, df2, df3])], axis=1)
.sort_index().reset_index(drop=True))
输出:
Date1 Date2 Date3
0 2018-12-31 NaN NaN
1 2019-09-30 2019-09-30 2019-09-30
2 NaN NaN 2021-06-30
3 NaN NaN 2021-11-30
4 2022-01-31 NaN NaN
5 NaN 2022-02-28 NaN
6 NaN NaN 2022-03-31