如何在pandas 中连接或合并具有不同列数的三个表?
How to concat or merge three tables with different number of columns in pandas?
我的麻烦始于一个 JSON 文件,其中我有某些 "device" 信息,以及针对不同设备的某些参数。
我能够捕获每个设备 json 就像每个设备的单行数据帧一样。他们有 40-60 列,包括公共列。
示例数据如下:
可重现代码:
df1 = pd.DataFrame({'id': {0: 1122},
'c1': {0: 'uid'},
'c2': {0: 'iopw'},
'c3': {0: 'uywy'},
'c4': {0: '7uyw'},
'c5': {0: 'iwoq'},
'c6': {0: 'owoe'}}
)
df2 = pd.DataFrame({'id': {0: 9910},
'c1': {0: 'mnjjj'},
'c3': {0: 'mhji'},
'c6': {0: 'mb '},
'c8': {0: 'bly'},
'c14': {0: 'bnhg'},
'c15': {0: 'kkkl'},
'c20': {0: 'llug'},
'c25': {0: '87jo'}})
df3 = pd.DataFrame({'id': {0: 2020},
'c4': {0: 'kvkh'},
'c5': {0: 'kjhjkh'},
'c10': {0: 'cvcvc'},
'c15': {0: 'ququ'}})
我试过合并,但我试过的下面代码中的问题是它创建了重复的列。
dfs = [df1, df2, df3]
from functools import reduce
df_final = reduce(lambda left,right: pd.merge(left,right,on='id',how="outer"), dfs)
我怎样才能避免重复,或者有没有其他更简洁的方法来连接或合并表格,这样我就没有任何重复的列?
预期的输出如下所示。它应该有 3 行,以及正确的列数
{'id': {0: 1122, 1: 9910, 2: 2020},
'c1': {0: 'uid', 1: 'mnjj', 2: nan},
'c2': {0: 'iopw', 1: nan, 2: nan},
'c3': {0: 'uywy', 1: nan, 2: nan},
'c4': {0: '7uyw', 1: nan, 2: 'kvkh'},
'c5': {0: 'iwoq', 1: nan, 2: 'kjhjkh'},
'c6': {0: 'owoe', 1: 'mb', 2: nan},
'c7': {0: nan, 1: nan, 2: nan},
'c8': {0: nan, 1: 'bly', 2: nan},
'c9': {0: nan, 1: nan, 2: nan},
'c10': {0: nan, 1: nan, 2: 'cvcvc'},
'c11': {0: nan, 1: nan, 2: nan},
'c12': {0: nan, 1: nan, 2: nan},
'c13': {0: nan, 1: nan, 2: nan},
'c14': {0: nan, 1: 'bnhg', 2: nan},
'c15': {0: nan, 1: 'kkkl', 2: 'ququ'},
'c16': {0: nan, 1: nan, 2: nan},
'c17': {0: nan, 1: nan, 2: nan},
'c18': {0: nan, 1: nan, 2: nan},
'c19': {0: nan, 1: nan, 2: nan},
'c20': {0: nan, 1: 'llug', 2: nan},
'c21': {0: nan, 1: nan, 2: nan},
'c22': {0: nan, 1: nan, 2: nan},
'c23': {0: nan, 1: nan, 2: nan},
'c24': {0: nan, 1: nan, 2: nan},
'c25': {0: nan, 1: '87jo', 2: nan}}
使用concat
with created index by id
with DataFrame.set_index
:
dfs = [df1, df2, df3]
df = pd.concat([x.set_index('id') for x in dfs], sort=True)
print (df)
_t')
c1 c10 c14 c15 c2 c20 c25 c3 c4 c5 c6 \
id
1122 uid NaN NaN NaN iopw NaN NaN uywy 7uyw iwoq owoe
9910 mnjjj NaN bnhg kkkl NaN llug 87jo mhji NaN NaN mb
2020 NaN cvcvc NaN ququ NaN NaN NaN NaN kvkh kjhjkh NaN
c8
id
1122 NaN
9910 bly
2020 NaN
然后要添加 c
列的所有可能组合,请使用 Series.str.extract
with DataFrame.reindex
:
maxim = df.columns.str.extract('(\d+)', expand=False).astype(int).max()
cols = [f'c{x}' for x in range(1, maxim+1)]
df = df.reindex(columns = cols).reset_index()
print (df)
id c1 c2 c3 c4 c5 c6 c7 c8 c9 ... c16 c17 \
0 1122 uid iopw uywy 7uyw iwoq owoe NaN NaN NaN ... NaN NaN
1 9910 mnjjj NaN mhji NaN NaN mb NaN bly NaN ... NaN NaN
2 2020 NaN NaN NaN kvkh kjhjkh NaN NaN NaN NaN ... NaN NaN
c18 c19 c20 c21 c22 c23 c24 c25
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN llug NaN NaN NaN NaN 87jo
2 NaN NaN NaN NaN NaN NaN NaN NaN
[3 rows x 26 columns]
我的麻烦始于一个 JSON 文件,其中我有某些 "device" 信息,以及针对不同设备的某些参数。
我能够捕获每个设备 json 就像每个设备的单行数据帧一样。他们有 40-60 列,包括公共列。
示例数据如下:
可重现代码:
df1 = pd.DataFrame({'id': {0: 1122},
'c1': {0: 'uid'},
'c2': {0: 'iopw'},
'c3': {0: 'uywy'},
'c4': {0: '7uyw'},
'c5': {0: 'iwoq'},
'c6': {0: 'owoe'}}
)
df2 = pd.DataFrame({'id': {0: 9910},
'c1': {0: 'mnjjj'},
'c3': {0: 'mhji'},
'c6': {0: 'mb '},
'c8': {0: 'bly'},
'c14': {0: 'bnhg'},
'c15': {0: 'kkkl'},
'c20': {0: 'llug'},
'c25': {0: '87jo'}})
df3 = pd.DataFrame({'id': {0: 2020},
'c4': {0: 'kvkh'},
'c5': {0: 'kjhjkh'},
'c10': {0: 'cvcvc'},
'c15': {0: 'ququ'}})
我试过合并,但我试过的下面代码中的问题是它创建了重复的列。
dfs = [df1, df2, df3]
from functools import reduce
df_final = reduce(lambda left,right: pd.merge(left,right,on='id',how="outer"), dfs)
我怎样才能避免重复,或者有没有其他更简洁的方法来连接或合并表格,这样我就没有任何重复的列?
预期的输出如下所示。它应该有 3 行,以及正确的列数
{'id': {0: 1122, 1: 9910, 2: 2020},
'c1': {0: 'uid', 1: 'mnjj', 2: nan},
'c2': {0: 'iopw', 1: nan, 2: nan},
'c3': {0: 'uywy', 1: nan, 2: nan},
'c4': {0: '7uyw', 1: nan, 2: 'kvkh'},
'c5': {0: 'iwoq', 1: nan, 2: 'kjhjkh'},
'c6': {0: 'owoe', 1: 'mb', 2: nan},
'c7': {0: nan, 1: nan, 2: nan},
'c8': {0: nan, 1: 'bly', 2: nan},
'c9': {0: nan, 1: nan, 2: nan},
'c10': {0: nan, 1: nan, 2: 'cvcvc'},
'c11': {0: nan, 1: nan, 2: nan},
'c12': {0: nan, 1: nan, 2: nan},
'c13': {0: nan, 1: nan, 2: nan},
'c14': {0: nan, 1: 'bnhg', 2: nan},
'c15': {0: nan, 1: 'kkkl', 2: 'ququ'},
'c16': {0: nan, 1: nan, 2: nan},
'c17': {0: nan, 1: nan, 2: nan},
'c18': {0: nan, 1: nan, 2: nan},
'c19': {0: nan, 1: nan, 2: nan},
'c20': {0: nan, 1: 'llug', 2: nan},
'c21': {0: nan, 1: nan, 2: nan},
'c22': {0: nan, 1: nan, 2: nan},
'c23': {0: nan, 1: nan, 2: nan},
'c24': {0: nan, 1: nan, 2: nan},
'c25': {0: nan, 1: '87jo', 2: nan}}
使用concat
with created index by id
with DataFrame.set_index
:
dfs = [df1, df2, df3]
df = pd.concat([x.set_index('id') for x in dfs], sort=True)
print (df)
_t')
c1 c10 c14 c15 c2 c20 c25 c3 c4 c5 c6 \
id
1122 uid NaN NaN NaN iopw NaN NaN uywy 7uyw iwoq owoe
9910 mnjjj NaN bnhg kkkl NaN llug 87jo mhji NaN NaN mb
2020 NaN cvcvc NaN ququ NaN NaN NaN NaN kvkh kjhjkh NaN
c8
id
1122 NaN
9910 bly
2020 NaN
然后要添加 c
列的所有可能组合,请使用 Series.str.extract
with DataFrame.reindex
:
maxim = df.columns.str.extract('(\d+)', expand=False).astype(int).max()
cols = [f'c{x}' for x in range(1, maxim+1)]
df = df.reindex(columns = cols).reset_index()
print (df)
id c1 c2 c3 c4 c5 c6 c7 c8 c9 ... c16 c17 \
0 1122 uid iopw uywy 7uyw iwoq owoe NaN NaN NaN ... NaN NaN
1 9910 mnjjj NaN mhji NaN NaN mb NaN bly NaN ... NaN NaN
2 2020 NaN NaN NaN kvkh kjhjkh NaN NaN NaN NaN ... NaN NaN
c18 c19 c20 c21 c22 c23 c24 c25
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN llug NaN NaN NaN NaN 87jo
2 NaN NaN NaN NaN NaN NaN NaN NaN
[3 rows x 26 columns]