如何在pandas 中连接或合并具有不同列数的三个表?

How to concat or merge three tables with different number of columns in pandas?

我的麻烦始于一个 JSON 文件,其中我有某些 "device" 信息,以及针对不同设备的某些参数。

我能够捕获每个设备 json 就像每个设备的单行数据帧一样。他们有 40-60 列,包括公共列。

示例数据如下:

可重现代码:

df1 = pd.DataFrame({'id': {0: 1122},
 'c1': {0: 'uid'},
 'c2': {0: 'iopw'},
 'c3': {0: 'uywy'},
 'c4': {0: '7uyw'},
 'c5': {0: 'iwoq'},
 'c6': {0: 'owoe'}}
)

df2 = pd.DataFrame({'id': {0: 9910},
 'c1': {0: 'mnjjj'},
 'c3': {0: 'mhji'},
 'c6': {0: 'mb '},
 'c8': {0: 'bly'},
 'c14': {0: 'bnhg'},
 'c15': {0: 'kkkl'},
 'c20': {0: 'llug'},
 'c25': {0: '87jo'}})


df3 = pd.DataFrame({'id': {0: 2020},
 'c4': {0: 'kvkh'},
 'c5': {0: 'kjhjkh'},
 'c10': {0: 'cvcvc'},
 'c15': {0: 'ququ'}})

我试过合并,但我试过的下面代码中的问题是它创建了重复的列。

dfs = [df1, df2, df3]
from functools import reduce
df_final = reduce(lambda left,right: pd.merge(left,right,on='id',how="outer"), dfs)

我怎样才能避免重复,或者有没有其他更简洁的方法来连接或合并表格,这样我就没有任何重复的列?


预期的输出如下所示。它应该有 3 行,以及正确的列数

{'id': {0: 1122, 1: 9910, 2: 2020},
 'c1': {0: 'uid', 1: 'mnjj', 2: nan},
 'c2': {0: 'iopw', 1: nan, 2: nan},
 'c3': {0: 'uywy', 1: nan, 2: nan},
 'c4': {0: '7uyw', 1: nan, 2: 'kvkh'},
 'c5': {0: 'iwoq', 1: nan, 2: 'kjhjkh'},
 'c6': {0: 'owoe', 1: 'mb', 2: nan},
 'c7': {0: nan, 1: nan, 2: nan},
 'c8': {0: nan, 1: 'bly', 2: nan},
 'c9': {0: nan, 1: nan, 2: nan},
 'c10': {0: nan, 1: nan, 2: 'cvcvc'},
 'c11': {0: nan, 1: nan, 2: nan},
 'c12': {0: nan, 1: nan, 2: nan},
 'c13': {0: nan, 1: nan, 2: nan},
 'c14': {0: nan, 1: 'bnhg', 2: nan},
 'c15': {0: nan, 1: 'kkkl', 2: 'ququ'},
 'c16': {0: nan, 1: nan, 2: nan},
 'c17': {0: nan, 1: nan, 2: nan},
 'c18': {0: nan, 1: nan, 2: nan},
 'c19': {0: nan, 1: nan, 2: nan},
 'c20': {0: nan, 1: 'llug', 2: nan},
 'c21': {0: nan, 1: nan, 2: nan},
 'c22': {0: nan, 1: nan, 2: nan},
 'c23': {0: nan, 1: nan, 2: nan},
 'c24': {0: nan, 1: nan, 2: nan},
 'c25': {0: nan, 1: '87jo', 2: nan}}

使用concat with created index by id with DataFrame.set_index:

dfs = [df1, df2, df3]

df = pd.concat([x.set_index('id') for x in dfs], sort=True)
print (df)
_t')
         c1    c10   c14   c15    c2   c20   c25    c3    c4      c5    c6  \
id                                                                           
1122    uid    NaN   NaN   NaN  iopw   NaN   NaN  uywy  7uyw    iwoq  owoe   
9910  mnjjj    NaN  bnhg  kkkl   NaN  llug  87jo  mhji   NaN     NaN   mb    
2020    NaN  cvcvc   NaN  ququ   NaN   NaN   NaN   NaN  kvkh  kjhjkh   NaN   

       c8  
id         
1122  NaN  
9910  bly  
2020  NaN  

然后要添加 c 列的所有可能组合,请使用 Series.str.extract with DataFrame.reindex:

maxim = df.columns.str.extract('(\d+)', expand=False).astype(int).max()
cols = [f'c{x}' for x in range(1, maxim+1)]
df = df.reindex(columns = cols).reset_index()
print (df)
     id     c1    c2    c3    c4      c5    c6  c7   c8  c9  ... c16  c17  \
0  1122    uid  iopw  uywy  7uyw    iwoq  owoe NaN  NaN NaN  ... NaN  NaN   
1  9910  mnjjj   NaN  mhji   NaN     NaN   mb  NaN  bly NaN  ... NaN  NaN   
2  2020    NaN   NaN   NaN  kvkh  kjhjkh   NaN NaN  NaN NaN  ... NaN  NaN   

   c18  c19   c20 c21  c22  c23  c24   c25  
0  NaN  NaN   NaN NaN  NaN  NaN  NaN   NaN  
1  NaN  NaN  llug NaN  NaN  NaN  NaN  87jo  
2  NaN  NaN   NaN NaN  NaN  NaN  NaN   NaN  

[3 rows x 26 columns]