将 DataFrame 中的嵌入式字典列表扩展为 DataFrame 的新列

Expand embedded list of dictionnaries in a DataFrame as new columns of the DataFrame

我有一个 Pandas DataFrame 看起来像:

import pandas as pd
print(pd.__version__)

df0 = pd.DataFrame([
 [12, None, [{'dst': '925', 'object': 'Lok. Certification', 'admin': 'valid'},
             {'dst': '935', 'object': 'Lok. Administration', 'admin': 'true'},
             {'dst': '944', 'object': 'Lok. Customer', 'admin': 'false'},
             {'dst': '945', 'object': 'Lok. Customer', 'admin': 'false'},
             {'dst': '954', 'object': 'Lok. Certification-C', 'admin': 'invalid'},
             {'dst': '956', 'object': 'Lok. Certification', 'admin': 'valid'}]],
 [13,'wXB', [{'dst': '986', 'object': 'Fral_heater', 'admin': 'valid'},
             {'dst': '987', 'object': 'Fral_cond.', 'admin': 'valid'}]],
 ])

第 2 列中的每个列表都具有完全相同的键(dstobjectadmin)。

df0.

的每一行可以有 0(空 [])到 100 个列表

我希望我可以扩展 df0 DataFrame 看起来像这样:

columns = ['id', 'name', 'dst', 'object', 'admin']

df_wanted
Out[416]: 
     id name  dst  object                admin
    12  None  925 'Lok. Certification'   'valid'
    12  None  935 'Lok. Administration'  'true'
    12  None  944 'Lok. Customer'        'false'
    12  None  945 'Lok. Customer'        'false'
    12  None  955 'Lok. Certification-C' 'invalid'
    12  None  956 'Lok. Certification'   'valid'
    13   wXB  987 'Lok. Fral_heater'     'valid'
    13   wXB  986 'Lok. Fral_cond.'      'valid'
    ...

请注意,前两列 idname 沿行复制以适合其列表中的元素数。

(必须在末尾使用 .astype(int)dst 列转换为 int。)

我怎样才能做到这一点?

信息:

Python 3.10.4
pd.__version__
'1.4.2'

您可以先 explode 列,然后将字典转换为列:

df0 = df0.explode(2, ignore_index=True)    
df0 = pd.concat([df0, df0[2].apply(pd.Series)], axis=1).drop(columns=2)

我建议将列扩展到一个新的 df,然后加入主 df:

# Copied from above
df = pd.DataFrame([
 [12, None, [{'dst': '925', 'object': 'Lok. Certification', 'admin': 'valid'},
             {'dst': '935', 'object': 'Lok. Administration', 'admin': 'true'},
             {'dst': '944', 'object': 'Lok. Customer', 'admin': 'false'},
             {'dst': '945', 'object': 'Lok. Customer', 'admin': 'false'},
             {'dst': '954', 'object': 'Lok. Certification-C', 'admin': 'invalid'},
             {'dst': '956', 'object': 'Lok. Certification', 'admin': 'valid'}]],
 [13,'wXB', [{'dst': '986', 'object': 'Fral_heater', 'admin': 'valid'},
             {'dst': '987', 'object': 'Fral_cond.', 'admin': 'valid'}]],
 ])

# Set the names of the columns
df.columns = ['id', 'name', 'object']

 # Create a new df from the column
df_tmp = df['object'].explode().apply(pd.Series)

# Join to original
df = pd.concat([df[['id', 'name']], df_tmp], axis=1).reset_index(drop=True)

# Result:
|    |   id | name   |   dst | object               | admin   |
|---:|-----:|:-------|------:|:---------------------|:--------|
|  0 |   12 |        |   925 | Lok. Certification   | valid   |
|  1 |   12 |        |   935 | Lok. Administration  | true    |
|  2 |   12 |        |   944 | Lok. Customer        | false   |
|  3 |   12 |        |   945 | Lok. Customer        | false   |
|  4 |   12 |        |   954 | Lok. Certification-C | invalid |
|  5 |   12 |        |   956 | Lok. Certification   | valid   |
|  6 |   13 | wXB    |   986 | Fral_heater          | valid   |
|  7 |   13 | wXB    |   987 | Fral_cond.           | valid   |

其他答案都很好。您可以构建单个 DataFrame 对象,然后 join 返回原始对象,而不是逐行构建 Series 对象。这个应该快一点。

df1 = df0.explode(2, ignore_index=True).pipe(lambda x: x.join(pd.DataFrame(x.pop(2).tolist())))

输出:

    0     1  dst                object    admin
0  12  None  925    Lok. Certification    valid
1  12  None  935   Lok. Administration     true
2  12  None  944         Lok. Customer    false
3  12  None  945         Lok. Customer    false
4  12  None  954  Lok. Certification-C  invalid
5  12  None  956    Lok. Certification    valid
6  13   wXB  986           Fral_heater    valid
7  13   wXB  987            Fral_cond.    valid

基准:

>>> %timeit df1 = df0.explode(2, ignore_index=True); df1 = pd.concat([df1, df1[2].apply(pd.Series)], axis=1).drop(columns=2)
8.4 ms ± 842 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit df1 = df0.explode(2, ignore_index=True).pipe(lambda x: x.join(pd.DataFrame(x.pop(2).tolist())))
4.8 ms ± 565 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

拉出第 2 列并从中创建一个数据框:

mix = df0.pop(2)
lengths = mix.str.len()
mix = pd.DataFrame(chain.from_iterable(mix))

扩展 df0 的长度为:

df0 = df0.loc[df0.index.repeat(lengths)]
df0.index = range(len(df0))

合并两个数据帧:

pd.concat([df0, mix], axis = 1)

    0     1  dst                object    admin
0  12  None  925    Lok. Certification    valid
1  12  None  935   Lok. Administration     true
2  12  None  944         Lok. Customer    false
3  12  None  945         Lok. Customer    false
4  12  None  954  Lok. Certification-C  invalid
5  12  None  956    Lok. Certification    valid
6  13   wXB  986           Fral_heater    valid
7  13   wXB  987            Fral_cond.    valid

如果你想要更快的速度,你可以转储到 numpy 中,将所有内容构建为字典,然后再创建一个新的数据帧(如果你有大量数据并且关心内存,这很有用 usage/performance)