将 DataFrame 中的嵌入式字典列表扩展为 DataFrame 的新列
Expand embedded list of dictionnaries in a DataFrame as new columns of the DataFrame
我有一个 Pandas DataFrame 看起来像:
import pandas as pd
print(pd.__version__)
df0 = pd.DataFrame([
[12, None, [{'dst': '925', 'object': 'Lok. Certification', 'admin': 'valid'},
{'dst': '935', 'object': 'Lok. Administration', 'admin': 'true'},
{'dst': '944', 'object': 'Lok. Customer', 'admin': 'false'},
{'dst': '945', 'object': 'Lok. Customer', 'admin': 'false'},
{'dst': '954', 'object': 'Lok. Certification-C', 'admin': 'invalid'},
{'dst': '956', 'object': 'Lok. Certification', 'admin': 'valid'}]],
[13,'wXB', [{'dst': '986', 'object': 'Fral_heater', 'admin': 'valid'},
{'dst': '987', 'object': 'Fral_cond.', 'admin': 'valid'}]],
])
第 2 列中的每个列表都具有完全相同的键(dst
、object
和 admin
)。
df0
.
的每一行可以有 0(空 []
)到 100 个列表
我希望我可以扩展 df0
DataFrame 看起来像这样:
columns = ['id', 'name', 'dst', 'object', 'admin']
df_wanted
Out[416]:
id name dst object admin
12 None 925 'Lok. Certification' 'valid'
12 None 935 'Lok. Administration' 'true'
12 None 944 'Lok. Customer' 'false'
12 None 945 'Lok. Customer' 'false'
12 None 955 'Lok. Certification-C' 'invalid'
12 None 956 'Lok. Certification' 'valid'
13 wXB 987 'Lok. Fral_heater' 'valid'
13 wXB 986 'Lok. Fral_cond.' 'valid'
...
请注意,前两列 id
和 name
沿行复制以适合其列表中的元素数。
(必须在末尾使用 .astype(int)
将 dst
列转换为 int
。)
我怎样才能做到这一点?
信息:
Python 3.10.4
pd.__version__
'1.4.2'
您可以先 explode
列,然后将字典转换为列:
df0 = df0.explode(2, ignore_index=True)
df0 = pd.concat([df0, df0[2].apply(pd.Series)], axis=1).drop(columns=2)
我建议将列扩展到一个新的 df,然后加入主 df:
# Copied from above
df = pd.DataFrame([
[12, None, [{'dst': '925', 'object': 'Lok. Certification', 'admin': 'valid'},
{'dst': '935', 'object': 'Lok. Administration', 'admin': 'true'},
{'dst': '944', 'object': 'Lok. Customer', 'admin': 'false'},
{'dst': '945', 'object': 'Lok. Customer', 'admin': 'false'},
{'dst': '954', 'object': 'Lok. Certification-C', 'admin': 'invalid'},
{'dst': '956', 'object': 'Lok. Certification', 'admin': 'valid'}]],
[13,'wXB', [{'dst': '986', 'object': 'Fral_heater', 'admin': 'valid'},
{'dst': '987', 'object': 'Fral_cond.', 'admin': 'valid'}]],
])
# Set the names of the columns
df.columns = ['id', 'name', 'object']
# Create a new df from the column
df_tmp = df['object'].explode().apply(pd.Series)
# Join to original
df = pd.concat([df[['id', 'name']], df_tmp], axis=1).reset_index(drop=True)
# Result:
| | id | name | dst | object | admin |
|---:|-----:|:-------|------:|:---------------------|:--------|
| 0 | 12 | | 925 | Lok. Certification | valid |
| 1 | 12 | | 935 | Lok. Administration | true |
| 2 | 12 | | 944 | Lok. Customer | false |
| 3 | 12 | | 945 | Lok. Customer | false |
| 4 | 12 | | 954 | Lok. Certification-C | invalid |
| 5 | 12 | | 956 | Lok. Certification | valid |
| 6 | 13 | wXB | 986 | Fral_heater | valid |
| 7 | 13 | wXB | 987 | Fral_cond. | valid |
其他答案都很好。您可以构建单个 DataFrame 对象,然后 join
返回原始对象,而不是逐行构建 Series 对象。这个应该快一点。
df1 = df0.explode(2, ignore_index=True).pipe(lambda x: x.join(pd.DataFrame(x.pop(2).tolist())))
输出:
0 1 dst object admin
0 12 None 925 Lok. Certification valid
1 12 None 935 Lok. Administration true
2 12 None 944 Lok. Customer false
3 12 None 945 Lok. Customer false
4 12 None 954 Lok. Certification-C invalid
5 12 None 956 Lok. Certification valid
6 13 wXB 986 Fral_heater valid
7 13 wXB 987 Fral_cond. valid
基准:
>>> %timeit df1 = df0.explode(2, ignore_index=True); df1 = pd.concat([df1, df1[2].apply(pd.Series)], axis=1).drop(columns=2)
8.4 ms ± 842 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df1 = df0.explode(2, ignore_index=True).pipe(lambda x: x.join(pd.DataFrame(x.pop(2).tolist())))
4.8 ms ± 565 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
拉出第 2 列并从中创建一个数据框:
mix = df0.pop(2)
lengths = mix.str.len()
mix = pd.DataFrame(chain.from_iterable(mix))
扩展 df0 的长度为:
df0 = df0.loc[df0.index.repeat(lengths)]
df0.index = range(len(df0))
合并两个数据帧:
pd.concat([df0, mix], axis = 1)
0 1 dst object admin
0 12 None 925 Lok. Certification valid
1 12 None 935 Lok. Administration true
2 12 None 944 Lok. Customer false
3 12 None 945 Lok. Customer false
4 12 None 954 Lok. Certification-C invalid
5 12 None 956 Lok. Certification valid
6 13 wXB 986 Fral_heater valid
7 13 wXB 987 Fral_cond. valid
如果你想要更快的速度,你可以转储到 numpy 中,将所有内容构建为字典,然后再创建一个新的数据帧(如果你有大量数据并且关心内存,这很有用 usage/performance)
我有一个 Pandas DataFrame 看起来像:
import pandas as pd
print(pd.__version__)
df0 = pd.DataFrame([
[12, None, [{'dst': '925', 'object': 'Lok. Certification', 'admin': 'valid'},
{'dst': '935', 'object': 'Lok. Administration', 'admin': 'true'},
{'dst': '944', 'object': 'Lok. Customer', 'admin': 'false'},
{'dst': '945', 'object': 'Lok. Customer', 'admin': 'false'},
{'dst': '954', 'object': 'Lok. Certification-C', 'admin': 'invalid'},
{'dst': '956', 'object': 'Lok. Certification', 'admin': 'valid'}]],
[13,'wXB', [{'dst': '986', 'object': 'Fral_heater', 'admin': 'valid'},
{'dst': '987', 'object': 'Fral_cond.', 'admin': 'valid'}]],
])
第 2 列中的每个列表都具有完全相同的键(dst
、object
和 admin
)。
df0
.
[]
)到 100 个列表
我希望我可以扩展 df0
DataFrame 看起来像这样:
columns = ['id', 'name', 'dst', 'object', 'admin']
df_wanted
Out[416]:
id name dst object admin
12 None 925 'Lok. Certification' 'valid'
12 None 935 'Lok. Administration' 'true'
12 None 944 'Lok. Customer' 'false'
12 None 945 'Lok. Customer' 'false'
12 None 955 'Lok. Certification-C' 'invalid'
12 None 956 'Lok. Certification' 'valid'
13 wXB 987 'Lok. Fral_heater' 'valid'
13 wXB 986 'Lok. Fral_cond.' 'valid'
...
请注意,前两列 id
和 name
沿行复制以适合其列表中的元素数。
(必须在末尾使用 .astype(int)
将 dst
列转换为 int
。)
我怎样才能做到这一点?
信息:
Python 3.10.4
pd.__version__
'1.4.2'
您可以先 explode
列,然后将字典转换为列:
df0 = df0.explode(2, ignore_index=True)
df0 = pd.concat([df0, df0[2].apply(pd.Series)], axis=1).drop(columns=2)
我建议将列扩展到一个新的 df,然后加入主 df:
# Copied from above
df = pd.DataFrame([
[12, None, [{'dst': '925', 'object': 'Lok. Certification', 'admin': 'valid'},
{'dst': '935', 'object': 'Lok. Administration', 'admin': 'true'},
{'dst': '944', 'object': 'Lok. Customer', 'admin': 'false'},
{'dst': '945', 'object': 'Lok. Customer', 'admin': 'false'},
{'dst': '954', 'object': 'Lok. Certification-C', 'admin': 'invalid'},
{'dst': '956', 'object': 'Lok. Certification', 'admin': 'valid'}]],
[13,'wXB', [{'dst': '986', 'object': 'Fral_heater', 'admin': 'valid'},
{'dst': '987', 'object': 'Fral_cond.', 'admin': 'valid'}]],
])
# Set the names of the columns
df.columns = ['id', 'name', 'object']
# Create a new df from the column
df_tmp = df['object'].explode().apply(pd.Series)
# Join to original
df = pd.concat([df[['id', 'name']], df_tmp], axis=1).reset_index(drop=True)
# Result:
| | id | name | dst | object | admin |
|---:|-----:|:-------|------:|:---------------------|:--------|
| 0 | 12 | | 925 | Lok. Certification | valid |
| 1 | 12 | | 935 | Lok. Administration | true |
| 2 | 12 | | 944 | Lok. Customer | false |
| 3 | 12 | | 945 | Lok. Customer | false |
| 4 | 12 | | 954 | Lok. Certification-C | invalid |
| 5 | 12 | | 956 | Lok. Certification | valid |
| 6 | 13 | wXB | 986 | Fral_heater | valid |
| 7 | 13 | wXB | 987 | Fral_cond. | valid |
其他答案都很好。您可以构建单个 DataFrame 对象,然后 join
返回原始对象,而不是逐行构建 Series 对象。这个应该快一点。
df1 = df0.explode(2, ignore_index=True).pipe(lambda x: x.join(pd.DataFrame(x.pop(2).tolist())))
输出:
0 1 dst object admin
0 12 None 925 Lok. Certification valid
1 12 None 935 Lok. Administration true
2 12 None 944 Lok. Customer false
3 12 None 945 Lok. Customer false
4 12 None 954 Lok. Certification-C invalid
5 12 None 956 Lok. Certification valid
6 13 wXB 986 Fral_heater valid
7 13 wXB 987 Fral_cond. valid
基准:
>>> %timeit df1 = df0.explode(2, ignore_index=True); df1 = pd.concat([df1, df1[2].apply(pd.Series)], axis=1).drop(columns=2)
8.4 ms ± 842 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df1 = df0.explode(2, ignore_index=True).pipe(lambda x: x.join(pd.DataFrame(x.pop(2).tolist())))
4.8 ms ± 565 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
拉出第 2 列并从中创建一个数据框:
mix = df0.pop(2)
lengths = mix.str.len()
mix = pd.DataFrame(chain.from_iterable(mix))
扩展 df0 的长度为:
df0 = df0.loc[df0.index.repeat(lengths)]
df0.index = range(len(df0))
合并两个数据帧:
pd.concat([df0, mix], axis = 1)
0 1 dst object admin
0 12 None 925 Lok. Certification valid
1 12 None 935 Lok. Administration true
2 12 None 944 Lok. Customer false
3 12 None 945 Lok. Customer false
4 12 None 954 Lok. Certification-C invalid
5 12 None 956 Lok. Certification valid
6 13 wXB 986 Fral_heater valid
7 13 wXB 987 Fral_cond. valid
如果你想要更快的速度,你可以转储到 numpy 中,将所有内容构建为字典,然后再创建一个新的数据帧(如果你有大量数据并且关心内存,这很有用 usage/performance)