JSON Python 中的解析问题
JSON Parsing Trouble in Python
我正在尝试从这个 JSON 数据中提取一个元素并将其格式化为我的 pandas DataFrame 中的另一列。
这是我目前的代码:
#Import libraries
import json
import requests
from IPython.display import JSON
import pandas as pd
#Load data
astronaut_db_url = 'https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb.json'
astronauts_db = requests.get(astronaut_db_url).json()
#Format data
df = pd.json_normalize(astronauts_db['astronauts'])
df_astro = df[['_id','astroNumber','awards','name','gender','inSpace','overallNumber','spacewalkCount','species','speciesGroup',
'totalMinutesInSpace','totalSecondsSpacewalking','lastLaunchDate.utc']]
#Get row per award
df_awards = df_astro.explode(['awards']).reset_index(drop=True)
df_awards.head()
df_awards['awards'][0]['title']
我想在我的 DataFrame 中获取每位宇航员的奖项名称,并在一个单元格中创建一个包含奖项列表的新列,如下所示:
Astronaut_ID Awards
dh3405kdmnd [First Person In Space, First Person to Cross Karman Line]
ert549fkfl3 [Crossed Karman Line, First Person on Moon]
我解决这个问题的想法是:
- 每位宇航员的每个奖项都获得一行
- 将 JSON 个单元格删除,只留下标题
- 每位宇航员在一个细胞中重组
我不确定如何完成此过程的第 2 步。有人可以帮我指明正确的方向吗?
我会使用 awards
作为字典列表并将函数应用于它的每个元素。
import json
import requests
from IPython.display import JSON
import pandas as pd
#Load data
astronaut_db_url = 'https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb.json'
astronauts_db = requests.get(astronaut_db_url).json()
#Format data
df = pd.json_normalize(astronauts_db['astronauts'])
df_astro = df[['_id','astroNumber','awards','name','gender','inSpace','overallNumber','spacewalkCount','species','speciesGroup',
'totalMinutesInSpace','totalSecondsSpacewalking','lastLaunchDate.utc']]
#Get row per award
df_awards = df_astro[['_id', 'awards']].copy()
df_awards['awards'] = df_awards['awards'].apply(lambda awards: [award['title'] for award in awards])
df_awards.columns = ['Astronaut_ID', 'Awards']
print(df_awards.head())
您可以将 record_path
和 meta
直接传入 json_normalize
,而不是执行步骤 1-2。然后可以使用 groupby
+ agg(list)
:
完成第 3 步
df_awards = pd.json_normalize(astronauts_db['astronauts'], 'awards', '_id').groupby('_id', as_index=False)['title'].agg(list)
print(df_awards.head(5))
输出:
_id title
0 0554c903-e8a6-43c5-8da8-76fb3495e93f [First Steppe Tortoise (Agrionemys horsfieldii)]
1 0729eec8-ae2f-44a5-900f-08b2f491c8fe [Crossed Kármán Line, ISS Visitor]
2 0ff02f81-a865-465d-97b8-cd6be84c56aa [Crossed Kármán Line, ISS Visitor, Space Resid...
3 157edd2d-58a0-4f47-b85d-4c6ade14a973 [Crossed Kármán Line]
4 15c82ce2-10d5-45e7-848e-6df388307e1f [Crossed Kármán Line]
我正在尝试从这个 JSON 数据中提取一个元素并将其格式化为我的 pandas DataFrame 中的另一列。
这是我目前的代码:
#Import libraries
import json
import requests
from IPython.display import JSON
import pandas as pd
#Load data
astronaut_db_url = 'https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb.json'
astronauts_db = requests.get(astronaut_db_url).json()
#Format data
df = pd.json_normalize(astronauts_db['astronauts'])
df_astro = df[['_id','astroNumber','awards','name','gender','inSpace','overallNumber','spacewalkCount','species','speciesGroup',
'totalMinutesInSpace','totalSecondsSpacewalking','lastLaunchDate.utc']]
#Get row per award
df_awards = df_astro.explode(['awards']).reset_index(drop=True)
df_awards.head()
df_awards['awards'][0]['title']
我想在我的 DataFrame 中获取每位宇航员的奖项名称,并在一个单元格中创建一个包含奖项列表的新列,如下所示:
Astronaut_ID Awards
dh3405kdmnd [First Person In Space, First Person to Cross Karman Line]
ert549fkfl3 [Crossed Karman Line, First Person on Moon]
我解决这个问题的想法是:
- 每位宇航员的每个奖项都获得一行
- 将 JSON 个单元格删除,只留下标题
- 每位宇航员在一个细胞中重组
我不确定如何完成此过程的第 2 步。有人可以帮我指明正确的方向吗?
我会使用 awards
作为字典列表并将函数应用于它的每个元素。
import json
import requests
from IPython.display import JSON
import pandas as pd
#Load data
astronaut_db_url = 'https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb.json'
astronauts_db = requests.get(astronaut_db_url).json()
#Format data
df = pd.json_normalize(astronauts_db['astronauts'])
df_astro = df[['_id','astroNumber','awards','name','gender','inSpace','overallNumber','spacewalkCount','species','speciesGroup',
'totalMinutesInSpace','totalSecondsSpacewalking','lastLaunchDate.utc']]
#Get row per award
df_awards = df_astro[['_id', 'awards']].copy()
df_awards['awards'] = df_awards['awards'].apply(lambda awards: [award['title'] for award in awards])
df_awards.columns = ['Astronaut_ID', 'Awards']
print(df_awards.head())
您可以将 record_path
和 meta
直接传入 json_normalize
,而不是执行步骤 1-2。然后可以使用 groupby
+ agg(list)
:
df_awards = pd.json_normalize(astronauts_db['astronauts'], 'awards', '_id').groupby('_id', as_index=False)['title'].agg(list)
print(df_awards.head(5))
输出:
_id title
0 0554c903-e8a6-43c5-8da8-76fb3495e93f [First Steppe Tortoise (Agrionemys horsfieldii)]
1 0729eec8-ae2f-44a5-900f-08b2f491c8fe [Crossed Kármán Line, ISS Visitor]
2 0ff02f81-a865-465d-97b8-cd6be84c56aa [Crossed Kármán Line, ISS Visitor, Space Resid...
3 157edd2d-58a0-4f47-b85d-4c6ade14a973 [Crossed Kármán Line]
4 15c82ce2-10d5-45e7-848e-6df388307e1f [Crossed Kármán Line]