通过 pandas 将复杂的嵌套 json 转换为 csv

Converting complex nested json to csv via pandas

我有以下 json 文件

{
    "matches": [
        {
            "team": "Sunrisers Hyderabad",
            "overallResult": "Won",
            "totalMatches": 3,
            "margins": [
                {
                    "bar": 290
                },
                {
                    "bar": 90
                }
            ]
        },
        {
            "team": "Pune Warriors",
            "overallResult": "None",
            "totalMatches": 0,
            "margins": null
        }
    ],
    "totalMatches": 70
}

注意 - 以上 json 是原始 json 的片段。实际文件在 'margins' 之后包含更多属性,其中一些是嵌套的,而另一些则不是。为了简洁起见,我只是提出了一些期望值。

我的目标是扁平化数据并将其加载到 CSV 中。这是我到目前为止编写的代码 -

import json
import pandas as pd

path = r"/Users/samt/Downloads/test_data.json"

with open(path) as f:
    t_data = {}
    data = json.load(f)
    for team in data['matches']:
        if team['margins']:
            for idx, margin in enumerate(team['margins']):
                t_data['team'] = team['team']
                t_data['overallResult'] = team['overallResult']
                t_data['totalMatches'] = team['totalMatches']
                t_data['margin'] = margin.get('bar')
        else:
            t_data['team'] = team['team']
            t_data['overallResult'] = team['overallResult']
            t_data['totalMatches'] = team['totalMatches']
            t_data['margin'] = margin.get('bar')

    df = pd.DataFrame.from_dict(t_data, orient='index')
    print(df)            

我知道数据被覆盖并且循环不正确 structured.I 对使用 Python 处理 JSON 对象有点陌生,我无法理解如何连接结果。

我的目标是一次,所有的结果都追加,使用to_csv并转换成行。对于每个边距,整个数据将作为单独的行进行复制。这是我期望的输出。有人可以帮忙翻译一下吗?

从我在网上找到的任何内容来看,它都是关于首先收集字典项目但是如何将其转置为行是我无法理解的。此外,是否有比为一个属性(即边距)执行两次循环更好的解析 json 的方法?

我无法使用 json_normalize,因为我们的环境不支持该库。

[输出数据]

您可以使用 pd.DataFrame 创建 DataFrame 并展开 margins

import json
import pandas as pd

with open('data.json', 'r', encoding='utf-8') as f:
    data = json.loads(f.read())

df = pd.DataFrame(data['matches']).explode('margins', ignore_index=True)
print(df)

                  team overallResult  totalMatches       margins
0  Sunrisers Hyderabad           Won             3  {'bar': 290}
1  Sunrisers Hyderabad           Won             3   {'bar': 90}
2        Pune Warriors          None             0          None

然后将margins列中的None值填入字典并转换为

bar = df['margins'].apply(lambda x: x if x else {'bar': pd.NA}).apply(pd.Series)
print(bar)

    bar
0   290
1    90
2  <NA>

最后,将系列加入原始数据框

df = df.join(bar).drop(columns='margins')
print(df)

                  team overallResult  totalMatches   bar
0  Sunrisers Hyderabad           Won             3   290
1  Sunrisers Hyderabad           Won             3    90
2        Pune Warriors          None             0  <NA>

使用 json 和 csv 模块:为每个团队创建字典,每个 margin 如果有的话。

import json, csv

s = '''{
    "matches": [
        {
            "team": "Sunrisers Hyderabad",
            "overallResult": "Won",
            "totalMatches": 3,
            "margins": [
                {
                    "bar": 290
                },
                {
                    "bar": 90
                }
            ]
        },
        {
            "team": "Pune Warriors",
            "overallResult": "None",
            "totalMatches": 0,
            "margins": null
        }
    ],
    "totalMatches": 70
}'''

j = json.loads(s)

matches = j['matches']
rows = []
for thing in matches:
    # print(thing)
    if not thing['margins']:
        rows.append(thing)
    else:
        for bar in (b['bar'] for b in thing['margins']):
            d = dict((k,thing[k]) for k in ('team','overallResult','totalMatches'))
            d['margins'] = bar
            rows.append(d)

# for row in rows: print(row)            

# using an in-memory stream for this example instead of an actual file
import io
f = io.StringIO(newline='')

fieldnames=('team','overallResult','totalMatches','margins')
writer = csv.DictWriter(f,fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
f.seek(0)
print(f.read())

team,overallResult,totalMatches,margins
Sunrisers Hyderabad,Won,3,290
Sunrisers Hyderabad,Won,3,90
Pune Warriors,None,0,

使用 operator.itemgetter()

可以帮助从字典中获取多个项目值
>>> import operator
>>> items = operator.itemgetter(*('team','overallResult','totalMatches'))
>>> #items = operator.itemgetter('team','overallResult','totalMatches')
>>> #stuff = ('team','overallResult','totalMatches'))
>>> #items = operator.itemgetter(*stuff)
>>> d = {'margins': 90,
...   'overallResult': 'Won',
...   'team': 'Sunrisers Hyderabad',
...   'totalMatches': 3}
>>> items(d)
('Sunrisers Hyderabad', 'Won', 3)
>>>

我喜欢使用它并给可调用对象起一个描述性名称,但我看不到它在这里用得太多。