在 Python 中读取包含多个对象的大 JSON 文件

Question

我有一个很大的 GZ 压缩 JSON 文件，其中每一行都是一个 JSON 对象（即 python 字典）。

这是前两行的示例：

  {"ID_CLIENTE":"o+AKj6GUgHxcFuaRk6/GSvzEWRYPXDLjtJDI79c7ccE=","ORIGEN":"oaDdZDrQCwqvi1YhNkjIJulA8C0a4mMZ7ESVlEWGwAs=","DESTINO":"OOcb8QTlctDfYOwjBI02hUJ1o3Bro/ir6IsmZRigja0=","PRECIO":0.0023907284768211919,"RESERVA":"2015-05-20","SALIDA":"2015-07-26","LLEGADA":"2015-07-27","DISTANCIA":0.48962542317352847,"EDAD":"19","sexo":"F"}{"ID_CLIENTE":"WHDhaR12zCTCVnNC/sLYmN3PPR3+f3ViaqkCt6NC3mI=","ORIGEN":"gwhY9rjoMzkD3wObU5Ito98WDN/9AN5Xd5DZDFeTgZw=","DESTINO":"OOcb8QTlctDfYOwjBI02hUJ1o3Bro/ir6IsmZRigja0=","PRECIO":0.001103046357615894,"RESERVA":"2015-04-08","SALIDA":"2015-07-24","LLEGADA":"2015-07-24","DISTANCIA":0.21382548869717155,"EDAD":"13","sexo":"M"}

因此，我使用以下代码将每一行读入 Pandas DataFrame：

import json
import gzip
import pandas as pd
import random

with gzip.GzipFile('data/000000000000.json.gz', 'r',) as fin:
    data_lan = pd.DataFrame()
    for line in fin:
        data_lan = pd.DataFrame([json.loads(line.decode('utf-8'))]).append(data_lan)

但这需要数年时间。有什么建议可以更快地读取数据吗？

编辑：最后是什么解决了问题：

import json
import gzip
import pandas as pd

with gzip.GzipFile('data/000000000000.json.gz', 'r',) as fin:
    data_lan = []
    for line in fin:
        data_lan.append(json.loads(line.decode('utf-8')))

data = pd.DataFrame(data_lan)

Answer 1

我自己也处理过类似的问题，append() 有点慢。我一般用一个dicts的list加载json文件，然后一下子创建一个Dataframe。通过这种方式，您可以灵活地使用列表，并且只有当您确定列表中的数据时，您才将其转换为 Dataframe。下面是这个概念的实现：

import pandas as pd
import gzip


def get_contents_from_json(file_path)-> dict:
    """
    Reads the contents of the json file into a dict
    :param file_path:
    :return: A dictionary of all contents in the file.
    """
    try:
        with gzip.open(file_path) as file:
            contents = file.read()
        return json.loads(contents.decode('UTF-8'))
    except json.JSONDecodeError:
        print('Error while reading json file')
    except FileNotFoundError:
        print(f'The JSON file was not found at the given path: \n{file_path}')


def main(file_path: str):
    file_contents = get_contents_from_json(file_path)
    if not isinstance(file_contents,list):
        # I've considered you have a JSON Array in your file
        # if not let me know in the comments
        raise TypeError("The file doesn't have a JSON Array!!!")
    all_columns = file_contents[0].keys()
    data_frame = pd.DataFrame(columns=all_columns, data=file_contents)
    print(f'Loaded {int(data_frame.size / len(all_columns))} Rows', 'Done!', sep='\n')


if __name__ == '__main__':
    main(r'C:\Users\carrot\Desktop\dummyData.json.gz')

Answer 2

A pandas DataFrame 适合一个连续的内存块，这意味着 pandas 在创建框架时需要知道数据集的大小。由于 append 更改了大小，因此必须分配新内存并将原始数据集和新数据集复制进来。随着数据集的增长，副本变得越来越大。

您可以使用from_records来避免这个问题。首先，您需要知道行数，这意味着扫描文件。如果您经常这样做，您可能会缓存该数字，但这是一个相对较快的操作。现在你有了大小，pandas 可以有效地分配内存。

# count rows
with gzip.GzipFile(file_to_test, 'r',) as fin:
    row_count = sum(1 for _ in fin)

# build dataframe from records
with gzip.GzipFile(file_to_test, 'r',) as fin:
    data_lan = pd.DataFrame.from_records(fin, nrows=row_count)

在 Python 中读取包含多个对象的大 JSON 文件

Reading a big JSON file with multiple objects in Python

performance

json

python-3.x

pandas-datareader