Pandas 将嵌入 JSON 的 CSV 读入数据框

Pandas read CSV with embedded JSON into dataframe

我需要读入包含 Pandas 的 CSV 文件,CSV 中的其中一列是 JSON 数据。但是,一旦我引入文件,JSON 已损坏(?)并且我无法在其上使用 json_normalize()

我无法附加文件,但这里有一些演示问题的示例代码:

df = pd.DataFrame({'location_id':[1,2,3], 'visits':[{"ABCD":9,"DEFG":8,"ASDF":6},
                                                    {"XYZR":4,"ABCD":4},
                                                    {"ASDF":4}]})
pd.json_normalize(df.visits)
# OUTPUTS THE NORMALIZED JSON JUST FINE

df.to_csv('test_visits.csv')
df2 = pd.read_csv('test_visits.csv')
pd.json_normalize(df2.visits)

# RESULTS IN ERROR:
# AttributeError: 'str' object has no attribute 'values'

我在 read_csv() 中是否遗漏了什么使 JSON 可用的东西?

提前致谢。

In [77]: df = pd.DataFrame({'location_id':[1,2,3], 'visits':[{"ABCD":9,"DEFG":8,"ASDF":6},
    ...:                                                     {"XYZR":4,"ABCD":4},
    ...:                                                     {"ASDF":4}]})

In [78]: df
Out[78]:
   location_id                             visits
0            1  {'ABCD': 9, 'DEFG': 8, 'ASDF': 6}
1            2             {'XYZR': 4, 'ABCD': 4}
2            3                        {'ASDF': 4}

In [79]: pd.json_normalize(df["visits"])
Out[79]:
   ABCD  DEFG  ASDF  XYZR
0   9.0   8.0   6.0   NaN
1   4.0   NaN   NaN   4.0
2   NaN   NaN   4.0   NaN

发生这种情况是因为一旦您写入 csv 并从 csv 中读取它,pandas 会将其读取为字符串。因此,当你试图规范化它时,它会抛出错误说 str object has no attribute values because it's not a json object

  • 问题是,'visits' 列是 str 类型(例如 '{"ABCD":9,"DEFG":8,"ASDF":6}')。
  • 将带有 .read_csv, use the converters parameter to apply ast.literal_eval 的 csv 加载到 'visits' 列时,会将 str 转换为 dict
    • converters: 用于转换某些列中的值的函数字典。键可以是整数或列标签。
from ast import literal_eval
import pandas as pd

# load the csv using the converters parameter with literal_eval
df2 = pd.read_csv('test_visits.csv', converters={'visits': literal_eval})

# normalize the visits, join it to location_id and drop the visits column
df2 = df2.join(pd.json_normalize(df2.visits)).drop(columns=['visits'])

# display(df)
   location_id  ABCD  DEFG  ASDF  XYZR
0            1   9.0   8.0   6.0   NaN
1            2   4.0   NaN   NaN   4.0
2            3   NaN   NaN   4.0   NaN