如何规范化 CSV 中的字典?

How can I normalize a dict inside a CSV?

我有一个包含字典值的 CSV 文件。

Test.CSV

id,name,contact,Location
1,Julie,"[{""name"":""Jenny Brown"",""relation"":""mother"",""number"":2113131313},{""name"":""Jorge"",""relation"":""brother"",""number"":121313131}]",US
2,Jim,"[{""name"":""Sana"",""relation"":""sister"",""number"":83279131}]",UK

我想规范化此 CSV。预期输出:

id , name, contact_name,contact_realation,contact_number,location
1,Julie,Jenny Brown,mother,2113131313,US
1,Julie,Jorge,brother,121313131,US
2,Jim,Sana,sister,83279131,UK

我已经使用 CSV reader 加载了数据,但我无法规范化联系人值。我该怎么做?

csvfile = csv.reader(open(filename, encoding="utf8"))

到目前为止我试过这个:

df=pd.read_csv(filename, converters={'contact':json.loads}, header=0)
contact_df = pd.io.json.json_normalize(df['contact'])

但出现以下错误:

AttributeError: 'list' object has no attribute 'values'

函数pd.io.json.json_normalize用于直接处理JSON对象。但是您似乎想直接使用 pd.Series 。一个不错的技巧是将值映射到 SeriesDataframe。然后,您可以使用 concat 创建具有多个索引的联系人数据框。最后,将其合并回原始数据框。

df=pd.read_csv(filename, converters={'contact':json.loads}, header=0)
df.index.name = 'row_id'
concat_df = df.merge(
    pd.concat(df["contact"].apply(pd.DataFrame).tolist(), keys=df.index),
    left_index=True, right_index=True
).drop("contact",1) 

输出:

          id name_x Location       name_y relation      number     role
row_id                                                                 
0      0   1  Julie       US  Jenny Brown   mother  2113131313      NaN
       1   1  Julie       US        Jorge      NaN   121313131  brother
1      0   2    Jim       UK         Sana   sister    83279131      NaN

如果您只想使用 csv module 执行此操作,您可以使用如下内容:

import csv
from io import StringIO
import json

data = """id,name,contact,location
1,Julie,"[{""name"":""Jenny Brown"",""relation"":""mother"",""number"":2113131313},{""name"":""Jorge"",""relation"":""brother"",""number"":121313131}]",US
2,Jim,"[{""name"":""Sana"",""relation"":""sister"",""number"":83279131}]",UK
3,Alice,,UK"""

reader = csv.DictReader(StringIO(data))
with open("processed.csv", 'w') as f:
    writer = csv.DictWriter(f, fieldnames=["id","name","contact_name","contact_relation","contact_number","location"])
    writer.writeheader()
    for row in reader:
        new_row = {k:row[k] for k in row if k != "contact"}
        if row["contact"]:
            contacts = json.loads(row["contact"])
            for contact in contacts:
                for key in contact:
                    new_row["contact_" + key] = contact[key]
                writer.writerow(new_row)
        else:
            writer.writerow(new_row)

这导致:

$cat processed.csv
id,name,contact_name,contact_relation,contact_number,location
1,Julie,Jenny Brown,mother,2113131313,US
1,Julie,Jorge,brother,121313131,US
2,Jim,Sana,sister,83279131,UK
3,Alice,,,,UK

编辑:更新了代码以说明没有联系信息的条目。