如何规范化 CSV 中的字典?
How can I normalize a dict inside a CSV?
我有一个包含字典值的 CSV 文件。
Test.CSV
id,name,contact,Location
1,Julie,"[{""name"":""Jenny Brown"",""relation"":""mother"",""number"":2113131313},{""name"":""Jorge"",""relation"":""brother"",""number"":121313131}]",US
2,Jim,"[{""name"":""Sana"",""relation"":""sister"",""number"":83279131}]",UK
我想规范化此 CSV。预期输出:
id , name, contact_name,contact_realation,contact_number,location
1,Julie,Jenny Brown,mother,2113131313,US
1,Julie,Jorge,brother,121313131,US
2,Jim,Sana,sister,83279131,UK
我已经使用 CSV reader 加载了数据,但我无法规范化联系人值。我该怎么做?
csvfile = csv.reader(open(filename, encoding="utf8"))
到目前为止我试过这个:
df=pd.read_csv(filename, converters={'contact':json.loads}, header=0)
contact_df = pd.io.json.json_normalize(df['contact'])
但出现以下错误:
AttributeError: 'list' object has no attribute 'values'
函数pd.io.json.json_normalize
用于直接处理JSON
对象。但是您似乎想直接使用 pd.Series
。一个不错的技巧是将值映射到 Series
或 Dataframe
。然后,您可以使用 concat
创建具有多个索引的联系人数据框。最后,将其合并回原始数据框。
df=pd.read_csv(filename, converters={'contact':json.loads}, header=0)
df.index.name = 'row_id'
concat_df = df.merge(
pd.concat(df["contact"].apply(pd.DataFrame).tolist(), keys=df.index),
left_index=True, right_index=True
).drop("contact",1)
输出:
id name_x Location name_y relation number role
row_id
0 0 1 Julie US Jenny Brown mother 2113131313 NaN
1 1 Julie US Jorge NaN 121313131 brother
1 0 2 Jim UK Sana sister 83279131 NaN
如果您只想使用 csv
module 执行此操作,您可以使用如下内容:
import csv
from io import StringIO
import json
data = """id,name,contact,location
1,Julie,"[{""name"":""Jenny Brown"",""relation"":""mother"",""number"":2113131313},{""name"":""Jorge"",""relation"":""brother"",""number"":121313131}]",US
2,Jim,"[{""name"":""Sana"",""relation"":""sister"",""number"":83279131}]",UK
3,Alice,,UK"""
reader = csv.DictReader(StringIO(data))
with open("processed.csv", 'w') as f:
writer = csv.DictWriter(f, fieldnames=["id","name","contact_name","contact_relation","contact_number","location"])
writer.writeheader()
for row in reader:
new_row = {k:row[k] for k in row if k != "contact"}
if row["contact"]:
contacts = json.loads(row["contact"])
for contact in contacts:
for key in contact:
new_row["contact_" + key] = contact[key]
writer.writerow(new_row)
else:
writer.writerow(new_row)
这导致:
$cat processed.csv
id,name,contact_name,contact_relation,contact_number,location
1,Julie,Jenny Brown,mother,2113131313,US
1,Julie,Jorge,brother,121313131,US
2,Jim,Sana,sister,83279131,UK
3,Alice,,,,UK
编辑:更新了代码以说明没有联系信息的条目。
我有一个包含字典值的 CSV 文件。
Test.CSV
id,name,contact,Location
1,Julie,"[{""name"":""Jenny Brown"",""relation"":""mother"",""number"":2113131313},{""name"":""Jorge"",""relation"":""brother"",""number"":121313131}]",US
2,Jim,"[{""name"":""Sana"",""relation"":""sister"",""number"":83279131}]",UK
我想规范化此 CSV。预期输出:
id , name, contact_name,contact_realation,contact_number,location
1,Julie,Jenny Brown,mother,2113131313,US
1,Julie,Jorge,brother,121313131,US
2,Jim,Sana,sister,83279131,UK
我已经使用 CSV reader 加载了数据,但我无法规范化联系人值。我该怎么做?
csvfile = csv.reader(open(filename, encoding="utf8"))
到目前为止我试过这个:
df=pd.read_csv(filename, converters={'contact':json.loads}, header=0)
contact_df = pd.io.json.json_normalize(df['contact'])
但出现以下错误:
AttributeError: 'list' object has no attribute 'values'
函数pd.io.json.json_normalize
用于直接处理JSON
对象。但是您似乎想直接使用 pd.Series
。一个不错的技巧是将值映射到 Series
或 Dataframe
。然后,您可以使用 concat
创建具有多个索引的联系人数据框。最后,将其合并回原始数据框。
df=pd.read_csv(filename, converters={'contact':json.loads}, header=0)
df.index.name = 'row_id'
concat_df = df.merge(
pd.concat(df["contact"].apply(pd.DataFrame).tolist(), keys=df.index),
left_index=True, right_index=True
).drop("contact",1)
输出:
id name_x Location name_y relation number role
row_id
0 0 1 Julie US Jenny Brown mother 2113131313 NaN
1 1 Julie US Jorge NaN 121313131 brother
1 0 2 Jim UK Sana sister 83279131 NaN
如果您只想使用 csv
module 执行此操作,您可以使用如下内容:
import csv
from io import StringIO
import json
data = """id,name,contact,location
1,Julie,"[{""name"":""Jenny Brown"",""relation"":""mother"",""number"":2113131313},{""name"":""Jorge"",""relation"":""brother"",""number"":121313131}]",US
2,Jim,"[{""name"":""Sana"",""relation"":""sister"",""number"":83279131}]",UK
3,Alice,,UK"""
reader = csv.DictReader(StringIO(data))
with open("processed.csv", 'w') as f:
writer = csv.DictWriter(f, fieldnames=["id","name","contact_name","contact_relation","contact_number","location"])
writer.writeheader()
for row in reader:
new_row = {k:row[k] for k in row if k != "contact"}
if row["contact"]:
contacts = json.loads(row["contact"])
for contact in contacts:
for key in contact:
new_row["contact_" + key] = contact[key]
writer.writerow(new_row)
else:
writer.writerow(new_row)
这导致:
$cat processed.csv
id,name,contact_name,contact_relation,contact_number,location
1,Julie,Jenny Brown,mother,2113131313,US
1,Julie,Jorge,brother,121313131,US
2,Jim,Sana,sister,83279131,UK
3,Alice,,,,UK
编辑:更新了代码以说明没有联系信息的条目。