在 python 中转换包含多级字典的嵌套列表
Converting nested list containing multilevel dictionary in python
我有一个 json 文件,其中包含包含多级字典的嵌套列表。我正在尝试根据这些数据创建一个 python DataFrame。
Loading data:
data = []
with open('TREC_blog_2012.json') as f:
for line in f:
data.append(json.loads(line))
数据输出:
IN LIST FORMAT: data[0]
{'id': '1d3bc37004e71da2816dbfda8df90746',
'article_url': 'https://www.washingtonpost.com/express/wp/2012/01/03/month-of-muscle/',
'title': 'Month of Muscle',
'author': 'Vicky Hallett',
'published_date': 1325608933000,
'contents': [{'content': 'Express', 'mime': 'text/plain', 'type': 'kicker'},
{'content': 'Month of Muscle', 'mime': 'text/plain', 'type': 'title'},
{'content': 'By Vicky Hallett', 'mime': 'text/plain', 'type': 'byline'},
{'content': 1325608933000, 'mime': 'text/plain', 'type': 'date'},
{'content': 'SparkPeople trainer Nicole Nichols asks for only 28 days to get you into shape',
'mime': 'text/plain',
'type': 'deck'},
{'fullcaption': 'Nicole Nichols, front, chose backup exercisers with strong but realistic physiques to make the program less intimidating.',
'imageURL': 'http://www.expressnightout.com/wp-content/uploads/2012/01/SparkPeople28DayBootcamp.jpg',
'mime': 'image/jpeg',
'imageHeight': 201,
'imageWidth': 300,
'type': 'image',
'blurb': 'Nicole Nichols, front, chose backup exercisers with strong but realistic physiques to make the program less intimidating.'},
{'content': 'If you’ve seen a Nicole Nichols workout before, chances are it was on YouTube. The fitness expert, known as just Coach Nicole to the millions of members of <a href="http://www.sparkpeople.com" target="_blank">SparkPeople.com</a>, has filmed dozens of routines for the free health website. The popular videos showcasing her girl-next-door style, gentle encouragement and clear cueing have built such a devoted following that the American Council on Exercise and Life Fitness just named her “America’s top personal trainer to watch.”',
'subtype': 'paragraph',
'type': 'sanitized_html',
'mime': 'text/html'},
{'content': '<strong>3. Prioritize.</strong> When people say they can’t fit exercise in their schedule, Nichols always asks, “How much TV do you watch?” Use your shows as a reward for your workout instead of the replacement, she suggests.',
'subtype': 'paragraph',
'type': 'sanitized_html',
'mime': 'text/html'},
{'role': '',
'type': 'author_info',
'name': 'Vicky Hallett',
'bio': 'Vicky Hallett is a freelancer and former MisFits columnist.'}],
'type': 'blog',
'source': 'The Washington Post'}
我想将此数据转换为 DataFrame 类型,其中键作为列,其各自的值作为行值。
但我面临的问题是关键 "contents" 包含一个多级字典值列表,我不明白如何将其转换为正确的 DataFrame 值。
The method I tried:
df = pd.DataFrame(data)
test = pd.DataFrame(df['contents'][0])
test.head()
将 df['contents'] 的输出作为
如果我尝试上述方法,数据未正确对齐且未正确分配。关于如何将内容键的字典列表解析为适当的数据框的任何建议?
TIA:)
您可能必须从每个子词典中单独提取相关信息,并将其分配给数据框的适当列。
这部分可以立即分配给数据框的列:
{'id': '1d3bc37004e71da2816dbfda8df90746',
'article_url': 'https://www.washingtonpost.com/express/wp/2012/01/03/month-of-muscle/',
'title': 'Month of Muscle',
'author': 'Vicky Hallett',
'published_date': 1325608933000}
但是,这部分需要首先分配给 python 中的字典,然后您可以将列提取到 pandas 数据框。
{'contents': [{'content': 'Express', 'mime': 'text/plain', 'type': 'kicker'}]}
因此您的代码可能如下所示:
import pandas as pd
json_file = {'id': '1d3bc37004e71da2816dbfda8df90746',
'article_url': 'https://www.washingtonpost.com/express/wp/2012/01/03/month-of-muscle/',
'title': 'Month of Muscle',
'author': 'Vicky Hallett',
'published_date': 1325608933000,
'contents': [{'content': 'Express', 'mime': 'text/plain', 'type': 'kicker'}]
}
df = pd.DataFrame.from_dict(json_file)
my_dict = df['contents'].values[0]
for key in my_dict.keys():
df[key] = my_dict[key]
您必须将此过程扩展到 json 文件的其他子词典(如果存在)。
如果原始 json 文件中没有 key/node 也是子词典中的键,则此代码会将子词典的所有元素分配给数据帧中的适当列。如果您的数据集中有多个 rows/json 文件,您可以使用此过程首先将每个 json 转换为 pandas 数据帧,然后您可以附加转换后的 - json ,现在是主全局数据框的数据框,其每行包含从单个 json 文件中提取的信息。
我会做这样的事情:
new_data = []
for row in data:
if 'contents' in row:
for content in row['contents']:
new_dict = dict(row)
del new_dict['contents']
for key, value in content.items():
new_dict['content_{}'.format(key)] = value
new_data.append(new_dict)
else:
new_data.append(row)
请注意,我在 'contents' 中为每个元素创建一行数据框。所以你将有 9 行对应于 data[0] 中的元素。
pd.DataFrame.from_dict(new_data)
基本上,您有两种方法可以将嵌套字典转换为二维数据框:您可以为列表中的每个元素保留一行,但您需要添加很多列(一个用于 [ 中包含的字典的每个元素) =21=],列数可能变化很大,让人头疼)或者在 'contents' 中为每个元素添加一行。我认为最后一个很适合你的情况。
我有一个 json 文件,其中包含包含多级字典的嵌套列表。我正在尝试根据这些数据创建一个 python DataFrame。
Loading data:
data = []
with open('TREC_blog_2012.json') as f:
for line in f:
data.append(json.loads(line))
数据输出:
IN LIST FORMAT: data[0]
{'id': '1d3bc37004e71da2816dbfda8df90746',
'article_url': 'https://www.washingtonpost.com/express/wp/2012/01/03/month-of-muscle/',
'title': 'Month of Muscle',
'author': 'Vicky Hallett',
'published_date': 1325608933000,
'contents': [{'content': 'Express', 'mime': 'text/plain', 'type': 'kicker'},
{'content': 'Month of Muscle', 'mime': 'text/plain', 'type': 'title'},
{'content': 'By Vicky Hallett', 'mime': 'text/plain', 'type': 'byline'},
{'content': 1325608933000, 'mime': 'text/plain', 'type': 'date'},
{'content': 'SparkPeople trainer Nicole Nichols asks for only 28 days to get you into shape',
'mime': 'text/plain',
'type': 'deck'},
{'fullcaption': 'Nicole Nichols, front, chose backup exercisers with strong but realistic physiques to make the program less intimidating.',
'imageURL': 'http://www.expressnightout.com/wp-content/uploads/2012/01/SparkPeople28DayBootcamp.jpg',
'mime': 'image/jpeg',
'imageHeight': 201,
'imageWidth': 300,
'type': 'image',
'blurb': 'Nicole Nichols, front, chose backup exercisers with strong but realistic physiques to make the program less intimidating.'},
{'content': 'If you’ve seen a Nicole Nichols workout before, chances are it was on YouTube. The fitness expert, known as just Coach Nicole to the millions of members of <a href="http://www.sparkpeople.com" target="_blank">SparkPeople.com</a>, has filmed dozens of routines for the free health website. The popular videos showcasing her girl-next-door style, gentle encouragement and clear cueing have built such a devoted following that the American Council on Exercise and Life Fitness just named her “America’s top personal trainer to watch.”',
'subtype': 'paragraph',
'type': 'sanitized_html',
'mime': 'text/html'},
{'content': '<strong>3. Prioritize.</strong> When people say they can’t fit exercise in their schedule, Nichols always asks, “How much TV do you watch?” Use your shows as a reward for your workout instead of the replacement, she suggests.',
'subtype': 'paragraph',
'type': 'sanitized_html',
'mime': 'text/html'},
{'role': '',
'type': 'author_info',
'name': 'Vicky Hallett',
'bio': 'Vicky Hallett is a freelancer and former MisFits columnist.'}],
'type': 'blog',
'source': 'The Washington Post'}
我想将此数据转换为 DataFrame 类型,其中键作为列,其各自的值作为行值。
但我面临的问题是关键 "contents" 包含一个多级字典值列表,我不明白如何将其转换为正确的 DataFrame 值。
The method I tried:
df = pd.DataFrame(data)
test = pd.DataFrame(df['contents'][0])
test.head()
将 df['contents'] 的输出作为
如果我尝试上述方法,数据未正确对齐且未正确分配。关于如何将内容键的字典列表解析为适当的数据框的任何建议?
TIA:)
您可能必须从每个子词典中单独提取相关信息,并将其分配给数据框的适当列。
这部分可以立即分配给数据框的列:
{'id': '1d3bc37004e71da2816dbfda8df90746',
'article_url': 'https://www.washingtonpost.com/express/wp/2012/01/03/month-of-muscle/',
'title': 'Month of Muscle',
'author': 'Vicky Hallett',
'published_date': 1325608933000}
但是,这部分需要首先分配给 python 中的字典,然后您可以将列提取到 pandas 数据框。
{'contents': [{'content': 'Express', 'mime': 'text/plain', 'type': 'kicker'}]}
因此您的代码可能如下所示:
import pandas as pd
json_file = {'id': '1d3bc37004e71da2816dbfda8df90746',
'article_url': 'https://www.washingtonpost.com/express/wp/2012/01/03/month-of-muscle/',
'title': 'Month of Muscle',
'author': 'Vicky Hallett',
'published_date': 1325608933000,
'contents': [{'content': 'Express', 'mime': 'text/plain', 'type': 'kicker'}]
}
df = pd.DataFrame.from_dict(json_file)
my_dict = df['contents'].values[0]
for key in my_dict.keys():
df[key] = my_dict[key]
您必须将此过程扩展到 json 文件的其他子词典(如果存在)。 如果原始 json 文件中没有 key/node 也是子词典中的键,则此代码会将子词典的所有元素分配给数据帧中的适当列。如果您的数据集中有多个 rows/json 文件,您可以使用此过程首先将每个 json 转换为 pandas 数据帧,然后您可以附加转换后的 - json ,现在是主全局数据框的数据框,其每行包含从单个 json 文件中提取的信息。
我会做这样的事情:
new_data = []
for row in data:
if 'contents' in row:
for content in row['contents']:
new_dict = dict(row)
del new_dict['contents']
for key, value in content.items():
new_dict['content_{}'.format(key)] = value
new_data.append(new_dict)
else:
new_data.append(row)
请注意,我在 'contents' 中为每个元素创建一行数据框。所以你将有 9 行对应于 data[0] 中的元素。
pd.DataFrame.from_dict(new_data)
基本上,您有两种方法可以将嵌套字典转换为二维数据框:您可以为列表中的每个元素保留一行,但您需要添加很多列(一个用于 [ 中包含的字典的每个元素) =21=],列数可能变化很大,让人头疼)或者在 'contents' 中为每个元素添加一行。我认为最后一个很适合你的情况。