如何快速取消嵌套 Pandas 数据框
How to quickly un-nest a Pandas dataframe
A JSON 文件我需要处理导入到内部嵌套列表的数据框,在转换为数据框之前,它是嵌套字典的列表。文件本身是嵌套的。
样本JSON:
{
"State": [
{
"ts": "2018-04-11T21:37:05.401Z",
"sensor": [
"accBodyX_ftPerSec2"
],
"value": null
},
{
"ts": "2018-04-11T21:37:05.901Z",
"sensor": [
"accBodyX_ftPerSec2"
],
"value": [
-3.38919
]
},
{
"ts": "2018-04-11T21:37:05.901Z",
"sensor": [
"accBodyY_ftPerSec2"
],
"value": [
-2.004781
]
},
{
"ts": "2018-04-11T21:37:05.901Z",
"sensor": [
"accBodyZ_ftPerSec2"
],
"value": [
-34.77694
]
}
]
}
数据框看起来像:
sensor ts value
0 [accBodyX_ftPerSec2] 2018-04-11T21:37:05.901Z [-3.38919]
1 [accBodyY_ftPerSec2] 2018-04-11T21:37:05.901Z [-2.004781]
2 [accBodyZ_ftPerSec2] 2018-04-11T21:37:05.901Z [-34.77694]
最终,我想要移除嵌套或找到一种使用它的方法。目标是将给定传感器名称的值列表以及随附的时间戳提取到 processing/plotting 的另一个数据帧中,如下所示:
ts value
0 2018-04-11T21:37:05.901Z -3.38919
1 2018-04-11T21:37:06.401Z -3.00241
2 2018-04-11T21:37:06.901Z -3.87694
为了移除嵌套,我已经这样做了,但它在仅 100,000 行上速度很慢,但幸运的是比 for 循环快得多。 (多亏了这个 post python pandas operations on columns)
def func(row):
row.sensor = row.sensor[0]
if type(row.value) is list:
row.value = row.value[0]
return row
df.apply(func, axis=1)
为了处理嵌套,我能够提取单个值。
例如:
print( df.iloc[:,2].iloc[1][0] )
-2.004781
但是,尝试 return 来自每行中每个列表的索引 0 的值列表会导致 return 仅第一个值:
print( df.iloc[:,2].iloc[:][0] )
-3.38919
当然,我可以使用 for 循环来完成此操作,但我知道有一种方法可以使用 Pandas 函数来完成,但我还没有发现。
您可能只需要在读入 DataFrame 之前进行一些手动清理:
>>> import json
>>> import pandas as pd
>>> def collapse_lists(data):
... return [{k: v[0] if (isinstance(v, list) and len(v) == 1)
... else v for k, v in d.items()} for d in data]
>>> with open('state.json') as f:
... data = pd.DataFrame(collapse_lists(json.load(f)['State']))
>>> data
sensor ts value
0 accBodyX_ftPerSec2 2018-04-11T21:37:05.401Z NaN
1 accBodyX_ftPerSec2 2018-04-11T21:37:05.901Z -3.389190
2 accBodyY_ftPerSec2 2018-04-11T21:37:05.901Z -2.004781
3 accBodyZ_ftPerSec2 2018-04-11T21:37:05.901Z -34.776940
这会将 JSON 文件加载到 Python 字典列表中,将任何长度为 1 的列表转换为标量值,然后将结果加载到 DataFrame 中。诚然,这不是最有效的方法,但是您解析 JSON 本身的其他选择可能有点矫枉过正,除非文件很大。
最后,转换为日期时间:
>>> data['ts'] = pd.to_datetime(data['ts'])
>>> data.dtypes
sensor object
ts datetime64[ns]
value float64
dtype: object
您可能还想考虑将 sensor
转换为分类数据类型以节省可能大量的内存:
The memory usage of a Categorical is proportional to the number of categories plus the length of the data. In contrast, an object dtype is a constant times the length of the data. (source)
在显式循环形式中,这看起来像:
def collapse_lists(data):
result = []
for d in data:
entry = {}
for k, v in d.items():
if isinstance(k, list) and len(v) == 1:
entry.update({k: v[0]})
else:
entry.update({k: v})
result.append(entry)
return result
如果您遇到多个 values/sensors 的情况,这里有一些代码可能会有所帮助。
测试JSON(修改为有多个values/sensors):
{
"State": [
{
"ts": "2018-04-11T21:37:05.401Z",
"sensor": [
"accBodyX_ftPerSec2"
],
"value": null
},
{
"ts": "2018-04-11T21:37:05.100Z",
"sensor": [
"accBodyX_ftPerSec2",
"accBodyY_ftPerSec2"
],
"value": null
},
{
"ts": "2018-04-11T21:37:05.901Z",
"sensor": [
"accBodyX_ftPerSec2"
],
"value": [
-3.38919
]
},
{
"ts": "2018-04-11T21:37:05.901Z",
"sensor": [
"accBodyY_ftPerSec2"
],
"value": [
-2.004781
]
},
{
"ts": "2018-04-11T21:37:05.901Z",
"sensor": [
"accBodyX_ftPerSec2",
"accBodyY_ftPerSec2",
"accBodyZ_ftPerSec2"
],
"value": [
-1.234567,
4.56789,
-34.77694
]
}
]
}
一些代码将它打成 df,这样每个 timestamp/sensor 组合都是一个新行:
import json
import pandas as pd
def grab_json(json_filename):
with open(json_filename, 'r') as f:
json_str = f.read()
json_dict = json.loads(json_str)
resturn json_dict
def create_row_per_timestamp_and_sensor(data):
result = []
for sub_dict in data:
# Make sure we have an equal number of sensors/values
values = [None]*len(sub_dict['sensor']) if sub_dict['value'] is None else sub_dict['value']
# Zip and iterate over each sensor/value respectively
for sensor, value in zip(sub_dict['sensor'], values):
result.append({'ts': sub_dict['ts'],
'sensor': sensor,
'value': value})
return result
json_dict = grab_json("df.json") # instead of "df.json" put your filename instead
df_list = create_row_per_timestamp_and_sensor(json_dict['State'])
new_df = pd.DataFrame(df_list)
print(new_df)
输出:
sensor ts value
0 accBodyX_ftPerSec2 2018-04-11T21:37:05.401Z NaN
1 accBodyX_ftPerSec2 2018-04-11T21:37:05.100Z NaN
2 accBodyY_ftPerSec2 2018-04-11T21:37:05.100Z NaN
3 accBodyX_ftPerSec2 2018-04-11T21:37:05.901Z -3.389190
4 accBodyY_ftPerSec2 2018-04-11T21:37:05.901Z -2.004781
5 accBodyX_ftPerSec2 2018-04-11T21:37:05.901Z -1.234567
6 accBodyY_ftPerSec2 2018-04-11T21:37:05.901Z 4.567890
7 accBodyZ_ftPerSec2 2018-04-11T21:37:05.901Z -34.776940
A JSON 文件我需要处理导入到内部嵌套列表的数据框,在转换为数据框之前,它是嵌套字典的列表。文件本身是嵌套的。
样本JSON:
{
"State": [
{
"ts": "2018-04-11T21:37:05.401Z",
"sensor": [
"accBodyX_ftPerSec2"
],
"value": null
},
{
"ts": "2018-04-11T21:37:05.901Z",
"sensor": [
"accBodyX_ftPerSec2"
],
"value": [
-3.38919
]
},
{
"ts": "2018-04-11T21:37:05.901Z",
"sensor": [
"accBodyY_ftPerSec2"
],
"value": [
-2.004781
]
},
{
"ts": "2018-04-11T21:37:05.901Z",
"sensor": [
"accBodyZ_ftPerSec2"
],
"value": [
-34.77694
]
}
]
}
数据框看起来像:
sensor ts value
0 [accBodyX_ftPerSec2] 2018-04-11T21:37:05.901Z [-3.38919]
1 [accBodyY_ftPerSec2] 2018-04-11T21:37:05.901Z [-2.004781]
2 [accBodyZ_ftPerSec2] 2018-04-11T21:37:05.901Z [-34.77694]
最终,我想要移除嵌套或找到一种使用它的方法。目标是将给定传感器名称的值列表以及随附的时间戳提取到 processing/plotting 的另一个数据帧中,如下所示:
ts value
0 2018-04-11T21:37:05.901Z -3.38919
1 2018-04-11T21:37:06.401Z -3.00241
2 2018-04-11T21:37:06.901Z -3.87694
为了移除嵌套,我已经这样做了,但它在仅 100,000 行上速度很慢,但幸运的是比 for 循环快得多。 (多亏了这个 post python pandas operations on columns)
def func(row):
row.sensor = row.sensor[0]
if type(row.value) is list:
row.value = row.value[0]
return row
df.apply(func, axis=1)
为了处理嵌套,我能够提取单个值。 例如:
print( df.iloc[:,2].iloc[1][0] )
-2.004781
但是,尝试 return 来自每行中每个列表的索引 0 的值列表会导致 return 仅第一个值:
print( df.iloc[:,2].iloc[:][0] )
-3.38919
当然,我可以使用 for 循环来完成此操作,但我知道有一种方法可以使用 Pandas 函数来完成,但我还没有发现。
您可能只需要在读入 DataFrame 之前进行一些手动清理:
>>> import json
>>> import pandas as pd
>>> def collapse_lists(data):
... return [{k: v[0] if (isinstance(v, list) and len(v) == 1)
... else v for k, v in d.items()} for d in data]
>>> with open('state.json') as f:
... data = pd.DataFrame(collapse_lists(json.load(f)['State']))
>>> data
sensor ts value
0 accBodyX_ftPerSec2 2018-04-11T21:37:05.401Z NaN
1 accBodyX_ftPerSec2 2018-04-11T21:37:05.901Z -3.389190
2 accBodyY_ftPerSec2 2018-04-11T21:37:05.901Z -2.004781
3 accBodyZ_ftPerSec2 2018-04-11T21:37:05.901Z -34.776940
这会将 JSON 文件加载到 Python 字典列表中,将任何长度为 1 的列表转换为标量值,然后将结果加载到 DataFrame 中。诚然,这不是最有效的方法,但是您解析 JSON 本身的其他选择可能有点矫枉过正,除非文件很大。
最后,转换为日期时间:
>>> data['ts'] = pd.to_datetime(data['ts'])
>>> data.dtypes
sensor object
ts datetime64[ns]
value float64
dtype: object
您可能还想考虑将 sensor
转换为分类数据类型以节省可能大量的内存:
The memory usage of a Categorical is proportional to the number of categories plus the length of the data. In contrast, an object dtype is a constant times the length of the data. (source)
在显式循环形式中,这看起来像:
def collapse_lists(data):
result = []
for d in data:
entry = {}
for k, v in d.items():
if isinstance(k, list) and len(v) == 1:
entry.update({k: v[0]})
else:
entry.update({k: v})
result.append(entry)
return result
如果您遇到多个 values/sensors 的情况,这里有一些代码可能会有所帮助。
测试JSON(修改为有多个values/sensors):
{
"State": [
{
"ts": "2018-04-11T21:37:05.401Z",
"sensor": [
"accBodyX_ftPerSec2"
],
"value": null
},
{
"ts": "2018-04-11T21:37:05.100Z",
"sensor": [
"accBodyX_ftPerSec2",
"accBodyY_ftPerSec2"
],
"value": null
},
{
"ts": "2018-04-11T21:37:05.901Z",
"sensor": [
"accBodyX_ftPerSec2"
],
"value": [
-3.38919
]
},
{
"ts": "2018-04-11T21:37:05.901Z",
"sensor": [
"accBodyY_ftPerSec2"
],
"value": [
-2.004781
]
},
{
"ts": "2018-04-11T21:37:05.901Z",
"sensor": [
"accBodyX_ftPerSec2",
"accBodyY_ftPerSec2",
"accBodyZ_ftPerSec2"
],
"value": [
-1.234567,
4.56789,
-34.77694
]
}
]
}
一些代码将它打成 df,这样每个 timestamp/sensor 组合都是一个新行:
import json
import pandas as pd
def grab_json(json_filename):
with open(json_filename, 'r') as f:
json_str = f.read()
json_dict = json.loads(json_str)
resturn json_dict
def create_row_per_timestamp_and_sensor(data):
result = []
for sub_dict in data:
# Make sure we have an equal number of sensors/values
values = [None]*len(sub_dict['sensor']) if sub_dict['value'] is None else sub_dict['value']
# Zip and iterate over each sensor/value respectively
for sensor, value in zip(sub_dict['sensor'], values):
result.append({'ts': sub_dict['ts'],
'sensor': sensor,
'value': value})
return result
json_dict = grab_json("df.json") # instead of "df.json" put your filename instead
df_list = create_row_per_timestamp_and_sensor(json_dict['State'])
new_df = pd.DataFrame(df_list)
print(new_df)
输出:
sensor ts value
0 accBodyX_ftPerSec2 2018-04-11T21:37:05.401Z NaN
1 accBodyX_ftPerSec2 2018-04-11T21:37:05.100Z NaN
2 accBodyY_ftPerSec2 2018-04-11T21:37:05.100Z NaN
3 accBodyX_ftPerSec2 2018-04-11T21:37:05.901Z -3.389190
4 accBodyY_ftPerSec2 2018-04-11T21:37:05.901Z -2.004781
5 accBodyX_ftPerSec2 2018-04-11T21:37:05.901Z -1.234567
6 accBodyY_ftPerSec2 2018-04-11T21:37:05.901Z 4.567890
7 accBodyZ_ftPerSec2 2018-04-11T21:37:05.901Z -34.776940