将带有嵌套列表的 Geo json 转换为 pandas 数据框
Convert Geo json with nested lists to pandas dataframe
我有一个巨大的 geo json 这种形式:
{'features': [{'properties': {'MARKET': 'Albany',
'geometry': {'coordinates': [[[-74.264948, 42.419877, 0],
[-74.262041, 42.425856, 0],
[-74.261175, 42.427631, 0],
[-74.260384, 42.429253, 0]]],
'type': 'Polygon'}}},
{'properties': {'MARKET': 'Albany',
'geometry': {'coordinates': [[[-73.929627, 42.078788, 0],
[-73.929114, 42.081658, 0]]],
'type': 'Polygon'}}},
{'properties': {'MARKET': 'Albuquerque',
'geometry': {'coordinates': [[[-74.769198, 43.114089, 0],
[-74.76786, 43.114496, 0],
[-74.766474, 43.114656, 0]]],
'type': 'Polygon'}}}],
'type': 'FeatureCollection'}
看完json:
import json
with open('x.json') as f:
data = json.load(f)
我将值读入列表,然后读入数据框:
#to get a list of all markets
mkt=set([f['properties']['MARKET'] for f in data['features']])
#to create a list of market and associated lat long
markets=[(market,list(chain.from_iterable(f['geometry']['coordinates']))) for f in data['features'] for market in mkt if f['properties']['MARKET']==mkt]
df = pd.DataFrame(markets[0:], columns=['a','b'])
df 的前几行是:
a b
0 Albany [[-74.264948, 42.419877, 0], [-74.262041, 42.4...
1 Albany [[-73.929627, 42.078788, 0], [-73.929114, 42.0...
2 Albany [[-74.769198, 43.114089, 0], [-74.76786, 43.11...
然后为了取消嵌套 b 列中的嵌套列表,我使用了 pandas concat
:
df1 = pd.concat([df.iloc[:,0:1], df['b'].apply(pd.Series)], axis=1)
但这是创建了 8070 个包含许多 NaN 的列。有没有办法按市场(a 列)对所有纬度和经度进行分组?需要一百万行乘以两列的数据框。
所需的操作是:
mkt lat long
Albany 42.419877 -74.264948
Albany 42.078788 -73.929627
..
Albuquerque 35.105361 -106.640342
请注意,列表元素([-74.769198, 43.114089, 0])中的零需要忽略。
类似这样的东西??
from pandas.io.json import json_normalize
df = json_normalize(geojson["features"])
coords = 'properties.geometry.coordinates'
df2 = (df[coords].apply(lambda r: [(i[0],i[1]) for i in r[0]])
.apply(pd.Series).stack()
.reset_index(level=1).rename(columns={0:coords,"level_1":"point"})
.join(df.drop(coords,1), how='left')).reset_index(level=0)
df2[['lat','long']] = df2[coords].apply(pd.Series)
df2
输出:
index point properties.geometry.coordinates properties.MARKET \
0 0 0 (-74.264948, 42.419877) Albany
1 0 1 (-74.262041, 42.425856) Albany
2 0 2 (-74.261175, 42.427631) Albany
3 0 3 (-74.260384, 42.429253) Albany
4 1 0 (-73.929627, 42.078788) Albany
5 1 1 (-73.929114, 42.081658) Albany
6 2 0 (-74.769198, 43.114089) Albuquerque
7 2 1 (-74.76786, 43.114496) Albuquerque
8 2 2 (-74.766474, 43.114656) Albuquerque
properties.geometry.type lat long
0 Polygon -74.264948 42.419877
1 Polygon -74.262041 42.425856
2 Polygon -74.261175 42.427631
3 Polygon -74.260384 42.429253
4 Polygon -73.929627 42.078788
5 Polygon -73.929114 42.081658
6 Polygon -74.769198 43.114089
7 Polygon -74.767860 43.114496
8 Polygon -74.766474 43.114656
如果:
geojson = {'features': [{'properties': {'MARKET': 'Albany',
'geometry': {'coordinates': [[[-74.264948, 42.419877, 0],
[-74.262041, 42.425856, 0],
[-74.261175, 42.427631, 0],
[-74.260384, 42.429253, 0]]],
'type': 'Polygon'}}},
{'properties': {'MARKET': 'Albany',
'geometry': {'coordinates': [[[-73.929627, 42.078788, 0],
[-73.929114, 42.081658, 0]]],
'type': 'Polygon'}}},
{'properties': {'MARKET': 'Albuquerque',
'geometry': {'coordinates': [[[-74.769198, 43.114089, 0],
[-74.76786, 43.114496, 0],
[-74.766474, 43.114656, 0]]],
'type': 'Polygon'}}}],
'type': 'FeatureCollection'}
@Anton_vBR回答得很好!
但是,也可以考虑 "geopandas" 库作为替代:
import geopandas
df = geopandas.read_file("yourfile.geojson")
其中 df 将是 "class geopandas.GeoDataFrame",这将允许您像普通 pandas 的 DataFrame 一样操作 geojson(通过内部结构递归)
上面的答案都很好,但这里有些不同。文档中的 Awkward Array library (note: I'm the author) is meant for working with nested data structures like this at large scale. As a coincidence, I used a GeoJSON file as a motivating example,尽管我正在编写更多教程,这些教程将更大的 Parquet 文件作为示例数据,与地理无关。
(这就是这与@kamal-barshevich 的 geopandas 回答的不同之处:geopandas 是一个特定领域的库,它“了解”地理并且可能具有与该领域的领域专家相关的功能。尴尬的数组是一个通用的用于操作对地理一无所知的数据结构的库。)
我上面链接的文档有一些使用数组函数本身操作 GeoJSON 文件的示例,没有 Pandas,从这里开始:
>>> import urllib.request
>>> import awkward as ak
>>>
>>> url = "https://raw.githubusercontent.com/Chicago/osd-bike-routes/master/data/Bikeroutes.geojson"
>>> bikeroutes_json = urllib.request.urlopen(url).read()
>>> bikeroutes = ak.from_json(bikeroutes_json)
>>> bikeroutes
<Record ... [-87.7, 42], [-87.7, 42]]]}}]} type='{"type": string, "crs": {"type"...'>
但在这个答案中,我将制作您想要的 Pandas 结构。 ak.to_pandas function turns nested lists into a MultiIndex。将它应用到 "coordinates"
inside "geometry"
inside "features"
:
>>> bikeroutes.features.geometry.coordinates
<Array [[[[-87.8, 41.9], ... [-87.7, 42]]]] type='1061 * var * var * var * float64'>
>>>
>>> ak.to_pandas(bikeroutes.features.geometry.coordinates)
values
entry subentry subsubentry subsubsubentry
0 0 0 0 -87.788573
1 41.923652
1 0 -87.788646
1 41.923651
2 0 -87.788845
... ...
1060 0 8 1 41.950493
9 0 -87.714819
1 41.950724
10 0 -87.715284
1 41.951042
[96724 rows x 1 columns]
列表嵌套深三层,最后一层是经度、纬度对(例如 [-87.788573, 41.923652]
)。您希望这些在单独的列中:
>>> bikeroutes.features.geometry.coordinates[..., 0]
<Array [[[-87.8, -87.8, ... -87.7, -87.7]]] type='1061 * var * var * float64'>
>>> bikeroutes.features.geometry.coordinates[..., 1]
<Array [[[41.9, 41.9, 41.9, ... 42, 42, 42]]] type='1061 * var * var * float64'>
这是使用类似于 NumPy 的切片(Awkward Array 是 NumPy 的泛化),获取除最后一个维度之外的所有维度 (...
);第一个表达式提取项目 0
(经度),第二个表达式提取项目 1
(纬度)。
我们可以使用 ak.zip 将它们合并到一个新的记录类型中,为它们指定列名:
>>> ak.to_pandas(ak.zip({
... "longitude": bikeroutes.features.geometry.coordinates[..., 0],
... "latitude": bikeroutes.features.geometry.coordinates[..., 1],
... }))
longitude latitude
entry subentry subsubentry
0 0 0 -87.788573 41.923652
1 -87.788646 41.923651
2 -87.788845 41.923650
3 -87.788951 41.923649
4 -87.789092 41.923648
... ... ...
1060 0 6 -87.714026 41.950199
7 -87.714335 41.950388
8 -87.714486 41.950493
9 -87.714819 41.950724
10 -87.715284 41.951042
[48362 rows x 2 columns]
这与您要查找的内容非常接近。您最不想做的就是将其中的每一个与 "features"
中的 "properties"
之一相匹配。我的 GeoJSON 文件没有 "MARKET"
:
>>> bikeroutes.features.properties.type
1061 * {"STREET": string, "TYPE": string, "BIKEROUTE": string, "F_STREET": string, "T_STREET": option[string]}
但 "STREET"
可能是一个很好的替身。这些属性与坐标处于不同的嵌套级别:
>>> bikeroutes.features.geometry.coordinates[..., 0].type
1061 * var * var * float64
>>> bikeroutes.features.properties.STREET.type
1061 * string
经度点是比街道名称深两层的嵌套列表,但是ak.zip将它们向下广播(类似于NumPy的广播的概念,需要可变长度列表的扩展).
最终表达式为:
>>> ak.to_pandas(ak.zip({
... "longitude": bikeroutes.features.geometry.coordinates[..., 0],
... "latitude": bikeroutes.features.geometry.coordinates[..., 1],
... "street": bikeroutes.features.properties.STREET,
... }))
longitude latitude street
entry subentry subsubentry
0 0 0 -87.788573 41.923652 W FULLERTON AVE
1 -87.788646 41.923651 W FULLERTON AVE
2 -87.788845 41.923650 W FULLERTON AVE
3 -87.788951 41.923649 W FULLERTON AVE
4 -87.789092 41.923648 W FULLERTON AVE
... ... ... ...
1060 0 6 -87.714026 41.950199 N ELSTON AVE
7 -87.714335 41.950388 N ELSTON AVE
8 -87.714486 41.950493 N ELSTON AVE
9 -87.714819 41.950724 N ELSTON AVE
10 -87.715284 41.951042 N ELSTON AVE
[48362 rows x 3 columns]
因为你只是想把行情和经纬度联系起来,你可以忽略MultiIndex,或者你可以用Pandas函数把那个MultiIndex的组成部分变成列。
我有一个巨大的 geo json 这种形式:
{'features': [{'properties': {'MARKET': 'Albany',
'geometry': {'coordinates': [[[-74.264948, 42.419877, 0],
[-74.262041, 42.425856, 0],
[-74.261175, 42.427631, 0],
[-74.260384, 42.429253, 0]]],
'type': 'Polygon'}}},
{'properties': {'MARKET': 'Albany',
'geometry': {'coordinates': [[[-73.929627, 42.078788, 0],
[-73.929114, 42.081658, 0]]],
'type': 'Polygon'}}},
{'properties': {'MARKET': 'Albuquerque',
'geometry': {'coordinates': [[[-74.769198, 43.114089, 0],
[-74.76786, 43.114496, 0],
[-74.766474, 43.114656, 0]]],
'type': 'Polygon'}}}],
'type': 'FeatureCollection'}
看完json:
import json
with open('x.json') as f:
data = json.load(f)
我将值读入列表,然后读入数据框:
#to get a list of all markets
mkt=set([f['properties']['MARKET'] for f in data['features']])
#to create a list of market and associated lat long
markets=[(market,list(chain.from_iterable(f['geometry']['coordinates']))) for f in data['features'] for market in mkt if f['properties']['MARKET']==mkt]
df = pd.DataFrame(markets[0:], columns=['a','b'])
df 的前几行是:
a b
0 Albany [[-74.264948, 42.419877, 0], [-74.262041, 42.4...
1 Albany [[-73.929627, 42.078788, 0], [-73.929114, 42.0...
2 Albany [[-74.769198, 43.114089, 0], [-74.76786, 43.11...
然后为了取消嵌套 b 列中的嵌套列表,我使用了 pandas concat
:
df1 = pd.concat([df.iloc[:,0:1], df['b'].apply(pd.Series)], axis=1)
但这是创建了 8070 个包含许多 NaN 的列。有没有办法按市场(a 列)对所有纬度和经度进行分组?需要一百万行乘以两列的数据框。
所需的操作是:
mkt lat long
Albany 42.419877 -74.264948
Albany 42.078788 -73.929627
..
Albuquerque 35.105361 -106.640342
请注意,列表元素([-74.769198, 43.114089, 0])中的零需要忽略。
类似这样的东西??
from pandas.io.json import json_normalize
df = json_normalize(geojson["features"])
coords = 'properties.geometry.coordinates'
df2 = (df[coords].apply(lambda r: [(i[0],i[1]) for i in r[0]])
.apply(pd.Series).stack()
.reset_index(level=1).rename(columns={0:coords,"level_1":"point"})
.join(df.drop(coords,1), how='left')).reset_index(level=0)
df2[['lat','long']] = df2[coords].apply(pd.Series)
df2
输出:
index point properties.geometry.coordinates properties.MARKET \
0 0 0 (-74.264948, 42.419877) Albany
1 0 1 (-74.262041, 42.425856) Albany
2 0 2 (-74.261175, 42.427631) Albany
3 0 3 (-74.260384, 42.429253) Albany
4 1 0 (-73.929627, 42.078788) Albany
5 1 1 (-73.929114, 42.081658) Albany
6 2 0 (-74.769198, 43.114089) Albuquerque
7 2 1 (-74.76786, 43.114496) Albuquerque
8 2 2 (-74.766474, 43.114656) Albuquerque
properties.geometry.type lat long
0 Polygon -74.264948 42.419877
1 Polygon -74.262041 42.425856
2 Polygon -74.261175 42.427631
3 Polygon -74.260384 42.429253
4 Polygon -73.929627 42.078788
5 Polygon -73.929114 42.081658
6 Polygon -74.769198 43.114089
7 Polygon -74.767860 43.114496
8 Polygon -74.766474 43.114656
如果:
geojson = {'features': [{'properties': {'MARKET': 'Albany',
'geometry': {'coordinates': [[[-74.264948, 42.419877, 0],
[-74.262041, 42.425856, 0],
[-74.261175, 42.427631, 0],
[-74.260384, 42.429253, 0]]],
'type': 'Polygon'}}},
{'properties': {'MARKET': 'Albany',
'geometry': {'coordinates': [[[-73.929627, 42.078788, 0],
[-73.929114, 42.081658, 0]]],
'type': 'Polygon'}}},
{'properties': {'MARKET': 'Albuquerque',
'geometry': {'coordinates': [[[-74.769198, 43.114089, 0],
[-74.76786, 43.114496, 0],
[-74.766474, 43.114656, 0]]],
'type': 'Polygon'}}}],
'type': 'FeatureCollection'}
@Anton_vBR回答得很好!
但是,也可以考虑 "geopandas" 库作为替代:
import geopandas
df = geopandas.read_file("yourfile.geojson")
其中 df 将是 "class geopandas.GeoDataFrame",这将允许您像普通 pandas 的 DataFrame 一样操作 geojson(通过内部结构递归)
上面的答案都很好,但这里有些不同。文档中的 Awkward Array library (note: I'm the author) is meant for working with nested data structures like this at large scale. As a coincidence, I used a GeoJSON file as a motivating example,尽管我正在编写更多教程,这些教程将更大的 Parquet 文件作为示例数据,与地理无关。
(这就是这与@kamal-barshevich 的 geopandas 回答的不同之处:geopandas 是一个特定领域的库,它“了解”地理并且可能具有与该领域的领域专家相关的功能。尴尬的数组是一个通用的用于操作对地理一无所知的数据结构的库。)
我上面链接的文档有一些使用数组函数本身操作 GeoJSON 文件的示例,没有 Pandas,从这里开始:
>>> import urllib.request
>>> import awkward as ak
>>>
>>> url = "https://raw.githubusercontent.com/Chicago/osd-bike-routes/master/data/Bikeroutes.geojson"
>>> bikeroutes_json = urllib.request.urlopen(url).read()
>>> bikeroutes = ak.from_json(bikeroutes_json)
>>> bikeroutes
<Record ... [-87.7, 42], [-87.7, 42]]]}}]} type='{"type": string, "crs": {"type"...'>
但在这个答案中,我将制作您想要的 Pandas 结构。 ak.to_pandas function turns nested lists into a MultiIndex。将它应用到 "coordinates"
inside "geometry"
inside "features"
:
>>> bikeroutes.features.geometry.coordinates
<Array [[[[-87.8, 41.9], ... [-87.7, 42]]]] type='1061 * var * var * var * float64'>
>>>
>>> ak.to_pandas(bikeroutes.features.geometry.coordinates)
values
entry subentry subsubentry subsubsubentry
0 0 0 0 -87.788573
1 41.923652
1 0 -87.788646
1 41.923651
2 0 -87.788845
... ...
1060 0 8 1 41.950493
9 0 -87.714819
1 41.950724
10 0 -87.715284
1 41.951042
[96724 rows x 1 columns]
列表嵌套深三层,最后一层是经度、纬度对(例如 [-87.788573, 41.923652]
)。您希望这些在单独的列中:
>>> bikeroutes.features.geometry.coordinates[..., 0]
<Array [[[-87.8, -87.8, ... -87.7, -87.7]]] type='1061 * var * var * float64'>
>>> bikeroutes.features.geometry.coordinates[..., 1]
<Array [[[41.9, 41.9, 41.9, ... 42, 42, 42]]] type='1061 * var * var * float64'>
这是使用类似于 NumPy 的切片(Awkward Array 是 NumPy 的泛化),获取除最后一个维度之外的所有维度 (...
);第一个表达式提取项目 0
(经度),第二个表达式提取项目 1
(纬度)。
我们可以使用 ak.zip 将它们合并到一个新的记录类型中,为它们指定列名:
>>> ak.to_pandas(ak.zip({
... "longitude": bikeroutes.features.geometry.coordinates[..., 0],
... "latitude": bikeroutes.features.geometry.coordinates[..., 1],
... }))
longitude latitude
entry subentry subsubentry
0 0 0 -87.788573 41.923652
1 -87.788646 41.923651
2 -87.788845 41.923650
3 -87.788951 41.923649
4 -87.789092 41.923648
... ... ...
1060 0 6 -87.714026 41.950199
7 -87.714335 41.950388
8 -87.714486 41.950493
9 -87.714819 41.950724
10 -87.715284 41.951042
[48362 rows x 2 columns]
这与您要查找的内容非常接近。您最不想做的就是将其中的每一个与 "features"
中的 "properties"
之一相匹配。我的 GeoJSON 文件没有 "MARKET"
:
>>> bikeroutes.features.properties.type
1061 * {"STREET": string, "TYPE": string, "BIKEROUTE": string, "F_STREET": string, "T_STREET": option[string]}
但 "STREET"
可能是一个很好的替身。这些属性与坐标处于不同的嵌套级别:
>>> bikeroutes.features.geometry.coordinates[..., 0].type
1061 * var * var * float64
>>> bikeroutes.features.properties.STREET.type
1061 * string
经度点是比街道名称深两层的嵌套列表,但是ak.zip将它们向下广播(类似于NumPy的广播的概念,需要可变长度列表的扩展).
最终表达式为:
>>> ak.to_pandas(ak.zip({
... "longitude": bikeroutes.features.geometry.coordinates[..., 0],
... "latitude": bikeroutes.features.geometry.coordinates[..., 1],
... "street": bikeroutes.features.properties.STREET,
... }))
longitude latitude street
entry subentry subsubentry
0 0 0 -87.788573 41.923652 W FULLERTON AVE
1 -87.788646 41.923651 W FULLERTON AVE
2 -87.788845 41.923650 W FULLERTON AVE
3 -87.788951 41.923649 W FULLERTON AVE
4 -87.789092 41.923648 W FULLERTON AVE
... ... ... ...
1060 0 6 -87.714026 41.950199 N ELSTON AVE
7 -87.714335 41.950388 N ELSTON AVE
8 -87.714486 41.950493 N ELSTON AVE
9 -87.714819 41.950724 N ELSTON AVE
10 -87.715284 41.951042 N ELSTON AVE
[48362 rows x 3 columns]
因为你只是想把行情和经纬度联系起来,你可以忽略MultiIndex,或者你可以用Pandas函数把那个MultiIndex的组成部分变成列。