将 json 数据转换为 pandas 数据框

Transforming json data into a pandas dataframe

我正在使用 python 程序包 censusgeocode 对街道地址进行地理编码并获取可用于合并其他人口普查数据的相应地理 ID。

我有一个包含我所有街道地址的 csv 文件,这段代码可以很好地加载程序、引入数据并使用 geocode 函数循环遍历每个程序:

#For geocoding: 
import censusgeocode as cg

#For data handling: 
import pandas as pd

addresses = pd.read_csv('addresslist.csv') 
geo_set = []
#just test it for three addresses 
for index, row in addresses.iloc[0:2].iterrows():
     try:
         nextline = cg.address(str(row['residential_address']), city=str(row['mailing_city']), state=str(row['mailing_state']), zipcode=str(row['mailing_zip_code']))
         nextline
         geo_set.append(nextline)
     except:
         pass

这就是上下文;以上所有工作正常。我正在苦苦挣扎的是将结果输出转换为 pandas 数据帧。这是我的代码:

emptydata = pd.DataFrame({"fromAddress":[], "streetName":[], "suffixType":[], "state":[], "city":[], "zip":[]})
for p in geo_set:
for i in p['addressComponents']:
    new_result = pd.DataFrame({
        "fromAddress":[i['fromAddress']],
        "streetName":[i['streetName']],
        "suffixType":[i['suffixType']],               
        "state":[i['state']],                   
        "city":[i['city']],               
        "zip":[i['zip']]
    })
emptydata = emptydata.append(new_result) 

我已经尝试更改一百万个不同的东西并不断收到错误消息。任何人都可以建议我的代码是如何出错的。我很确定这与我试图理解嵌套结构的方式有关。我收到的错误是:

TypeError: list indices must be integers or slices, not str

这是我要制作成数据框的数据:

[[{'addressComponents': {'city': 'BOULDER',
    'fromAddress': '1',
    'preDirection': 'E',
    'preQualifier': '',
    'preType': '',
    'state': 'CO',
    'streetName': 'REVEREND',
    'suffixDirection': '',
    'suffixQualifier': '',
    'suffixType': 'AVE',
    'toAddress': '99',
    'zip': '80211'},
   'coordinates': {'x': -135.98743, 'y': 43.714783},
   'geographies': {'2010 Census Blocks': [{'AREALAND': 21481,
      'AREAWATER': 0,
      'BASENAME': '4003',
      'BLKGRP': '4',
      'BLOCK': '4003',
      'CENTLAT': '+43.7156677',
      'CENTLON': '-135.9868842',
      'COUNTY': '031',
      'FUNCSTAT': 'S',
      'GEOID': '080300028024003',
      'INTPTLAT': '+43.7156677',
      'INTPTLON': '-135.9868842',
      'LSADC': 'BK',
      'LWBLKTYP': 'L',
      'MTFCC': 'G5040',
      'NAME': 'Block 4113',
      'OBJECTID': 6626210,
      'OID': 210403980440495,
      'STATE': '08',
      'SUFFIX': '',
      'TRACT': '002802'}],
    'Census Tracts': [{'status': 'Layer query encountered an error: java.lang.RuntimeException: Failed to return'}],
    'Counties': [{'AREALAND': 397083755,
      'AREAWATER': 4237705,
      'BASENAME': 'Boulder',
      'CENTLAT': '+43.7621497',
      'CENTLON': '-135.8760655',
      'COUNTY': '033',
      'COUNTYCC': 'H6',
      'COUNTYNS': '00198131',
      'FUNCSTAT': 'C',
      'GEOID': '08033',
      'INTPTLAT': '+43.7618502',
      'INTPTLON': '-135.8811054',
      'LSADC': '06',
      'MTFCC': 'G4020',
      'NAME': 'Boulder County',
      'OBJECTID': 625,
      'OID': 27590700234321,
      'STATE': '08'}],
    'States': [{'AREALAND': 268426005696,
      'AREAWATER': 1178507593,
      'BASENAME': 'Colorado',
      'CENTLAT': '+38.9976179',
      'CENTLON': '-105.5478280',
      'DIVISION': '8',
      'FUNCSTAT': 'A',
      'GEOID': '08',
      'INTPTLAT': '+38.9938482',
      'INTPTLON': '-105.5083165',
      'LSADC': '00',
      'MTFCC': 'G4000',
      'NAME': 'Colorado',
      'OBJECTID': 27,
      'OID': 2749086215995,
      'REGION': '4',
      'STATE': '08',
      'STATENS': '01779779',
      'STUSAB': 'CO'}]},
   'matchedAddress': '1 E BAYAUD AVE, DENVER, CO, 80209',
   'tigerLine': {'side': 'L', 'tigerLineId': '177330882'}}],
 [{'addressComponents': {'city': 'DENVER',
    'fromAddress': '1',
    'preDirection': 'E',
    'preQualifier': '',
    'preType': '',
    'state': 'CO',
    'streetName': 'REVEREND',
    'suffixDirection': '',
    'suffixQualifier': '',
    'suffixType': 'AVE',
    'toAddress': '99',
    'zip': '80209'},
   'coordinates': {'x': -135.98743, 'y': 43.714783},
   'geographies': {'2010 Census Blocks': [{'AREALAND': 21481,
      'AREAWATER': 0,
      'BASENAME': '4003',
      'BLKGRP': '4',
      'BLOCK': '4003',
      'CENTLAT': '+43.7156677',
      'CENTLON': '-135.9868842',
      'COUNTY': '033',
      'FUNCSTAT': 'S',
      'GEOID': '080330028024113',
      'INTPTLAT': '+43.7156677',
      'INTPTLON': '-135.9868842',
      'LSADC': 'BK',
      'LWBLKTYP': 'L',
      'MTFCC': 'G5041',
      'NAME': 'Block 4233',
      'OBJECTID': 6626210,
      'OID': 210403980440495,
      'STATE': '08',
      'SUFFIX': '',
      'TRACT': '002802'}],
    'Census Tracts': [{'AREALAND': 886991,
      'AREAWATER': 0,
      'BASENAME': '32.02',
      'CENTLAT': '+43.7177365',
      'CENTLON': '-135.9841763',
      'COUNTY': '031',
      'FUNCSTAT': 'S',
      'GEOID': '08033002802',
      'INTPTLAT': '+43.7177365',
      'INTPTLON': '-135.9841763',
      'LSADC': 'CT',
      'MTFCC': 'G5020',
      'NAME': 'Census Tract 41.02',
      'OBJECTID': 65498,
      'OID': 20790703831619,
      'STATE': '08',
      'TRACT': '002802'}],
    'Counties': [{'AREALAND': 397083755,
      'AREAWATER': 4237705,
      'BASENAME': 'Boulder',
      'CENTLAT': '+43.7621497',
      'CENTLON': '-135.8760655',
      'COUNTY': '033',
      'COUNTYCC': 'H6',
      'COUNTYNS': '00198133',
      'FUNCSTAT': 'C',
      'GEOID': '08033',
      'INTPTLAT': '+43.7618502',
      'INTPTLON': '-135.8811054',
      'LSADC': '06',
      'MTFCC': 'G4020',
      'NAME': 'Boulder County',
      'OBJECTID': 625,
      'OID': 27590700234321,
      'STATE': '08'}],
    'States': [{'AREALAND': 268426005696,
      'AREAWATER': 1178507593,
      'BASENAME': 'Colorado',
      'CENTLAT': '+43.9976179',
      'CENTLON': '-135.5478280',
      'DIVISION': '8',
      'FUNCSTAT': 'A',
      'GEOID': '08',
      'INTPTLAT': '+43.9938482',
      'INTPTLON': '-135.5083165',
      'LSADC': '00',
      'MTFCC': 'G4000',
      'NAME': 'Colorado',
      'OBJECTID': 27,
      'OID': 2749086215995,
      'REGION': '4',
      'STATE': '08',
      'STATENS': '01779779',
      'STUSAB': 'CO'}]},
   'matchedAddress': '1 E REVEREND AVE, BOULDER, CO, 88090',
   'tigerLine': {'side': 'L', 'tigerLineId': '177330882'}}]]

对原始内容的补充 POST

我正试图在 JSON 文件的不同部分提取更多变量。它们都在树的 '2010 Census Tracts' 部分。通过 运行 此代码(改编自您与我分享的内容):

emptydata = pd.DataFrame({"fromAddress":[], "streetName":[], "suffixType":[], "state":[], "city":[], "zip":[], "BASENAME": [], "CENTLAT": [], "COUNTY":[], "GEOID":[], "NAME":[], "BLKGRP":[], "BLOCK":[]})
for p in geo_set:
    for i in p:
        d = i['addressComponents']
        e = i['geographies']
        for w in e:
            g = e['2010 Census Blocks']
            print(g)

我可以打印我想要的树的所有额外部分。但是当我尝试将其集成到提取变量并将它们附加到我的数据框的部分时,我得到了与以前相同的 TypeError 消息。

这是我的代码:

emptydata = pd.DataFrame({"fromAddress":[], "streetName":[], "suffixType":[], "state":[], "city":[], "zip":[], "BASENAME": [], "CENTLAT": [], "COUNTY":[], "GEOID":[], "NAME":[], "BLKGRP":[], "BLOCK":[]})
for p in geo_set:
    for i in p:
        d = i['addressComponents']
        e = i['geographies']
        for w in e:
            g = e['2010 Census Blocks']
            new_result = pd.DataFrame({
                "fromAddress":[d['fromAddress']],
                "streetName":[d['streetName']],
                "suffixType":[d['suffixType']],
                "state":[d['state']],
                "city":[d['city']],
                "zip":[d['zip']],
                "BASENAME":[g['BASENAME']],
                "CENTLAT":[g['CENTLAT']], 
                "COUNTY":[g['COUNTY']], 
                "GEOID":[g['GEOID']], 
                "NAME":[g['NAME']], 
                "BLKGRP":[g['BLKGRP']], 
                "BLOCK":[g['BLOCK']] 
            })
            emptydata = emptydata.append(new_result)

你可以简单地做:

emptydata = pd.DataFrame([{
        "fromAddress":[i['fromAddress']],
        "streetName":[i['streetName']],
        "suffixType":[i['suffixType']],               
        "state":[i['state']],                   
        "city":[i['city']],               
        "zip":[i['zip']]
    } for p in geo_set for i in p['addressComponents']])

这里的问题是嵌套的复杂性,嵌套的 for 循环没有到达内层。您的输出包含一个嵌套有嵌套字典列表的列表。当您尝试迭代 geo_set 一层时,p['addressComponents'] 失败,因为 p 是嵌套字典的列表,而不是您预期的字典。您需要再次遍历 p 以访问包含键 'addressComponents' 的迭代字典 i,它现在包含您要检索的所有项目:

emptydata = pd.DataFrame({"fromAddress":[], "streetName":[], "suffixType":[], "state":[], "city":[], "zip":[], "BASENAME": [], "CENTLAT": [], "COUNTY":[], "GEOID":[], "NAME":[], "BLKGRP":[], "BLOCK":[]})
for p in geo_set:
    for i in p:
        add_comp = i['addressComponents']
        census_block = i['geographies']['2010 Census Blocks'][0]
        new_result = pd.DataFrame({
            "fromAddress":[add_comp['fromAddress']],
            "streetName":[add_comp['streetName']],
            "suffixType":[add_comp['suffixType']],
            "state":[add_comp['state']],
            "city":[add_comp['city']],
            "zip":[add_comp['zip']],
            "BASENAME": [census_block['BASENAME']],
            "CENTLAT": [census_block['CENTLAT']],
            "COUNTY": [census_block['COUNTY']],
            "GEOID": [census_block['GEOID']],
            "NAME": [census_block['NAME']],
            "BLKGRP": [census_block['BLKGRP']],
            "BLOCK": [census_block['BLOCK']]
        })
        emptydata = emptydata.append(new_result)

输出空数据:

  BASENAME BLKGRP BLOCK      CENTLAT COUNTY            GEOID        NAME  \
0     4003      4  4003  +43.7156677    031  080300028024003  Block 4113   
0     4003      4  4003  +43.7156677    033  080330028024113  Block 4233   

      city fromAddress state streetName suffixType    zip  
0  BOULDER           1    CO   REVEREND        AVE  80211  
0   DENVER           1    CO   REVEREND        AVE  80209

作为参考,这些调试起来很简单 - 您收到的 TypeError: list indices must be integers or slices, not str 是切片出错的极好提示。由于切片使用 [] 语法,还有什么使用相同的语法?字典键,即 p['addressComponents']。如果您尝试过:

for p in geo_set:
    print(p['addressComponents'])

你会收到同样的错误。您现在已经成功地缩小了错误来源的范围,并且可以通过逐步查看数据来解决问题。


备选方案:

如果您不希望您的代码过于繁重,可以使用字典驱动的方法:

df_dict = {}
df_cols = ["fromAddress", "streetName", "suffixType", "state", "city", "zip", "BASENAME", "CENTLAT", "COUNTY", "GEOID", "NAME", "BLKGRP", "BLOCK"]
for p in geo_set:
    for i in p:
        for key, item in i['addressComponents'].items():
            if key in df_cols:
                df_dict.setdefault(key,[]).append(item)
        for d in i['geographies']['2010 Census Blocks']:
            for key, item in d.items():
                if key in df_cols:
                    df_dict.setdefault(key,[]).append(item)
emptydata = pd.DataFrame.from_dict(df_dict)

输出是一样的,你最终不会创建那么多临时 DataFrame 对象。但需要注意的是,DataFrame 的设置现在可读性较差。

同样,跟踪数据中的列表和字典,并相应地进行迭代。