从 Pandas 列中的 json 数据中提取键值计数

Extracting key value counts from json data in a column in Pandas

我一直在尝试从 Pandas 的列中的 json 数据中提取键值计数,但没有成功。数据格式可以在数据框中找到:

    data = [['ID_1', '{\'RestaurantsTakeOut\': \'True\', \'BusinessParking\': "{\'garage\': False, \'street\': False, \'validated\': False, \'lot\': False, \'valet\': False}", \'WiFi\': "u\'no\'", \'RestaurantsDelivery\': \'False\', \'OutdoorSeating\': \'False\', \'RestaurantsAttire\': "u\'casual\'", \'BusinessAcceptsCreditCards\': \'True\', \'RestaurantsGoodForGroups\': \'True\', \'RestaurantsReservations\': \'False\', \'HasTV\': \'False\', \'Ambience\': "{\'romantic\': False, \'intimate\': False, \'touristy\': False, \'hipster\': False, \'divey\': False, \'classy\': False, \'trendy\': False, \'upscale\': False, \'casual\': False}", \'Alcohol\': "u\'none\'", \'RestaurantsPriceRange2\': \'1\', \'GoodForKids\': \'True\'}'], 
        ['ID_2','{\'RestaurantsTakeOut\': \'True\', \'HasTV\': \'True\', \'NoiseLevel\': "u\'average\'", \'Alcohol\': "u\'full_bar\'", \'BusinessAcceptsCreditCards\': \'True\', \'RestaurantsAttire\': "u\'casual\'", \'Caters\': \'False\', \'RestaurantsDelivery\': \'False\', \'RestaurantsTakeOut\': \'True\', \'Ambience\': "{\'romantic\': False, \'intimate\': True, \'classy\': False, \'hipster\': False, \'divey\': False, \'touristy\': False, \'trendy\': False, \'upscale\': False, \'casual\': False}", \'RestaurantsGoodForGroups\': \'True\', \'BusinessParking\': "{\'garage\': False, \'street\': True, \'validated\': False, \'lot\': False, \'valet\': False}", \'GoodForKids\': \'False\', \'RestaurantsPriceRange2\': \'2\', \'WiFi\': "u\'free\'", \'BikeParking\': \'True\', \'RestaurantsReservations\': \'True\'}' ], 
        ['ID_3','{\'RestaurantsTakeOut\': \'False\', \'GoodForKids\': \'True\', \'NoiseLevel\': "u\'average\'", \'RestaurantsPriceRange2\': \'2\', \'BusinessAcceptsCreditCards\': \'True\', \'HasTV\': \'False\', \'OutdoorSeating\': \'False\', \'RestaurantsTakeOut\': \'True\', \'RestaurantsTableService\': \'True\', \'RestaurantsDelivery\': \'False\', \'BusinessParking\': "{\'garage\': False, \'street\': False, \'validated\': False, \'lot\': True, \'valet\': False}", \'RestaurantsReservations\': \'True\', \'BikeParking\': \'True\', \'GoodForMeal\': "{\'dessert\': False, \'latenight\': False, \'lunch\': True, \'dinner\': True, \'brunch\': False, \'breakfast\': False}", \'Ambience\': "{\'romantic\': False, \'intimate\': False, \'touristy\': False, \'hipster\': False, \'divey\': False, \'classy\': False, \'trendy\': False, \'upscale\': False, \'casual\': True}", \'WiFi\': "u\'no\'", \'Alcohol\': "\'beer_and_wine\'", \'RestaurantsGoodForGroups\': \'True\', \'RestaurantsAttire\': "\'casual\'"}']] 

df = pd.DataFrame(data, columns = ['business_id', 'attributes']) 

我一直在尝试提取 keys 、 values 和 counts 并将结果放入类似于以下的格式:

Key1 Value1 Count
Key1 Value2 Count
Key2 Value1 Count
Key2 Value2 Count 
Key3 Value1 Count  

在此之后,我想 select 一些键并将这些键填充为数据框中的新列,其中唯一键的值将填充到该列中。

    business_id atrributes                     RestaurantsTakeOut
0   ID_1        same as in original dataframe  True 
1   ID_2        same as in original dataframe  True 
2   ID_3        same as in original dataframe  False 

任何有关如何获得这些结果的想法都将不胜感激。

IIUC,

您只需要在 ast 模块和 pandas json_normalize

的帮助下取消嵌套 json
from pandas.io.json import json_normalize
from ast import literal_eval

def unnest_json(dataframe, column):
    dataframe_new = json_normalize(dataframe[column].apply(literal_eval))
    return dataframe_new



df1 = unnest_json(df,'attributes')


# going a level further

print(unnest_json(df1,'BusinessParking'))


   garage  street  validated    lot  valet
0   False   False      False  False  False
1   False    True      False  False  False
2   False   False      False   True  False

请注意,您的某些 json 将具有 NaN 字段,您可以 fillna('{}') 将它们重新映射为空白 json 字段。

通过一个简单的循环,您可以根据您的键创建一个数据帧字典

json_fields = ['BusinessParking','Ambience','GoodForMeal']
dfs = {}
for field in json_fields:

    try:
        dataframe = unnest_json(df1,field)
    except ValueError:
        dataframe = unnest_json(df1.fillna('{}'),field)

    dfs[field] = dataframe

print(dfs['Ambience'])

   romantic  intimate  touristy  hipster  divey  classy  trendy  upscale  \
0     False     False     False    False  False   False   False    False   
1     False      True     False    False  False   False   False    False   
2     False     False     False    False  False   False   False    False   

   casual  
0   False  
1   False  
2    True