如何使用 pandas 在 JSON 对象上执行自定义 window 函数?

How to do custom window function on JSON object with pandas?

我下面有一个相当嵌套的 JSON 对象,我正在尝试计算具有最多事件(即 'parameters' 键的长度)的用户(即 'profileId')。

我有下面的代码来获取参数的长度,但我现在试图让每条记录的计算都是正确的,因为我现在设置它的方式会为每条记录设置相同的值- 我查看了 pandas window 函数 https://pandas.pydata.org/docs/user_guide/window.html 但无法获得正确的结果。

    response = response.json()
    df = pd.json_normalize(response['items'])
    df['calcfield'] = len(df["events"].iloc[0][0].get('parameters'))
    

df['arrayfield'] 的输出如下:

[
        {
           "type":"auth",
           "name":"activity",
           "parameters":[
              {
                 "name":"api_name",
                 "value":"admin"
              },
              {
                 "name":"method_name",
                 "value":"directory.users.list"
              },
              {
                 "name":"client_id",
                 "value":"722230783769-dsta4bi9fkom72qcu0t34aj3qpcoqloq.apps.googleusercontent.com"
              },
              {
                 "name":"num_response_bytes",
                 "intValue":"7158"
              },
              {
                 "name":"product_bucket",
                 "value":"GSUITE_ADMIN"
              },
              {
                 "name":"app_name",
                 "value":"Untitled project"
              },
              {
                 "name":"client_type",
                 "value":"WEB"
              }
           ]
        }
     ]   },   {
     "kind":"admin#reports#activity",
     "id":{
        "time":"2022-05-05T23:58:48.914Z",
        "uniqueQualifier":"-4002873813067783265",
        "applicationName":"token",
        "customerId":"C02f6wppb"
     },
     "etag":"\"5T53xK7dpLei95RNoKZd9uz5Xb8LJpBJb72fi2HaNYM/9DTdB8t7uixvUbjo4LUEg53_gf0\"",
     "actor":{
        "email":"nancy.admin@hyenacapital.net",
        "profileId":"100230688039070881323"
     },
     "ipAddress":"54.80.168.30",
     "events":[
        {
           "type":"auth",
           "name":"activity",
           "parameters":[
              {
                 "name":"api_name",
                 "value":"gmail"
              },
              {
                 "name":"method_name",
                 "value":"gmail.users.messages.list"
              },
              {
                 "name":"client_id",
                 "value":"927538837578.apps.googleusercontent.com"
              },
              {
                 "name":"num_response_bytes",
                 "intValue":"2"
              },
              {
                 "name":"product_bucket",
                 "value":"GMAIL"
              },
              {
                 "name":"app_name",
                 "value":"Zapier"
              },
              {
                 "name":"client_type",
                 "value":"WEB"
              }
           ]

原始 JSON 我读入的 BLOB

{
   "kind":"admin#reports#activities",
   "etag":"\"5g8\"",
   "nextPageToken":"A:1651795128914034:-4002873813067783265:151219070090:C02f6wppb",
   "items":[
      {
         "kind":"admin#reports#activity",
         "id":{
            "time":"2022-05-05T23:59:39.421Z",
            "uniqueQualifier":"5526793068617678141",
            "applicationName":"token",
            "customerId":"cds"
         },
         "etag":"\"jkYcURYoi8\"",
         "actor":{
            "email":"blah@blah.net",
            "profileId":"1323"
         },
         "ipAddress":"107.178.193.87",
         "events":[
            {
               "type":"auth",
               "name":"activity",
               "parameters":[
                  {
                     "name":"api_name",
                     "value":"admin"
                  },
                  {
                     "name":"method_name",
                     "value":"directory.users.list"
                  },
                  {
                     "name":"client_id",
                     "value":"722230783769-dsta4bi9fkom72qcu0t34aj3qpcoqloq.apps.googleusercontent.com"
                  },
                  {
                     "name":"num_response_bytes",
                     "intValue":"7158"
                  },
                  {
                     "name":"product_bucket",
                     "value":"GSUITE_ADMIN"
                  },
                  {
                     "name":"app_name",
                     "value":"Untitled project"
                  },
                  {
                     "name":"client_type",
                     "value":"WEB"
                  }
               ]
            }
         ]
      },
      {
         "kind":"admin#reports#activity",
         "id":{
            "time":"2022-05-05T23:58:48.914Z",
            "uniqueQualifier":"-4002873813067783265",
            "applicationName":"token",
            "customerId":"df"
         },
         "etag":"\"5T53xK7dpLei95RNoKZd9uz5Xb8LJpBJb72fi2HaNYM/9DTdB8t7uixvUbjo4LUEg53_gf0\"",
         "actor":{
            "email":"blah.blah@bebe.net",
            "profileId":"1324"
         },
         "ipAddress":"54.80.168.30",
         "events":[
            {
               "type":"auth",
               "name":"activity",
               "parameters":[
                  {
                     "name":"api_name",
                     "value":"gmail"
                  },
                  {
                     "name":"method_name",
                     "value":"gmail.users.messages.list"
                  },
                  {
                     "name":"client_id",
                     "value":"927538837578.apps.googleusercontent.com"
                  },
                  {
                     "name":"num_response_bytes",
                     "intValue":"2"
                  },
                  {
                     "name":"product_bucket",
                     "value":"GMAIL"
                  },
                  
                  {
                     "name":"client_type",
                     "value":"WEB"
                  }
               ]
            }
         ]
      }
   ]
}

您的要求并不完全清楚,并且 df['arrayfield'] 不在您提供的示例中。但是,如果您查看 json_normalize 之后的事件列,则可以使用以下行来提取每个参数键的长度。您作为示例给出的 blob 设置为响应...

df = pd.json_normalize(response['items'])
df['calcfield'] = df['events'].str[0].str.get('parameters').str.len()

因为每个参数键有 7 个元素,所以很难说这就是您真正想要的。

使用:

df.groupby('actor.profileId')['events'].apply(lambda x: [len(x.iloc[i][0]['parameters']) for i in range(len(x))])

其中 returns 每个 profileid 参数列表。输出和示例数据:

actor.profileId
1323    [7]
1324    [7]
Name: events, dtype: object