Pandas:按列值和时间戳对记录进行分组,并对每条记录应用一个函数

Pandas: group records by a column value and timestamp and apply a function on each record

我有一个 JSON 以太坊交易文件,其结构如下:

  ...,
  {
    "blockNumber": "14492022",
    "timeStamp": "1648703953",
    "hash": "0xdc15c50f4532ec385a3747f2a0e646922a395f6aa574794a14d07d8219ddea3e",
    "nonce": "89",
    "blockHash": "0xa804b6c72753657275c58b00bf17cd2ca7e2be3cbf4a6615f4cc7175c3c76aea",
    "transactionIndex": "162",
    "from": "0x4ca43dc185ff11844e448604cd11409a92a3794b",
    "to": "0xf87e31492faf9a91b02ee0deaad50d51d56d5d4d",
    "value": "0",
    "gas": "130000",
    "gasPrice": "40449366242",
    "isError": "0",
    "txreceipt_status": "1",
    "input": "0x23b872dd0000000000000000000000004ca43dc185ff11844e448604cd11409a92a3794b00000000000000000000000034380456f50e013f1b8b2b9b5dc9d55fb0ca9c2b00000000000000000000000000000042ffffffffffffffffffffffffffffff76",
    "contractAddress": "",
    "cumulativeGasUsed": "9370793",
    "gasUsed": "106123",
    "confirmations": "180974",
  },
  ...

Pandas 数据帧应该用于执行以下操作。

首先,timeStamp应该转换成合适的格式。然后,我需要按 from 值和 1 周的间隔对数据进行分组。然后在组的每个元素中,我想添加一个新键(名称无关紧要),其值为:

所以,有些人喜欢以下内容:

我需要这个新键来识别在同一时间间隔内出现的元素,以便使用外部工具执行额外的分析。

到目前为止,我已经尝试了以下方法:

df = pd.read_json(file_path)
df['timeStamp'] = pd.to_datetime(df['timeStamp'],unit='s')

df_grouped = transactions_df.groupby(['from'], pd.Grouper(key="timeStamp", freq="1W"))

# from here I don't know how to apply the changes described above

这是一个可能的答案:

数据:

这是一些示例数据。为了这个答案,我删除了不必要的列。

     timeStamp     from
0   1648703953  0xaaaaa
1   1648779553  0xaaaaa
2   1648855153  0xaaaaa
3   1648930753  0xaaaaa
4   1649006353  0xaaaaa
5   1649081953  0xaaaaa
6   1649157553  0xaaaaa
7   1649233153  0xaaaaa
8   1649308753  0xaaaaa
9   1649384353  0xaaaaa
10  1649459953  0xaaaaa
11  1649535553  0xaaaaa
12  1649611153  0xaaaaa
13  1649686753  0xaaaaa
14  1649762353  0xaaaaa
15  1648703953  0xFFFFF
16  1648779553  0xFFFFF
17  1648855153  0xFFFFF
18  1648930753  0xFFFFF
19  1649006353  0xFFFFF
20  1649081953  0xFFFFF
21  1649157553  0xFFFFF
22  1649233153  0xFFFFF
23  1649308753  0xFFFFF
24  1649384353  0xFFFFF

代码:

# Timestamp to readable format.
df['timeStamp'] = pd.to_datetime(df['timeStamp'].astype(int), unit='s')

# Group by week.
df['week_id'] = df.groupby(['from', pd.Grouper(key='timeStamp', freq='1W-MON')]).ngroup()
df['week_id'] -= df.groupby('from')['week_id'].transform('min')

# Tag.
df['first_timeStamp'] = df.groupby(['from', 'week_id'])['timeStamp'].transform('min')
df['tag'] = df['from'] + '_' + df['first_timeStamp'].dt.strftime('%Y-%m-%dT%H:%M:%S') + '_' + df['week_id'].astype(str)

print(df.drop(columns=['week_id', 'first_timeStamp']))

结果:

             timeStamp     from                            tag
0  2022-03-31 05:19:13  0xaaaaa  0xaaaaa_2022-03-31T05:19:13_0
1  2022-04-01 02:19:13  0xaaaaa  0xaaaaa_2022-03-31T05:19:13_0
2  2022-04-01 23:19:13  0xaaaaa  0xaaaaa_2022-03-31T05:19:13_0
3  2022-04-02 20:19:13  0xaaaaa  0xaaaaa_2022-03-31T05:19:13_0
4  2022-04-03 17:19:13  0xaaaaa  0xaaaaa_2022-03-31T05:19:13_0
5  2022-04-04 14:19:13  0xaaaaa  0xaaaaa_2022-03-31T05:19:13_0
6  2022-04-05 11:19:13  0xaaaaa  0xaaaaa_2022-04-05T11:19:13_1
7  2022-04-06 08:19:13  0xaaaaa  0xaaaaa_2022-04-05T11:19:13_1
8  2022-04-07 05:19:13  0xaaaaa  0xaaaaa_2022-04-05T11:19:13_1
9  2022-04-08 02:19:13  0xaaaaa  0xaaaaa_2022-04-05T11:19:13_1
10 2022-04-08 23:19:13  0xaaaaa  0xaaaaa_2022-04-05T11:19:13_1
11 2022-04-09 20:19:13  0xaaaaa  0xaaaaa_2022-04-05T11:19:13_1
12 2022-04-10 17:19:13  0xaaaaa  0xaaaaa_2022-04-05T11:19:13_1
13 2022-04-11 14:19:13  0xaaaaa  0xaaaaa_2022-04-05T11:19:13_1
14 2022-04-12 11:19:13  0xaaaaa  0xaaaaa_2022-04-12T11:19:13_2
15 2022-03-31 05:19:13  0xFFFFF  0xFFFFF_2022-03-31T05:19:13_0
16 2022-04-01 02:19:13  0xFFFFF  0xFFFFF_2022-03-31T05:19:13_0
17 2022-04-01 23:19:13  0xFFFFF  0xFFFFF_2022-03-31T05:19:13_0
18 2022-04-02 20:19:13  0xFFFFF  0xFFFFF_2022-03-31T05:19:13_0
19 2022-04-03 17:19:13  0xFFFFF  0xFFFFF_2022-03-31T05:19:13_0
20 2022-04-04 14:19:13  0xFFFFF  0xFFFFF_2022-03-31T05:19:13_0
21 2022-04-05 11:19:13  0xFFFFF  0xFFFFF_2022-04-05T11:19:13_1
22 2022-04-06 08:19:13  0xFFFFF  0xFFFFF_2022-04-05T11:19:13_1
23 2022-04-07 05:19:13  0xFFFFF  0xFFFFF_2022-04-05T11:19:13_1
24 2022-04-08 02:19:13  0xFFFFF  0xFFFFF_2022-04-05T11:19:13_1

注意这里是按每月周分组的,所以第一周不是从数据的第一天开始。