Pandas:按列值和时间戳对记录进行分组,并对每条记录应用一个函数
Pandas: group records by a column value and timestamp and apply a function on each record
我有一个 JSON 以太坊交易文件,其结构如下:
...,
{
"blockNumber": "14492022",
"timeStamp": "1648703953",
"hash": "0xdc15c50f4532ec385a3747f2a0e646922a395f6aa574794a14d07d8219ddea3e",
"nonce": "89",
"blockHash": "0xa804b6c72753657275c58b00bf17cd2ca7e2be3cbf4a6615f4cc7175c3c76aea",
"transactionIndex": "162",
"from": "0x4ca43dc185ff11844e448604cd11409a92a3794b",
"to": "0xf87e31492faf9a91b02ee0deaad50d51d56d5d4d",
"value": "0",
"gas": "130000",
"gasPrice": "40449366242",
"isError": "0",
"txreceipt_status": "1",
"input": "0x23b872dd0000000000000000000000004ca43dc185ff11844e448604cd11409a92a3794b00000000000000000000000034380456f50e013f1b8b2b9b5dc9d55fb0ca9c2b00000000000000000000000000000042ffffffffffffffffffffffffffffff76",
"contractAddress": "",
"cumulativeGasUsed": "9370793",
"gasUsed": "106123",
"confirmations": "180974",
},
...
Pandas 数据帧应该用于执行以下操作。
首先,timeStamp
应该转换成合适的格式。然后,我需要按 from
值和 1 周的间隔对数据进行分组。然后在组的每个元素中,我想添加一个新键(名称无关紧要),其值为:
f"{from}_{timestamp_of_first_record_of_the_group}_{time_interval}
所以,有些人喜欢以下内容:
0xdc15c50f4532ec385a3747f2a0e646922a395f6aa574794a14d07d8219ddea3e_2022-03-31T05:19:13_1W
我需要这个新键来识别在同一时间间隔内出现的元素,以便使用外部工具执行额外的分析。
到目前为止,我已经尝试了以下方法:
df = pd.read_json(file_path)
df['timeStamp'] = pd.to_datetime(df['timeStamp'],unit='s')
df_grouped = transactions_df.groupby(['from'], pd.Grouper(key="timeStamp", freq="1W"))
# from here I don't know how to apply the changes described above
这是一个可能的答案:
数据:
这是一些示例数据。为了这个答案,我删除了不必要的列。
timeStamp from
0 1648703953 0xaaaaa
1 1648779553 0xaaaaa
2 1648855153 0xaaaaa
3 1648930753 0xaaaaa
4 1649006353 0xaaaaa
5 1649081953 0xaaaaa
6 1649157553 0xaaaaa
7 1649233153 0xaaaaa
8 1649308753 0xaaaaa
9 1649384353 0xaaaaa
10 1649459953 0xaaaaa
11 1649535553 0xaaaaa
12 1649611153 0xaaaaa
13 1649686753 0xaaaaa
14 1649762353 0xaaaaa
15 1648703953 0xFFFFF
16 1648779553 0xFFFFF
17 1648855153 0xFFFFF
18 1648930753 0xFFFFF
19 1649006353 0xFFFFF
20 1649081953 0xFFFFF
21 1649157553 0xFFFFF
22 1649233153 0xFFFFF
23 1649308753 0xFFFFF
24 1649384353 0xFFFFF
代码:
# Timestamp to readable format.
df['timeStamp'] = pd.to_datetime(df['timeStamp'].astype(int), unit='s')
# Group by week.
df['week_id'] = df.groupby(['from', pd.Grouper(key='timeStamp', freq='1W-MON')]).ngroup()
df['week_id'] -= df.groupby('from')['week_id'].transform('min')
# Tag.
df['first_timeStamp'] = df.groupby(['from', 'week_id'])['timeStamp'].transform('min')
df['tag'] = df['from'] + '_' + df['first_timeStamp'].dt.strftime('%Y-%m-%dT%H:%M:%S') + '_' + df['week_id'].astype(str)
print(df.drop(columns=['week_id', 'first_timeStamp']))
结果:
timeStamp from tag
0 2022-03-31 05:19:13 0xaaaaa 0xaaaaa_2022-03-31T05:19:13_0
1 2022-04-01 02:19:13 0xaaaaa 0xaaaaa_2022-03-31T05:19:13_0
2 2022-04-01 23:19:13 0xaaaaa 0xaaaaa_2022-03-31T05:19:13_0
3 2022-04-02 20:19:13 0xaaaaa 0xaaaaa_2022-03-31T05:19:13_0
4 2022-04-03 17:19:13 0xaaaaa 0xaaaaa_2022-03-31T05:19:13_0
5 2022-04-04 14:19:13 0xaaaaa 0xaaaaa_2022-03-31T05:19:13_0
6 2022-04-05 11:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
7 2022-04-06 08:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
8 2022-04-07 05:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
9 2022-04-08 02:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
10 2022-04-08 23:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
11 2022-04-09 20:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
12 2022-04-10 17:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
13 2022-04-11 14:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
14 2022-04-12 11:19:13 0xaaaaa 0xaaaaa_2022-04-12T11:19:13_2
15 2022-03-31 05:19:13 0xFFFFF 0xFFFFF_2022-03-31T05:19:13_0
16 2022-04-01 02:19:13 0xFFFFF 0xFFFFF_2022-03-31T05:19:13_0
17 2022-04-01 23:19:13 0xFFFFF 0xFFFFF_2022-03-31T05:19:13_0
18 2022-04-02 20:19:13 0xFFFFF 0xFFFFF_2022-03-31T05:19:13_0
19 2022-04-03 17:19:13 0xFFFFF 0xFFFFF_2022-03-31T05:19:13_0
20 2022-04-04 14:19:13 0xFFFFF 0xFFFFF_2022-03-31T05:19:13_0
21 2022-04-05 11:19:13 0xFFFFF 0xFFFFF_2022-04-05T11:19:13_1
22 2022-04-06 08:19:13 0xFFFFF 0xFFFFF_2022-04-05T11:19:13_1
23 2022-04-07 05:19:13 0xFFFFF 0xFFFFF_2022-04-05T11:19:13_1
24 2022-04-08 02:19:13 0xFFFFF 0xFFFFF_2022-04-05T11:19:13_1
注意这里是按每月周分组的,所以第一周不是从数据的第一天开始。
我有一个 JSON 以太坊交易文件,其结构如下:
...,
{
"blockNumber": "14492022",
"timeStamp": "1648703953",
"hash": "0xdc15c50f4532ec385a3747f2a0e646922a395f6aa574794a14d07d8219ddea3e",
"nonce": "89",
"blockHash": "0xa804b6c72753657275c58b00bf17cd2ca7e2be3cbf4a6615f4cc7175c3c76aea",
"transactionIndex": "162",
"from": "0x4ca43dc185ff11844e448604cd11409a92a3794b",
"to": "0xf87e31492faf9a91b02ee0deaad50d51d56d5d4d",
"value": "0",
"gas": "130000",
"gasPrice": "40449366242",
"isError": "0",
"txreceipt_status": "1",
"input": "0x23b872dd0000000000000000000000004ca43dc185ff11844e448604cd11409a92a3794b00000000000000000000000034380456f50e013f1b8b2b9b5dc9d55fb0ca9c2b00000000000000000000000000000042ffffffffffffffffffffffffffffff76",
"contractAddress": "",
"cumulativeGasUsed": "9370793",
"gasUsed": "106123",
"confirmations": "180974",
},
...
Pandas 数据帧应该用于执行以下操作。
首先,timeStamp
应该转换成合适的格式。然后,我需要按 from
值和 1 周的间隔对数据进行分组。然后在组的每个元素中,我想添加一个新键(名称无关紧要),其值为:
f"{from}_{timestamp_of_first_record_of_the_group}_{time_interval}
所以,有些人喜欢以下内容:
0xdc15c50f4532ec385a3747f2a0e646922a395f6aa574794a14d07d8219ddea3e_2022-03-31T05:19:13_1W
我需要这个新键来识别在同一时间间隔内出现的元素,以便使用外部工具执行额外的分析。
到目前为止,我已经尝试了以下方法:
df = pd.read_json(file_path)
df['timeStamp'] = pd.to_datetime(df['timeStamp'],unit='s')
df_grouped = transactions_df.groupby(['from'], pd.Grouper(key="timeStamp", freq="1W"))
# from here I don't know how to apply the changes described above
这是一个可能的答案:
数据:
这是一些示例数据。为了这个答案,我删除了不必要的列。
timeStamp from
0 1648703953 0xaaaaa
1 1648779553 0xaaaaa
2 1648855153 0xaaaaa
3 1648930753 0xaaaaa
4 1649006353 0xaaaaa
5 1649081953 0xaaaaa
6 1649157553 0xaaaaa
7 1649233153 0xaaaaa
8 1649308753 0xaaaaa
9 1649384353 0xaaaaa
10 1649459953 0xaaaaa
11 1649535553 0xaaaaa
12 1649611153 0xaaaaa
13 1649686753 0xaaaaa
14 1649762353 0xaaaaa
15 1648703953 0xFFFFF
16 1648779553 0xFFFFF
17 1648855153 0xFFFFF
18 1648930753 0xFFFFF
19 1649006353 0xFFFFF
20 1649081953 0xFFFFF
21 1649157553 0xFFFFF
22 1649233153 0xFFFFF
23 1649308753 0xFFFFF
24 1649384353 0xFFFFF
代码:
# Timestamp to readable format.
df['timeStamp'] = pd.to_datetime(df['timeStamp'].astype(int), unit='s')
# Group by week.
df['week_id'] = df.groupby(['from', pd.Grouper(key='timeStamp', freq='1W-MON')]).ngroup()
df['week_id'] -= df.groupby('from')['week_id'].transform('min')
# Tag.
df['first_timeStamp'] = df.groupby(['from', 'week_id'])['timeStamp'].transform('min')
df['tag'] = df['from'] + '_' + df['first_timeStamp'].dt.strftime('%Y-%m-%dT%H:%M:%S') + '_' + df['week_id'].astype(str)
print(df.drop(columns=['week_id', 'first_timeStamp']))
结果:
timeStamp from tag
0 2022-03-31 05:19:13 0xaaaaa 0xaaaaa_2022-03-31T05:19:13_0
1 2022-04-01 02:19:13 0xaaaaa 0xaaaaa_2022-03-31T05:19:13_0
2 2022-04-01 23:19:13 0xaaaaa 0xaaaaa_2022-03-31T05:19:13_0
3 2022-04-02 20:19:13 0xaaaaa 0xaaaaa_2022-03-31T05:19:13_0
4 2022-04-03 17:19:13 0xaaaaa 0xaaaaa_2022-03-31T05:19:13_0
5 2022-04-04 14:19:13 0xaaaaa 0xaaaaa_2022-03-31T05:19:13_0
6 2022-04-05 11:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
7 2022-04-06 08:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
8 2022-04-07 05:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
9 2022-04-08 02:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
10 2022-04-08 23:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
11 2022-04-09 20:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
12 2022-04-10 17:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
13 2022-04-11 14:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
14 2022-04-12 11:19:13 0xaaaaa 0xaaaaa_2022-04-12T11:19:13_2
15 2022-03-31 05:19:13 0xFFFFF 0xFFFFF_2022-03-31T05:19:13_0
16 2022-04-01 02:19:13 0xFFFFF 0xFFFFF_2022-03-31T05:19:13_0
17 2022-04-01 23:19:13 0xFFFFF 0xFFFFF_2022-03-31T05:19:13_0
18 2022-04-02 20:19:13 0xFFFFF 0xFFFFF_2022-03-31T05:19:13_0
19 2022-04-03 17:19:13 0xFFFFF 0xFFFFF_2022-03-31T05:19:13_0
20 2022-04-04 14:19:13 0xFFFFF 0xFFFFF_2022-03-31T05:19:13_0
21 2022-04-05 11:19:13 0xFFFFF 0xFFFFF_2022-04-05T11:19:13_1
22 2022-04-06 08:19:13 0xFFFFF 0xFFFFF_2022-04-05T11:19:13_1
23 2022-04-07 05:19:13 0xFFFFF 0xFFFFF_2022-04-05T11:19:13_1
24 2022-04-08 02:19:13 0xFFFFF 0xFFFFF_2022-04-05T11:19:13_1
注意这里是按每月周分组的,所以第一周不是从数据的第一天开始。