重采样、分组、旋转 pandas 数据框
Resampling, grouping, pivoting a pandas dataframe
我有一个包含时间戳和两列的日志文件。我现在想重新采样并 "pivot" 从日志文件创建的日期帧。
示例原始 dataframe/log 文件:
timestamp colA colB
2015-01-01 00:10:01 a x
2014-01-01 00:10:01 b y
2015-01-01 00:10:03 a x
2015-01-01 00:10:03 a x
2015-01-01 00:10:03 a y
2015-01-01 00:10:04 b x
2014-01-01 00:10:04 b y
2014-01-01 00:10:04 b y
2014-01-01 00:10:04 a x
2014-01-01 00:10:05 a x
2014-01-01 00:10:05 a x
2014-01-01 00:10:07 a y
2014-01-01 00:10:08 a x
按秒重采样的示例结果:
a b
timestamp x y x y
2015-01-01 00:10:01 1 0 0 1
2015-01-01 00:10:02 0 0 0 0
2015-01-01 00:10:03 2 1 0 0
2015-01-01 00:10:04 1 0 1 2
2014-01-01 00:10:05 2 0 0 0
2014-01-01 00:10:06 0 0 0 0
2014-01-01 00:10:07 0 1 0 0
2014-01-01 00:10:08 1 0 0 0
我将如何实现这一目标?先重采样,然后groupby/pivot?或者反过来?更具体地说,单元格应包含每个特定重采样时间间隔的 colA/colB 组合的计数。在示例中为秒,但也可以是分钟、小时等。
我不固定在这种格式上,我也可以考虑得到一个重新采样和分组的结果 timestamp/colA 比如
colB
timestamp colA x y
2015-01-01 00:10:01 a 1 0
b 0 1
2015-01-01 00:10:02 a 0 0
b 0 0
2015-01-01 00:10:03 a 2 1
b 0 0
2015-01-01 00:10:04 a 1 0
b 1 2
2014-01-01 00:10:05 a 2 0
b 0 0
2014-01-01 00:10:06 a 0 0
b 0 0
2014-01-01 00:10:07 a 0 1
b 0 0
2014-01-01 00:10:08 a 1 0
b 0 0
最后的用法是绘制不同的计数值
谢谢。
您可以使用 pd.crosstab
:
import numpy as np
import pandas as pd
df = pd.read_table('data', sep='\s{2,}', parse_dates=[0])
table = pd.crosstab(index=[df['timestamp']], columns=[df['colA'], df['colB']])
产量
colA a b
colB x y x y
timestamp
2014-01-01 00:10:01 0 0 0 1
2014-01-01 00:10:04 1 0 0 2
2014-01-01 00:10:05 2 0 0 0
2014-01-01 00:10:07 0 1 0 0
2014-01-01 00:10:08 1 0 0 0
2015-01-01 00:10:01 1 0 0 0
2015-01-01 00:10:03 2 1 0 0
2015-01-01 00:10:04 0 0 1 0
我有一个包含时间戳和两列的日志文件。我现在想重新采样并 "pivot" 从日志文件创建的日期帧。
示例原始 dataframe/log 文件:
timestamp colA colB
2015-01-01 00:10:01 a x
2014-01-01 00:10:01 b y
2015-01-01 00:10:03 a x
2015-01-01 00:10:03 a x
2015-01-01 00:10:03 a y
2015-01-01 00:10:04 b x
2014-01-01 00:10:04 b y
2014-01-01 00:10:04 b y
2014-01-01 00:10:04 a x
2014-01-01 00:10:05 a x
2014-01-01 00:10:05 a x
2014-01-01 00:10:07 a y
2014-01-01 00:10:08 a x
按秒重采样的示例结果:
a b
timestamp x y x y
2015-01-01 00:10:01 1 0 0 1
2015-01-01 00:10:02 0 0 0 0
2015-01-01 00:10:03 2 1 0 0
2015-01-01 00:10:04 1 0 1 2
2014-01-01 00:10:05 2 0 0 0
2014-01-01 00:10:06 0 0 0 0
2014-01-01 00:10:07 0 1 0 0
2014-01-01 00:10:08 1 0 0 0
我将如何实现这一目标?先重采样,然后groupby/pivot?或者反过来?更具体地说,单元格应包含每个特定重采样时间间隔的 colA/colB 组合的计数。在示例中为秒,但也可以是分钟、小时等。
我不固定在这种格式上,我也可以考虑得到一个重新采样和分组的结果 timestamp/colA 比如
colB
timestamp colA x y
2015-01-01 00:10:01 a 1 0
b 0 1
2015-01-01 00:10:02 a 0 0
b 0 0
2015-01-01 00:10:03 a 2 1
b 0 0
2015-01-01 00:10:04 a 1 0
b 1 2
2014-01-01 00:10:05 a 2 0
b 0 0
2014-01-01 00:10:06 a 0 0
b 0 0
2014-01-01 00:10:07 a 0 1
b 0 0
2014-01-01 00:10:08 a 1 0
b 0 0
最后的用法是绘制不同的计数值
谢谢。
您可以使用 pd.crosstab
:
import numpy as np
import pandas as pd
df = pd.read_table('data', sep='\s{2,}', parse_dates=[0])
table = pd.crosstab(index=[df['timestamp']], columns=[df['colA'], df['colB']])
产量
colA a b
colB x y x y
timestamp
2014-01-01 00:10:01 0 0 0 1
2014-01-01 00:10:04 1 0 0 2
2014-01-01 00:10:05 2 0 0 0
2014-01-01 00:10:07 0 1 0 0
2014-01-01 00:10:08 1 0 0 0
2015-01-01 00:10:01 1 0 0 0
2015-01-01 00:10:03 2 1 0 0
2015-01-01 00:10:04 0 0 1 0