Pandas resample 函数问题从分钟到毫秒 ersampling
Pandas resample function issue from minute to millisecond ersampling
我在使用 pandas 重采样功能时遇到问题。我有分钟采样数据,我正在尝试以 0.7 秒的频率重新采样。我尝试使用带有“700L”选项的重新采样,但没有按预期运行。
我举了个例子:
import pandas as pd
from datetime import datetime
import pytz
import numpy as np
import matplotlib.pyplot as plt
def convert_2_datetime(timestamp, timezoneid):
"""
:param timestamp: UTC format in milliseconds (data.index.values)
:param timezoneid: timezone object from CTX (for example pytz.timezone(ctx.inp.assets[0].properties['timezoneid']))
:return: vector of datetimes
"""
if isinstance(timestamp,int) or isinstance(timestamp,float):
utctime = datetime.utcfromtimestamp(timestamp / 1000).replace(tzinfo=pytz.utc)
output = utctime.astimezone(pytz.timezone(timezoneid.zone))
else:
utctime = [datetime.utcfromtimestamp(i / 1000).replace(tzinfo=pytz.utc) for i in timestamp]
output = [i.astimezone(pytz.timezone(timezoneid.zone)) for i in utctime]
return output
# minute sampled data
v1 = [0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0]
data = pd.DataFrame({'v1':np.array(v1)},index=np.arange(start=1,stop=len(v1)+1)*60000)
data['ts']= convert_2_datetime(timestamp=data.index.values,timezoneid=pytz.timezone('UTC'))
data.set_index('ts',inplace=True)
data07 = data.resample(rule='700L',closed={'right','left'}).interpolate(method='linear')
data06 = data.resample(rule='600L',closed={'right','left'}).interpolate(method='linear')
data11 = data.resample(rule='1100L',closed={'right','left'}).interpolate(method='linear')
plt.show()
data07.v1.plot(style='.',label='700 ms')
data06.v1.plot(style='.', label = '600 ms')
data11.v1.plot(style='.', label = '1500 ms')
data.v1.plot(style='x',label='original')
plt.legend()
print('Finish!')
如果我用“600L”(示例中的 data06)重新采样,最终结果是正确的; '700L'(示例中的 data07)不正确。见下图:
我缺少有关重采样功能的信息?
非常感谢大家!
解决方法
对于您的情况,我认为您应该在插值之前对方法重新采样,例如 mean
。我相信这只与 resample
的输出以及 interpolate
读取它的方式有关。例如,以下似乎有效:
data07 = data.resample('700L').mean().interpolate()
data06 = data.resample('600L').mean().interpolate()
data10 = data.resample('1000L').mean().interpolate()
这个图表明它有效:
data07.v1.plot(style='.',label='700 ms', alpha=0.75, ms=3,zorder=2)
data06.v1.plot(style='^',label='600 ms', alpha=0.5, zorder=1)
data10.v1.plot(style='^',label='1000 ms', alpha=0.5, zorder=0, ms=10)
data.v1.plot(style='x',label='original', ms=10)
plt.legend()
解释(有点……):
当您使用任何方法(包括 mean()
)对数据进行重新采样时,您会得到 NaN
s 表示数据重新采样的位置:
>>> data.resample('700L').mean().head()
v1
ts
1970-01-01 00:00:59.500000+00:00 0.0
1970-01-01 00:01:00.200000+00:00 NaN
1970-01-01 00:01:00.900000+00:00 NaN
1970-01-01 00:01:01.600000+00:00 NaN
1970-01-01 00:01:02.300000+00:00 NaN
当您调用 interpolate
时,它将用适当的线性插值填充 NaN
。
>>> data.resample('700l').mean().interpolate().head()
v1
ts
1970-01-01 00:00:59.500000+00:00 0.0
1970-01-01 00:01:00.200000+00:00 0.0
1970-01-01 00:01:00.900000+00:00 0.0
1970-01-01 00:01:01.600000+00:00 0.0
1970-01-01 00:01:02.300000+00:00 0.0
当你直接在[=16=的输出上调用interpolate
时,interpolate
的行为似乎并不像预期的那样,给出一堆NaN
开始,然后从最大值 (1) 开始逐渐下降。不太确定为什么:
>>> data.resample('700l').interpolate().head()
v1
ts
1970-01-01 00:00:59.500000+00:00 NaN
1970-01-01 00:01:00.200000+00:00 NaN
1970-01-01 00:01:00.900000+00:00 NaN
1970-01-01 00:01:01.600000+00:00 NaN
1970-01-01 00:01:02.300000+00:00 NaN
我在使用 pandas 重采样功能时遇到问题。我有分钟采样数据,我正在尝试以 0.7 秒的频率重新采样。我尝试使用带有“700L”选项的重新采样,但没有按预期运行。 我举了个例子:
import pandas as pd
from datetime import datetime
import pytz
import numpy as np
import matplotlib.pyplot as plt
def convert_2_datetime(timestamp, timezoneid):
"""
:param timestamp: UTC format in milliseconds (data.index.values)
:param timezoneid: timezone object from CTX (for example pytz.timezone(ctx.inp.assets[0].properties['timezoneid']))
:return: vector of datetimes
"""
if isinstance(timestamp,int) or isinstance(timestamp,float):
utctime = datetime.utcfromtimestamp(timestamp / 1000).replace(tzinfo=pytz.utc)
output = utctime.astimezone(pytz.timezone(timezoneid.zone))
else:
utctime = [datetime.utcfromtimestamp(i / 1000).replace(tzinfo=pytz.utc) for i in timestamp]
output = [i.astimezone(pytz.timezone(timezoneid.zone)) for i in utctime]
return output
# minute sampled data
v1 = [0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0]
data = pd.DataFrame({'v1':np.array(v1)},index=np.arange(start=1,stop=len(v1)+1)*60000)
data['ts']= convert_2_datetime(timestamp=data.index.values,timezoneid=pytz.timezone('UTC'))
data.set_index('ts',inplace=True)
data07 = data.resample(rule='700L',closed={'right','left'}).interpolate(method='linear')
data06 = data.resample(rule='600L',closed={'right','left'}).interpolate(method='linear')
data11 = data.resample(rule='1100L',closed={'right','left'}).interpolate(method='linear')
plt.show()
data07.v1.plot(style='.',label='700 ms')
data06.v1.plot(style='.', label = '600 ms')
data11.v1.plot(style='.', label = '1500 ms')
data.v1.plot(style='x',label='original')
plt.legend()
print('Finish!')
如果我用“600L”(示例中的 data06)重新采样,最终结果是正确的; '700L'(示例中的 data07)不正确。见下图:
我缺少有关重采样功能的信息?
非常感谢大家!
解决方法
对于您的情况,我认为您应该在插值之前对方法重新采样,例如 mean
。我相信这只与 resample
的输出以及 interpolate
读取它的方式有关。例如,以下似乎有效:
data07 = data.resample('700L').mean().interpolate()
data06 = data.resample('600L').mean().interpolate()
data10 = data.resample('1000L').mean().interpolate()
这个图表明它有效:
data07.v1.plot(style='.',label='700 ms', alpha=0.75, ms=3,zorder=2)
data06.v1.plot(style='^',label='600 ms', alpha=0.5, zorder=1)
data10.v1.plot(style='^',label='1000 ms', alpha=0.5, zorder=0, ms=10)
data.v1.plot(style='x',label='original', ms=10)
plt.legend()
解释(有点……):
当您使用任何方法(包括 mean()
)对数据进行重新采样时,您会得到 NaN
s 表示数据重新采样的位置:
>>> data.resample('700L').mean().head()
v1
ts
1970-01-01 00:00:59.500000+00:00 0.0
1970-01-01 00:01:00.200000+00:00 NaN
1970-01-01 00:01:00.900000+00:00 NaN
1970-01-01 00:01:01.600000+00:00 NaN
1970-01-01 00:01:02.300000+00:00 NaN
当您调用 interpolate
时,它将用适当的线性插值填充 NaN
。
>>> data.resample('700l').mean().interpolate().head()
v1
ts
1970-01-01 00:00:59.500000+00:00 0.0
1970-01-01 00:01:00.200000+00:00 0.0
1970-01-01 00:01:00.900000+00:00 0.0
1970-01-01 00:01:01.600000+00:00 0.0
1970-01-01 00:01:02.300000+00:00 0.0
当你直接在[=16=的输出上调用interpolate
时,interpolate
的行为似乎并不像预期的那样,给出一堆NaN
开始,然后从最大值 (1) 开始逐渐下降。不太确定为什么:
>>> data.resample('700l').interpolate().head()
v1
ts
1970-01-01 00:00:59.500000+00:00 NaN
1970-01-01 00:01:00.200000+00:00 NaN
1970-01-01 00:01:00.900000+00:00 NaN
1970-01-01 00:01:01.600000+00:00 NaN
1970-01-01 00:01:02.300000+00:00 NaN