用 pandas.date_range 和 pandas.reindex python 填充时间序列数据中缺失的点
filling the missing points in the time series data with pandas.date_range and pandas.reindex python
我正在尝试用 pandas 填充 ascii 文件中时间序列数据中缺失的点。我觉得其他的还行,就是第一行填了nan,虽然本来就有数据。
我的数据样本是:
"2011-08-26 00:00:00",1155179,3.232,23.7,3.281,0.386,25.27,111.5665,28.92,29.83,19.13,0,111.5,13.02,29.77,345.7
"2011-08-26 00:00:30",1155180,3.289,20.44,2.153,0.222,25.25,111.5735,28.94,29.82,19.53,0,111.5,13.02,29.79,342.4
.
.
"2011-08-26 23:59:30",1155297,12.62,28.06,3.162,1.356,24.3,111.4614,28.65,29.84,19.53,0,111.4,13.06,29.50,350.1
我使用了如下代码:
t1 = np.genfromtxt(INPUT,dtype=None,delimiter=',',usecols=[0])
start = t1[0].strip('\'"')
end = t1[-1].strip('\'"')
data=pd.read_csv(INPUT,sep=',',index_col=[0],parse_dates=[0])
index = pd.date_range(start,end,freq="30S")
df = data
sk_f = df.reindex(index)
所以用这段代码,我想读取第一列的第一个和最后一个字符串,并将它们制作到索引中,以填补可能的缺失点,以 nan 表示。但是,问题是第一列也填写了如下结果:
2011-08-26 00:00:00,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
2011-08-26 00:00:30,1155180,3.289,20.44,2.153,0.222,25.25,111.5735,28.94,29.82,19.53,0,111.5,13.02,29.79,342.4
.
.
2011-08-26 23:59:30,1155297,12.62,28.06,3.162,1.356,24.3,111.4614,28.65,29.84,19.53,0,111.4,13.06,29.50,350.1
表示虽然原文件中有数据,但是第一行被意外填满了。从第二行开始,一切正常,填充缺失数据似乎也正常。我试图找出它发生的原因。老实说,我还没有找到原因。
任何想法或帮助将不胜感激。
谢谢,
艾萨克
我认为您可以省略 genfromtxt
读取文件并仅尝试 read_csv
, then found min
and max
dates for reindex
方法。
或使用resample
:
import pandas as pd
import numpy as np
import io
temp=u""""2011-08-26 00:00:00",1155179,3.232,23.7,3.281,0.386,25.27,111.5665,28.92,29.83,19.13,0,111.5,13.02,29.77,345.7
"2011-08-26 00:00:30",1155180,3.289,20.44,2.153,0.222,25.25,111.5735,28.94,29.82,19.53,0,111.5,13.02,29.79,342.4
"2011-08-26 23:59:30",1155297,12.62,28.06,3.162,1.356,24.3,111.4614,28.65,29.84,19.53,0,111.4,13.06,29.50,350.1"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=",", index_col=[0], parse_dates=[0], header=None)
print df
1 2 3 4 5 6 7 \
0
2011-08-26 00:00:00 1155179 3.232 23.70 3.281 0.386 25.27 111.5665
2011-08-26 00:00:30 1155180 3.289 20.44 2.153 0.222 25.25 111.5735
2011-08-26 23:59:30 1155297 12.620 28.06 3.162 1.356 24.30 111.4614
8 9 10 11 12 13 14 15
0
2011-08-26 00:00:00 28.92 29.83 19.13 0 111.5 13.02 29.77 345.7
2011-08-26 00:00:30 28.94 29.82 19.53 0 111.5 13.02 29.79 342.4
2011-08-26 23:59:30 28.65 29.84 19.53 0 111.4 13.06 29.50 350.1
start = df.index.min()
end = df.index.max()
print start
2011-08-26 00:00:00
print end
2011-08-26 23:59:30
index = pd.date_range(start,end,freq="30S")
sk_f = df.reindex(index)
print sk_f.head()
1 2 3 4 5 6 7 \
2011-08-26 00:00:00 1155179 3.232 23.70 3.281 0.386 25.27 111.5665
2011-08-26 00:00:30 1155180 3.289 20.44 2.153 0.222 25.25 111.5735
2011-08-26 00:01:00 NaN NaN NaN NaN NaN NaN NaN
2011-08-26 00:01:30 NaN NaN NaN NaN NaN NaN NaN
2011-08-26 00:02:00 NaN NaN NaN NaN NaN NaN NaN
8 9 10 11 12 13 14 15
2011-08-26 00:00:00 28.92 29.83 19.13 0 111.5 13.02 29.77 345.7
2011-08-26 00:00:30 28.94 29.82 19.53 0 111.5 13.02 29.79 342.4
2011-08-26 00:01:00 NaN NaN NaN NaN NaN NaN NaN NaN
2011-08-26 00:01:30 NaN NaN NaN NaN NaN NaN NaN NaN
2011-08-26 00:02:00 NaN NaN NaN NaN NaN NaN NaN NaN
print df.resample('30S', fill_method='ffill').head()
1 2 3 4 5 6 7 \
0
2011-08-26 00:00:00 1155179 3.232 23.70 3.281 0.386 25.27 111.5665
2011-08-26 00:00:30 1155180 3.289 20.44 2.153 0.222 25.25 111.5735
2011-08-26 00:01:00 1155180 3.289 20.44 2.153 0.222 25.25 111.5735
2011-08-26 00:01:30 1155180 3.289 20.44 2.153 0.222 25.25 111.5735
2011-08-26 00:02:00 1155180 3.289 20.44 2.153 0.222 25.25 111.5735
8 9 10 11 12 13 14 15
0
2011-08-26 00:00:00 28.92 29.83 19.13 0 111.5 13.02 29.77 345.7
2011-08-26 00:00:30 28.94 29.82 19.53 0 111.5 13.02 29.79 342.4
2011-08-26 00:01:00 28.94 29.82 19.53 0 111.5 13.02 29.79 342.4
2011-08-26 00:01:30 28.94 29.82 19.53 0 111.5 13.02 29.79 342.4
2011-08-26 00:02:00 28.94 29.82 19.53 0 111.5 13.02 29.79 342.4
我正在尝试用 pandas 填充 ascii 文件中时间序列数据中缺失的点。我觉得其他的还行,就是第一行填了nan,虽然本来就有数据。 我的数据样本是:
"2011-08-26 00:00:00",1155179,3.232,23.7,3.281,0.386,25.27,111.5665,28.92,29.83,19.13,0,111.5,13.02,29.77,345.7
"2011-08-26 00:00:30",1155180,3.289,20.44,2.153,0.222,25.25,111.5735,28.94,29.82,19.53,0,111.5,13.02,29.79,342.4
.
.
"2011-08-26 23:59:30",1155297,12.62,28.06,3.162,1.356,24.3,111.4614,28.65,29.84,19.53,0,111.4,13.06,29.50,350.1
我使用了如下代码:
t1 = np.genfromtxt(INPUT,dtype=None,delimiter=',',usecols=[0])
start = t1[0].strip('\'"')
end = t1[-1].strip('\'"')
data=pd.read_csv(INPUT,sep=',',index_col=[0],parse_dates=[0])
index = pd.date_range(start,end,freq="30S")
df = data
sk_f = df.reindex(index)
所以用这段代码,我想读取第一列的第一个和最后一个字符串,并将它们制作到索引中,以填补可能的缺失点,以 nan 表示。但是,问题是第一列也填写了如下结果:
2011-08-26 00:00:00,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
2011-08-26 00:00:30,1155180,3.289,20.44,2.153,0.222,25.25,111.5735,28.94,29.82,19.53,0,111.5,13.02,29.79,342.4
.
.
2011-08-26 23:59:30,1155297,12.62,28.06,3.162,1.356,24.3,111.4614,28.65,29.84,19.53,0,111.4,13.06,29.50,350.1
表示虽然原文件中有数据,但是第一行被意外填满了。从第二行开始,一切正常,填充缺失数据似乎也正常。我试图找出它发生的原因。老实说,我还没有找到原因。 任何想法或帮助将不胜感激。 谢谢, 艾萨克
我认为您可以省略 genfromtxt
读取文件并仅尝试 read_csv
, then found min
and max
dates for reindex
方法。
或使用resample
:
import pandas as pd
import numpy as np
import io
temp=u""""2011-08-26 00:00:00",1155179,3.232,23.7,3.281,0.386,25.27,111.5665,28.92,29.83,19.13,0,111.5,13.02,29.77,345.7
"2011-08-26 00:00:30",1155180,3.289,20.44,2.153,0.222,25.25,111.5735,28.94,29.82,19.53,0,111.5,13.02,29.79,342.4
"2011-08-26 23:59:30",1155297,12.62,28.06,3.162,1.356,24.3,111.4614,28.65,29.84,19.53,0,111.4,13.06,29.50,350.1"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=",", index_col=[0], parse_dates=[0], header=None)
print df
1 2 3 4 5 6 7 \
0
2011-08-26 00:00:00 1155179 3.232 23.70 3.281 0.386 25.27 111.5665
2011-08-26 00:00:30 1155180 3.289 20.44 2.153 0.222 25.25 111.5735
2011-08-26 23:59:30 1155297 12.620 28.06 3.162 1.356 24.30 111.4614
8 9 10 11 12 13 14 15
0
2011-08-26 00:00:00 28.92 29.83 19.13 0 111.5 13.02 29.77 345.7
2011-08-26 00:00:30 28.94 29.82 19.53 0 111.5 13.02 29.79 342.4
2011-08-26 23:59:30 28.65 29.84 19.53 0 111.4 13.06 29.50 350.1
start = df.index.min()
end = df.index.max()
print start
2011-08-26 00:00:00
print end
2011-08-26 23:59:30
index = pd.date_range(start,end,freq="30S")
sk_f = df.reindex(index)
print sk_f.head()
1 2 3 4 5 6 7 \
2011-08-26 00:00:00 1155179 3.232 23.70 3.281 0.386 25.27 111.5665
2011-08-26 00:00:30 1155180 3.289 20.44 2.153 0.222 25.25 111.5735
2011-08-26 00:01:00 NaN NaN NaN NaN NaN NaN NaN
2011-08-26 00:01:30 NaN NaN NaN NaN NaN NaN NaN
2011-08-26 00:02:00 NaN NaN NaN NaN NaN NaN NaN
8 9 10 11 12 13 14 15
2011-08-26 00:00:00 28.92 29.83 19.13 0 111.5 13.02 29.77 345.7
2011-08-26 00:00:30 28.94 29.82 19.53 0 111.5 13.02 29.79 342.4
2011-08-26 00:01:00 NaN NaN NaN NaN NaN NaN NaN NaN
2011-08-26 00:01:30 NaN NaN NaN NaN NaN NaN NaN NaN
2011-08-26 00:02:00 NaN NaN NaN NaN NaN NaN NaN NaN
print df.resample('30S', fill_method='ffill').head()
1 2 3 4 5 6 7 \
0
2011-08-26 00:00:00 1155179 3.232 23.70 3.281 0.386 25.27 111.5665
2011-08-26 00:00:30 1155180 3.289 20.44 2.153 0.222 25.25 111.5735
2011-08-26 00:01:00 1155180 3.289 20.44 2.153 0.222 25.25 111.5735
2011-08-26 00:01:30 1155180 3.289 20.44 2.153 0.222 25.25 111.5735
2011-08-26 00:02:00 1155180 3.289 20.44 2.153 0.222 25.25 111.5735
8 9 10 11 12 13 14 15
0
2011-08-26 00:00:00 28.92 29.83 19.13 0 111.5 13.02 29.77 345.7
2011-08-26 00:00:30 28.94 29.82 19.53 0 111.5 13.02 29.79 342.4
2011-08-26 00:01:00 28.94 29.82 19.53 0 111.5 13.02 29.79 342.4
2011-08-26 00:01:30 28.94 29.82 19.53 0 111.5 13.02 29.79 342.4
2011-08-26 00:02:00 28.94 29.82 19.53 0 111.5 13.02 29.79 342.4