基于出现频率的概率预测
Probabilistic prediction based on occurrence frequency
我有一个 2011-2013 年降雨量的时间序列,其中降雨量数据为 1(无雨)和 0(有雨)格式。原始数据间隔为 1 小时,每天上午 10 点至下午 3 点。我不想预测 2014 年的降雨量,但我想根据降雨量列中出现 1 或 0 来预测同一时间间隔全年降雨的可能性。目前,我使用以下代码通过计算出现 1 次或 0 次来预测下雨的可能性:
import pandas as pd
b = {'year':[2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,
2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,
2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013],
'month': [1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,11,12],
'rain':[1,0,0,0,1,1,0,1,1,0,0,1,0,0,1,0,0,0,1,1,1,1,1,0,0,1,1,0,1,0,1,0,1,0,1,0]}
b = pd.DataFrame(b,columns = ['year','month','rain'])
def X(b):
if (b['month'] == 1):
return 'Jan'
elif (b['month']==2):
return 'Feb'
elif (b['month']==3):
return 'Mar'
elif (b['month']==4):
return 'Apr'
elif (b['month']==5):
return 'May'
elif (b['month']==6):
return 'Jun'
elif (b['month']==7):
return 'Jul'
elif (b['month']==8):
return 'Aug'
elif (b['month']==9):
return 'Sep'
elif (b['month']==10):
return 'Oct'
elif (b['month']==11):
return 'Nov'
elif (b['month']==12):
return 'Dec'
b['X'] = b.apply(X,axis =1)
mask_x = (b['X']=='Jul')
mask_y = b['rain'].loc[mask_x]
mask_y.value_counts()
我认为这种方法不适用于大型数据集,有人可以建议我一种有效且稳健的方法来预测此类数据集的降雨量。
数据是通过每小时随机选择 [0,1]
创建的。我们通过在日期列中按时间分组来计算总数和案例数。现在您可以通过 total/number 个事件来计算降雨率。我正在按照您的代码创建年、月和月的缩写名称,但这并不是真正必要的。
import pandas as pd
import numpy as np
import random
random.seed(20200817)
date_rng = pd.date_range('2013-01-01', '2016-01-01', freq='1H')
rain = random.choices([0,1], k=len(date_rng))
b = pd.DataFrame({'date':pd.to_datetime(date_rng), 'rain':rain})
hour_rain = b.groupby([b.date.dt.month, b.date.dt.day, b.date.dt.hour])['rain'].agg([sum,np.size])
hour_rain.index.names = ['month','day','hour']
hour_rain.reset_index()
month day hour sum size
0 1 1 0 0 4
1 1 1 1 2 3
2 1 1 2 3 3
3 1 1 3 1 3
4 1 1 4 1 3
... ... ... ... ... ...
8755 12 31 19 2 3
8756 12 31 20 2 3
8757 12 31 21 2 3
8758 12 31 22 0 3
8759 12 31 23 0 3
我正在尝试执行的操作如下所示:
import pandas as pd
import numpy as np
import random
random.seed(20200817)
date_rng = pd.date_range('2013-01-01', '2015-12-31', freq='1H')
rain = random.choices([0,1], k=len(date_rng))
b = pd.DataFrame({'date':pd.to_datetime(date_rng), 'rain':rain})
b['year'] = b['date'].dt.year
b['month'] = b['date'].dt.month
b['day'] = b['date'].dt.day
b['hour'] = b['date'].dt.hour
b['X'] = b['date'].dt.strftime('%b')
b['hour']= b['hour'].astype(str).str.zfill(2)
b['day']= b['day'].astype(str).str.zfill(2)
# Joint the Month, Date, Hour and Minute together
b['var'] = b['X']+b['day'].astype(str)+b['hour'].astype(str)
cols = b.columns.tolist()
cols = cols[-1:] + cols[:-1]
b = b[cols]
# drop the unwanted columns
b = b.drop(["date","month","X","hour","day","year"], axis=1)
# now lets say I wanna predict 20 January 15.00 chance of rain
mask_x = (b['var']=='Jan2015')
mask_y = b['rain'].loc[mask_x]
mask_y.value_counts()
output:
0 2
1 1
# means the chance of rain is 33.33% and no chance of rain is 66.67%
当我对大型数据集(超过 20 年)执行此操作时,我觉得效果不是很好。
我有一个 2011-2013 年降雨量的时间序列,其中降雨量数据为 1(无雨)和 0(有雨)格式。原始数据间隔为 1 小时,每天上午 10 点至下午 3 点。我不想预测 2014 年的降雨量,但我想根据降雨量列中出现 1 或 0 来预测同一时间间隔全年降雨的可能性。目前,我使用以下代码通过计算出现 1 次或 0 次来预测下雨的可能性:
import pandas as pd
b = {'year':[2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,
2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,
2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013],
'month': [1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,11,12],
'rain':[1,0,0,0,1,1,0,1,1,0,0,1,0,0,1,0,0,0,1,1,1,1,1,0,0,1,1,0,1,0,1,0,1,0,1,0]}
b = pd.DataFrame(b,columns = ['year','month','rain'])
def X(b):
if (b['month'] == 1):
return 'Jan'
elif (b['month']==2):
return 'Feb'
elif (b['month']==3):
return 'Mar'
elif (b['month']==4):
return 'Apr'
elif (b['month']==5):
return 'May'
elif (b['month']==6):
return 'Jun'
elif (b['month']==7):
return 'Jul'
elif (b['month']==8):
return 'Aug'
elif (b['month']==9):
return 'Sep'
elif (b['month']==10):
return 'Oct'
elif (b['month']==11):
return 'Nov'
elif (b['month']==12):
return 'Dec'
b['X'] = b.apply(X,axis =1)
mask_x = (b['X']=='Jul')
mask_y = b['rain'].loc[mask_x]
mask_y.value_counts()
我认为这种方法不适用于大型数据集,有人可以建议我一种有效且稳健的方法来预测此类数据集的降雨量。
数据是通过每小时随机选择 [0,1]
创建的。我们通过在日期列中按时间分组来计算总数和案例数。现在您可以通过 total/number 个事件来计算降雨率。我正在按照您的代码创建年、月和月的缩写名称,但这并不是真正必要的。
import pandas as pd
import numpy as np
import random
random.seed(20200817)
date_rng = pd.date_range('2013-01-01', '2016-01-01', freq='1H')
rain = random.choices([0,1], k=len(date_rng))
b = pd.DataFrame({'date':pd.to_datetime(date_rng), 'rain':rain})
hour_rain = b.groupby([b.date.dt.month, b.date.dt.day, b.date.dt.hour])['rain'].agg([sum,np.size])
hour_rain.index.names = ['month','day','hour']
hour_rain.reset_index()
month day hour sum size
0 1 1 0 0 4
1 1 1 1 2 3
2 1 1 2 3 3
3 1 1 3 1 3
4 1 1 4 1 3
... ... ... ... ... ...
8755 12 31 19 2 3
8756 12 31 20 2 3
8757 12 31 21 2 3
8758 12 31 22 0 3
8759 12 31 23 0 3
我正在尝试执行的操作如下所示:
import pandas as pd
import numpy as np
import random
random.seed(20200817)
date_rng = pd.date_range('2013-01-01', '2015-12-31', freq='1H')
rain = random.choices([0,1], k=len(date_rng))
b = pd.DataFrame({'date':pd.to_datetime(date_rng), 'rain':rain})
b['year'] = b['date'].dt.year
b['month'] = b['date'].dt.month
b['day'] = b['date'].dt.day
b['hour'] = b['date'].dt.hour
b['X'] = b['date'].dt.strftime('%b')
b['hour']= b['hour'].astype(str).str.zfill(2)
b['day']= b['day'].astype(str).str.zfill(2)
# Joint the Month, Date, Hour and Minute together
b['var'] = b['X']+b['day'].astype(str)+b['hour'].astype(str)
cols = b.columns.tolist()
cols = cols[-1:] + cols[:-1]
b = b[cols]
# drop the unwanted columns
b = b.drop(["date","month","X","hour","day","year"], axis=1)
# now lets say I wanna predict 20 January 15.00 chance of rain
mask_x = (b['var']=='Jan2015')
mask_y = b['rain'].loc[mask_x]
mask_y.value_counts()
output:
0 2
1 1
# means the chance of rain is 33.33% and no chance of rain is 66.67%
当我对大型数据集(超过 20 年)执行此操作时,我觉得效果不是很好。