计算组内移动中位数
Calculating moving median within group
我想对 4 天前的价格列执行滚动中位数,数据将按日期分组。所以基本上我想获取给定日期的价格和 4 天前的所有价格,并计算这些值的中位数。
示例数据如下:
id date price
1637027 2020-01-21 7045204.0
280955 2020-01-11 3590000.0
782078 2020-01-28 2600000.0
1921717 2020-02-17 5500000.0
1280579 2020-01-23 869000.0
2113506 2020-01-23 628869.0
580638 2020-01-25 650000.0
1843598 2020-02-29 969000.0
2300960 2020-01-24 5401530.0
1921380 2020-02-19 1220000.0
853202 2020-02-02 2990000.0
1024595 2020-01-27 3300000.0
565202 2020-01-25 3540000.0
703824 2020-01-18 3990000.0
426016 2020-01-26 830000.0
我已经接近结合滚动和 groupby 了:
df.groupby('date').rolling(window = 4, on = 'date')['price'].median()
但这似乎为每个索引值添加一行,并且根据中值定义,我无法以某种方式合并这些行以每行产生一个结果。
结果现在看起来像这样:
date date
2020-01-10 2020-01-10 NaN
2020-01-10 NaN
2020-01-10 NaN
2020-01-10 3070000.0
2020-01-10 4890000.0
...
2020-03-11 2020-03-11 4290000.0
2020-03-11 3745000.0
2020-03-11 3149500.0
2020-03-11 3149500.0
2020-03-11 3149500.0
Name: price, Length: 389716, dtype: float64
它似乎只是删除了前 3 个值,然后只是打印了价格值。
是否有可能每个日期得到一个滞后/移动中值?
您可以使用 rolling
以 5 天的频率 window 获取今天和最后 4 天,然后 drop_duplicates
保留每天的最后一行。首先创建一个 copy
(如果你想保留原来的),每个日期 sort_values
并确保日期列是 datetime
#sort and change to datetime
df_f = df[['date','price']].copy().sort_values('date')
df_f['date'] = pd.to_datetime(df_f['date'])
#create the column rolling
df_f['price'] = df_f.rolling('5D', on='date')['price'].median()
#drop_duplicates and keep the last row per day
df_f = df_f.drop_duplicates(['date'], keep='last').reset_index(drop=True)
print (df_f)
date price
0 2020-01-11 3590000.0
1 2020-01-18 3990000.0
2 2020-01-21 5517602.0
3 2020-01-23 869000.0
4 2020-01-24 3135265.0
5 2020-01-25 2204500.0
6 2020-01-26 849500.0
7 2020-01-27 869000.0
8 2020-01-28 2950000.0
9 2020-02-02 2990000.0
10 2020-02-17 5500000.0
11 2020-02-19 3360000.0
12 2020-02-29 969000.0
这是一个循序渐进的过程。可能有更有效的方法来获得你想要的东西。请注意,如果您有日期的时间信息,则需要在按日期分组之前删除该信息。
import pandas as pd
import statistics as stat
import numpy as np
# Replace with you data import
df = pd.read_csv('random_dates_prices.csv')
# Convert your date to a datetime
df['date'] = pd.to_datetime(df['date'])
# Sort your data by date
df = df.sort_values(by = ['date'])
# Create group by object
dates = df.groupby('date')
# Reformat dataframe for one row per day, with prices in a nested list
df = pd.DataFrame(dates['price'].apply(lambda s: s.tolist()))
# Extract price lists to a separate list
prices = df['price'].tolist()
# Initialize list to store past four days of prices for current day
four_days = []
# Loop over the prices list to combine the last four days to a single list
for i in range(3, len(prices), 1):
x = i - 1
y = i - 2
z = i - 3
four_days.append(prices[i] + prices[x] + prices[y] + prices[z])
# Initialize a list to store median values
medians = []
# Loop through four_days list and calculate the median of the last for days for the current date
for i in range(len(four_days)):
medians.append(stat.median(four_days[i]))
# Create dummy zero values to add lists create to dataframe
four_days.insert(0, 0)
four_days.insert(0, 0)
four_days.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
# Add both new lists to data frames
df['last_four_day_prices'] = four_days
df['last_four_days_median'] = medians
# Replace dummy zeros with np.nan
df[['last_four_day_prices', 'last_four_days_median']] = df[['last_four_day_prices', 'last_four_days_median']].replace(0, np.nan)
# Clean data frame so you only have a single date a median value for past four days
df_clean = df.drop(['price', 'last_four_day_prices'], axis=1)
我想对 4 天前的价格列执行滚动中位数,数据将按日期分组。所以基本上我想获取给定日期的价格和 4 天前的所有价格,并计算这些值的中位数。
示例数据如下:
id date price
1637027 2020-01-21 7045204.0
280955 2020-01-11 3590000.0
782078 2020-01-28 2600000.0
1921717 2020-02-17 5500000.0
1280579 2020-01-23 869000.0
2113506 2020-01-23 628869.0
580638 2020-01-25 650000.0
1843598 2020-02-29 969000.0
2300960 2020-01-24 5401530.0
1921380 2020-02-19 1220000.0
853202 2020-02-02 2990000.0
1024595 2020-01-27 3300000.0
565202 2020-01-25 3540000.0
703824 2020-01-18 3990000.0
426016 2020-01-26 830000.0
我已经接近结合滚动和 groupby 了:
df.groupby('date').rolling(window = 4, on = 'date')['price'].median()
但这似乎为每个索引值添加一行,并且根据中值定义,我无法以某种方式合并这些行以每行产生一个结果。
结果现在看起来像这样:
date date
2020-01-10 2020-01-10 NaN
2020-01-10 NaN
2020-01-10 NaN
2020-01-10 3070000.0
2020-01-10 4890000.0
...
2020-03-11 2020-03-11 4290000.0
2020-03-11 3745000.0
2020-03-11 3149500.0
2020-03-11 3149500.0
2020-03-11 3149500.0
Name: price, Length: 389716, dtype: float64
它似乎只是删除了前 3 个值,然后只是打印了价格值。
是否有可能每个日期得到一个滞后/移动中值?
您可以使用 rolling
以 5 天的频率 window 获取今天和最后 4 天,然后 drop_duplicates
保留每天的最后一行。首先创建一个 copy
(如果你想保留原来的),每个日期 sort_values
并确保日期列是 datetime
#sort and change to datetime
df_f = df[['date','price']].copy().sort_values('date')
df_f['date'] = pd.to_datetime(df_f['date'])
#create the column rolling
df_f['price'] = df_f.rolling('5D', on='date')['price'].median()
#drop_duplicates and keep the last row per day
df_f = df_f.drop_duplicates(['date'], keep='last').reset_index(drop=True)
print (df_f)
date price
0 2020-01-11 3590000.0
1 2020-01-18 3990000.0
2 2020-01-21 5517602.0
3 2020-01-23 869000.0
4 2020-01-24 3135265.0
5 2020-01-25 2204500.0
6 2020-01-26 849500.0
7 2020-01-27 869000.0
8 2020-01-28 2950000.0
9 2020-02-02 2990000.0
10 2020-02-17 5500000.0
11 2020-02-19 3360000.0
12 2020-02-29 969000.0
这是一个循序渐进的过程。可能有更有效的方法来获得你想要的东西。请注意,如果您有日期的时间信息,则需要在按日期分组之前删除该信息。
import pandas as pd
import statistics as stat
import numpy as np
# Replace with you data import
df = pd.read_csv('random_dates_prices.csv')
# Convert your date to a datetime
df['date'] = pd.to_datetime(df['date'])
# Sort your data by date
df = df.sort_values(by = ['date'])
# Create group by object
dates = df.groupby('date')
# Reformat dataframe for one row per day, with prices in a nested list
df = pd.DataFrame(dates['price'].apply(lambda s: s.tolist()))
# Extract price lists to a separate list
prices = df['price'].tolist()
# Initialize list to store past four days of prices for current day
four_days = []
# Loop over the prices list to combine the last four days to a single list
for i in range(3, len(prices), 1):
x = i - 1
y = i - 2
z = i - 3
four_days.append(prices[i] + prices[x] + prices[y] + prices[z])
# Initialize a list to store median values
medians = []
# Loop through four_days list and calculate the median of the last for days for the current date
for i in range(len(four_days)):
medians.append(stat.median(four_days[i]))
# Create dummy zero values to add lists create to dataframe
four_days.insert(0, 0)
four_days.insert(0, 0)
four_days.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
# Add both new lists to data frames
df['last_four_day_prices'] = four_days
df['last_four_days_median'] = medians
# Replace dummy zeros with np.nan
df[['last_four_day_prices', 'last_four_days_median']] = df[['last_four_day_prices', 'last_four_days_median']].replace(0, np.nan)
# Clean data frame so you only have a single date a median value for past four days
df_clean = df.drop(['price', 'last_four_day_prices'], axis=1)