数据框中过去可能丢失的月份的值
value from a past, potentially missing month in dataframe
假设我有一个如下所示的 DataFrame:
Month, Gender, State, Value
2010-01, M, S1, 10
2010-02, M, S1, 20
2010-05, M, S1, 26
2010-03, F, S2, 11
我想为上个月(或过去 X
个月)的给定性别和州添加另一列如果存在,即:
Month, Gender, State, Value, Last Value
2010-01, M, S1, 10, NaN
2010-02, M, S1, 20, 10
2010-05, M, S1, 26, NaN (there is no 2010-04 for M, S1)
2010-03, F, S2, 11, NaN
我知道我必须 groupby(['Gender', 'State'])
但是 shift()
不起作用,因为它只是按行数移动数据,它不知道周期本身(如果缺少一个月) .
我找到了这样做的方法,虽然不太高兴:
full_index = []
for g in all_genders:
for s in all_states:
for m in all_months:
full_index.append((g, s, m))
df = df.set_index(['Gender', 'State', 'Month'])
df = df.reindex(full_index) # fill in all missing values
所以基本上,我们不处理数据中缺失的行,而是创建缺失的行,shift()
将按预期工作。
即:
df['Last Value'] = df.shift(1).Value
...
df = df.reset_index() # go back to tabular format from this hierarchy
假设我有一个如下所示的 DataFrame:
Month, Gender, State, Value
2010-01, M, S1, 10
2010-02, M, S1, 20
2010-05, M, S1, 26
2010-03, F, S2, 11
我想为上个月(或过去 X
个月)的给定性别和州添加另一列如果存在,即:
Month, Gender, State, Value, Last Value
2010-01, M, S1, 10, NaN
2010-02, M, S1, 20, 10
2010-05, M, S1, 26, NaN (there is no 2010-04 for M, S1)
2010-03, F, S2, 11, NaN
我知道我必须 groupby(['Gender', 'State'])
但是 shift()
不起作用,因为它只是按行数移动数据,它不知道周期本身(如果缺少一个月) .
我找到了这样做的方法,虽然不太高兴:
full_index = []
for g in all_genders:
for s in all_states:
for m in all_months:
full_index.append((g, s, m))
df = df.set_index(['Gender', 'State', 'Month'])
df = df.reindex(full_index) # fill in all missing values
所以基本上,我们不处理数据中缺失的行,而是创建缺失的行,shift()
将按预期工作。
即:
df['Last Value'] = df.shift(1).Value
...
df = df.reset_index() # go back to tabular format from this hierarchy