pandas

Question

我有这个数据框 df:

U,Datetime
01,2015-01-01 20:00:00
01,2015-02-01 20:05:00
01,2015-04-01 21:00:00
01,2015-05-01 22:00:00
01,2015-07-01 22:05:00
02,2015-08-01 20:00:00
02,2015-09-01 21:00:00
02,2014-01-01 23:00:00
02,2014-02-01 22:05:00
02,2015-01-01 20:00:00
02,2014-03-01 21:00:00
03,2015-10-01 20:00:00
03,2015-11-01 21:00:00
03,2015-12-01 23:00:00
03,2015-01-01 22:05:00
03,2015-02-01 20:00:00
03,2015-05-01 21:00:00
03,2014-01-01 20:00:00
03,2014-02-01 21:00:00

由 U 和一个 Datetime 对象制作。我想做的是过滤在 months/year 中至少连续出现三次的 U 值。到目前为止，我已按 U、year 和 month 分组为：

m = df.groupby(['U',df.index.year,df.index.month]).size()

获得：

U          
1  2015  1     1
         2     1
         4     1
         5     1
         7     1
2  2014  1     1
         2     1
         3     1
   2015  1     1
         8     1
         9     1
3  2014  1     1
         2     1
   2015  1     1
         2     1
         5     1
         10    1
         11    1
         12    1

第三列与不同 months/year 中的出现有关。在这种情况下，只有 02 和 03 的 U 值在 months/year 中包含至少三个连续值。现在我无法弄清楚如何 select 这些用户并将他们从列表中取出，或者只是将他们保留在原始数据框中 df 并丢弃其他用户。我也试过：

g = m.groupby(level=[0,1]).diff()

但是我无法得到任何有用的信息。

Answer 1

我终于想出了解决办法 :) .

为了让您了解自定义函数的工作原理，它只是从之前的值中减去月份的值，结果当然应该是 one，并且这应该发生两次，例如，如果你有一个数字列表 [5 , 6 , 7] ，所以 7 - 6 = 1 和 6 - 5 = 1 , 1 这里出现了两次所以条件已经满足

In [80]:
df.reset_index(inplace=True)

In [281]:
df['month'] = df.Datetime.dt.month
df['year'] = df.Datetime.dt.year
df
Out[281]:
            Datetime    U   month   year
0   2015-01-01 20:00:00 1   1       2015
1   2015-02-01 20:05:00 1   2       2015
2   2015-04-01 21:00:00 1   4       2015
3   2015-05-01 22:00:00 1   5       2015
4   2015-07-01 22:05:00 1   7       2015
5   2015-08-01 20:00:00 2   8       2015
6   2015-09-01 21:00:00 2   9       2015
7   2014-01-01 23:00:00 2   1       2014
8   2014-02-01 22:05:00 2   2       2014
9   2015-01-01 20:00:00 2   1       2015
10  2014-03-01 21:00:00 2   3       2014
11  2015-10-01 20:00:00 3   10      2015
12  2015-11-01 21:00:00 3   11      2015
13  2015-12-01 23:00:00 3   12      2015
14  2015-01-01 22:05:00 3   1       2015
15  2015-02-01 20:00:00 3   2       2015
16  2015-05-01 21:00:00 3   5       2015
17  2014-01-01 20:00:00 3   1       2014
18  2014-02-01 21:00:00 3   2       2014

In [284]:
g = df.groupby([df['U'] , df.year])

In [86]:
res = g.filter(lambda x : is_at_least_three_consec(x['month'].diff().values.tolist()))
res
Out[86]:
      Datetime          U   month   year
7   2014-01-01 23:00:00 2   1       2014
8   2014-02-01 22:05:00 2   2       2014
10  2014-03-01 21:00:00 2   3       2014
11  2015-10-01 20:00:00 3   10      2015
12  2015-11-01 21:00:00 3   11      2015
13  2015-12-01 23:00:00 3   12      2015
14  2015-01-01 22:05:00 3   1       2015
15  2015-02-01 20:00:00 3   2       2015
16  2015-05-01 21:00:00 3   5       2015

如果您想查看自定义函数的结果

In [84]:
res = g['month'].agg(lambda x : is_at_least_three_consec(x.diff().values.tolist()))
res
Out[84]:
U  year
1  2015    False
2  2014     True
   2015    False
3  2014    False
   2015     True
Name: month, dtype: bool

自定义函数是这样实现的

In [53]:    
def is_at_least_three_consec(month_diff):
    consec_count = 0
    #print(month_diff)
    for index , val in enumerate(month_diff):
        if index != 0 and val == 1:
                consec_count += 1
                if consec_count == 2:
                    return True
        else:
            consec_count = 0

    return False

pandas - groupby 和过滤连续值

pandas - groupby and filtering for consecutive values

python

time-series

dataframe