重采样、分层、分类+时间数据的散景图

Question

我知道我正在接近这个，但我就是无法让散景来做我正在寻找的东西。我需要将时间数据重新采样为 15 分钟的时间间隔，然后按分层、分类类型对其进行分组，并绘制跨时间组的结果图。将不胜感激任何帮助。

我有这样的数据：

    basket_id   food_type               classified_time             dipped_time                 slot_number
0   185261      CHICKEN FILLETS         2019-07-07 11:38:23.153858  2019-07-07 11:38:40.271070  8
1   185263      CHICKEN FILLETS         2019-07-07 11:38:25.831668  2019-07-07 11:38:53.265553  4
2   185273      CRISPY CHICKEN TENDERS  2019-07-07 11:39:26.184932  2019-07-07 11:39:58.164302  5
3   185276      CRISPY CHICKEN TENDERS  2019-07-07 11:39:30.178273  2019-07-07 11:39:46.076617  1
...

我可以重新采样这些数据，以便得到这个结果，看起来非常像在正确的轨道上：

agg_15m = df[['dipped_time', 'food_type']] \
            .set_index('dipped_time', 'food_type') \
            .groupby('food_type') \
            .resample('15Min') \
            .agg({'food_type': 'count'}) \
            .rename(columns={'food_type':'COUNT'}) \
            .reset_index()
display(agg_15m)

然后我可以使用 groupby 来获得我认为正确的结构：

group = agg_15m.groupby(['dipped_time', 'food_type'])
display(group.sum())

仅此一项就需要大量计算数据帧，因为我不太熟悉使用多索引数据的概念。

现在是有趣的部分，试图让 Bokeh 对这些数据做些什么。 This instruction from bokeh seems to give the right direction; however, it is using only a single groupby. This instruction from bokeh 为分层分类数据提供了一些方向，但该示例仅使用文字完成。

这就是我尝试过的方法。

    p = figure(
        title="Baskets Cooked per 15min",
        y_axis_label="Count",
        plot_width=plot_width,
        plot_height=plot_height,
        toolbar_location=toolbar_loc,
    )
    p.vbar(x='dipped_time_food_type', top='COUNT', width=1e3*60*15, source=self.group.sum() )

这给出了一个空图

如果我尝试将组对象放入 x_range、as per these instructions、

self.p = figure(
            title="Baskets Cooked per 15min",
            y_axis_label="Count",
            plot_width=plot_width,
            plot_height=plot_height,
            toolbar_location=toolbar_loc,
            x_range=group
        )

我在设置图形时收到以下错误，即使这看起来是 the format explained here:

ValueError: expected an element of either Seq(String), Seq(Tuple(String, String)) or Seq(Tuple(String, String, String)), got [(Timestamp('2019-07-07 11:30:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 11:30:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 11:30:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 11:30:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 11:45:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 11:45:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 11:45:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 11:45:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:00:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 12:00:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:00:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:00:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:15:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 12:15:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:15:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:15:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:30:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 12:30:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:30:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:30:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:45:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:45:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:45:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:00:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 13:00:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:15:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 13:15:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:30:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 13:30:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:45:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 14:00:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 14:15:00'), 'POTATO FRIES')]

我也尝试过一些其他的方法，但这似乎是我最接近的方法。会喜欢对数据框结构的任何见解，或者我遗漏的任何其他愚蠢的错误。

感谢您的帮助！

编辑所以我注意到最后一个错误与数据结构无关，而是与数据类型有关。所以我将日期时间转换为字符串：

agg_15m = df[['dipped_time', 'food_type']] \
                .set_index('dipped_time', 'food_type') \
                .groupby('food_type') \
                .resample('15Min') \
                .agg({'food_type': 'count'}) \
                .rename(columns={'food_type':'COUNT'}) \
                .reset_index()
agg_15m['dipped_time'] = agg_15m['dipped_time'].to_string()
self.group = agg_15m.groupby(['dipped_time', 'food_type'])
self.p = figure(
            title="Baskets Cooked per 15min",
            y_axis_label="Count",
            plot_width=plot_width,
            plot_height=plot_height,
            toolbar_location=toolbar_loc,
            x_range=self.group
        )
self.p.vbar(x='dipped_time_food_type', top='COUNT_std', width=1, source=ColumnDataSource(self.group))

这现在给我一个相当丑陋的图表，似乎不代表基础数据。

我正在尝试做一些更像这样的事情：

编辑

上一版本的字符串转换不正确。更新为

agg_15m = df[['dipped_time', 'food_type']] \
                .set_index('dipped_time', 'food_type') \
                .groupby('food_type') \
                .resample('15Min') \
                .agg({'food_type': 'count'}) \
                .rename(columns={'food_type':'COUNT'}) \
                .reset_index()
agg_15m['dipped_time'] = agg_15m['dipped_time'].astype(str)
self.group = agg_15m.groupby(['dipped_time', 'food_type'])
self.p = figure(
            title="Baskets Cooked per 15min",
            y_axis_label="Count",
            plot_width=plot_width,
            plot_height=plot_height,
            toolbar_location=toolbar_loc,
            x_range=self.group
        )
self.p.vbar(x='dipped_time_food_type', top='COUNT_std', width=1, source=ColumnDataSource(self.group))

这给出了正确的数据，但现在图表是空的，角落上有一些瑕疵。

编辑

我无法让它工作，所以我选择了手动方法。此代码有效：

    df['dipped_time'] = pd.to_datetime(df['dipped_time'], errors='coerce') #convert to datetime so we can resample
    #group by food and resample to 15min intervals
    agg_15m = df[['dipped_time', 'food_type']] \
                .set_index('dipped_time', 'food_type') \
                .groupby('food_type') \
                .resample('15Min') \
                .agg({'food_type': 'count'}) \
                .rename(columns={'food_type':'COUNT'}) \
                .reset_index()
    agg_15m['dipped_time'] = agg_15m['dipped_time'].astype(str)
    plot_width  = 800
    plot_height = 600
    toolbar_loc = 'above'

    self.p = figure(
            title="Baskets Cooked per 15min",
            y_axis_label="Count",
            plot_width=plot_width,
            plot_height=plot_height,
            toolbar_location=toolbar_loc,
            x_range=sorted(self.agg_15m.dipped_time.unique())
        )
    self.food_types = self.agg_15m.food_type.unique()
    self.data_source = dict(
            x=sorted(self.agg_15m.dipped_time.unique())
        )
    df = self.agg_15m
    for food_type in self.food_types:
            arr = []
            for time in sorted(self.agg_15m.dipped_time.unique()):
                if df.loc[(df["dipped_time"]==time) & (df["food_type"]==food_type), "COUNT"].empty:
                    arr.append(0)
                else:
                    arr.append(df.loc[(df["dipped_time"]==time) & (df["food_type"]==food_type), "COUNT"].values[0])
            self.data_source[food_type] = arr

    fill_colors=[
            Spectral5[i]
            for i in range(len(self.food_types))]

    self.p.vbar_stack(self.food_types, \
                          x='x', \
                          width=0.9, alpha=0.5, \
                          source=ColumnDataSource(self.data_source), \
                          fill_color=fill_colors,
                          legend=[value(x) for x in self.food_types])

仍然对更多惯用的解决方案持开放态度。

Answer 1

您试图将 COUNT_std 绘制为条形的顶部，但如果您实际查看 ColumnDataSource 中的数据，您会发现它只是 NaN 值：

 'COUNT_std': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),

确实，如果您回到该组，查看 group.describe() 的输出，您会发现 NaN 来自那里：

In [40]: group.describe()
Out[40]:
                                           COUNT
                                           count mean std  min  25%  50%  75%  max
dipped_time         food_type
2019-07-07 12:30:00 POTATO FRIES             1.0  5.0 NaN  5.0  5.0  5.0  5.0  5.0
2019-07-07 12:45:00 CRISPY CHICKEN TENDERS   1.0  3.0 NaN  3.0  3.0  3.0  3.0  3.0
                    POPCORN CHICKEN          1.0  3.0 NaN  3.0  3.0  3.0  3.0  3.0
                    POTATO FRIES             1.0  4.0 NaN  4.0  4.0  4.0  4.0  4.0
2019-07-07 13:00:00 CRISPY CHICKEN TENDERS   1.0  6.0 NaN  6.0  6.0  6.0  6.0  6.0
                    POTATO FRIES             1.0  3.0 NaN  3.0  3.0  3.0  3.0  3.0
2019-07-07 13:15:00 CRISPY CHICKEN TENDERS   1.0  0.0 NaN  0.0  0.0  0.0  0.0  0.0
                    POTATO FRIES             1.0  5.0 NaN  5.0  5.0  5.0  5.0  5.0
2019-07-07 13:30:00 CRISPY CHICKEN TENDERS   1.0  6.0 NaN  6.0  6.0  6.0  6.0  6.0
                    POTATO FRIES             1.0  1.0 NaN  1.0  1.0  1.0  1.0  1.0
2019-07-07 13:45:00 POTATO FRIES             1.0  6.0 NaN  6.0  6.0  6.0  6.0  6.0
2019-07-07 14:00:00 POTATO FRIES             1.0  0.0 NaN  0.0  0.0  0.0  0.0  0.0
2019-07-07 14:15:00 POTATO FRIES             1.0  3.0 NaN  3.0  3.0  3.0  3.0  3.0

我不确定为什么该列最终充满了 NaN，但这是最后一个情节出现问题的直接原因。相反，如果您使用具有有效数值的列，例如COUNT_max:

p.vbar(x='dipped_time_food_type', top='COUNT_max', width=0.9, source=group)

然后你可以看到你想要的情节，模数任何视觉样式：

请注意，我将条形宽度设置为 0.9，因此它们之间实际上有 space。

重采样、分层、分类+时间数据的散景图

Bokeh Graph for Resampled, Hierarchical, Categorical+Time Data

python

hierarchical-data

dataframe

bokeh