重采样、分层、分类+时间数据的散景图
Bokeh Graph for Resampled, Hierarchical, Categorical+Time Data
我知道我正在接近这个,但我就是无法让散景来做我正在寻找的东西。我需要将时间数据重新采样为 15 分钟的时间间隔,然后按分层、分类类型对其进行分组,并绘制跨时间组的结果图。将不胜感激任何帮助。
我有这样的数据:
basket_id food_type classified_time dipped_time slot_number
0 185261 CHICKEN FILLETS 2019-07-07 11:38:23.153858 2019-07-07 11:38:40.271070 8
1 185263 CHICKEN FILLETS 2019-07-07 11:38:25.831668 2019-07-07 11:38:53.265553 4
2 185273 CRISPY CHICKEN TENDERS 2019-07-07 11:39:26.184932 2019-07-07 11:39:58.164302 5
3 185276 CRISPY CHICKEN TENDERS 2019-07-07 11:39:30.178273 2019-07-07 11:39:46.076617 1
...
我可以重新采样这些数据,以便得到这个结果,看起来非常像在正确的轨道上:
agg_15m = df[['dipped_time', 'food_type']] \
.set_index('dipped_time', 'food_type') \
.groupby('food_type') \
.resample('15Min') \
.agg({'food_type': 'count'}) \
.rename(columns={'food_type':'COUNT'}) \
.reset_index()
display(agg_15m)
然后我可以使用 groupby 来获得我认为正确的结构:
group = agg_15m.groupby(['dipped_time', 'food_type'])
display(group.sum())
仅此一项就需要大量计算数据帧,因为我不太熟悉使用多索引数据的概念。
现在是有趣的部分,试图让 Bokeh 对这些数据做些什么。 This instruction from bokeh seems to give the right direction; however, it is using only a single groupby. This instruction from bokeh 为分层分类数据提供了一些方向,但该示例仅使用文字完成。
这就是我尝试过的方法。
p = figure(
title="Baskets Cooked per 15min",
y_axis_label="Count",
plot_width=plot_width,
plot_height=plot_height,
toolbar_location=toolbar_loc,
)
p.vbar(x='dipped_time_food_type', top='COUNT', width=1e3*60*15, source=self.group.sum() )
这给出了一个空图
如果我尝试将组对象放入 x_range、as per these instructions、
self.p = figure(
title="Baskets Cooked per 15min",
y_axis_label="Count",
plot_width=plot_width,
plot_height=plot_height,
toolbar_location=toolbar_loc,
x_range=group
)
我在设置图形时收到以下错误,即使这看起来是 the format explained here:
ValueError: expected an element of either Seq(String), Seq(Tuple(String, String)) or Seq(Tuple(String, String, String)), got [(Timestamp('2019-07-07 11:30:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 11:30:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 11:30:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 11:30:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 11:45:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 11:45:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 11:45:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 11:45:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:00:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 12:00:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:00:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:00:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:15:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 12:15:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:15:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:15:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:30:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 12:30:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:30:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:30:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:45:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:45:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:45:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:00:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 13:00:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:15:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 13:15:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:30:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 13:30:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:45:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 14:00:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 14:15:00'), 'POTATO FRIES')]
我也尝试过一些其他的方法,但这似乎是我最接近的方法。会喜欢对数据框结构的任何见解,或者我遗漏的任何其他愚蠢的错误。
感谢您的帮助!
编辑
所以我注意到最后一个错误与数据结构无关,而是与数据类型有关。所以我将日期时间转换为字符串:
agg_15m = df[['dipped_time', 'food_type']] \
.set_index('dipped_time', 'food_type') \
.groupby('food_type') \
.resample('15Min') \
.agg({'food_type': 'count'}) \
.rename(columns={'food_type':'COUNT'}) \
.reset_index()
agg_15m['dipped_time'] = agg_15m['dipped_time'].to_string()
self.group = agg_15m.groupby(['dipped_time', 'food_type'])
self.p = figure(
title="Baskets Cooked per 15min",
y_axis_label="Count",
plot_width=plot_width,
plot_height=plot_height,
toolbar_location=toolbar_loc,
x_range=self.group
)
self.p.vbar(x='dipped_time_food_type', top='COUNT_std', width=1, source=ColumnDataSource(self.group))
这现在给我一个相当丑陋的图表,似乎不代表基础数据。
我正在尝试做一些更像这样的事情:
编辑
上一版本的字符串转换不正确。更新为
agg_15m = df[['dipped_time', 'food_type']] \
.set_index('dipped_time', 'food_type') \
.groupby('food_type') \
.resample('15Min') \
.agg({'food_type': 'count'}) \
.rename(columns={'food_type':'COUNT'}) \
.reset_index()
agg_15m['dipped_time'] = agg_15m['dipped_time'].astype(str)
self.group = agg_15m.groupby(['dipped_time', 'food_type'])
self.p = figure(
title="Baskets Cooked per 15min",
y_axis_label="Count",
plot_width=plot_width,
plot_height=plot_height,
toolbar_location=toolbar_loc,
x_range=self.group
)
self.p.vbar(x='dipped_time_food_type', top='COUNT_std', width=1, source=ColumnDataSource(self.group))
这给出了正确的数据,但现在图表是空的,角落上有一些瑕疵。
编辑
我无法让它工作,所以我选择了手动方法。此代码有效:
df['dipped_time'] = pd.to_datetime(df['dipped_time'], errors='coerce') #convert to datetime so we can resample
#group by food and resample to 15min intervals
agg_15m = df[['dipped_time', 'food_type']] \
.set_index('dipped_time', 'food_type') \
.groupby('food_type') \
.resample('15Min') \
.agg({'food_type': 'count'}) \
.rename(columns={'food_type':'COUNT'}) \
.reset_index()
agg_15m['dipped_time'] = agg_15m['dipped_time'].astype(str)
plot_width = 800
plot_height = 600
toolbar_loc = 'above'
self.p = figure(
title="Baskets Cooked per 15min",
y_axis_label="Count",
plot_width=plot_width,
plot_height=plot_height,
toolbar_location=toolbar_loc,
x_range=sorted(self.agg_15m.dipped_time.unique())
)
self.food_types = self.agg_15m.food_type.unique()
self.data_source = dict(
x=sorted(self.agg_15m.dipped_time.unique())
)
df = self.agg_15m
for food_type in self.food_types:
arr = []
for time in sorted(self.agg_15m.dipped_time.unique()):
if df.loc[(df["dipped_time"]==time) & (df["food_type"]==food_type), "COUNT"].empty:
arr.append(0)
else:
arr.append(df.loc[(df["dipped_time"]==time) & (df["food_type"]==food_type), "COUNT"].values[0])
self.data_source[food_type] = arr
fill_colors=[
Spectral5[i]
for i in range(len(self.food_types))]
self.p.vbar_stack(self.food_types, \
x='x', \
width=0.9, alpha=0.5, \
source=ColumnDataSource(self.data_source), \
fill_color=fill_colors,
legend=[value(x) for x in self.food_types])
仍然对更多惯用的解决方案持开放态度。
您试图将 COUNT_std
绘制为条形的顶部,但如果您实际查看 ColumnDataSource
中的数据,您会发现它只是 NaN 值:
'COUNT_std': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),
确实,如果您回到该组,查看 group.describe()
的输出,您会发现 NaN 来自那里:
In [40]: group.describe()
Out[40]:
COUNT
count mean std min 25% 50% 75% max
dipped_time food_type
2019-07-07 12:30:00 POTATO FRIES 1.0 5.0 NaN 5.0 5.0 5.0 5.0 5.0
2019-07-07 12:45:00 CRISPY CHICKEN TENDERS 1.0 3.0 NaN 3.0 3.0 3.0 3.0 3.0
POPCORN CHICKEN 1.0 3.0 NaN 3.0 3.0 3.0 3.0 3.0
POTATO FRIES 1.0 4.0 NaN 4.0 4.0 4.0 4.0 4.0
2019-07-07 13:00:00 CRISPY CHICKEN TENDERS 1.0 6.0 NaN 6.0 6.0 6.0 6.0 6.0
POTATO FRIES 1.0 3.0 NaN 3.0 3.0 3.0 3.0 3.0
2019-07-07 13:15:00 CRISPY CHICKEN TENDERS 1.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0
POTATO FRIES 1.0 5.0 NaN 5.0 5.0 5.0 5.0 5.0
2019-07-07 13:30:00 CRISPY CHICKEN TENDERS 1.0 6.0 NaN 6.0 6.0 6.0 6.0 6.0
POTATO FRIES 1.0 1.0 NaN 1.0 1.0 1.0 1.0 1.0
2019-07-07 13:45:00 POTATO FRIES 1.0 6.0 NaN 6.0 6.0 6.0 6.0 6.0
2019-07-07 14:00:00 POTATO FRIES 1.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0
2019-07-07 14:15:00 POTATO FRIES 1.0 3.0 NaN 3.0 3.0 3.0 3.0 3.0
我不确定为什么该列最终充满了 NaN,但这是最后一个情节出现问题的直接原因。相反,如果您使用具有有效数值的列,例如COUNT_max
:
p.vbar(x='dipped_time_food_type', top='COUNT_max', width=0.9, source=group)
然后你可以看到你想要的情节,模数任何视觉样式:
请注意,我将条形宽度设置为 0.9,因此它们之间实际上有 space。
我知道我正在接近这个,但我就是无法让散景来做我正在寻找的东西。我需要将时间数据重新采样为 15 分钟的时间间隔,然后按分层、分类类型对其进行分组,并绘制跨时间组的结果图。将不胜感激任何帮助。
我有这样的数据:
basket_id food_type classified_time dipped_time slot_number
0 185261 CHICKEN FILLETS 2019-07-07 11:38:23.153858 2019-07-07 11:38:40.271070 8
1 185263 CHICKEN FILLETS 2019-07-07 11:38:25.831668 2019-07-07 11:38:53.265553 4
2 185273 CRISPY CHICKEN TENDERS 2019-07-07 11:39:26.184932 2019-07-07 11:39:58.164302 5
3 185276 CRISPY CHICKEN TENDERS 2019-07-07 11:39:30.178273 2019-07-07 11:39:46.076617 1
...
我可以重新采样这些数据,以便得到这个结果,看起来非常像在正确的轨道上:
agg_15m = df[['dipped_time', 'food_type']] \
.set_index('dipped_time', 'food_type') \
.groupby('food_type') \
.resample('15Min') \
.agg({'food_type': 'count'}) \
.rename(columns={'food_type':'COUNT'}) \
.reset_index()
display(agg_15m)
然后我可以使用 groupby 来获得我认为正确的结构:
group = agg_15m.groupby(['dipped_time', 'food_type'])
display(group.sum())
仅此一项就需要大量计算数据帧,因为我不太熟悉使用多索引数据的概念。
现在是有趣的部分,试图让 Bokeh 对这些数据做些什么。 This instruction from bokeh seems to give the right direction; however, it is using only a single groupby. This instruction from bokeh 为分层分类数据提供了一些方向,但该示例仅使用文字完成。
这就是我尝试过的方法。
p = figure(
title="Baskets Cooked per 15min",
y_axis_label="Count",
plot_width=plot_width,
plot_height=plot_height,
toolbar_location=toolbar_loc,
)
p.vbar(x='dipped_time_food_type', top='COUNT', width=1e3*60*15, source=self.group.sum() )
这给出了一个空图
如果我尝试将组对象放入 x_range、as per these instructions、
self.p = figure(
title="Baskets Cooked per 15min",
y_axis_label="Count",
plot_width=plot_width,
plot_height=plot_height,
toolbar_location=toolbar_loc,
x_range=group
)
我在设置图形时收到以下错误,即使这看起来是 the format explained here:
ValueError: expected an element of either Seq(String), Seq(Tuple(String, String)) or Seq(Tuple(String, String, String)), got [(Timestamp('2019-07-07 11:30:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 11:30:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 11:30:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 11:30:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 11:45:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 11:45:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 11:45:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 11:45:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:00:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 12:00:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:00:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:00:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:15:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 12:15:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:15:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:15:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:30:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 12:30:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:30:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:30:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:45:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:45:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:45:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:00:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 13:00:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:15:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 13:15:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:30:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 13:30:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:45:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 14:00:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 14:15:00'), 'POTATO FRIES')]
我也尝试过一些其他的方法,但这似乎是我最接近的方法。会喜欢对数据框结构的任何见解,或者我遗漏的任何其他愚蠢的错误。
感谢您的帮助!
编辑 所以我注意到最后一个错误与数据结构无关,而是与数据类型有关。所以我将日期时间转换为字符串:
agg_15m = df[['dipped_time', 'food_type']] \
.set_index('dipped_time', 'food_type') \
.groupby('food_type') \
.resample('15Min') \
.agg({'food_type': 'count'}) \
.rename(columns={'food_type':'COUNT'}) \
.reset_index()
agg_15m['dipped_time'] = agg_15m['dipped_time'].to_string()
self.group = agg_15m.groupby(['dipped_time', 'food_type'])
self.p = figure(
title="Baskets Cooked per 15min",
y_axis_label="Count",
plot_width=plot_width,
plot_height=plot_height,
toolbar_location=toolbar_loc,
x_range=self.group
)
self.p.vbar(x='dipped_time_food_type', top='COUNT_std', width=1, source=ColumnDataSource(self.group))
这现在给我一个相当丑陋的图表,似乎不代表基础数据。
我正在尝试做一些更像这样的事情:
编辑
上一版本的字符串转换不正确。更新为
agg_15m = df[['dipped_time', 'food_type']] \
.set_index('dipped_time', 'food_type') \
.groupby('food_type') \
.resample('15Min') \
.agg({'food_type': 'count'}) \
.rename(columns={'food_type':'COUNT'}) \
.reset_index()
agg_15m['dipped_time'] = agg_15m['dipped_time'].astype(str)
self.group = agg_15m.groupby(['dipped_time', 'food_type'])
self.p = figure(
title="Baskets Cooked per 15min",
y_axis_label="Count",
plot_width=plot_width,
plot_height=plot_height,
toolbar_location=toolbar_loc,
x_range=self.group
)
self.p.vbar(x='dipped_time_food_type', top='COUNT_std', width=1, source=ColumnDataSource(self.group))
这给出了正确的数据,但现在图表是空的,角落上有一些瑕疵。
编辑
我无法让它工作,所以我选择了手动方法。此代码有效:
df['dipped_time'] = pd.to_datetime(df['dipped_time'], errors='coerce') #convert to datetime so we can resample
#group by food and resample to 15min intervals
agg_15m = df[['dipped_time', 'food_type']] \
.set_index('dipped_time', 'food_type') \
.groupby('food_type') \
.resample('15Min') \
.agg({'food_type': 'count'}) \
.rename(columns={'food_type':'COUNT'}) \
.reset_index()
agg_15m['dipped_time'] = agg_15m['dipped_time'].astype(str)
plot_width = 800
plot_height = 600
toolbar_loc = 'above'
self.p = figure(
title="Baskets Cooked per 15min",
y_axis_label="Count",
plot_width=plot_width,
plot_height=plot_height,
toolbar_location=toolbar_loc,
x_range=sorted(self.agg_15m.dipped_time.unique())
)
self.food_types = self.agg_15m.food_type.unique()
self.data_source = dict(
x=sorted(self.agg_15m.dipped_time.unique())
)
df = self.agg_15m
for food_type in self.food_types:
arr = []
for time in sorted(self.agg_15m.dipped_time.unique()):
if df.loc[(df["dipped_time"]==time) & (df["food_type"]==food_type), "COUNT"].empty:
arr.append(0)
else:
arr.append(df.loc[(df["dipped_time"]==time) & (df["food_type"]==food_type), "COUNT"].values[0])
self.data_source[food_type] = arr
fill_colors=[
Spectral5[i]
for i in range(len(self.food_types))]
self.p.vbar_stack(self.food_types, \
x='x', \
width=0.9, alpha=0.5, \
source=ColumnDataSource(self.data_source), \
fill_color=fill_colors,
legend=[value(x) for x in self.food_types])
仍然对更多惯用的解决方案持开放态度。
您试图将 COUNT_std
绘制为条形的顶部,但如果您实际查看 ColumnDataSource
中的数据,您会发现它只是 NaN 值:
'COUNT_std': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),
确实,如果您回到该组,查看 group.describe()
的输出,您会发现 NaN 来自那里:
In [40]: group.describe()
Out[40]:
COUNT
count mean std min 25% 50% 75% max
dipped_time food_type
2019-07-07 12:30:00 POTATO FRIES 1.0 5.0 NaN 5.0 5.0 5.0 5.0 5.0
2019-07-07 12:45:00 CRISPY CHICKEN TENDERS 1.0 3.0 NaN 3.0 3.0 3.0 3.0 3.0
POPCORN CHICKEN 1.0 3.0 NaN 3.0 3.0 3.0 3.0 3.0
POTATO FRIES 1.0 4.0 NaN 4.0 4.0 4.0 4.0 4.0
2019-07-07 13:00:00 CRISPY CHICKEN TENDERS 1.0 6.0 NaN 6.0 6.0 6.0 6.0 6.0
POTATO FRIES 1.0 3.0 NaN 3.0 3.0 3.0 3.0 3.0
2019-07-07 13:15:00 CRISPY CHICKEN TENDERS 1.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0
POTATO FRIES 1.0 5.0 NaN 5.0 5.0 5.0 5.0 5.0
2019-07-07 13:30:00 CRISPY CHICKEN TENDERS 1.0 6.0 NaN 6.0 6.0 6.0 6.0 6.0
POTATO FRIES 1.0 1.0 NaN 1.0 1.0 1.0 1.0 1.0
2019-07-07 13:45:00 POTATO FRIES 1.0 6.0 NaN 6.0 6.0 6.0 6.0 6.0
2019-07-07 14:00:00 POTATO FRIES 1.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0
2019-07-07 14:15:00 POTATO FRIES 1.0 3.0 NaN 3.0 3.0 3.0 3.0 3.0
我不确定为什么该列最终充满了 NaN,但这是最后一个情节出现问题的直接原因。相反,如果您使用具有有效数值的列,例如COUNT_max
:
p.vbar(x='dipped_time_food_type', top='COUNT_max', width=0.9, source=group)
然后你可以看到你想要的情节,模数任何视觉样式:
请注意,我将条形宽度设置为 0.9,因此它们之间实际上有 space。