python pandas:将开始和结束日期时间范围(存储为 2 列)转换为单独的行(eqpt 使用)
python pandas: transform start and end datetime range (stored as 2 columns) to individual rows (eqpt utilisation)
您好,我有一个如下 df 的数据集。我分别提供图像和示例数据框。
我想将原始数据帧 (df) 转换为转换后的数据帧 (dft),以便我可以看到每个设备在 24 小时内(甚至更长的时间长达 9 天)的利用率...在 5分钟间隔。然后可以使用 dft 绘制...描述的工具提示等。
当然,如果你有任何替代的更简单的解决方案而不是我下面的概述也可以很好。
原始数据框 (df)
这是上面的数据框 (df),您可以将其复制粘贴到 jupyter 以创建它:
from io import StringIO
import pandas as pd
dfstr = StringIO(u"""
eqpt;starttm;endtm;use_count;desc
AT1;2017-04-01 10:35;2017-04-01 11:05;2;test asdf1
AT2;2017-04-01 11:00;2017-04-01 11:30;5;test asdf2
AT1;2017-04-01 11:00;2017-04-01 11:30;4;test asdf3
AT3;2017-04-01 10:45;2017-04-01 11:45;3;test asdf4
CBL1;2017-04-01 11:10;2017-04-1 11:40;4;test asdf5
""")
df = pd.read_csv(dfstr, sep=";")
df
我想将 df 转换为每个设备的单独行...说开始时间和结束时间从 2017-04-01 00:00 到 23:55 这样我就可以知道每 5 分钟网格中的设备利用率以及绘图和重采样以说每 1 小时内的最大值以进行汇总等
转换后的数据帧 (dft)
这是生成的转换图像..和示例结果数据帧 (dft) 如下:
此数据框的列来自原始数据框的 'eqpt'。
刚刚意识到,如果只需要保持 use_counts 聚合一个数字,则描述列不能在同一个数据帧 dft 中。因此,请提供任何可以实现相同目的的替代解决方案,但将列保持为浮点数仅用于计数,而描述文本在其他地方聚合。以后可以合并或查找。
这是上面的数据框(dft):
dftstr = StringIO(u"""
datetime;Item;AT1;AT2;AT3;CBL1;AT_n
2017-04-01 10:30;use_count;;;;;
2017-04-01 10:35;use_count;2;;;;
2017-04-01 10:40;use_count;2;;;;
2017-04-01 10:45;use_count;2;;3;;
2017-04-01 10:50;use_count;2;;3;;
2017-04-01 10:55;use_count;2;;3;;
2017-04-01 11:00;use_count;6;5;3;;
2017-04-01 11:05;use_count;4;5;3;;
2017-04-01 11:10;use_count;4;5;3;4;
2017-04-01 11:15;use_count;4;5;3;4;
2017-04-01 11:20;use_count;4;5;3;4;
2017-04-01 11:25;use_count;4;5;3;4;
2017-04-01 11:30;use_count;;;3;4;
2017-04-01 11:35;use_count;;;3;4;
2017-04-01 11:40;use_count;;;3;;
2017-04-01 11:45;use_count;;;;;
2017-04-01 11:50;use_count;;;;;
2017-04-01 11:55;use_count;;;;;
2017-04-01 12:00;use_count;;;;;
2017-04-01 10:30;desc;;;;;
2017-04-01 10:35;desc;2: test_adf1;similar desc;;;
2017-04-01 10:40;desc;2: test_adf1;for;;;
2017-04-01 10:45;desc;2: test_adf1;the;;;
2017-04-01 10:50;desc;2: test_adf1;rest;;;
2017-04-01 10:55;desc;2: test_adf1;of;;;
2017-04-01 11:00;desc;"2: test_asdf1
4: test_asdf3";the;;;
2017-04-01 11:05;desc;4: test_asdf3;columns;;;
2017-04-01 11:10;desc;4: test_asdf3;;;;
2017-04-01 11:15;desc;4: test_asdf3;;;;
2017-04-01 11:20;desc;4: test_asdf3;;;;
2017-04-01 11:25;desc;4: test_asdf3;;;;
2017-04-01 11:30;desc;;;;;
2017-04-01 11:35;desc;;;;;
2017-04-01 11:40;desc;;;;;
2017-04-01 11:45;desc;;;;;
2017-04-01 11:50;desc;;;;;
2017-04-01 11:55;desc;;;;;
2017-04-01 12:00;desc;;;;;
;;and so on from 00:00 to 23:55;;;;
""")
dft = pd.read_csv(dftstr, sep=";")
dft
这里需要几个步骤。我使用了您的设置,但立即通过 parse_dates
:
将时间戳转换为 pandas 日期时间对象
from io import StringIO
import pandas as pd
dfstr = StringIO(u"""
eqpt;starttm;endtm;use_count;desc
AT1;2017-04-01 10:35;2017-04-01 11:05;2;test asdf1
AT2;2017-04-01 11:00;2017-04-01 11:30;5;test asdf2
AT1;2017-04-01 11:00;2017-04-01 11:30;4;test asdf3
AT3;2017-04-01 10:45;2017-04-01 11:45;3;test asdf4
CBL1;2017-04-01 11:10;2017-04-1 11:40;4;test asdf5
""")
df = pd.read_csv(dfstr, sep=";", parse_dates=["starttm", "endtm"])
print(df)
eqpt starttm endtm use_count desc
0 AT1 2017-04-01 10:35:00 2017-04-01 11:05:00 2 test asdf1
1 AT2 2017-04-01 11:00:00 2017-04-01 11:30:00 5 test asdf2
2 AT1 2017-04-01 11:00:00 2017-04-01 11:30:00 4 test asdf3
3 AT3 2017-04-01 10:45:00 2017-04-01 11:45:00 3 test asdf4
4 CBL1 2017-04-01 11:10:00 2017-04-01 11:40:00 4 test asdf5
现在,这里有 3 个函数可以完成这项工作:
expand
获取单行输入 df
并创建一个数据框,其 DatetimeIndex 范围从 starttm
到 endtm
,间隔为 5 分钟。此外,还添加了实际的 use_count
和 desc
值。
summarize
处理重叠,同时组合 desc
字符串并在设备同时多次使用时求和 use_counts
。它必须进行类型检查,因为输入可能是 pandas Series
或 DataFrame
。如果只为单个设备指定一行,则会通过 Series
。否则,传递 DataFrame
。
aggregate
结合了 expand
和 summarize
。首先,扩展并连接单个设备的所有条目(行)。然后,汇总展开的列。
就是这样。最后,您使用groupby
对设备进行分组并应用aggregate
功能:
def expand(row):
index = pd.date_range(row["starttm"], row["endtm"], freq="5min")
use_count=row["use_count"]
desc= "{}:{}".format(use_count, row["desc"])
return pd.DataFrame(index=index).assign(use_count=use_count, desc=desc)
def summarize(index, use_count, desc):
if isinstance(use_count, pd.DataFrame):
use_count = use_count.sum(axis=1)
if isinstance(desc, pd.DataFrame):
desc = desc.apply(lambda x: ", ".join(x.dropna()), axis=1)
return pd.DataFrame({"use_count": use_count, "desc": desc}, index=index)
def aggregate(sub_df):
dfs = pd.concat([expand(series) for idx, series in sub_df.iterrows()], axis=1)
return summarize(dfs.index, dfs["use_count"], dfs["desc"])
transformed = df.groupby("eqpt").apply(aggregate).unstack("eqpt")
生成的数据框具有多索引列以区分 desc
和 use_counts
允许正确的数据类型:
print(transformed["use_count"])
eqpt AT1 AT2 AT3 CBL1
2017-04-01 10:35:00 2.0 NaN NaN NaN
2017-04-01 10:40:00 2.0 NaN NaN NaN
2017-04-01 10:45:00 2.0 NaN 3.0 NaN
2017-04-01 10:50:00 2.0 NaN 3.0 NaN
2017-04-01 10:55:00 2.0 NaN 3.0 NaN
2017-04-01 11:00:00 6.0 5.0 3.0 NaN
2017-04-01 11:05:00 6.0 5.0 3.0 NaN
2017-04-01 11:10:00 4.0 5.0 3.0 4.0
2017-04-01 11:15:00 4.0 5.0 3.0 4.0
2017-04-01 11:20:00 4.0 5.0 3.0 4.0
2017-04-01 11:25:00 4.0 5.0 3.0 4.0
2017-04-01 11:30:00 4.0 5.0 3.0 4.0
2017-04-01 11:35:00 NaN NaN 3.0 4.0
2017-04-01 11:40:00 NaN NaN 3.0 4.0
2017-04-01 11:45:00 NaN NaN 3.0 NaN
print(transformed)
desc use_count
eqpt AT1 AT2 AT3 CBL1 AT1 AT2 AT3 CBL1
2017-04-01 10:35:00 2:test asdf1 None None None 2.0 NaN NaN NaN
2017-04-01 10:40:00 2:test asdf1 None None None 2.0 NaN NaN NaN
2017-04-01 10:45:00 2:test asdf1 None 3:test asdf4 None 2.0 NaN 3.0 NaN
2017-04-01 10:50:00 2:test asdf1 None 3:test asdf4 None 2.0 NaN 3.0 NaN
2017-04-01 10:55:00 2:test asdf1 None 3:test asdf4 None 2.0 NaN 3.0 NaN
2017-04-01 11:00:00 2:test asdf1, 4:test asdf3 5:test asdf2 3:test asdf4 None 6.0 5.0 3.0 NaN
2017-04-01 11:05:00 2:test asdf1, 4:test asdf3 5:test asdf2 3:test asdf4 None 6.0 5.0 3.0 NaN
2017-04-01 11:10:00 4:test asdf3 5:test asdf2 3:test asdf4 4:test asdf5 4.0 5.0 3.0 4.0
2017-04-01 11:15:00 4:test asdf3 5:test asdf2 3:test asdf4 4:test asdf5 4.0 5.0 3.0 4.0
2017-04-01 11:20:00 4:test asdf3 5:test asdf2 3:test asdf4 4:test asdf5 4.0 5.0 3.0 4.0
2017-04-01 11:25:00 4:test asdf3 5:test asdf2 3:test asdf4 4:test asdf5 4.0 5.0 3.0 4.0
2017-04-01 11:30:00 4:test asdf3 5:test asdf2 3:test asdf4 4:test asdf5 4.0 5.0 3.0 4.0
2017-04-01 11:35:00 None None 3:test asdf4 4:test asdf5 NaN NaN 3.0 4.0
2017-04-01 11:40:00 None None 3:test asdf4 4:test asdf5 NaN NaN 3.0 4.0
2017-04-01 11:45:00 None None 3:test asdf4 None NaN NaN 3.0 NaN
要跨越一整天的日期时间索引,您可以使用 reindex
:
transformed.reindex(pd.date_range("2017-04-01 00:00", "2017-04-01 23:55", freq="5min"))
您好,我有一个如下 df 的数据集。我分别提供图像和示例数据框。
我想将原始数据帧 (df) 转换为转换后的数据帧 (dft),以便我可以看到每个设备在 24 小时内(甚至更长的时间长达 9 天)的利用率...在 5分钟间隔。然后可以使用 dft 绘制...描述的工具提示等。
当然,如果你有任何替代的更简单的解决方案而不是我下面的概述也可以很好。
原始数据框 (df)
这是上面的数据框 (df),您可以将其复制粘贴到 jupyter 以创建它:
from io import StringIO
import pandas as pd
dfstr = StringIO(u"""
eqpt;starttm;endtm;use_count;desc
AT1;2017-04-01 10:35;2017-04-01 11:05;2;test asdf1
AT2;2017-04-01 11:00;2017-04-01 11:30;5;test asdf2
AT1;2017-04-01 11:00;2017-04-01 11:30;4;test asdf3
AT3;2017-04-01 10:45;2017-04-01 11:45;3;test asdf4
CBL1;2017-04-01 11:10;2017-04-1 11:40;4;test asdf5
""")
df = pd.read_csv(dfstr, sep=";")
df
我想将 df 转换为每个设备的单独行...说开始时间和结束时间从 2017-04-01 00:00 到 23:55 这样我就可以知道每 5 分钟网格中的设备利用率以及绘图和重采样以说每 1 小时内的最大值以进行汇总等
转换后的数据帧 (dft)
这是生成的转换图像..和示例结果数据帧 (dft) 如下:
此数据框的列来自原始数据框的 'eqpt'。
刚刚意识到,如果只需要保持 use_counts 聚合一个数字,则描述列不能在同一个数据帧 dft 中。因此,请提供任何可以实现相同目的的替代解决方案,但将列保持为浮点数仅用于计数,而描述文本在其他地方聚合。以后可以合并或查找。
这是上面的数据框(dft):
dftstr = StringIO(u"""
datetime;Item;AT1;AT2;AT3;CBL1;AT_n
2017-04-01 10:30;use_count;;;;;
2017-04-01 10:35;use_count;2;;;;
2017-04-01 10:40;use_count;2;;;;
2017-04-01 10:45;use_count;2;;3;;
2017-04-01 10:50;use_count;2;;3;;
2017-04-01 10:55;use_count;2;;3;;
2017-04-01 11:00;use_count;6;5;3;;
2017-04-01 11:05;use_count;4;5;3;;
2017-04-01 11:10;use_count;4;5;3;4;
2017-04-01 11:15;use_count;4;5;3;4;
2017-04-01 11:20;use_count;4;5;3;4;
2017-04-01 11:25;use_count;4;5;3;4;
2017-04-01 11:30;use_count;;;3;4;
2017-04-01 11:35;use_count;;;3;4;
2017-04-01 11:40;use_count;;;3;;
2017-04-01 11:45;use_count;;;;;
2017-04-01 11:50;use_count;;;;;
2017-04-01 11:55;use_count;;;;;
2017-04-01 12:00;use_count;;;;;
2017-04-01 10:30;desc;;;;;
2017-04-01 10:35;desc;2: test_adf1;similar desc;;;
2017-04-01 10:40;desc;2: test_adf1;for;;;
2017-04-01 10:45;desc;2: test_adf1;the;;;
2017-04-01 10:50;desc;2: test_adf1;rest;;;
2017-04-01 10:55;desc;2: test_adf1;of;;;
2017-04-01 11:00;desc;"2: test_asdf1
4: test_asdf3";the;;;
2017-04-01 11:05;desc;4: test_asdf3;columns;;;
2017-04-01 11:10;desc;4: test_asdf3;;;;
2017-04-01 11:15;desc;4: test_asdf3;;;;
2017-04-01 11:20;desc;4: test_asdf3;;;;
2017-04-01 11:25;desc;4: test_asdf3;;;;
2017-04-01 11:30;desc;;;;;
2017-04-01 11:35;desc;;;;;
2017-04-01 11:40;desc;;;;;
2017-04-01 11:45;desc;;;;;
2017-04-01 11:50;desc;;;;;
2017-04-01 11:55;desc;;;;;
2017-04-01 12:00;desc;;;;;
;;and so on from 00:00 to 23:55;;;;
""")
dft = pd.read_csv(dftstr, sep=";")
dft
这里需要几个步骤。我使用了您的设置,但立即通过 parse_dates
:
from io import StringIO
import pandas as pd
dfstr = StringIO(u"""
eqpt;starttm;endtm;use_count;desc
AT1;2017-04-01 10:35;2017-04-01 11:05;2;test asdf1
AT2;2017-04-01 11:00;2017-04-01 11:30;5;test asdf2
AT1;2017-04-01 11:00;2017-04-01 11:30;4;test asdf3
AT3;2017-04-01 10:45;2017-04-01 11:45;3;test asdf4
CBL1;2017-04-01 11:10;2017-04-1 11:40;4;test asdf5
""")
df = pd.read_csv(dfstr, sep=";", parse_dates=["starttm", "endtm"])
print(df)
eqpt starttm endtm use_count desc
0 AT1 2017-04-01 10:35:00 2017-04-01 11:05:00 2 test asdf1
1 AT2 2017-04-01 11:00:00 2017-04-01 11:30:00 5 test asdf2
2 AT1 2017-04-01 11:00:00 2017-04-01 11:30:00 4 test asdf3
3 AT3 2017-04-01 10:45:00 2017-04-01 11:45:00 3 test asdf4
4 CBL1 2017-04-01 11:10:00 2017-04-01 11:40:00 4 test asdf5
现在,这里有 3 个函数可以完成这项工作:
expand
获取单行输入df
并创建一个数据框,其 DatetimeIndex 范围从starttm
到endtm
,间隔为 5 分钟。此外,还添加了实际的use_count
和desc
值。summarize
处理重叠,同时组合desc
字符串并在设备同时多次使用时求和use_counts
。它必须进行类型检查,因为输入可能是 pandasSeries
或DataFrame
。如果只为单个设备指定一行,则会通过Series
。否则,传递DataFrame
。aggregate
结合了expand
和summarize
。首先,扩展并连接单个设备的所有条目(行)。然后,汇总展开的列。
就是这样。最后,您使用groupby
对设备进行分组并应用aggregate
功能:
def expand(row):
index = pd.date_range(row["starttm"], row["endtm"], freq="5min")
use_count=row["use_count"]
desc= "{}:{}".format(use_count, row["desc"])
return pd.DataFrame(index=index).assign(use_count=use_count, desc=desc)
def summarize(index, use_count, desc):
if isinstance(use_count, pd.DataFrame):
use_count = use_count.sum(axis=1)
if isinstance(desc, pd.DataFrame):
desc = desc.apply(lambda x: ", ".join(x.dropna()), axis=1)
return pd.DataFrame({"use_count": use_count, "desc": desc}, index=index)
def aggregate(sub_df):
dfs = pd.concat([expand(series) for idx, series in sub_df.iterrows()], axis=1)
return summarize(dfs.index, dfs["use_count"], dfs["desc"])
transformed = df.groupby("eqpt").apply(aggregate).unstack("eqpt")
生成的数据框具有多索引列以区分 desc
和 use_counts
允许正确的数据类型:
print(transformed["use_count"])
eqpt AT1 AT2 AT3 CBL1
2017-04-01 10:35:00 2.0 NaN NaN NaN
2017-04-01 10:40:00 2.0 NaN NaN NaN
2017-04-01 10:45:00 2.0 NaN 3.0 NaN
2017-04-01 10:50:00 2.0 NaN 3.0 NaN
2017-04-01 10:55:00 2.0 NaN 3.0 NaN
2017-04-01 11:00:00 6.0 5.0 3.0 NaN
2017-04-01 11:05:00 6.0 5.0 3.0 NaN
2017-04-01 11:10:00 4.0 5.0 3.0 4.0
2017-04-01 11:15:00 4.0 5.0 3.0 4.0
2017-04-01 11:20:00 4.0 5.0 3.0 4.0
2017-04-01 11:25:00 4.0 5.0 3.0 4.0
2017-04-01 11:30:00 4.0 5.0 3.0 4.0
2017-04-01 11:35:00 NaN NaN 3.0 4.0
2017-04-01 11:40:00 NaN NaN 3.0 4.0
2017-04-01 11:45:00 NaN NaN 3.0 NaN
print(transformed)
desc use_count
eqpt AT1 AT2 AT3 CBL1 AT1 AT2 AT3 CBL1
2017-04-01 10:35:00 2:test asdf1 None None None 2.0 NaN NaN NaN
2017-04-01 10:40:00 2:test asdf1 None None None 2.0 NaN NaN NaN
2017-04-01 10:45:00 2:test asdf1 None 3:test asdf4 None 2.0 NaN 3.0 NaN
2017-04-01 10:50:00 2:test asdf1 None 3:test asdf4 None 2.0 NaN 3.0 NaN
2017-04-01 10:55:00 2:test asdf1 None 3:test asdf4 None 2.0 NaN 3.0 NaN
2017-04-01 11:00:00 2:test asdf1, 4:test asdf3 5:test asdf2 3:test asdf4 None 6.0 5.0 3.0 NaN
2017-04-01 11:05:00 2:test asdf1, 4:test asdf3 5:test asdf2 3:test asdf4 None 6.0 5.0 3.0 NaN
2017-04-01 11:10:00 4:test asdf3 5:test asdf2 3:test asdf4 4:test asdf5 4.0 5.0 3.0 4.0
2017-04-01 11:15:00 4:test asdf3 5:test asdf2 3:test asdf4 4:test asdf5 4.0 5.0 3.0 4.0
2017-04-01 11:20:00 4:test asdf3 5:test asdf2 3:test asdf4 4:test asdf5 4.0 5.0 3.0 4.0
2017-04-01 11:25:00 4:test asdf3 5:test asdf2 3:test asdf4 4:test asdf5 4.0 5.0 3.0 4.0
2017-04-01 11:30:00 4:test asdf3 5:test asdf2 3:test asdf4 4:test asdf5 4.0 5.0 3.0 4.0
2017-04-01 11:35:00 None None 3:test asdf4 4:test asdf5 NaN NaN 3.0 4.0
2017-04-01 11:40:00 None None 3:test asdf4 4:test asdf5 NaN NaN 3.0 4.0
2017-04-01 11:45:00 None None 3:test asdf4 None NaN NaN 3.0 NaN
要跨越一整天的日期时间索引,您可以使用 reindex
:
transformed.reindex(pd.date_range("2017-04-01 00:00", "2017-04-01 23:55", freq="5min"))