python pandas:将开始和结束日期时间范围(存储为 2 列)转换为单独的行(eqpt 使用)

python pandas: transform start and end datetime range (stored as 2 columns) to individual rows (eqpt utilisation)

您好,我有一个如下 df 的数据集。我分别提供图像和示例数据框。

我想将原始数据帧 (df) 转换为转换后的数据帧 (dft),以便我可以看到每个设备在 24 小时内(甚至更长的时间长达 9 天)的利用率...在 5分钟间隔。然后可以使用 dft 绘制...描述的工具提示等。

当然,如果你有任何替代的更简单的解决方案而不是我下面的概述也可以很好。

原始数据框 (df)

这是上面的数据框 (df),您可以将其复制粘贴到 jupyter 以创建它:

from io import StringIO
import pandas as pd

dfstr = StringIO(u"""
eqpt;starttm;endtm;use_count;desc
AT1;2017-04-01 10:35;2017-04-01 11:05;2;test asdf1
AT2;2017-04-01 11:00;2017-04-01 11:30;5;test asdf2
AT1;2017-04-01 11:00;2017-04-01 11:30;4;test asdf3
AT3;2017-04-01 10:45;2017-04-01 11:45;3;test asdf4
CBL1;2017-04-01 11:10;2017-04-1 11:40;4;test asdf5
""")
df = pd.read_csv(dfstr, sep=";")
df

我想将 df 转换为每个设备的单独行...说开始时间和结束时间从 2017-04-01 00:00 到 23:55 这样我就可以知道每 5 分钟网格中的设备利用率以及绘图和重采样以说每 1 小时内的最大值以进行汇总等

转换后的数据帧 (dft)

这是生成的转换图像..和示例结果数据帧 (dft) 如下:

此数据框的列来自原始数据框的 'eqpt'。

刚刚意识到,如果只需要保持 use_counts 聚合一个数字,则描述列不能在同一个数据帧 dft 中。因此,请提供任何可以实现相同目的的替代解决方案,但将列保持为浮点数仅用于计数,而描述文本在其他地方聚合。以后可以合并或查找。

这是上面的数据框(dft):

dftstr = StringIO(u"""
datetime;Item;AT1;AT2;AT3;CBL1;AT_n
2017-04-01 10:30;use_count;;;;;
2017-04-01 10:35;use_count;2;;;;
2017-04-01 10:40;use_count;2;;;;
2017-04-01 10:45;use_count;2;;3;;
2017-04-01 10:50;use_count;2;;3;;
2017-04-01 10:55;use_count;2;;3;;
2017-04-01 11:00;use_count;6;5;3;;
2017-04-01 11:05;use_count;4;5;3;;
2017-04-01 11:10;use_count;4;5;3;4;
2017-04-01 11:15;use_count;4;5;3;4;
2017-04-01 11:20;use_count;4;5;3;4;
2017-04-01 11:25;use_count;4;5;3;4;
2017-04-01 11:30;use_count;;;3;4;
2017-04-01 11:35;use_count;;;3;4;
2017-04-01 11:40;use_count;;;3;;
2017-04-01 11:45;use_count;;;;;
2017-04-01 11:50;use_count;;;;;
2017-04-01 11:55;use_count;;;;;
2017-04-01 12:00;use_count;;;;;
2017-04-01 10:30;desc;;;;;
2017-04-01 10:35;desc;2: test_adf1;similar desc;;;
2017-04-01 10:40;desc;2: test_adf1;for;;;
2017-04-01 10:45;desc;2: test_adf1;the;;;
2017-04-01 10:50;desc;2: test_adf1;rest;;;
2017-04-01 10:55;desc;2: test_adf1;of;;;
2017-04-01 11:00;desc;"2: test_asdf1
4: test_asdf3";the;;;
2017-04-01 11:05;desc;4: test_asdf3;columns;;;
2017-04-01 11:10;desc;4: test_asdf3;;;;
2017-04-01 11:15;desc;4: test_asdf3;;;;
2017-04-01 11:20;desc;4: test_asdf3;;;;
2017-04-01 11:25;desc;4: test_asdf3;;;;
2017-04-01 11:30;desc;;;;;
2017-04-01 11:35;desc;;;;;
2017-04-01 11:40;desc;;;;;
2017-04-01 11:45;desc;;;;;
2017-04-01 11:50;desc;;;;;
2017-04-01 11:55;desc;;;;;
2017-04-01 12:00;desc;;;;;
;;and so on from 00:00 to  23:55;;;;
""")
dft = pd.read_csv(dftstr, sep=";")
dft

这里需要几个步骤。我使用了您的设置,但立即通过 parse_dates:

将时间戳转换为 pandas 日期时间对象
from io import StringIO
import pandas as pd

dfstr = StringIO(u"""
eqpt;starttm;endtm;use_count;desc
AT1;2017-04-01 10:35;2017-04-01 11:05;2;test asdf1
AT2;2017-04-01 11:00;2017-04-01 11:30;5;test asdf2
AT1;2017-04-01 11:00;2017-04-01 11:30;4;test asdf3
AT3;2017-04-01 10:45;2017-04-01 11:45;3;test asdf4
CBL1;2017-04-01 11:10;2017-04-1 11:40;4;test asdf5
""")

df = pd.read_csv(dfstr, sep=";", parse_dates=["starttm", "endtm"])
print(df)

    eqpt    starttm                 endtm                   use_count       desc
0   AT1     2017-04-01 10:35:00     2017-04-01 11:05:00     2         test asdf1
1   AT2     2017-04-01 11:00:00     2017-04-01 11:30:00     5         test asdf2
2   AT1     2017-04-01 11:00:00     2017-04-01 11:30:00     4         test asdf3
3   AT3     2017-04-01 10:45:00     2017-04-01 11:45:00     3         test asdf4
4   CBL1    2017-04-01 11:10:00     2017-04-01 11:40:00     4         test asdf5

现在,这里有 3 个函数可以完成这项工作:

  • expand 获取单行输入 df 并创建一个数据框,其 DatetimeIndex 范围从 starttmendtm,间隔为 5 分钟。此外,还添加了实际的 use_countdesc 值。
  • summarize 处理重叠,同时组合 desc 字符串并在设备同时多次使用时求和 use_counts。它必须进行类型检查,因为输入可能是 pandas SeriesDataFrame。如果只为单个设备指定一行,则会通过 Series。否则,传递 DataFrame
  • aggregate 结合了 expandsummarize。首先,扩展并连接单个设备的所有条目(行)。然后,汇总展开的列。

就是这样。最后,您使用groupby 对设备进行分组并应用aggregate 功能:

def expand(row):
    index = pd.date_range(row["starttm"], row["endtm"], freq="5min")
    use_count=row["use_count"]
    desc= "{}:{}".format(use_count, row["desc"])

    return pd.DataFrame(index=index).assign(use_count=use_count, desc=desc)


def summarize(index, use_count, desc):
    if isinstance(use_count, pd.DataFrame):
        use_count = use_count.sum(axis=1)

    if isinstance(desc, pd.DataFrame):
        desc = desc.apply(lambda x: ", ".join(x.dropna()), axis=1)

    return pd.DataFrame({"use_count": use_count, "desc": desc}, index=index)


def aggregate(sub_df):
    dfs = pd.concat([expand(series) for idx, series in sub_df.iterrows()], axis=1)
    return summarize(dfs.index, dfs["use_count"], dfs["desc"])


transformed = df.groupby("eqpt").apply(aggregate).unstack("eqpt")

生成的数据框具有多索引列以区分 descuse_counts 允许正确的数据类型:

print(transformed["use_count"])

eqpt                 AT1  AT2  AT3  CBL1
2017-04-01 10:35:00  2.0  NaN  NaN   NaN
2017-04-01 10:40:00  2.0  NaN  NaN   NaN
2017-04-01 10:45:00  2.0  NaN  3.0   NaN
2017-04-01 10:50:00  2.0  NaN  3.0   NaN
2017-04-01 10:55:00  2.0  NaN  3.0   NaN
2017-04-01 11:00:00  6.0  5.0  3.0   NaN
2017-04-01 11:05:00  6.0  5.0  3.0   NaN
2017-04-01 11:10:00  4.0  5.0  3.0   4.0
2017-04-01 11:15:00  4.0  5.0  3.0   4.0
2017-04-01 11:20:00  4.0  5.0  3.0   4.0
2017-04-01 11:25:00  4.0  5.0  3.0   4.0
2017-04-01 11:30:00  4.0  5.0  3.0   4.0
2017-04-01 11:35:00  NaN  NaN  3.0   4.0
2017-04-01 11:40:00  NaN  NaN  3.0   4.0
2017-04-01 11:45:00  NaN  NaN  3.0   NaN


print(transformed)

                        desc                                                                        use_count
eqpt                    AT1                         AT2             AT3             CBL1            AT1     AT2     AT3     CBL1
2017-04-01 10:35:00     2:test asdf1                None            None            None            2.0     NaN     NaN     NaN
2017-04-01 10:40:00     2:test asdf1                None            None            None            2.0     NaN     NaN     NaN
2017-04-01 10:45:00     2:test asdf1                None            3:test asdf4    None            2.0     NaN     3.0     NaN
2017-04-01 10:50:00     2:test asdf1                None            3:test asdf4    None            2.0     NaN     3.0     NaN
2017-04-01 10:55:00     2:test asdf1                None            3:test asdf4    None            2.0     NaN     3.0     NaN
2017-04-01 11:00:00     2:test asdf1, 4:test asdf3  5:test asdf2    3:test asdf4    None            6.0     5.0     3.0     NaN
2017-04-01 11:05:00     2:test asdf1, 4:test asdf3  5:test asdf2    3:test asdf4    None            6.0     5.0     3.0     NaN
2017-04-01 11:10:00     4:test asdf3                5:test asdf2    3:test asdf4    4:test asdf5    4.0     5.0     3.0     4.0
2017-04-01 11:15:00     4:test asdf3                5:test asdf2    3:test asdf4    4:test asdf5    4.0     5.0     3.0     4.0
2017-04-01 11:20:00     4:test asdf3                5:test asdf2    3:test asdf4    4:test asdf5    4.0     5.0     3.0     4.0
2017-04-01 11:25:00     4:test asdf3                5:test asdf2    3:test asdf4    4:test asdf5    4.0     5.0     3.0     4.0
2017-04-01 11:30:00     4:test asdf3                5:test asdf2    3:test asdf4    4:test asdf5    4.0     5.0     3.0     4.0
2017-04-01 11:35:00     None                        None            3:test asdf4    4:test asdf5    NaN     NaN     3.0     4.0
2017-04-01 11:40:00     None                        None            3:test asdf4    4:test asdf5    NaN     NaN     3.0     4.0
2017-04-01 11:45:00     None                        None            3:test asdf4    None            NaN     NaN     3.0     NaN

要跨越一整天的日期时间索引,您可以使用 reindex:

transformed.reindex(pd.date_range("2017-04-01 00:00", "2017-04-01 23:55", freq="5min"))