按搜索算法对数据进行分组
Grouping Data by Search Algorithms
我在 Python 中有一个示例数据集,其中每条数据有 3 个值:
[ 字符串日期,整数 24 小时时间(前两个数字 = 小时,后两个数字 = 分钟),整数持续时间(始终为 15 分钟)]
我的目标是将具有相同日期且具有相邻 24 小时时间的数据分组。 24 小时时间值在相邻位置以 15 分钟的间隔分隔。最终,将具有相邻时间的数据片段分组将导致持续时间增加,无论分组的 15 分钟间隔有多少。我在下面提供了列表 final_dataset
以更好地表示最终数据集的外观。
我测试了一些代码以线性搜索 initial_dataset
。这是粗略的伪代码:
# -- Start at first data piece (call this previous)
# -- Check next data piece (call this current)
# -- Subtract 24hr time values for current and previous
# -- If difference is 15, append to a separate list the combined data piece
# Check next data piece (call this next)
# Subtract 24hr time values for next and current
# Repeat
# Check next data piece (call this next next)
# Repeat this linear iteration until the difference > 15
# Store last position of no adjacency
# -- Continue at the last position of no adjacency and repeat this entire process until end of initial_dataset is reached
通过数据结构或搜索算法,是否有更有效的方法来实现这一目标?
# -- Example Dataset
initial_dataset = [ ['July 26, 2021', 1000, 15],
['July 26, 2021', 1015, 15],
['July 26, 2021', 1030, 15],
['July 26, 2021', 1045, 15],
['July 26, 2021', 1500, 15],
['July 27, 2021', 1400, 15], ]
final_dataset = [ ['July 26, 2021', 1000, 60],
['July 26, 2021', 1500, 15]
['July 27, 2021', 1400, 15] ]
通过使用 collections.defaultdict
,分组时只需要对您的数据进行一次传递:
import collections
data = [['July 26, 2021', 1000, 15], ['July 26, 2021', 1015, 15], ['July 26, 2021', 1030, 15], ['July 26, 2021', 1045, 15], ['July 26, 2021', 1500, 15], ['July 27, 2021', 1400, 15]]
d = collections.defaultdict(dict)
for a, b, c in data:
if (v:=int(b/100)) in d[a]:
d[a][v] += c
else:
d[a][v] = c
result = [[a, j*100, k] for a, b in d.items() for j, k in b.items()]
输出:
[['July 26, 2021', 1000, 60], ['July 26, 2021', 1500, 15], ['July 27, 2021', 1400, 15]]
我在 Python 中有一个示例数据集,其中每条数据有 3 个值:
[ 字符串日期,整数 24 小时时间(前两个数字 = 小时,后两个数字 = 分钟),整数持续时间(始终为 15 分钟)]
我的目标是将具有相同日期且具有相邻 24 小时时间的数据分组。 24 小时时间值在相邻位置以 15 分钟的间隔分隔。最终,将具有相邻时间的数据片段分组将导致持续时间增加,无论分组的 15 分钟间隔有多少。我在下面提供了列表 final_dataset
以更好地表示最终数据集的外观。
我测试了一些代码以线性搜索 initial_dataset
。这是粗略的伪代码:
# -- Start at first data piece (call this previous)
# -- Check next data piece (call this current)
# -- Subtract 24hr time values for current and previous
# -- If difference is 15, append to a separate list the combined data piece
# Check next data piece (call this next)
# Subtract 24hr time values for next and current
# Repeat
# Check next data piece (call this next next)
# Repeat this linear iteration until the difference > 15
# Store last position of no adjacency
# -- Continue at the last position of no adjacency and repeat this entire process until end of initial_dataset is reached
通过数据结构或搜索算法,是否有更有效的方法来实现这一目标?
# -- Example Dataset
initial_dataset = [ ['July 26, 2021', 1000, 15],
['July 26, 2021', 1015, 15],
['July 26, 2021', 1030, 15],
['July 26, 2021', 1045, 15],
['July 26, 2021', 1500, 15],
['July 27, 2021', 1400, 15], ]
final_dataset = [ ['July 26, 2021', 1000, 60],
['July 26, 2021', 1500, 15]
['July 27, 2021', 1400, 15] ]
通过使用 collections.defaultdict
,分组时只需要对您的数据进行一次传递:
import collections
data = [['July 26, 2021', 1000, 15], ['July 26, 2021', 1015, 15], ['July 26, 2021', 1030, 15], ['July 26, 2021', 1045, 15], ['July 26, 2021', 1500, 15], ['July 27, 2021', 1400, 15]]
d = collections.defaultdict(dict)
for a, b, c in data:
if (v:=int(b/100)) in d[a]:
d[a][v] += c
else:
d[a][v] = c
result = [[a, j*100, k] for a, b in d.items() for j, k in b.items()]
输出:
[['July 26, 2021', 1000, 60], ['July 26, 2021', 1500, 15], ['July 27, 2021', 1400, 15]]