查找循环数据簇的最小值和最大值
Finding the minimum and maximum value of a cluster for cyclic data
如何确定循环数据的簇的最小值和最大值,这里是0到24的范围,同时考虑到簇超出了取值范围的限制?
看蓝色的簇,我想确定值22和2作为簇的边界。哪个算法可以解决这个问题?
我找到了解决问题的方法。
假设数据格式如下:
#!/usr/bin/env python3
import numpy as np
data = np.array([0, 1, 2, 12, 13, 14, 15, 21, 22, 23])
labels = np.array([0, 0, 0, 1, 1, 1, 1, 0, 0, 0])
bounds = get_cluster_bounds(data, labels)
print(bounds) # {0: array([21, 2]), 1: array([12, 15])}
您可以在这里找到函数:
#!/usr/bin/env python3
import numpy as np
def get_cluster_bounds(data: np.ndarray, labels: np.ndarray) -> dict:
"""
There are five ways in which the points of the cluster can be cyclically
considered. The points to be determined are marked with an arrow.
In the first case, the cluster data is distributed beyond the edge of
the cycle:
↓B ↓A
|#####____________#####|
In the second case, the data lies exactly at the beginning of the value
range, but without exceeding it.
↓A ↓B
|##########____________|
In the third case, the data lies exactly at the end of the value
range, but without exceeding it.
↓A ↓B
|____________##########|
In the fourth, the data lies within the value range
without touching a border.
↓A ↓B
|_______##########_____|
In the fifth and simplest case, the data lies in the entire area without
another label existing.
↓A ↓B
|######################|
Args:
data: (n, 1) numpy array containing all data points.
labels: (n, 1) numpy array containing all data labels.
Returns:
bounds: A dictionary whose key is the index of the cluster and
whose value specifies the start and end point of the
cluster.
"""
# Sort the data in ascending order.
shuffle = data.argsort()
data = data[shuffle]
labels = labels[shuffle]
# Get the number of unique clusters.
labels_unique = np.unique(labels)
num_clusters = labels_unique.size
bounds = {}
for c_index in range(num_clusters):
mask = labels == c_index
# Case 1 or 5
if mask[0] and mask[-1]:
# Case 5
if np.all(mask):
start = data[0]
end = data[-1]
# Case 1
else:
edges = np.where(np.invert(mask))[0]
start = data[edges[-1] + 1]
end = data[edges[0] - 1]
# Case 2
elif mask[0] and not mask[-1]:
edges = np.where(np.invert(mask))[0]
start = data[0]
end = data[edges[0] - 1]
# Case 3
elif not mask[0] and mask[-1]:
edges = np.where(np.invert(mask))[0]
start = data[edges[-1] + 1]
end = data[-1]
# Case 4
elif not mask[0] and not mask[-1]:
edges = np.where(mask)[0]
start = data[edges[0]]
end = data[edges[-1]]
else:
raise ValueError('This should not happen.')
bounds[c_index] = np.array([start, end])
return bounds
如何确定循环数据的簇的最小值和最大值,这里是0到24的范围,同时考虑到簇超出了取值范围的限制?
看蓝色的簇,我想确定值22和2作为簇的边界。哪个算法可以解决这个问题?
我找到了解决问题的方法。 假设数据格式如下:
#!/usr/bin/env python3
import numpy as np
data = np.array([0, 1, 2, 12, 13, 14, 15, 21, 22, 23])
labels = np.array([0, 0, 0, 1, 1, 1, 1, 0, 0, 0])
bounds = get_cluster_bounds(data, labels)
print(bounds) # {0: array([21, 2]), 1: array([12, 15])}
您可以在这里找到函数:
#!/usr/bin/env python3
import numpy as np
def get_cluster_bounds(data: np.ndarray, labels: np.ndarray) -> dict:
"""
There are five ways in which the points of the cluster can be cyclically
considered. The points to be determined are marked with an arrow.
In the first case, the cluster data is distributed beyond the edge of
the cycle:
↓B ↓A
|#####____________#####|
In the second case, the data lies exactly at the beginning of the value
range, but without exceeding it.
↓A ↓B
|##########____________|
In the third case, the data lies exactly at the end of the value
range, but without exceeding it.
↓A ↓B
|____________##########|
In the fourth, the data lies within the value range
without touching a border.
↓A ↓B
|_______##########_____|
In the fifth and simplest case, the data lies in the entire area without
another label existing.
↓A ↓B
|######################|
Args:
data: (n, 1) numpy array containing all data points.
labels: (n, 1) numpy array containing all data labels.
Returns:
bounds: A dictionary whose key is the index of the cluster and
whose value specifies the start and end point of the
cluster.
"""
# Sort the data in ascending order.
shuffle = data.argsort()
data = data[shuffle]
labels = labels[shuffle]
# Get the number of unique clusters.
labels_unique = np.unique(labels)
num_clusters = labels_unique.size
bounds = {}
for c_index in range(num_clusters):
mask = labels == c_index
# Case 1 or 5
if mask[0] and mask[-1]:
# Case 5
if np.all(mask):
start = data[0]
end = data[-1]
# Case 1
else:
edges = np.where(np.invert(mask))[0]
start = data[edges[-1] + 1]
end = data[edges[0] - 1]
# Case 2
elif mask[0] and not mask[-1]:
edges = np.where(np.invert(mask))[0]
start = data[0]
end = data[edges[0] - 1]
# Case 3
elif not mask[0] and mask[-1]:
edges = np.where(np.invert(mask))[0]
start = data[edges[-1] + 1]
end = data[-1]
# Case 4
elif not mask[0] and not mask[-1]:
edges = np.where(mask)[0]
start = data[edges[0]]
end = data[edges[-1]]
else:
raise ValueError('This should not happen.')
bounds[c_index] = np.array([start, end])
return bounds