根据均值差阈值对排序数组进行分组

Group sorted array according a mean difference threshold

问题陈述:我有一个 N 排序整数数组和一个阈值 K。我想以这样的方式对它们进行分组,即对于每个元素,组均值与元素之间的差异为 <= K。最好使用什么算法?

我研究了 Jenks 的自然中断和 k 均值聚类,但这两种方法似乎更适合您拥有所需数量的聚类的情况,而我有每个聚类所需的最大方差。

// example
const distances = [5, 8, 8, 9, 16, 20, 29, 42, 56, 57, 57, 58, 103, 104, 150, 167]
const threshold = 10

// desired output:
// cluster(distances) =>
// [
//   [8, 8, 9, 5, 16, 20]
//   [29, 42]
//   [56, 57, 57, 58]
//   [103, 104]
//   [150, 167]
// ]

这是我目前的进度:https://gist.github.com/qrohlf/785c667735171b7353702cc74c10857d

我可能会尝试某种分而治之的方法来纠正 'ballpark' 我从目前的实施中获得的结果,但我并没有真正看到一个好的、干净的立即执行此操作的方法。

我搜索并找到了这个:带算术平均值的未加权对组法。 这是一篇带有示例的文章:link。我认为它会对你有所帮助,看起来很容易确认你的目的。

UPGMA 算法生成有根树状图并需要一个恒定速率假设 - 也就是说,它假设一个超度量树,其中从根到每个分支尖端的距离相等.

对于遇到此问题的任何其他人,这是我对上述 UPGMA 算法的(未优化的)实现:

const head = array => array[0]
const tail = array => array.slice(1)
const last = array => array[array.length - 1]
const sum = array => array.reduce((a, b) => a + b)
const avg = array => sum(array) / array.length
const minIndex = array => array.reduce((iMin, x, i) => x < array[iMin] ? i : iMin, 0)
const range = length => Array.apply(null, Array(length)).map((_, i) => i)
const isArray = Array.isArray

const distances = [5, 8, 8, 9, 16, 20, 29, 42, 56, 57, 57, 58, 103, 104, 150, 167, 800]

// cluster an array of numeric values such that the mean difference of each
// point within each cluster is within a threshold value
const cluster = (points, threshold = 10) => {
  return _cluster(points, range(points.length).map(i => [i]), threshold).map(c =>
    isArray(c) ? c.map(i => points[i]) : [points[c]])
}

// recursive call
const _cluster = (points, clusters, threshold) => {
  const matrix = getDistanceMatrix(points, clusters)
  // get the minimum col index for each row in the matrix
  const rowMinimums = matrix.map(minIndex)
  // get the index for the column containing the smallest distance
  const bestRow = minIndex(rowMinimums.map((col, row) => matrix[row][col]))
  const bestCol = rowMinimums[bestRow]
  const isValid = isValidCluster(points, mergeClusters(clusters[bestRow], clusters[bestCol]), threshold)

  if (!isValid) {
    return clusters
  }

  return _cluster(points, merge(clusters, bestRow, bestCol), threshold)
}

const isValidCluster = (points, cluster, threshold) => {
  // at this point, cluster is guaranteed to be an array, not a single point
  const distances = cluster.map(i => points[i])
  const mean = avg(distances)
  return distances.every(d => Math.abs(mean - d) <= threshold)
}

// immutable merge of indices a and b in clusters
const merge = (clusters, a, b) => {
  // merge two clusters by index
  const clusterA = clusters[a]
  const clusterB = clusters[b]
  // optimization opportunity: this filter is causing *another* iteration
  // of clusters.
  const withoutPoints = clusters.filter(c => c !== clusterA && c !== clusterB)

  return [mergeClusters(clusterA, clusterB)].concat(withoutPoints)
}

const mergeClusters = (clusterA, clusterB) => clusterA.concat(clusterB)

// optimization opportunity: this currently does 2x the work needed, since the
// distance from a->b is the same as the distance from b->a
const getDistanceMatrix = (points, clusters) => {
  // reduce clusters to distance/average distance
  const reduced = clusters.map(c => Array.isArray(c) ? avg(c.map(i => points[i])) : points[c])
  return reduced.map((i, row) => reduced.map((j, col) => (row === col) ? Infinity : Math.abs(j - i)))
}

const log2DArray = rows => console.log('[\n' + rows.map(row => '  [' + row.join(', ') + ']').join('\n') + '\n]')

console.log('clustered points:')
log2DArray(cluster(distances))