拆分数据集群的最佳方法

Question

我正在研究一种基于时间戳拆分 CSV 文件中的数据的方法。

例如，对于给定的对象 ID，检查每个条目的日期并查看它是否在给定的允许范围内。因此，如果 table 中的一组行是：

OBJECT ID   -   Info    -   Date
obj1           xyz         1/1/12
obj1           xyw         1/2/12
obj1           cya         1/3/12
obj1           abc         2/1/12
...

在这个例子中，第四个条目完全超出了其他条目所在的时间区域。因此，我希望的行为是让脚本将该条目分配给一个新对象，比如 'obj2' 例如，这样它就可以与自己集群中的数据分开。请注意，这将应用于的数据集会有些大，至少在几千个，所以我不知道手动算法是否足够快。

我目前正在使用 R 来尝试使用 FPC 包中的 PAM 和 PAMK 函数来完成这项工作。这为我提供了聚类图（我认为），但我不知道如何将此信息应用于实际数据。

关于执行此操作的最佳方法有什么想法或建议吗？

Answer 1

我通过以下步骤找到了解决方案：

// Convert the timestamps to milliseconds
newData <- as.POSIXct(data$date, format="date_format_here")

// Split the data using the object ID as the parameter
splitData <- split(data, f=data$id)

// Iterate over the split sessions, concatenating the cluster IDs as it goes using paste
for each {
    pamk.result <- pamk(splitData[[i]][dataColumnIndex]
    newData[i,1] <- paste(data[i,1], 
                        pamk.result$pamobject$clustering[[x]], 
                        sep="delimiter_here")
}

无论如何，这是我处理问题的粗略概述。也许这会给其他人一些想法。

拆分数据集群的最佳方法

Best approach to splitting up clusters of data

csv

r

cluster-computing