基于两列映射两个文件的簇

Mapping the clusters of two files based on two columns

我有两个文件。第一个文件有三列: SiteID 、 Time 和 ClusterNo.

第二个文件有四列:SiteA_ID、SiteB_ID、时间和集群号。

file1 <- data.frame("Site_ID" = sample(74000:74500, 1000, replace =TRUE), "Time" =  runif(1000)*100, "ClusterNo." = sample(1:500, 1000, replace = TRUE)) 
file2 <- data.frame("SiteA_ID" = sample(74000:74500, 1000, replace =TRUE),"SiteB_ID" = sample(74000:74500, 1000, replace =TRUE), "Time" =     runif(1000)*100, "ClusterNo." = sample(1:500, 1000, replace = TRUE))   

我们必须找出(文件 1 和文件 2 的)哪些集群以文件 1 的 Site_ID 与文件 2 的站点(A 或 B)匹配的方式进行映射;文件1和文件2的时间相差不超过2个单位。

所需的输出是一个包含三列的文件:ClusterNoOfFile1 和 ClusterNoOfFile2 以及 CommonSite

[注:CommonSite是集群映射的file1和file2的公共站点]

下面是一种按照您的目标完成某些事情的方法(我不太清楚您的输入应该给出什么输出)。您可以根据自己的具体需要对其进行修改。

library(dplyr)
library(tidyr)

# Generate the data (your code)
file1 <- data.frame("Site_ID" = sample(74000:74500, 1000, replace =TRUE), "Time" =  runif(1000)*100, "ClusterNo." = sample(1:500, 1000, replace = TRUE))
file2 <- data.frame("SiteA_ID" = sample(74000:74500, 1000, replace =TRUE),"SiteB_ID" = sample(74000:74500, 1000, replace =TRUE), "Time" =     runif(1000)*100, "ClusterNo." = sample(1:500, 1000, replace = TRUE))

# Convert file2 to long format so there is only one site id
file2Long <- gather(file2, Site_Type, Site_ID, -Time, -ClusterNo.)

# Inner join with file1 so you retain all rows with matching site id.
file12 <- inner_join(file1, file2Long, by = 'Site_ID')

# Compute time difference and store whether it is within range
file12$TimeDiff2 <- abs(file12$Time.x - file12$Time.y) <= 2

# Filter the ones that meet the threshold criteria of 2, and retain only
# columns of interest.
file12Diff2 <- filter(file12, TimeDiff2 == TRUE)
file12Diff2 <- select(file12Diff2, ClusterNo..x, ClusterNo..y, Site_ID)

输出将如下所示(.x 表示文件 1,.y 表示文件 2 - 您可以将这些名称更改为您需要的任何名称):

  ClusterNo..x ClusterNo..y Site_ID
1          400           96   74308
2          298          438   74027
3          397          137   74265
4          420          286   74395
5          280           77   74097
6          176          333   74303