R 中的传递关系:查找一个值的所有链接记录

Transitive relations in R: finding all linked records of a value

我有一个显示链接记录的数据框:

df <- data.frame(case = c(1,2,3,4,5,6), linked_to = c("2,4", 3,NA,NA,6,NA), stringsAsFactors = F)

# case linked_to
# 1          2,4
# 2            3
# 3         <NA>
# 4         <NA>
# 5            6
# 6         <NA>

在示例中,案例 1 链接到案例 24。由于案例 2 也与案例 3 相关联,因此案例 1 与案例 234 相关联。我想创建一个新列来指定 all linked cases:

# case linked_to all_linked
# 1          2,4    1,2,3,4
# 2            3    1,2,3,4
# 3         <NA>    1,2,3,4
# 4         <NA>    1,2,3,4
# 5            6        5,6
# 6         <NA>        5,6

我可以使用 igraph 中的 decompose.graph 函数来完成此操作,以找到孤立的组件,但解决方案似乎有些复杂:

library(igraph)

# Transform to igraph format    

to <- sapply(df$linked_to, function(x) unlist(strsplit(x,",")) )

from <- rep(rownames(df), sapply(to, length) )

to <- unlist(to)

from <- from[!is.na(to)]
to <- to[!is.na(to)]

d <- data.frame(from,to)

gr <- graph.data.frame(d)

# Split into components
grs <- decompose.graph(gr)

comp <- sapply(grs, function(x) V(x)$name)

matches <-  sapply(df$case, function(case) {
  sapply(comp, function(comp) {
    case %in% comp
  })
})

matches <- as.data.frame(matches)

ind <- sapply(matches, which)

# Assign all members of the component they belong to to each vertex
df$all_linked <- sapply(ind, function(x) {
  paste(comp[[x]], collapse = ",")
})

有没有更简单高效的解决方案? 可以,但不需要依赖网络分析工具

这个效率更高一些,使用了sna包中的kpath.census函数。 (或者,您可以在 igraph 中使用 distances 来达到同样的效果。)

library(sna)
df <- data.frame(case = c(1,2,3,4,5,6), 
                 linked_to = c("2,4", 3,NA,NA,6,NA), 
                 stringsAsFactors = F)

net <- data.frame(case = c(1,1,2,3,4,5,6), 
                 linked_to = c(2, 4, 3,NA,NA,6,NA), 
                 stringsAsFactors = F)

g <- network(net[complete.cases(net),], directed = FALSE)

comemb <- kpath.census(g, maxlen = 10, mode = "digraph",  tabulate.by.vertex = TRUE, 
                       path.comembership = "sum")$path.comemb

comemb_names <- sapply(1:ncol(comemb), 
                       function(x) ifelse(comemb[x,] > 0 , 
                                          colnames(comemb)[x], 0))

comemb_names <- lapply(1:nrow(comemb_names), function(x) comemb_names[x,][comemb_names[x,] != "0"])

df$all_linked <- sapply(comemb_names, function(x) paste(x,collapse = ","))

结果:

> df
  case linked_to all_linked
1    1       2,4    1,2,3,4
2    2         3    1,2,3,4
3    3      <NA>    1,2,3,4
4    4      <NA>    1,2,3,4
5    5         6        5,6
6    6      <NA>        5,6