根据 2 个向量中的信息识别重叠组的相对大小

Question

我正在处理非常混乱的家庭数据，因为孩子有可能被归入多个家庭。数据结构如下：

famid <- c("A","A","B","C","C","D","D")
kidid <- c("1","2","1","3","4","4","5")
df <- as.data.frame(cbind(famid, kidid))

我想根据该家庭中的所有孩子都归入另一个更大的家庭的标准来确定我可以放弃哪些家庭。

例如，家庭 A 包含孩子 1 和孩子 2。家庭 B 包含孩子 1。因为家庭 B 完全包含在家庭 A 中，所以我想删除家庭 B。

或者，家庭C包含Kid 3和Kid 4。家庭D包含Kid 4和Kid 5。两个家庭都不完全包含在另一个家庭中，所以我暂时不想放弃任何一个。

在我的数据中，每个孩子最多可以有 6 个家庭，每个家庭最多可以有 8 个孩子。有成千上万的家庭和成千上万的孩子。

我尝试通过创建一个非常宽的 data.frame 来解决这个问题，每个学生一行，孩子与之相关的每个家庭都有一列，孩子与之相关的每个家庭中的每个兄弟姐妹，以及每个关联的家庭的附加列 (sibgrp) 将所有兄弟姐妹连接在一起。但是，当我试图在连接的字符串中搜索单个兄弟姐妹时，我发现我不知道该怎么做——grepl 不会将向量作为模式参数。

然后我开始研究相交函数和相似函数，但它们将整个向量相互比较，而不是将一个向量内的观察值与该向量内的其他观察值进行比较。（意思是——我无法查找字符串 df[1,2] 和字符串 df[1,3] 之间的交集。Intersect 而是标识 df[2] 和 df[3] 之间的交集）。

我试图改变我的想法以适应这种方法，这样我就可以将兄弟姐妹的向量相互比较，假设我已经知道至少有一个兄弟姐妹是共享的。考虑到有多少不同的家庭，有多少家庭甚至没有一个共同的孩子，我什至不知道如何开始这样做。

我在这里错过了什么？非常感谢任何反馈。谢谢！

Answer 1

这个函数也可以用来做任务。它 returns 一个包含可以删除的家族名称的字符向量。

test_function <- function(dataset){

## split the kidid on the basis of famid
kids_family <- split.default(dataset[['kidid']],f = dataset[['famid']])

family <- names(kids_family)

## This function generates all the possible combinations if we select any two families from family
combn_family <- combn(family,2)

family_removed <- character(0)
apply(combn_family,MARGIN = 2, function(x){

  if (length(setdiff(kids_family[[x[1]]],kids_family[[x[2]]])) == 0)
    family_removed <<- c(family_removed,x[1])
  else if (length(setdiff(kids_family[[x[2]]],kids_family[[x[1]]])) == 0)
    family_removed <<- c(family_removed,x[2])

})

return (family_removed)
}
> df <- data.frame(famid = c("A","A","B","C","C","D","D", "E", "E", "E", "F", "F"),
+                  kidid = c(1, 2, 1, 3, 4, 4, 5, 7, 8, 9, 7, 9))
> test_function(df)
[1] "B" "F"

Answer 2

我已经尝试了 setdiff，但没有成功。我来了 post 这个费力的解决方案，希望有更好的方法。

# dependencies for melting tables and handling data.frames
require(reshape2)
require(dplyr)


# I have added two more cases to your data.frame
# kidid is passed as numeric (with quoted would have been changed to vector by default)
df <- data.frame(famid = c("A","A","B","C","C","D","D", "E", "E", "E", "F", "F"),
                 kidid = c(1, 2, 1, 3, 4, 4, 5, 7, 8, 9, 7, 9))

# let's have a look to it
df
famid kidid
1      A     1
2      A     2
3      B     1
4      C     3
5      C     4
6      D     4
7      D     5
8      E     7
9      E     8
10     E     9
11     F     7
12     F     9

# we build a contingency table
m <- table(df$famid, df$kidid)

# a family A only contains a family B, if A has all the elements of B, 
# and at least one that B doesnt have
m

  1 2 3 4 5 7 8 9
A 1 1 0 0 0 0 0 0
B 1 0 0 0 0 0 0 0
C 0 0 1 1 0 0 0 0
D 0 0 0 1 1 0 0 0
E 0 0 0 0 0 1 1 1
F 0 0 0 0 0 1 0 1

# an helper function to implement that and return a friendly data.frame
family_contained <- function(m){
  res <- list()
  for (i in 1:nrow(m))
    # for each line in m, we calculate the difference to all other lines
    res[[i]] <- t(apply(m[-i, ], 1, function(row) m[i, ] - row))
  # here we test if all values are 0+ (ie if the selected family has all element of the other)
  # and if at least one is >=1 (ie if the selected family has at least one element that the other doesnt have)
  tab <- sapply(res, function(m) apply(m, 1,  function(x) all(x>=0) & any(x>=1)))
  # we format it as a table to have nice names
  tab %>% as.table() %>% 
    # we melt it into a data.frame
    melt()  %>% 
    # only select TRUE and get rid of this column
    filter(value) %>% select(-value) %>% 
    # to make things clear we name columns
    `colnames<-`(c("this_family_is_contained", "this_family_contains"))
}

family_contained(m)
# this_family_is_contained this_family_contains
# 1           B               A
# 2           F               E

# finally you can filter them with
filter(df, !(famid %in% family_contained(m)$this_family_is_contained))

根据 2 个向量中的信息识别重叠组的相对大小

Identifying relative size of overlapping groups based on information in 2 vectors

r

vector

intersect