如何在一个变量的组合中找到关系

How to find relationships in combinations in one variable

我想数一数不同作者在标题上合作的频率。给定的数据集如下所示:

Title  | Author
------ | ------
A      | ABC  
A      | DEF  
B      | ABC  
B      | GHI  
B      | JKL  
C      | ABC  
C      | JKL  
D      | GHI  
D      | DEF
E      | ABC
E      | JKL
F      | ABC
F      | JKL

我的目标 table 应该是这样的,其中 count 表示作者合作的标题数。

Author | Works with | Count
------ | ---------- | -----
ABC    | DEF        |     1    
ABC    | GHI        |     0
ABC    | JKL        |     3
DEF    | ABC        |     1
DEF    | GHI        |     2
...    | ...        |   ...

您可以使用 sqldf 包

target = sqldf("select a.author as a1,b.author as a2,count(*) as count from df a inner join df b on a.title = b.title group by a.author,b.author")
target <- target[!target$a1== target$a2,]

使用基函数的解决方案:

Title <- c("A","A","B","B","B","C","C","D","D","E","E","F","F")
Author <- c("ABC","DEF","ABC","GHI","JKL","ABC","JKL","GHI","DEF","ABC","JKL","ABC","JKL")

df <- data.frame(cbind(Title, Author))
df2 <- expand.grid(unique(df$Author), unique(df$Author)) #set up data frame with unique combinations of all authors

lauth <- tapply(df$Title, df$Author, FUN=function(x) paste(x)) #get vector of all titles that each author worked on
myfun <- function(x,y) sum(lauth[[x]] %in% lauth[[y]]) #function

df2$count <- mapply(myfun, x=df2$Var1, y=df2$Var2) #apply function to columns of dataframe

使用 user36 的 tablecrossprod 的另一个基础 R 解决方案。

# get counts of author interactions
counts <- crossprod(table(dat))

# construct data.frame from count results
mydf <- data.frame(author=rep(rownames(counts), each=nrow(counts)),
                   worksWith=rownames(counts),
                   count=c(counts))

# drop same author observations (equal to total number of pubs by author)
mydf <- mydf[mydf$author != mydf$worksWith,]

结果data.frame的前6行是

head(mydf)
  author worksWith count
2    ABC       DEF     1
3    ABC       GHI     1
4    ABC       JKL     4
5    DEF       ABC     1
7    DEF       GHI     1
8    DEF       JKL     0

数据

dat <-
structure(list(Title = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 
4L, 4L, 5L, 5L, 6L, 6L), .Label = c("A", "B", "C", "D", "E", 
"F"), class = "factor"), Author = structure(c(1L, 2L, 1L, 3L, 
4L, 1L, 4L, 3L, 2L, 1L, 4L, 1L, 4L), .Label = c("ABC", "DEF", 
"GHI", "JKL"), class = "factor")), .Names = c("Title", "Author"
), class = "data.frame", row.names = c(NA, -13L))