如何在一个变量的组合中找到关系
How to find relationships in combinations in one variable
我想数一数不同作者在标题上合作的频率。给定的数据集如下所示:
Title | Author
------ | ------
A | ABC
A | DEF
B | ABC
B | GHI
B | JKL
C | ABC
C | JKL
D | GHI
D | DEF
E | ABC
E | JKL
F | ABC
F | JKL
我的目标 table 应该是这样的,其中 count 表示作者合作的标题数。
Author | Works with | Count
------ | ---------- | -----
ABC | DEF | 1
ABC | GHI | 0
ABC | JKL | 3
DEF | ABC | 1
DEF | GHI | 2
... | ... | ...
您可以使用 sqldf 包
target = sqldf("select a.author as a1,b.author as a2,count(*) as count from df a inner join df b on a.title = b.title group by a.author,b.author")
target <- target[!target$a1== target$a2,]
使用基函数的解决方案:
Title <- c("A","A","B","B","B","C","C","D","D","E","E","F","F")
Author <- c("ABC","DEF","ABC","GHI","JKL","ABC","JKL","GHI","DEF","ABC","JKL","ABC","JKL")
df <- data.frame(cbind(Title, Author))
df2 <- expand.grid(unique(df$Author), unique(df$Author)) #set up data frame with unique combinations of all authors
lauth <- tapply(df$Title, df$Author, FUN=function(x) paste(x)) #get vector of all titles that each author worked on
myfun <- function(x,y) sum(lauth[[x]] %in% lauth[[y]]) #function
df2$count <- mapply(myfun, x=df2$Var1, y=df2$Var2) #apply function to columns of dataframe
使用 user36 的 table
和 crossprod
的另一个基础 R 解决方案。
# get counts of author interactions
counts <- crossprod(table(dat))
# construct data.frame from count results
mydf <- data.frame(author=rep(rownames(counts), each=nrow(counts)),
worksWith=rownames(counts),
count=c(counts))
# drop same author observations (equal to total number of pubs by author)
mydf <- mydf[mydf$author != mydf$worksWith,]
结果data.frame的前6行是
head(mydf)
author worksWith count
2 ABC DEF 1
3 ABC GHI 1
4 ABC JKL 4
5 DEF ABC 1
7 DEF GHI 1
8 DEF JKL 0
数据
dat <-
structure(list(Title = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 3L,
4L, 4L, 5L, 5L, 6L, 6L), .Label = c("A", "B", "C", "D", "E",
"F"), class = "factor"), Author = structure(c(1L, 2L, 1L, 3L,
4L, 1L, 4L, 3L, 2L, 1L, 4L, 1L, 4L), .Label = c("ABC", "DEF",
"GHI", "JKL"), class = "factor")), .Names = c("Title", "Author"
), class = "data.frame", row.names = c(NA, -13L))
我想数一数不同作者在标题上合作的频率。给定的数据集如下所示:
Title | Author
------ | ------
A | ABC
A | DEF
B | ABC
B | GHI
B | JKL
C | ABC
C | JKL
D | GHI
D | DEF
E | ABC
E | JKL
F | ABC
F | JKL
我的目标 table 应该是这样的,其中 count 表示作者合作的标题数。
Author | Works with | Count
------ | ---------- | -----
ABC | DEF | 1
ABC | GHI | 0
ABC | JKL | 3
DEF | ABC | 1
DEF | GHI | 2
... | ... | ...
您可以使用 sqldf 包
target = sqldf("select a.author as a1,b.author as a2,count(*) as count from df a inner join df b on a.title = b.title group by a.author,b.author")
target <- target[!target$a1== target$a2,]
使用基函数的解决方案:
Title <- c("A","A","B","B","B","C","C","D","D","E","E","F","F")
Author <- c("ABC","DEF","ABC","GHI","JKL","ABC","JKL","GHI","DEF","ABC","JKL","ABC","JKL")
df <- data.frame(cbind(Title, Author))
df2 <- expand.grid(unique(df$Author), unique(df$Author)) #set up data frame with unique combinations of all authors
lauth <- tapply(df$Title, df$Author, FUN=function(x) paste(x)) #get vector of all titles that each author worked on
myfun <- function(x,y) sum(lauth[[x]] %in% lauth[[y]]) #function
df2$count <- mapply(myfun, x=df2$Var1, y=df2$Var2) #apply function to columns of dataframe
使用 user36 的 table
和 crossprod
的另一个基础 R 解决方案。
# get counts of author interactions
counts <- crossprod(table(dat))
# construct data.frame from count results
mydf <- data.frame(author=rep(rownames(counts), each=nrow(counts)),
worksWith=rownames(counts),
count=c(counts))
# drop same author observations (equal to total number of pubs by author)
mydf <- mydf[mydf$author != mydf$worksWith,]
结果data.frame的前6行是
head(mydf)
author worksWith count
2 ABC DEF 1
3 ABC GHI 1
4 ABC JKL 4
5 DEF ABC 1
7 DEF GHI 1
8 DEF JKL 0
数据
dat <-
structure(list(Title = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 3L,
4L, 4L, 5L, 5L, 6L, 6L), .Label = c("A", "B", "C", "D", "E",
"F"), class = "factor"), Author = structure(c(1L, 2L, 1L, 3L,
4L, 1L, 4L, 3L, 2L, 1L, 4L, 1L, 4L), .Label = c("ABC", "DEF",
"GHI", "JKL"), class = "factor")), .Names = c("Title", "Author"
), class = "data.frame", row.names = c(NA, -13L))