R:变量对之间的相关性
R: correlations between pairs of variables
我有一个如下所示的数据框:
ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4
实际上有更多列,但想法是有很多特征(如上例中的 bmi、身高和智商),然后又是相同数量的列,但这些是标准化残差在对一些变量进行回归后(上例中名为 bmi.residuals、height.residuals 和 IQ.residuals 的列)。我想创建一个对象,其中包含每对特征和残差之间的相关性,它看起来像这样:
trait correlation
bmi 0.85
height 0.90
IQ 0.75
其中相关性"bmi"是bmi和bmi.residuals的相关性,相关性"height"是身高和height.residuals的相关性,IQ是IQ和IQ的相关性和 IQ.residuals,等等
我可以一一计算所有相关性,但如果我在数据框中有很多列(很多特征),则必须有某种方法可以自动执行此操作。任何想法如何?我怀疑 lapply 可以派上用场,但不确定如何...
也许这对你有用:
bmi <- c(26, 27, 23)
height <- c(187, 176, 189)
bmi.residuals <- c(0.1, 0.3, 0.4)
height.residuals <- c(0.3, 0.2, 0.1)
df <- data.frame(bmi, height, bmi.residuals, height.residuals)
corr_df <- data.frame(cor(df))
names <- colnames(df)
names <- names[!grepl("residuals", names)]
cors <- data.frame(
traits = character(length(names)),
correlation = numeric(length(names)),
stringsAsFactors = FALSE
)
for (i in 1:length(names)) {
cors$traits[i] <- names[i]
cors$correlation[i] <-
corr_df[i, which(grepl(names[i], names(corr_df)))[2]]
}
输入:
> df
bmi height bmi.residuals height.residuals
1 26 187 0.1 0.3
2 27 176 0.3 0.2
3 23 189 0.4 0.1
相关矩阵:
> corr_df
bmi height bmi.residuals height.residuals
bmi 1.0000000 -0.78920304 -0.57655666 0.7205767
height -0.7892030 1.00000000 -0.04676098 -0.1428571
bmi.residuals -0.5765567 -0.04676098 1.00000000 -0.9819805
height.residuals 0.7205767 -0.14285714 -0.98198051 1.0000000
输出:
> cors
traits correlation
1 bmi -0.5765567
2 height -0.1428571
请注意,这仅在原始列出现在 .residual 列之前才有效。
这是一个简短的解决方案:
假设您有一个包含变量 a, a.resi, b, b.resi
的数据框
df <- data.frame(a=c(1:10), b=c(1:10),
a.resi=c(-1:-10), b.resi=c(-1:-10))
首先,使用所有核心变量(即没有后缀 .resi)创建一个向量(名为 'core')
core <- names(df) [1:2]
然后,使用 paste0()
创建另一个包含核心变量和后缀 .resi 的向量(名为 core.resi)
core.resi <- paste0(core, '.resi')
定义一个接受 3 个参数的函数:数据帧 (Data)、x 和 y。这个
函数将计算数据帧 Data
中给定 x 和 y 之间的相关性
MyFun <- function(Data, x,y) cor(Data[,x], Data[,y])
最后,将函数应用到向量核心和core.resi
mapply(MyFun, x=core, y=core.resi, MoreArgs = list(Data=df)) %>%
data.frame()
您可以尝试 tidyverse 解决方案:
library(tidyverse)
cor(d[,-1]) %>%
as.tibble() %>%
add_column(Trait=colnames(.)) %>%
gather(key, value, -Trait) %>%
rowwise() %>%
filter(grepl(paste(Trait, collapse = "|"), key)) %>%
filter(Trait != key) %>%
ungroup()
# A tibble: 3 x 3
Trait key value
<chr> <chr> <dbl>
1 bmi bmi.residuals -3.248544e-17
2 height height.residuals -7.837838e-01
3 IQ IQ.residuals 4.487375e-01
或者您直接从 data.frame 开始:
d %>%
gather(key, value, -ID) %>%
mutate(gr=strtrim(key,2)) %>%
split(.$gr) %>%
map(~spread(.,key, value)) %>%
map(~cor(.[-1:-2])[,2]) %>%
map(~data.frame(Trait1=names(.)[1], Trait2=names(.)[2], cor=.[1],stringsAsFactors = F)) %>%
bind_rows()
Trait1 Trait2 cor
1 bmi bmi.residuals -3.248544e-17
2 height height.residuals -7.837838e-01
3 IQ IQ.residuals 4.487375e-01
另一个使用 dplyr
和 tidyr
的解决方案。这个想法是首先创建所有相关性,因为这足够简单和快速,然后创建一个数据集并仅在变量名称匹配但不相同时保留行:
df = read.table(text = "
ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4
", header=T)
library(dplyr)
library(tidyr)
# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)
df %>%
select(-ID) %>% # remove unnecessary columns
cor() %>% # get all correlations (even ones you don't care about)
data.frame() %>% # save result as a dataframe
mutate(v1 = row.names(.)) %>% # add row names as a column
gather(v2,cor, -v1) %>% # reshape data
filter(f(v1,v2) & v1 != v2) # keep pairs that v1 matches v2, but are not the same
# v1 v2 cor
# 1 bmi bmi.residuals -3.248544e-17
# 2 height height.residuals -7.837838e-01
# 3 IQ IQ.residuals 4.487375e-01
另一种方法是先找出感兴趣的对,然后计算相关性:
df = read.table(text = "
ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4
", header=T)
library(dplyr)
library(tidyr)
# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)
# function to get cor between two variables
f2 = function(x,y) cor(df2[,x], df2[,y])
f2 = Vectorize(f2)
# keep only columns that you want to get correlations
df2 = df %>% select(-ID)
expand.grid(v1=names(df2), v2=names(df2)) %>% # get all possible combinations of names
filter(f(v1,v2) & v1 != v2) %>% # keep pairs of names where v1 matches v2, but are not the same
mutate(cor = f2(v1,v2)) # for those pairs (only) obtain correlation value
# v1 v2 cor
# 1 bmi bmi.residuals -3.248544e-17
# 2 height height.residuals -7.837838e-01
# 3 IQ IQ.residuals 4.487375e-01
我建议您选择较快的一种,因为您拥有的行数和列数可能会影响上述方法的速度。
我有一个如下所示的数据框:
ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4
实际上有更多列,但想法是有很多特征(如上例中的 bmi、身高和智商),然后又是相同数量的列,但这些是标准化残差在对一些变量进行回归后(上例中名为 bmi.residuals、height.residuals 和 IQ.residuals 的列)。我想创建一个对象,其中包含每对特征和残差之间的相关性,它看起来像这样:
trait correlation
bmi 0.85
height 0.90
IQ 0.75
其中相关性"bmi"是bmi和bmi.residuals的相关性,相关性"height"是身高和height.residuals的相关性,IQ是IQ和IQ的相关性和 IQ.residuals,等等
我可以一一计算所有相关性,但如果我在数据框中有很多列(很多特征),则必须有某种方法可以自动执行此操作。任何想法如何?我怀疑 lapply 可以派上用场,但不确定如何...
也许这对你有用:
bmi <- c(26, 27, 23)
height <- c(187, 176, 189)
bmi.residuals <- c(0.1, 0.3, 0.4)
height.residuals <- c(0.3, 0.2, 0.1)
df <- data.frame(bmi, height, bmi.residuals, height.residuals)
corr_df <- data.frame(cor(df))
names <- colnames(df)
names <- names[!grepl("residuals", names)]
cors <- data.frame(
traits = character(length(names)),
correlation = numeric(length(names)),
stringsAsFactors = FALSE
)
for (i in 1:length(names)) {
cors$traits[i] <- names[i]
cors$correlation[i] <-
corr_df[i, which(grepl(names[i], names(corr_df)))[2]]
}
输入:
> df
bmi height bmi.residuals height.residuals
1 26 187 0.1 0.3
2 27 176 0.3 0.2
3 23 189 0.4 0.1
相关矩阵:
> corr_df
bmi height bmi.residuals height.residuals
bmi 1.0000000 -0.78920304 -0.57655666 0.7205767
height -0.7892030 1.00000000 -0.04676098 -0.1428571
bmi.residuals -0.5765567 -0.04676098 1.00000000 -0.9819805
height.residuals 0.7205767 -0.14285714 -0.98198051 1.0000000
输出:
> cors
traits correlation
1 bmi -0.5765567
2 height -0.1428571
请注意,这仅在原始列出现在 .residual 列之前才有效。
这是一个简短的解决方案:
假设您有一个包含变量 a, a.resi, b, b.resi
的数据框df <- data.frame(a=c(1:10), b=c(1:10),
a.resi=c(-1:-10), b.resi=c(-1:-10))
首先,使用所有核心变量(即没有后缀 .resi)创建一个向量(名为 'core')
core <- names(df) [1:2]
然后,使用 paste0()
创建另一个包含核心变量和后缀 .resi 的向量(名为 core.resi)core.resi <- paste0(core, '.resi')
定义一个接受 3 个参数的函数:数据帧 (Data)、x 和 y。这个 函数将计算数据帧 Data
中给定 x 和 y 之间的相关性MyFun <- function(Data, x,y) cor(Data[,x], Data[,y])
最后,将函数应用到向量核心和core.resi
mapply(MyFun, x=core, y=core.resi, MoreArgs = list(Data=df)) %>%
data.frame()
您可以尝试 tidyverse 解决方案:
library(tidyverse)
cor(d[,-1]) %>%
as.tibble() %>%
add_column(Trait=colnames(.)) %>%
gather(key, value, -Trait) %>%
rowwise() %>%
filter(grepl(paste(Trait, collapse = "|"), key)) %>%
filter(Trait != key) %>%
ungroup()
# A tibble: 3 x 3
Trait key value
<chr> <chr> <dbl>
1 bmi bmi.residuals -3.248544e-17
2 height height.residuals -7.837838e-01
3 IQ IQ.residuals 4.487375e-01
或者您直接从 data.frame 开始:
d %>%
gather(key, value, -ID) %>%
mutate(gr=strtrim(key,2)) %>%
split(.$gr) %>%
map(~spread(.,key, value)) %>%
map(~cor(.[-1:-2])[,2]) %>%
map(~data.frame(Trait1=names(.)[1], Trait2=names(.)[2], cor=.[1],stringsAsFactors = F)) %>%
bind_rows()
Trait1 Trait2 cor
1 bmi bmi.residuals -3.248544e-17
2 height height.residuals -7.837838e-01
3 IQ IQ.residuals 4.487375e-01
另一个使用 dplyr
和 tidyr
的解决方案。这个想法是首先创建所有相关性,因为这足够简单和快速,然后创建一个数据集并仅在变量名称匹配但不相同时保留行:
df = read.table(text = "
ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4
", header=T)
library(dplyr)
library(tidyr)
# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)
df %>%
select(-ID) %>% # remove unnecessary columns
cor() %>% # get all correlations (even ones you don't care about)
data.frame() %>% # save result as a dataframe
mutate(v1 = row.names(.)) %>% # add row names as a column
gather(v2,cor, -v1) %>% # reshape data
filter(f(v1,v2) & v1 != v2) # keep pairs that v1 matches v2, but are not the same
# v1 v2 cor
# 1 bmi bmi.residuals -3.248544e-17
# 2 height height.residuals -7.837838e-01
# 3 IQ IQ.residuals 4.487375e-01
另一种方法是先找出感兴趣的对,然后计算相关性:
df = read.table(text = "
ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4
", header=T)
library(dplyr)
library(tidyr)
# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)
# function to get cor between two variables
f2 = function(x,y) cor(df2[,x], df2[,y])
f2 = Vectorize(f2)
# keep only columns that you want to get correlations
df2 = df %>% select(-ID)
expand.grid(v1=names(df2), v2=names(df2)) %>% # get all possible combinations of names
filter(f(v1,v2) & v1 != v2) %>% # keep pairs of names where v1 matches v2, but are not the same
mutate(cor = f2(v1,v2)) # for those pairs (only) obtain correlation value
# v1 v2 cor
# 1 bmi bmi.residuals -3.248544e-17
# 2 height height.residuals -7.837838e-01
# 3 IQ IQ.residuals 4.487375e-01
我建议您选择较快的一种,因为您拥有的行数和列数可能会影响上述方法的速度。