R:变量对之间的相关性

R: correlations between pairs of variables

我有一个如下所示的数据框:

ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4

实际上有更多列,但想法是有很多特征(如上例中的 bmi、身高和智商),然后又是相同数量的列,但这些是标准化残差在对一些变量进行回归后(上例中名为 bmi.residuals、height.residuals 和 IQ.residuals 的列)。我想创建一个对象,其中包含每对特征和残差之间的相关性,它看起来像这样:

trait correlation 
bmi 0.85
height 0.90
IQ 0.75

其中相关性"bmi"是bmi和bmi.residuals的相关性,相关性"height"是身高和height.residuals的相关性,IQ是IQ和IQ的相关性和 IQ.residuals,等等

我可以一一计算所有相关性,但如果我在数据框中有很多列(很多特征),则必须有某种方法可以自动执行此操作。任何想法如何?我怀疑 lapply 可以派上用场,但不确定如何...

也许这对你有用:

bmi <- c(26, 27, 23)
height <- c(187, 176, 189)

bmi.residuals <- c(0.1, 0.3, 0.4)
height.residuals <- c(0.3, 0.2, 0.1)

df <- data.frame(bmi, height, bmi.residuals, height.residuals)

corr_df <- data.frame(cor(df))

names <- colnames(df)
names <- names[!grepl("residuals", names)]

cors <- data.frame(
  traits = character(length(names)),
  correlation = numeric(length(names)),
  stringsAsFactors = FALSE
)

for (i in 1:length(names)) {
  cors$traits[i] <- names[i]
  cors$correlation[i] <-
    corr_df[i, which(grepl(names[i], names(corr_df)))[2]]
}

输入:

> df
  bmi height bmi.residuals height.residuals
1  26    187           0.1              0.3
2  27    176           0.3              0.2
3  23    189           0.4              0.1

相关矩阵:

> corr_df
                        bmi      height bmi.residuals height.residuals
bmi               1.0000000 -0.78920304   -0.57655666        0.7205767
height           -0.7892030  1.00000000   -0.04676098       -0.1428571
bmi.residuals    -0.5765567 -0.04676098    1.00000000       -0.9819805
height.residuals  0.7205767 -0.14285714   -0.98198051        1.0000000

输出:

> cors
  traits correlation
1    bmi  -0.5765567
2 height  -0.1428571

请注意,这仅在原始列出现在 .residual 列之前才有效。

这是一个简短的解决方案:

假设您有一个包含变量 a, a.resi, b, b.resi

的数据框
df <- data.frame(a=c(1:10), b=c(1:10),
              a.resi=c(-1:-10), b.resi=c(-1:-10))

首先,使用所有核心变量(即没有后缀 .resi)创建一个向量(名为 'core')

core <- names(df) [1:2]

然后,使用 paste0()

创建另一个包含核心变量和后缀 .resi 的向量(名为 core.resi)
core.resi <- paste0(core, '.resi')

定义一个接受 3 个参数的函数:数据帧 (Data)、x 和 y。这个 函数将计算数据帧 Data

中给定 x 和 y 之间的相关性
MyFun <- function(Data, x,y) cor(Data[,x], Data[,y])

最后,将函数应用到向量核心和core.resi

mapply(MyFun, x=core, y=core.resi, MoreArgs = list(Data=df)) %>% 
data.frame() 

您可以尝试 tidyverse 解决方案:

library(tidyverse)
cor(d[,-1]) %>% 
  as.tibble() %>% 
  add_column(Trait=colnames(.)) %>% 
  gather(key, value, -Trait) %>% 
  rowwise() %>% 
  filter(grepl(paste(Trait, collapse = "|"), key)) %>% 
  filter(Trait != key) %>% 
  ungroup()
# A tibble: 3 x 3
   Trait              key         value
   <chr>            <chr>         <dbl>
1    bmi    bmi.residuals -3.248544e-17
2 height height.residuals -7.837838e-01
3     IQ     IQ.residuals  4.487375e-01

或者您直接从 data.frame 开始:

d %>% 
  gather(key, value, -ID) %>% 
  mutate(gr=strtrim(key,2)) %>% 
  split(.$gr) %>% 
  map(~spread(.,key, value)) %>%
  map(~cor(.[-1:-2])[,2]) %>% 
  map(~data.frame(Trait1=names(.)[1], Trait2=names(.)[2], cor=.[1],stringsAsFactors = F)) %>% 
  bind_rows()  
  Trait1           Trait2           cor
1    bmi    bmi.residuals -3.248544e-17
2 height height.residuals -7.837838e-01
3     IQ     IQ.residuals  4.487375e-01

另一个使用 dplyrtidyr 的解决方案。这个想法是首先创建所有相关性,因为这足够简单和快速,然后创建一个数据集并仅在变量名称匹配但不相同时保留行:

df = read.table(text = "
ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4
", header=T)

library(dplyr)
library(tidyr)

# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)


df %>% 
  select(-ID) %>%                # remove unnecessary columns
  cor() %>%                      # get all correlations (even ones you don't care about)
  data.frame() %>%               # save result as a dataframe
  mutate(v1 = row.names(.)) %>%  # add row names as a column
  gather(v2,cor, -v1) %>%        # reshape data
  filter(f(v1,v2) & v1 != v2)    # keep pairs that v1 matches v2, but are not the same

#       v1               v2           cor
# 1    bmi    bmi.residuals -3.248544e-17
# 2 height height.residuals -7.837838e-01
# 3     IQ     IQ.residuals  4.487375e-01

另一种方法是先找出感兴趣的对,然后计算相关性:

df = read.table(text = "
ID bmi height IQ bmi.residuals height.residuals IQ.residuals
a 26 187 110 0.1 0.3 0.4
b 27 176 115 0.3 0.2 0.7
c 23 189 108 0.4 0.1 0.5
d 25 168 101 0.6 0.6 0.6
e 24 190 99 -0.1 0.2 0.4
", header=T)

library(dplyr)
library(tidyr)

# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)

# function to get cor between two variables
f2 = function(x,y) cor(df2[,x], df2[,y])
f2 = Vectorize(f2)

# keep only columns that you want to get correlations
df2 = df %>% select(-ID)

expand.grid(v1=names(df2), v2=names(df2)) %>%  # get all possible combinations of names
  filter(f(v1,v2) & v1 != v2) %>%              # keep pairs of names where v1 matches v2, but are not the same
  mutate(cor = f2(v1,v2))                      # for those pairs (only) obtain correlation value

#       v1               v2           cor
# 1    bmi    bmi.residuals -3.248544e-17
# 2 height height.residuals -7.837838e-01
# 3     IQ     IQ.residuals  4.487375e-01

我建议您选择较快的一种,因为您拥有的行数和列数可能会影响上述方法的速度。