将模型拟合到多个分组或子集，并提取数据框输出的原始因子列

Question

我想拟合模型并提取按分组因子（下面的 fac1 和 fac2）或子集划分的特定参数。我的问题是，当 sapply 输出正确的参数时，我遇到了一个列表，其中的元素被命名为组合。我想要得到的是一个 data.frame ，其中每个因素都有一个带有适当标签的列。我想在 base R 中执行此操作。

注意，答案必须是一般，而不是针对本例中使用的特定名称。如果因素名称包括 'periods.' ，答案不应该受到阻碍我最终会做一些可以与任何数据一起使用的事情，所以这个答案需要这样做，也需要与任何数量的因素一起使用。我实际上是在一个更大的数据集上使用自定义函数，但这个例子代表了我的问题。

以下是可重现的代码：

#create data
fac1 <- c(rep("A", 10), rep("B",10))
fac2 <- rep(c(rep("X", 5), rep("Y",5)),2)
x <- rep(1:5,4)
set.seed(1337)
y <- rep(seq(2, 10, 2), 4) * runif(20, .8, 1.2)

xy <- data.frame(x,y) #bind parameters for regression

factors <- list(fac1, fac2) #split by 2 factors

sapply(split(xy, factors), function(c) coef(lm(c$y~c$x))[2]) 
#run regression by these 4 groups, pull out slope

输出为：

A.X.c$x  B.X.c$x  A.Y.c$x  B.Y.c$x 
1.861290 2.131431 1.590733 1.746169

我想要的是：

fac1 fac2 slope
A    X    1.861290 
B    X    2.131431 
A    Y    1.590733 
B    Y    1.746169

以下代码可能会变得更通用以实现此目的，但我担心 expand.grid 进行了所有可能的组合但用户的数据中缺少组合的情况，以及是否订单将保持不变。 expand.grid 是否使用类似的方法来分割确定返回值顺序的数据的子集？

slopes <- sapply(split(xy, factors), function(c) coef(lm(c$y~c$x))[2]) 

dataframeplz <- as.data.frame(expand.grid(unique(fac1), unique(fac2))) 

dataframeplz$slope <- slopes

dataframeplz

如果有帮助，这里是 plyr 解决方案。它非常简单，但不是基础 R。任何人都知道在 Hadley 的代码中哪里发生了这种魔法？ Github 用户？

library("plyr")
neatdata <- data.frame(fac1,fac2,x,y)
ddply(neatdata, c("fac1", "fac2"), function(c) coef(lm(c$y~c$x))[2])

Answer 1

我使用了 base R 并专注于您的具体示例。此过程有局限性，因为它将列名作为字符串处理并保留您需要的适当信息。

#create data
fac1 <- c(rep("A", 10), rep("B",10))
fac2 <- rep(c(rep("X", 5), rep("Y",5)),2)
x <- rep(1:5,4)
set.seed(1337)
y <- rep(seq(2, 10, 2), 4) * runif(20, .8, 1.2)

xy <- data.frame(x,y) #bind parameters for regression

factors <- list(fac1, fac2) #split by 2 factors

dt_res = sapply(split(xy, factors), function(c) coef(lm(c$y~c$x))[2]) #run regression by these 4 groups, pull out slope

dt_res

# A.X.c$x  B.X.c$x  A.Y.c$x  B.Y.c$x 
# 1.861290 2.131431 1.590733 1.746169


dt_res = data.frame(dt_res)
dt_res = data.frame(names=rownames(dt_res),   # save the names as a column
                    slope=dt_res$dt_res,
                    row.names = NULL)

dt_res$names = gsub(".c[$]x","",dt_res$names)  # delete the unecessary characters from names
dt_res$fac1 = substr(dt_res$names,1,1)       # pick first character
dt_res$fac2 = substr(dt_res$names,3,3)       # pick 3rd character
dt_res[,c("fac1","fac2","slope")]

#    fac1 fac2    slope
# 1    A    X 1.861290
# 2    B    X 2.131431
# 3    A    Y 1.590733
# 4    B    Y 1.746169

我已将其更新为更通用的内容：

  #create data
fac1 <- c(rep("A", 10), rep("B",10))
fac2 <- rep(c(rep("X", 5), rep("Y",5)),2)
x <- rep(1:5,4)
set.seed(1337)
y <- rep(seq(2, 10, 2), 4) * runif(20, .8, 1.2)

xy <- data.frame(x,y) #bind parameters for regression

factors <- list(fac1, fac2) #split by 2 factors

res = sapply(split(xy, factors), function(c) coef(lm(c$y~c$x))[2]) #run regression by these 4 groups, pull out slope

# split names by . (that will be the split symbol always)
    names = strsplit(names(split(xy, factors)), split ="[.]")

# create empty vectors to store names
fac1 = vector()
fac2 = vector()

for (i in 1:length(names)){

# iterate through the list of names and keep values from the corresponding position
  fac1 = c(fac1, names[[i]][1])
  fac2 = c(fac2, names[[i]][2])
}


data.frame(fac1,
           fac2,
           slope = res,
           row.names = NULL)

Answer 2

对于 base R，aggregate 是针对此类情况的用户友好函数。

aggregate(cbind(slope=1:nrow(xy))~fac1+fac2,FUN=function(r) coef(lm(y~x,data=xy[r,]))[2])

  fac1 fac2    slope
1    A    X 1.861290
2    B    X 2.131431
3    A    Y 1.590733
4    B    Y 1.746169

这也可以通过 by 以更类似于您的原始方式的方式完成。

setNames(as.data.frame.table(
  by(xy,list(fac1,fac2),FUN=function(c) coef(lm(c$y~c$x))[2])),
  c("fac1","fac2","slope"))

Answer 3

一个。 Webb 的回答更优雅，但是这个 lapply/arbitrary function/do.call/rbind 工作流程多年来一直是我对这种事情的最后选择：

# Move the factors inside your data frame, so they'll be available after the split()
xy <- data.frame(x, y, fac1, fac2)

# Iterate over the split
reglist <- lapply(split(xy, factors), FUN = function(group) {

    # Get the current factor levels
    group_levels <- unique(group[c("fac1", "fac2")])

    # Mash it all into a data.frame
    data.frame(group_levels, slope = coef(lm(y ~ x, data = group))[2])

})

# Collapse the list into a data.frame
do.call("rbind", reglist)

将模型拟合到多个分组或子集，并提取数据框输出的原始因子列

fit model to multiple groupings or subsets and extract original factor columns for data frame output

r

apply