在不使用 R 中的 lme4 包的情况下估计混合级逻辑回归系数

Question

我有一个包含 37000 个实例的 2 级数据集，代表 199 个主题的选择。我必须为 199 个人中的每一个人估计逻辑回归系数。我已经通过子集手动完成了 199 次，但我想知道是否有更有效的方法通过循环获取系数而不使用 lme4 包。另外，我应该将系数计算为每个主题的变量。

这是我的代码。

### Split of the dataset in each subject ID
mylist <- split(df_merged2, df_merged2$sjind)

### Indication of subject 1 in the first subsetting
df1 <- mylist[[1]]

### Logistic regression

glm1 <- glm(rep ~ reward_v.2 + trans_v.2 + reward_transition, data = df1)   

### Extracting the coefficients
reward_transition <- coef(glm1)[4] 

reward <- coef(glm1)[2] 

transition <- coef(glm1)[3] 

reward<- as.numeric(reward)

reward_transition <- as.numeric(reward_transition)

transition <- as.numeric(transition)

omega <- reward_transition - reward

### Computing the constant coefficients as variables

df1$rewardmix <- 1

df1$rewardmix <- reward

df1$omega <- 1

df1$omega <- omega
df1$transmix <- 1

df1$transmix <- transition

df1$reward_transitionmix <- reward_transition

Answer 1

您可以使用基础包中的 by() 函数，其简短描述为 "Apply a Function to a Data Frame Split by Factors"（参考：help(by)）

这是一个使用您的数据框术语和主题 ID 变量名称的示例：

# Make the simulated data reproducible
set.seed(1717)

# The IDs can be sorted in any order
ids = c('A','B','B','A','A','B','B','B','C','C','C','B','C')

# Sample data frame with: subject ID, target variable (y), input variable (x)
df_merged2 = data.frame(sjind=ids,
                        y=rnorm(length(ids)),
                        x=rnorm(length(ids)))
head(df_merged2)

数据的前 6 行如下所示：

  sjind          y          x
1     A -1.4548934  1.1004932
2     B -1.7084245 -0.7731208
3     B  2.1004557 -1.6229203
4     A -1.0283021  0.4233806
5     A  0.4133888  1.2398577
6     B -1.4104637  0.3746706

现在使用 by() 函数为由 sjind 个唯一值定义的每个组拟合 GLM 模型：

glm_by_sjind = by(df_merged2, as.factor(df_merged2$sjind),
                              function(df) glm(y ~ x, data=df))

输出对象glm_by_sjind是一个具有以下属性的列表：

它的元素数量与 sjind 中唯一值的数量一样多（在本例中为 3）
它由 sjind 变量的唯一值索引（在本例中为 "A"、"B"、"C"）
每个元素包含 glm() 运行对输入数据框的每个分割的回归输出（其中分割由 sjind 唯一值明确定义）

因此，例如，您可以请求主题 "B" 的回归输出摘要，如下所示：

> summary(glm_by_sjind[["B"]])

Call:
glm(formula = y ~ x, data = df)

Deviance Residuals: 
       2         3         6         7         8        12  
-1.40226   1.59040  -0.00186   0.06400  -1.93118   1.68091  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -1.0487     0.7472  -1.404    0.233
x            -0.9605     0.9170  -1.047    0.354

(Dispersion parameter for gaussian family taken to be 2.763681)

    Null deviance: 14.087  on 5  degrees of freedom
Residual deviance: 11.055  on 4  degrees of freedom
AIC: 26.694

Number of Fisher Scoring iterations: 2

如果我们更进一步，我们还可以执行 健全性检查，即每个 GLM 模型都基于预期的案例数（即每个模型中的案例数应该等于输入数据帧中 sjind 变量的频率分布。

freq_sjind_in_data = as.list( table(df_merged2$sjind) )
ncases_in_each_glm = lapply( glm_results, function(glm) NROW(glm$data) ) 
all.equal( freq_sjind_in_data,
           ncases_in_each_glm )

其中 returns TRUE.

或者也可以目视检查：

as.data.frame(freq_sjind_in_data)
as.data.frame(ncases_in_each_glm)

哪个return

  A B C
1 3 6 4

在这两种情况下。

在不使用 R 中的 lme4 包的情况下估计混合级逻辑回归系数

Estimating mix-level logistic regression coefficients without using lme4 package in R

r

subset

lme4

mixed-models

logistic-regression