如何计算出我的数据的 theta 值以用于负二项式 GLM?
How to work out theta value of my data for use in negative binomial GLM?
我正在尝试对计数数据集执行 GLM,但发现我的数据过于分散,因此不适合使用泊松 GLM。我知道我必须改用负二项式 GLM,这需要一个 theta 值。但是,当我尝试 运行 我的模型摘要时,出现以下一系列错误,并且无法找到 theta 值。对此的任何帮助将不胜感激。我将总结我的数据集和我的代码,用于生成模型的总结和下面的错误。
数据集摘要:
用于 GLM 的数据是总数(计数数据)和治疗(代表不同治疗的字母,例如 C、M、F)
用于生成 theta 的代码:
summary(m1 <- glm.nb(Total ~ Treatment, data = twohour))
此代码的输出,底部有错误:
对于生成 theta 值的任何帮助,我们将不胜感激。提前致谢。
根据要求,摘要和模型输出为文本:
总结:
> summary(twohour)
Treatment | Length | ID | Block1 Block2 | Fertility | Notes | Total
Length:252 | Length:252 | Min. : 1.00 | Min. : 0.0 Min. : 0.00 Min. :0.0000 Length:252 Min. : 0.0
Class :character Class :character 1st Qu.:10.00 1st Qu.:125.8 1st Qu.: 39.50 1st Qu.:1.0000 Class :character 1st Qu.:172.2
Mode :character Mode :character Median :19.50 Median :154.0 Median :104.50 Median :1.0000 Mode :character Median :263.0
Mean :19.89 Mean :143.5 Mean : 94.66 Mean :0.9683 Mean :238.1
3rd Qu.:30.00 3rd Qu.:179.2 3rd Qu.:146.00 3rd Qu.:1.0000 3rd Qu.:309.5
Max. :40.00 Max. :227.0 Max. :228.00 Max. :1.0000 Max. :434.0
模型输出:
> Call: glm.nb(formula = Total ~ Treatment, data = twohour, init.theta =
> 2055605.705,
> link = log)
>
> Deviance Residuals:
> Min 1Q Median 3Q Max
> -23.001 -4.624 1.650 4.567 12.571
>
> Coefficients:
Estimate Std. Error z value Pr(>|z|) (Intercept) 5.577987 0.009846 566.534 < 2e-16 *** TreatmentC -0.102625 0.014394 -7.130 1.01e-12 *** TreatmentF -0.154580 0.014396 -10.737 < 2e-16 *** TreatmentF30 -0.298972 0.019920 -15.008 < 2e-16 *** TreatmentM -0.158733 0.014613 -10.862 < 2e-16 ***
TreatmentM30 -0.044795 0.013992 -3.201 0.00137 ** TreatmentMxF
-0.105191 0.014211 -7.402 1.34e-13 ***
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(2055606) family taken to
be 1)
Null deviance: 15127 on 251 degrees of freedom Residual deviance: 14799 on 245 degrees of freedom AIC: 16542
Number of Fisher Scoring iterations: 1
Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width,
3L, : invalid 'nsmall' argument In addition: Warning messages:
1: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =control$trace : iteration limit reached
2: In sqrt(1/i) : NaNs produced
3: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > : iteration limit reached
4: In sqrt(1/i) : NaNs produced
tl;dr 我怀疑这是由离群值引起的,尤其是 (??) 一些与数据集其余部分不一致的零值。如果零值不是 errors/weird 情况,您可能会考虑 零膨胀 模型 ...???
我们可能需要您的数据才能确定发生了什么。以下是我目前可以收集到的信息:
- 您的结果的某些方面看起来像 分散不足(theta 估计大得离谱,“达到迭代限制”警告...
- ...但我同意您的数据似乎 过度分散 (剩余偏差与剩余 df 的比率很大;范围从 0 到 434,平均值为 238)
- ...偏差残差的极端范围(-23 到 +12)表明异常值(偏差残差基本上在对数尺度上...)
我可以通过构建一个主要是泊松分布但有一些极端异常值的数据集来完成大部分工作:
n <- 252 ## total number of obs
ng <- 7 ## number of groups/treatments
mu <- exp(6) ## mean response
## NOTE: this doesn't match your data, I did it accidentally,
## but it does reproduce the errors.
set.seed(101)
dd <- data.frame(
## mostly Poisson, but with 2 values at the min and max values
y = c(rpois(n-4, lambda=mu), rep(c(0,434), each=2)),
f = factor(rep(1:ng, length.out = n))
)
summary(dd)
library(MASS)
m2 <- glm.nb(y~f, data = dd)
(零值看起来是最大的问题。我可以用 2 个(但不是 1 个)零异常值重现该问题,其余数据泊松均值很大...)
Warning messages:
1: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > :
iteration limit reached
2: In sqrt(1/i) : NaNs produced
3: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > :
iteration limit reached
4: In sqrt(1/i) : NaNs produced
结果:
Call:
glm.nb(formula = y ~ f, data = dd, init.theta = 12474197.56,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-28.1363 -0.4887 0.0444 0.7153 3.4771
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.0005206 0.0082958 723.319 < 2e-16 ***
f2 0.0006192 0.0117302 0.053 0.95790
f3 0.0015129 0.0117276 0.129 0.89736
f4 -0.0328793 0.0118297 -2.779 0.00545 **
f5 -0.0195274 0.0117898 -1.656 0.09766 .
f6 0.0068583 0.0117120 0.586 0.55816
f7 -0.0087784 0.0117579 -0.747 0.45531
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(12474198) family taken to be 1)
Null deviance: 1837.2 on 251 degrees of freedom
Residual deviance: 1820.0 on 245 degrees of freedom
AIC: 3795.6
Number of Fisher Scoring iterations: 1
Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, :
invalid 'nsmall' argument
一点点挖掘表明这个特定错误是由于无法计算 theta 估计的标准误差(它是 NaN
)造成的...
查看诊断 (plot(m2)
) 清楚地显示异常值:
以下工作正常(或多或少:它给出了荒谬的theta
估计,因为数据不是过度分散一旦零-inflation被考虑).
library(pscl)
zeroinfl(y~f, dist="negbin",data = dd)
我正在尝试对计数数据集执行 GLM,但发现我的数据过于分散,因此不适合使用泊松 GLM。我知道我必须改用负二项式 GLM,这需要一个 theta 值。但是,当我尝试 运行 我的模型摘要时,出现以下一系列错误,并且无法找到 theta 值。对此的任何帮助将不胜感激。我将总结我的数据集和我的代码,用于生成模型的总结和下面的错误。
数据集摘要:
用于 GLM 的数据是总数(计数数据)和治疗(代表不同治疗的字母,例如 C、M、F)
用于生成 theta 的代码:
summary(m1 <- glm.nb(Total ~ Treatment, data = twohour))
此代码的输出,底部有错误:
对于生成 theta 值的任何帮助,我们将不胜感激。提前致谢。
根据要求,摘要和模型输出为文本:
总结:
> summary(twohour)
Treatment | Length | ID | Block1 Block2 | Fertility | Notes | Total
Length:252 | Length:252 | Min. : 1.00 | Min. : 0.0 Min. : 0.00 Min. :0.0000 Length:252 Min. : 0.0
Class :character Class :character 1st Qu.:10.00 1st Qu.:125.8 1st Qu.: 39.50 1st Qu.:1.0000 Class :character 1st Qu.:172.2
Mode :character Mode :character Median :19.50 Median :154.0 Median :104.50 Median :1.0000 Mode :character Median :263.0
Mean :19.89 Mean :143.5 Mean : 94.66 Mean :0.9683 Mean :238.1
3rd Qu.:30.00 3rd Qu.:179.2 3rd Qu.:146.00 3rd Qu.:1.0000 3rd Qu.:309.5
Max. :40.00 Max. :227.0 Max. :228.00 Max. :1.0000 Max. :434.0
模型输出:
> Call: glm.nb(formula = Total ~ Treatment, data = twohour, init.theta =
> 2055605.705,
> link = log)
>
> Deviance Residuals:
> Min 1Q Median 3Q Max
> -23.001 -4.624 1.650 4.567 12.571
>
> Coefficients:
Estimate Std. Error z value Pr(>|z|) (Intercept) 5.577987 0.009846 566.534 < 2e-16 *** TreatmentC -0.102625 0.014394 -7.130 1.01e-12 *** TreatmentF -0.154580 0.014396 -10.737 < 2e-16 *** TreatmentF30 -0.298972 0.019920 -15.008 < 2e-16 *** TreatmentM -0.158733 0.014613 -10.862 < 2e-16 ***
TreatmentM30 -0.044795 0.013992 -3.201 0.00137 ** TreatmentMxF
-0.105191 0.014211 -7.402 1.34e-13 ***
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(2055606) family taken to
be 1)
Null deviance: 15127 on 251 degrees of freedom Residual deviance: 14799 on 245 degrees of freedom AIC: 16542
Number of Fisher Scoring iterations: 1
Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width,
3L, : invalid 'nsmall' argument In addition: Warning messages:
1: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =control$trace : iteration limit reached
2: In sqrt(1/i) : NaNs produced
3: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > : iteration limit reached
4: In sqrt(1/i) : NaNs produced
tl;dr 我怀疑这是由离群值引起的,尤其是 (??) 一些与数据集其余部分不一致的零值。如果零值不是 errors/weird 情况,您可能会考虑 零膨胀 模型 ...???
我们可能需要您的数据才能确定发生了什么。以下是我目前可以收集到的信息:
- 您的结果的某些方面看起来像 分散不足(theta 估计大得离谱,“达到迭代限制”警告...
- ...但我同意您的数据似乎 过度分散 (剩余偏差与剩余 df 的比率很大;范围从 0 到 434,平均值为 238)
- ...偏差残差的极端范围(-23 到 +12)表明异常值(偏差残差基本上在对数尺度上...)
我可以通过构建一个主要是泊松分布但有一些极端异常值的数据集来完成大部分工作:
n <- 252 ## total number of obs
ng <- 7 ## number of groups/treatments
mu <- exp(6) ## mean response
## NOTE: this doesn't match your data, I did it accidentally,
## but it does reproduce the errors.
set.seed(101)
dd <- data.frame(
## mostly Poisson, but with 2 values at the min and max values
y = c(rpois(n-4, lambda=mu), rep(c(0,434), each=2)),
f = factor(rep(1:ng, length.out = n))
)
summary(dd)
library(MASS)
m2 <- glm.nb(y~f, data = dd)
(零值看起来是最大的问题。我可以用 2 个(但不是 1 个)零异常值重现该问题,其余数据泊松均值很大...)
Warning messages:
1: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > :
iteration limit reached
2: In sqrt(1/i) : NaNs produced
3: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > :
iteration limit reached
4: In sqrt(1/i) : NaNs produced
结果:
Call:
glm.nb(formula = y ~ f, data = dd, init.theta = 12474197.56,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-28.1363 -0.4887 0.0444 0.7153 3.4771
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.0005206 0.0082958 723.319 < 2e-16 ***
f2 0.0006192 0.0117302 0.053 0.95790
f3 0.0015129 0.0117276 0.129 0.89736
f4 -0.0328793 0.0118297 -2.779 0.00545 **
f5 -0.0195274 0.0117898 -1.656 0.09766 .
f6 0.0068583 0.0117120 0.586 0.55816
f7 -0.0087784 0.0117579 -0.747 0.45531
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(12474198) family taken to be 1)
Null deviance: 1837.2 on 251 degrees of freedom
Residual deviance: 1820.0 on 245 degrees of freedom
AIC: 3795.6
Number of Fisher Scoring iterations: 1
Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, : invalid 'nsmall' argument
一点点挖掘表明这个特定错误是由于无法计算 theta 估计的标准误差(它是 NaN
)造成的...
查看诊断 (plot(m2)
) 清楚地显示异常值:
以下工作正常(或多或少:它给出了荒谬的theta
估计,因为数据不是过度分散一旦零-inflation被考虑).
library(pscl)
zeroinfl(y~f, dist="negbin",data = dd)