频率 table,第二个变量在 R 中为 "analytic weight"
Frequency table with second variable as "analytic weight" in R
我想在 R 中创建一个频率 table,将另一个变量考虑为权重。
更准确地说,作为 "analytic weight",例如在 Stata 中。根据其帮助文件,
aweights, or analytic weights, are weights that are inversely
proportional to the variance of an observation; i.e., the variance of
the jth observation is assumed to be sigma^2/w_j, where w_j are the
weights. Typically, the observations represent averages and the
weights are the number of elements that gave rise to the average.
For most Stata commands, the recorded scale of aweights is
irrelevant; Stata internally rescales them to sum to N, the number of
observations in your data, when it uses them.
stackflow 成员的宝贵贡献是:
Table_WEIGHT <- xtabs(WEIGHT ~ INTERVIEW_DAY, timeuse_2003)
> Prop <- prop.table(Table_WEIGHT)
> Cum <- cumsum(100 * Prop / sum(Prop))
> Cum
1 2 3 4 5 6 7
14.35397 29.14973 43.23935 57.31355 71.50782 85.80359 100.00000
> out <- data.frame(INTERVIEW_DAY = names(Table_WEIGHT), Freq = as.numeric(Table_WEIGHT),
+ Prop = as.numeric(Prop), Cum = as.numeric(Cum))
> out
INTERVIEW_DAY Freq Prop Cum
1 1 11803438268 0.1435397 14.35397
2 2 12166729888 0.1479576 29.14973
3 3 11586059070 0.1408962 43.23935
4 4 11573379591 0.1407420 57.31355
5 5 11672116808 0.1419427 71.50782
6 6 11755579310 0.1429577 85.80359
7 7 11673877965 0.1419641 100.00000
尽管如此,频率仍然不是我所期望的,因为我们使用的是第二个变量的总和作为权重,而不是上面设置的"analytic weight"。
所需的 table 应该是:
(mean) |
interview_d |
ay | Freq. Percent Cum.
------------+-----------------------------------
1 | 2,974.1424 14.35 14.35
2 | 3,065.6819 14.80 29.15
3 | 2,919.3688 14.09 43.24
4 |2,916.17392 14.07 57.31
5 |2,941.05299 14.19 71.51
6 | 2,962.0832 14.30 85.80
7 | 2,941.4968 14.20 100.00
------------+-----------------------------------
Total | 20,720 100.00
请注意 "Freq" 是完全不同的。
这里有两个变量(INTERVIEW_DATE)和WEIGHT(WEIGHT)的例子,它们是调查的日期和原始文章中没有指定的权重。
> timeuse_2003$INTERVIEW_DATE[1:15]
[1] "2003-01-03" "2003-01-04" "2003-01-04" "2003-01-02" "2003-01-09" "2003-01-02" "2003-01-06"
[8] "2003-01-07" "2003-01-04" "2003-01-09" "2003-01-04" "2003-01-05" "2003-01-04" "2003-01-01"
[15] "2003-01-04"
> timeuse_2003$WEIGHT[1:15]
[1] 8155462.7 1735322.5 3830527.5 6622023.0 3068387.3 3455424.9 1637826.3 6574426.8 1528296.3
[10] 4277052.8 1961482.3 505227.2 2135476.8 5366309.3 1058351.1
我将感谢任何贡献。
您所要求的可以通过以下方式完成:
library(tidyverse)
a <- tibble(interview_day = 1:7,
frequency = c(2974.1424, 3065.6819, 2919.3688, 2916.17392, 2941.05299, 2962.0832, 2941.4968)) %>%
mutate(percent = frequency/sum(frequency),
cum_pct = cumsum(percent)) %>%
bind_rows(t(colSums(.)[2:3]) %>% as.data.frame())
这是一个仅使用基础 R 的解决方案:
df <- data.frame(frequency = c(2974.1424, 3065.6819, 2919.3688, 2916.17392, 2941.05299, 2962.0832, 2941.4968))
df$interview_day <- 1:nrow(df)
df$percent <- df$frequency/sum(df$frequency)
df$cum_pct <- cumsum(df$percent)
我根据 Stata 帮助文件找到了一个非优雅的解决方案。
我刚刚添加了行
timeuse_2003$N_WEIGHT <- timeuse_2003$WEIGHT * 20720/ sum(timeuse_2003$WEIGHT)
并使用
保留代码
Table_WEIGHT <- xtabs(N_WEIGHT ~ INTERVIEW_DAY, timeuse_2003)
Prop <- prop.table(Table_WEIGHT)
Cum <- cumsum(100 * Prop / sum(Prop))
Cum
Freq_Table <- data.frame(INTERVIEW_DAY = names(Table_WEIGHT), Freq = as.numeric(Table_WEIGHT),
Prop = as.numeric(Prop), Cum = as.numeric(Cum))
Freq_Table
然后 table 是正确的,例如:
> Freq_Table
INTERVIEW_DAY Freq Prop Cum
1 1 2974.1424 0.14353969 14.353969
2 2 3065.6819 0.14795762 29.149731
3 3 2919.3688 0.14089618 43.239349
4 4 2916.1739 0.14074198 57.313547
5 5 2941.0530 0.14194271 71.507819
6 6 2962.0832 0.14295769 85.803587
7 7 2941.4968 0.14196413 100.000000
如果有人能阐明如何用我手动输入的观察次数代替自动输入的次数(此代码将用于不同的数据集,所以我无法更新每一个,每次都切换观察次数。像“.N”这样的就很好了!
谢谢!
我想在 R 中创建一个频率 table,将另一个变量考虑为权重。
更准确地说,作为 "analytic weight",例如在 Stata 中。根据其帮助文件,
aweights, or analytic weights, are weights that are inversely
proportional to the variance of an observation; i.e., the variance of
the jth observation is assumed to be sigma^2/w_j, where w_j are the
weights. Typically, the observations represent averages and the
weights are the number of elements that gave rise to the average.
For most Stata commands, the recorded scale of aweights is
irrelevant; Stata internally rescales them to sum to N, the number of
observations in your data, when it uses them.
stackflow 成员的宝贵贡献是:
Table_WEIGHT <- xtabs(WEIGHT ~ INTERVIEW_DAY, timeuse_2003)
> Prop <- prop.table(Table_WEIGHT)
> Cum <- cumsum(100 * Prop / sum(Prop))
> Cum
1 2 3 4 5 6 7
14.35397 29.14973 43.23935 57.31355 71.50782 85.80359 100.00000
> out <- data.frame(INTERVIEW_DAY = names(Table_WEIGHT), Freq = as.numeric(Table_WEIGHT),
+ Prop = as.numeric(Prop), Cum = as.numeric(Cum))
> out
INTERVIEW_DAY Freq Prop Cum
1 1 11803438268 0.1435397 14.35397
2 2 12166729888 0.1479576 29.14973
3 3 11586059070 0.1408962 43.23935
4 4 11573379591 0.1407420 57.31355
5 5 11672116808 0.1419427 71.50782
6 6 11755579310 0.1429577 85.80359
7 7 11673877965 0.1419641 100.00000
尽管如此,频率仍然不是我所期望的,因为我们使用的是第二个变量的总和作为权重,而不是上面设置的"analytic weight"。
所需的 table 应该是:
(mean) |
interview_d |
ay | Freq. Percent Cum.
------------+-----------------------------------
1 | 2,974.1424 14.35 14.35
2 | 3,065.6819 14.80 29.15
3 | 2,919.3688 14.09 43.24
4 |2,916.17392 14.07 57.31
5 |2,941.05299 14.19 71.51
6 | 2,962.0832 14.30 85.80
7 | 2,941.4968 14.20 100.00
------------+-----------------------------------
Total | 20,720 100.00
请注意 "Freq" 是完全不同的。
这里有两个变量(INTERVIEW_DATE)和WEIGHT(WEIGHT)的例子,它们是调查的日期和原始文章中没有指定的权重。
> timeuse_2003$INTERVIEW_DATE[1:15]
[1] "2003-01-03" "2003-01-04" "2003-01-04" "2003-01-02" "2003-01-09" "2003-01-02" "2003-01-06"
[8] "2003-01-07" "2003-01-04" "2003-01-09" "2003-01-04" "2003-01-05" "2003-01-04" "2003-01-01"
[15] "2003-01-04"
> timeuse_2003$WEIGHT[1:15]
[1] 8155462.7 1735322.5 3830527.5 6622023.0 3068387.3 3455424.9 1637826.3 6574426.8 1528296.3
[10] 4277052.8 1961482.3 505227.2 2135476.8 5366309.3 1058351.1
我将感谢任何贡献。
您所要求的可以通过以下方式完成:
library(tidyverse)
a <- tibble(interview_day = 1:7,
frequency = c(2974.1424, 3065.6819, 2919.3688, 2916.17392, 2941.05299, 2962.0832, 2941.4968)) %>%
mutate(percent = frequency/sum(frequency),
cum_pct = cumsum(percent)) %>%
bind_rows(t(colSums(.)[2:3]) %>% as.data.frame())
这是一个仅使用基础 R 的解决方案:
df <- data.frame(frequency = c(2974.1424, 3065.6819, 2919.3688, 2916.17392, 2941.05299, 2962.0832, 2941.4968))
df$interview_day <- 1:nrow(df)
df$percent <- df$frequency/sum(df$frequency)
df$cum_pct <- cumsum(df$percent)
我根据 Stata 帮助文件找到了一个非优雅的解决方案。 我刚刚添加了行
timeuse_2003$N_WEIGHT <- timeuse_2003$WEIGHT * 20720/ sum(timeuse_2003$WEIGHT)
并使用
保留代码Table_WEIGHT <- xtabs(N_WEIGHT ~ INTERVIEW_DAY, timeuse_2003)
Prop <- prop.table(Table_WEIGHT)
Cum <- cumsum(100 * Prop / sum(Prop))
Cum
Freq_Table <- data.frame(INTERVIEW_DAY = names(Table_WEIGHT), Freq = as.numeric(Table_WEIGHT),
Prop = as.numeric(Prop), Cum = as.numeric(Cum))
Freq_Table
然后 table 是正确的,例如:
> Freq_Table
INTERVIEW_DAY Freq Prop Cum
1 1 2974.1424 0.14353969 14.353969
2 2 3065.6819 0.14795762 29.149731
3 3 2919.3688 0.14089618 43.239349
4 4 2916.1739 0.14074198 57.313547
5 5 2941.0530 0.14194271 71.507819
6 6 2962.0832 0.14295769 85.803587
7 7 2941.4968 0.14196413 100.000000
如果有人能阐明如何用我手动输入的观察次数代替自动输入的次数(此代码将用于不同的数据集,所以我无法更新每一个,每次都切换观察次数。像“.N”这样的就很好了!
谢谢!