在子集之后维护数据帧行
Maintain data frame rows after subet
我正在尝试根据子集计算某些数据的收益率百分比:
# example data set
set.seed(10)
Measurement <- rnorm(1000, 5, 2)
ID <- rep(c(1:100), each=10)
Batch <- rep(c(1:10), each=100)
df <- data.frame(Batch, ID, Measurement)
df$ID <- factor(df$ID)
df$Batch <- factor(df$Batch)
# Subset data based on measurement range
pass <- subset(df, Measurement > 6 & Measurement < 7)
# Calculate number of rows in data frame (by Batch then ID)
ac <- ddply(df, c("Batch", "ID"), nrow)
colnames(ac) <- c("Batch", "ID", "Total")
# Calculate number of rows in subset (by Batch then ID)
bc <- ddply(pass, c("Batch", "ID"), nrow)
colnames(bc) <- c("Batch", "ID", "Pass")
# Calculate yield
bc$Yield <- (bc$Pass / ac$Total) * 100
# plot yield
ggplot(bc, aes(ID, Yield, colour=Batch)) + geom_point()
我的问题是,由于我的过滤范围(在 6 到 7 之间),我的子集(通过)的行数少于我的数据框 (df)
nrow(ac)
[1] 100
nrow(bc)
[1] 83
因此我不能使用
bc$Yield <- (bc$Pass / ac$Total) * 100
或者我收到错误
replacement has 100 rows, data has 83
我试图保持通用的原因是因为我的真实数据具有不同的批次和 ID 数量(否则我可以在我的产量计算中除以一个常数)。如果数据超出限制(在本例中为 6 到 7),谁能告诉我如何在我的子集中放入 0。或者指出一种更优雅的收益率计算方法。谢谢
更新:
str(df)
'data.frame': 1000 obs. of 3 variables:
$ Batch : Factor w/ 10 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ID : Factor w/ 100 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Measurement: num 5.04 4.63 2.26 3.8 5.59 ...
我想这就是你想要的。我已经使用 dplyr 的 group_by 完成并在此处进行总结。
对于每个Batch/ID,它计算观测值的数量、测量值介于 6 和 7 之间的观测值的数量以及这两者的比率。
library(dplyr)
# example data set
set.seed(10)
Measurement <- rnorm(1000, 5, 2)
ID <- rep(c(1:100), each=10)
Batch <- rep(c(1:10), each=100)
df <- data.frame(Batch, ID, Measurement)
df$ID <- factor(df$ID)
df$Batch <- factor(df$Batch)
# Subset data based on measurement range
countFunc <- function(x) sum((x > 6)&(x<7))
# Calculate number of rows, rows that meet criteria, and yield.
totals <- df %>% group_by(Batch, ID) %>%
summarize(total = length(Measurement), x = countFunc(Measurement)) %>%
mutate(yield = x/total) %>%
as.data.frame()
我正在尝试根据子集计算某些数据的收益率百分比:
# example data set
set.seed(10)
Measurement <- rnorm(1000, 5, 2)
ID <- rep(c(1:100), each=10)
Batch <- rep(c(1:10), each=100)
df <- data.frame(Batch, ID, Measurement)
df$ID <- factor(df$ID)
df$Batch <- factor(df$Batch)
# Subset data based on measurement range
pass <- subset(df, Measurement > 6 & Measurement < 7)
# Calculate number of rows in data frame (by Batch then ID)
ac <- ddply(df, c("Batch", "ID"), nrow)
colnames(ac) <- c("Batch", "ID", "Total")
# Calculate number of rows in subset (by Batch then ID)
bc <- ddply(pass, c("Batch", "ID"), nrow)
colnames(bc) <- c("Batch", "ID", "Pass")
# Calculate yield
bc$Yield <- (bc$Pass / ac$Total) * 100
# plot yield
ggplot(bc, aes(ID, Yield, colour=Batch)) + geom_point()
我的问题是,由于我的过滤范围(在 6 到 7 之间),我的子集(通过)的行数少于我的数据框 (df)
nrow(ac)
[1] 100
nrow(bc)
[1] 83
因此我不能使用
bc$Yield <- (bc$Pass / ac$Total) * 100
或者我收到错误
replacement has 100 rows, data has 83
我试图保持通用的原因是因为我的真实数据具有不同的批次和 ID 数量(否则我可以在我的产量计算中除以一个常数)。如果数据超出限制(在本例中为 6 到 7),谁能告诉我如何在我的子集中放入 0。或者指出一种更优雅的收益率计算方法。谢谢
更新:
str(df)
'data.frame': 1000 obs. of 3 variables:
$ Batch : Factor w/ 10 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ID : Factor w/ 100 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Measurement: num 5.04 4.63 2.26 3.8 5.59 ...
我想这就是你想要的。我已经使用 dplyr 的 group_by 完成并在此处进行总结。
对于每个Batch/ID,它计算观测值的数量、测量值介于 6 和 7 之间的观测值的数量以及这两者的比率。
library(dplyr)
# example data set
set.seed(10)
Measurement <- rnorm(1000, 5, 2)
ID <- rep(c(1:100), each=10)
Batch <- rep(c(1:10), each=100)
df <- data.frame(Batch, ID, Measurement)
df$ID <- factor(df$ID)
df$Batch <- factor(df$Batch)
# Subset data based on measurement range
countFunc <- function(x) sum((x > 6)&(x<7))
# Calculate number of rows, rows that meet criteria, and yield.
totals <- df %>% group_by(Batch, ID) %>%
summarize(total = length(Measurement), x = countFunc(Measurement)) %>%
mutate(yield = x/total) %>%
as.data.frame()