如何使用 ifelse 语句以 data.table 语法分组取值?
How to use an ifelse statement to take means by groups with data.table syntax?
我使用的 data.table 代码工作正常,但我无法转换以包含 ifelse
语句。我使用以下代表:
set.seed(1645)
Place <- c(rep("Copenhagen",7),rep("Berlin",11),rep("Roma",12))
Year <- c(rep("2020",4),rep("2021",3),rep("2020",6),rep("2021",5),rep("2019",4),rep("2020",4),rep("2021",4))
Value1 <- c(runif(3),NA,runif(8),NA,runif(9),NA,runif(7))
Value2 <- c(runif(4),NA,runif(2),runif(6),NA,NA,runif(11),NA,NA,runif(2))
df <- data.frame(Place,Year,Value1,Value2)
> df
Place Year Value1 Value2
1 Copenhagen 2020 0.10517697 0.865935100
2 Copenhagen 2020 0.96597760 0.579956282
3 Copenhagen 2020 0.47262307 0.346569960
4 Copenhagen 2020 NA 0.478763951
5 Copenhagen 2021 0.90030423 NA
6 Copenhagen 2021 0.14444142 0.280377315
7 Copenhagen 2021 0.73801550 0.302816525
8 Berlin 2020 0.13961383 0.641314310
9 Berlin 2020 0.40221211 0.756374251
10 Berlin 2020 0.49613139 0.070459347
11 Berlin 2020 0.95190545 0.184497038
12 Berlin 2020 0.40182901 0.407892240
13 Berlin 2020 NA 0.002209376
14 Berlin 2021 0.38310025 NA
15 Berlin 2021 0.76417492 NA
16 Berlin 2021 0.29001287 0.632133629
17 Berlin 2021 0.84478784 0.365406326
18 Berlin 2021 0.55547323 0.493870653
19 Roma 2019 0.44198733 0.067744090
20 Roma 2019 0.50403809 0.847876518
21 Roma 2019 0.85358805 0.952393606
22 Roma 2019 0.74996137 0.887583928
23 Roma 2020 NA 0.631937527
24 Roma 2020 0.08303509 0.993400333
25 Roma 2020 0.74205719 0.589183185
26 Roma 2020 0.27552659 0.522451407
27 Roma 2021 0.39518410 NA
28 Roma 2021 0.38390124 NA
29 Roma 2021 0.36605674 0.942102065
30 Roma 2021 0.32014949 0.375689863
如果组中的 NA <= 25%,我想按地点和年份计算平均值。没有我的条件,这工作正常:
setDT(df)
df_means <- df[,.(Value1_mean = mean(Value1),Value2_mean = mean(Value2)), by = .(Place,Year)]
> df_means
Place Year Value1_mean Value2_mean
1: Copenhagen 2020 NA 0.4257258
2: Copenhagen 2021 0.3581245 NA
3: Berlin 2020 NA 0.3935807
4: Berlin 2021 0.3729461 NA
5: Roma 2019 0.4572996 0.3956536
6: Roma 2020 NA 0.6494491
7: Roma 2021 0.4142637 NA
我没有包含 ifelse
语句,这不起作用:
df_means2 <- df[,.(Value1_mean = ifelse(sum(is.na(Value1))/length(Value1)>=0.25,NA,mean(Value1,na.rm=TRUE)),
Value2_mean = ifelse(sum(is.na(Value2))/length(Value2)>=0.25,NA,mean(Value2,na.rm=TRUE))),
by = .(Place,Year)]
我检查了这些帖子 , , and 但没有解决我的问题。我的预期结果应该是这样的:
> df_means2
Place Year Value1_mean Value2_mean
1 Copenhagen 2020 mean mean
2 Copenhagen 2021 mean <NA>
3 Berlin 2020 mean mean
4 Berlin 2021 mean <NA>
5 Roma 2019 mean mean
6 Roma 2020 mean mean
7 Roma 2021 mean <NA>
如何转换我的代码?
我们可以用if/else
library(data.table)
df[, lapply(.SD, function(x) if(mean(is.na(x)) <= 0.25)
mean(x, na.rm = TRUE) else NA_real_), by = .(Place, Year)]
-输出
Place Year Value1 Value2
1: Copenhagen 2020 0.5145925 0.5678063
2: Copenhagen 2021 0.5942537 NA
3: Berlin 2020 0.4783384 0.3437911
4: Berlin 2021 0.5675098 NA
5: Roma 2019 0.6373937 0.6888995
6: Roma 2020 0.3668730 0.6842431
7: Roma 2021 0.3663229 NA
在 OP 的代码中,使用 NA_real_
而不是 NA
,默认情况下是 logical
这会在 class
中产生冲突
df[,.(Value1_mean = ifelse(sum(is.na(Value1))/length(Value1)>0.25,
NA_real_,mean(Value1,na.rm=TRUE)),
Value2_mean = ifelse(sum(is.na(Value2))/length(Value2)>0.25,
NA_real_,
mean(Value2,na.rm=TRUE))),
by = .(Place,Year)]
-输出
Place Year Value1_mean Value2_mean
1: Copenhagen 2020 0.5145925 0.5678063
2: Copenhagen 2021 0.5942537 NA
3: Berlin 2020 0.4783384 0.3437911
4: Berlin 2021 0.5675098 NA
5: Roma 2019 0.6373937 0.6888995
6: Roma 2020 0.3668730 0.6842431
7: Roma 2021 0.3663229 NA
我使用的 data.table 代码工作正常,但我无法转换以包含 ifelse
语句。我使用以下代表:
set.seed(1645)
Place <- c(rep("Copenhagen",7),rep("Berlin",11),rep("Roma",12))
Year <- c(rep("2020",4),rep("2021",3),rep("2020",6),rep("2021",5),rep("2019",4),rep("2020",4),rep("2021",4))
Value1 <- c(runif(3),NA,runif(8),NA,runif(9),NA,runif(7))
Value2 <- c(runif(4),NA,runif(2),runif(6),NA,NA,runif(11),NA,NA,runif(2))
df <- data.frame(Place,Year,Value1,Value2)
> df
Place Year Value1 Value2
1 Copenhagen 2020 0.10517697 0.865935100
2 Copenhagen 2020 0.96597760 0.579956282
3 Copenhagen 2020 0.47262307 0.346569960
4 Copenhagen 2020 NA 0.478763951
5 Copenhagen 2021 0.90030423 NA
6 Copenhagen 2021 0.14444142 0.280377315
7 Copenhagen 2021 0.73801550 0.302816525
8 Berlin 2020 0.13961383 0.641314310
9 Berlin 2020 0.40221211 0.756374251
10 Berlin 2020 0.49613139 0.070459347
11 Berlin 2020 0.95190545 0.184497038
12 Berlin 2020 0.40182901 0.407892240
13 Berlin 2020 NA 0.002209376
14 Berlin 2021 0.38310025 NA
15 Berlin 2021 0.76417492 NA
16 Berlin 2021 0.29001287 0.632133629
17 Berlin 2021 0.84478784 0.365406326
18 Berlin 2021 0.55547323 0.493870653
19 Roma 2019 0.44198733 0.067744090
20 Roma 2019 0.50403809 0.847876518
21 Roma 2019 0.85358805 0.952393606
22 Roma 2019 0.74996137 0.887583928
23 Roma 2020 NA 0.631937527
24 Roma 2020 0.08303509 0.993400333
25 Roma 2020 0.74205719 0.589183185
26 Roma 2020 0.27552659 0.522451407
27 Roma 2021 0.39518410 NA
28 Roma 2021 0.38390124 NA
29 Roma 2021 0.36605674 0.942102065
30 Roma 2021 0.32014949 0.375689863
如果组中的 NA <= 25%,我想按地点和年份计算平均值。没有我的条件,这工作正常:
setDT(df)
df_means <- df[,.(Value1_mean = mean(Value1),Value2_mean = mean(Value2)), by = .(Place,Year)]
> df_means
Place Year Value1_mean Value2_mean
1: Copenhagen 2020 NA 0.4257258
2: Copenhagen 2021 0.3581245 NA
3: Berlin 2020 NA 0.3935807
4: Berlin 2021 0.3729461 NA
5: Roma 2019 0.4572996 0.3956536
6: Roma 2020 NA 0.6494491
7: Roma 2021 0.4142637 NA
我没有包含 ifelse
语句,这不起作用:
df_means2 <- df[,.(Value1_mean = ifelse(sum(is.na(Value1))/length(Value1)>=0.25,NA,mean(Value1,na.rm=TRUE)),
Value2_mean = ifelse(sum(is.na(Value2))/length(Value2)>=0.25,NA,mean(Value2,na.rm=TRUE))),
by = .(Place,Year)]
我检查了这些帖子
> df_means2
Place Year Value1_mean Value2_mean
1 Copenhagen 2020 mean mean
2 Copenhagen 2021 mean <NA>
3 Berlin 2020 mean mean
4 Berlin 2021 mean <NA>
5 Roma 2019 mean mean
6 Roma 2020 mean mean
7 Roma 2021 mean <NA>
如何转换我的代码?
我们可以用if/else
library(data.table)
df[, lapply(.SD, function(x) if(mean(is.na(x)) <= 0.25)
mean(x, na.rm = TRUE) else NA_real_), by = .(Place, Year)]
-输出
Place Year Value1 Value2
1: Copenhagen 2020 0.5145925 0.5678063
2: Copenhagen 2021 0.5942537 NA
3: Berlin 2020 0.4783384 0.3437911
4: Berlin 2021 0.5675098 NA
5: Roma 2019 0.6373937 0.6888995
6: Roma 2020 0.3668730 0.6842431
7: Roma 2021 0.3663229 NA
在 OP 的代码中,使用 NA_real_
而不是 NA
,默认情况下是 logical
这会在 class
df[,.(Value1_mean = ifelse(sum(is.na(Value1))/length(Value1)>0.25,
NA_real_,mean(Value1,na.rm=TRUE)),
Value2_mean = ifelse(sum(is.na(Value2))/length(Value2)>0.25,
NA_real_,
mean(Value2,na.rm=TRUE))),
by = .(Place,Year)]
-输出
Place Year Value1_mean Value2_mean
1: Copenhagen 2020 0.5145925 0.5678063
2: Copenhagen 2021 0.5942537 NA
3: Berlin 2020 0.4783384 0.3437911
4: Berlin 2021 0.5675098 NA
5: Roma 2019 0.6373937 0.6888995
6: Roma 2020 0.3668730 0.6842431
7: Roma 2021 0.3663229 NA