如何逐行计算与模式匹配并满足某些条件的特定列的列乘积?
How to calculate the column product for specific columns matching a pattern and meet some conditions, row-by-row?
假设我有以下 data.frame
:
dt = tibble::tibble(
id_0 = rep(123, 6),
name_0 = rep("A", 6),
id_1 = c(rep(321, 3), rep(322, 3)),
name_1 = c(rep("B", 3), rep("C", 3)),
p_1 = c(rep(0.7, 3), rep(0.3, 3)),
id_2 = c(NA, 323:326, NA),
name_2 = c(NA, "D", "E", "J", "G", NA),
p_2 = c(NA, 0.8, 0.2, 0.9, 0.1, NA),
id_3 = c(NA, NA, 323, NA, NA, NA),
na_3 = c(NA, NA, "H", NA, NA, NA),
p_3 = c(NA, NA, 1, NA, NA, NA),
)
看起来像这样:
# A tibble: 6 x 11
id_0 name_0 id_1 name_1 p_1 id_2 name_2 p_2 id_3 na_3
<dbl> <chr> <dbl> <chr> <dbl> <int> <chr> <dbl> <dbl> <chr>
1 123 A 321 B 0.7 NA NA NA NA NA
2 123 A 321 B 0.7 323 D 0.8 NA NA
3 123 A 321 B 0.7 324 E 0.2 323 H
4 123 A 322 C 0.3 325 J 0.9 NA NA
5 123 A 322 C 0.3 326 G 0.1 NA NA
6 123 A 322 C 0.3 NA NA NA NA NA
我需要从所有 p_*
列中逐行乘积。在这种情况下,它将是 Product = p_1 * p_2 * p_3
,但通常它可以是从 p_1 到 p_* 的任何产品(这个 data.frame
因情况而异,我的意思是 Product = product(p_1, p_2, ..., p_n)
) . 请注意, p_*
始终大于 cero 且小于或等于 1 (p_ > 0 & p_ <= 1)。所以我需要完成的任务有两件事:Product
1) 必须省略 NAs
和 2) 对于 data.frame
中存在的任意数量的 p_*
是通用的。
理想的输出应如下所示:
# A tibble: 6 x 12
id_0 name_0 id_1 name_1 p_1 id_2 name_2 p_2 id_3 na_3 p_3 Product
<dbl> <chr> <dbl> <chr> <dbl> <int> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 123 A 321 B 0.7 NA NA NA NA NA NA 0.7
2 123 A 321 B 0.7 323 D 0.8 NA NA NA 0.56
3 123 A 321 B 0.7 324 E 0.2 323 H 1 0.14
4 123 A 322 C 0.3 325 J 0.9 NA NA NA 0.27
5 123 A 322 C 0.3 326 G 0.1 NA NA NA 0.03
6 123 A 322 C 0.3 NA NA NA NA NA NA 0.3
我建议整形手术:
library(dplyr)
library(tidyr) # pivot_longer
# preserve a row-wise "id"
dt <- mutate(dt, rn = row_number())
dt %>%
pivot_longer(-rn, names_pattern = c("(.*)_([0-9])"), names_to = c(".value", "num"))
# # A tibble: 24 x 6
# rn num id name p na
# <int> <chr> <dbl> <chr> <dbl> <chr>
# 1 1 0 123 A NA <NA>
# 2 1 1 321 B 0.7 <NA>
# 3 1 2 NA <NA> NA <NA>
# 4 1 3 NA <NA> NA <NA>
# 5 2 0 123 A NA <NA>
# 6 2 1 321 B 0.7 <NA>
# 7 2 2 323 D 0.8 <NA>
# 8 2 3 NA <NA> NA <NA>
# 9 3 0 123 A NA <NA>
# 10 3 1 321 B 0.7 <NA>
# # ... with 14 more rows
有了这个,我们可以很容易地group_by
,计算乘积,...
dt %>%
pivot_longer(-rn, names_pattern = c("(.*)_([0-9])"), names_to = c(".value", "num")) %>%
group_by(rn) %>%
summarize(Product = prod(p, na.rm = TRUE))
# # A tibble: 6 x 2
# rn Product
# <int> <dbl>
# 1 1 0.7
# 2 2 0.560
# 3 3 0.140
# 4 4 0.27
# 5 5 0.03
# 6 6 0.3
...然后加入到 dt
。
dt %>%
pivot_longer(-rn, names_pattern = c("(.*)_([0-9])"), names_to = c(".value", "num")) %>%
group_by(rn) %>%
summarize(Product = prod(p, na.rm = TRUE)) %>%
left_join(dt, ., by = "rn") %>%
select(-rn)
# # A tibble: 6 x 12
# id_0 name_0 id_1 name_1 p_1 id_2 name_2 p_2 id_3 na_3 p_3 Product
# <dbl> <chr> <dbl> <chr> <dbl> <int> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
# 1 123 A 321 B 0.7 NA <NA> NA NA <NA> NA 0.7
# 2 123 A 321 B 0.7 323 D 0.8 NA <NA> NA 0.560
# 3 123 A 321 B 0.7 324 E 0.2 323 H 1 0.140
# 4 123 A 322 C 0.3 325 F 0.9 NA <NA> NA 0.27
# 5 123 A 322 C 0.3 326 G 0.1 NA <NA> NA 0.03
# 6 123 A 322 C 0.3 NA <NA> NA NA <NA> NA 0.3
(旁注:根据您对 “任意数量的 p_*
” 的评论,将数据保存为更长的格式(出于pivot_longer
) 进行更多处理。)
一个衬里(也许可以改进为更 dplyr 的形式):
> dt$Product = apply(dt %>% select(starts_with('p_')), 1, prod, na.rm = T)
> dt
# A tibble: 6 x 12
id_0 name_0 id_1 name_1 p_1 id_2 name_2 p_2 id_3 na_3 p_3 Product
<dbl> <chr> <dbl> <chr> <dbl> <int> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 123 A 321 B 0.7 NA NA NA NA NA NA 0.7
2 123 A 321 B 0.7 323 D 0.8 NA NA NA 0.56
3 123 A 321 B 0.7 324 E 0.2 323 H 1 0.14
4 123 A 322 C 0.3 325 F 0.9 NA NA NA 0.27
5 123 A 322 C 0.3 326 G 0.1 NA NA NA 0.03
6 123 A 322 C 0.3 NA NA NA NA NA NA 0.3
利用magrittr
%<>%
:
可以这样写
dt %<>% mutate(Product = apply(dt %>% select(starts_with('p_')), 1, prod, na.rm = T))
假设我有以下 data.frame
:
dt = tibble::tibble(
id_0 = rep(123, 6),
name_0 = rep("A", 6),
id_1 = c(rep(321, 3), rep(322, 3)),
name_1 = c(rep("B", 3), rep("C", 3)),
p_1 = c(rep(0.7, 3), rep(0.3, 3)),
id_2 = c(NA, 323:326, NA),
name_2 = c(NA, "D", "E", "J", "G", NA),
p_2 = c(NA, 0.8, 0.2, 0.9, 0.1, NA),
id_3 = c(NA, NA, 323, NA, NA, NA),
na_3 = c(NA, NA, "H", NA, NA, NA),
p_3 = c(NA, NA, 1, NA, NA, NA),
)
看起来像这样:
# A tibble: 6 x 11
id_0 name_0 id_1 name_1 p_1 id_2 name_2 p_2 id_3 na_3
<dbl> <chr> <dbl> <chr> <dbl> <int> <chr> <dbl> <dbl> <chr>
1 123 A 321 B 0.7 NA NA NA NA NA
2 123 A 321 B 0.7 323 D 0.8 NA NA
3 123 A 321 B 0.7 324 E 0.2 323 H
4 123 A 322 C 0.3 325 J 0.9 NA NA
5 123 A 322 C 0.3 326 G 0.1 NA NA
6 123 A 322 C 0.3 NA NA NA NA NA
我需要从所有 p_*
列中逐行乘积。在这种情况下,它将是 Product = p_1 * p_2 * p_3
,但通常它可以是从 p_1 到 p_* 的任何产品(这个 data.frame
因情况而异,我的意思是 Product = product(p_1, p_2, ..., p_n)
) . 请注意, p_*
始终大于 cero 且小于或等于 1 (p_ > 0 & p_ <= 1)。所以我需要完成的任务有两件事:Product
1) 必须省略 NAs
和 2) 对于 data.frame
中存在的任意数量的 p_*
是通用的。
理想的输出应如下所示:
# A tibble: 6 x 12
id_0 name_0 id_1 name_1 p_1 id_2 name_2 p_2 id_3 na_3 p_3 Product
<dbl> <chr> <dbl> <chr> <dbl> <int> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 123 A 321 B 0.7 NA NA NA NA NA NA 0.7
2 123 A 321 B 0.7 323 D 0.8 NA NA NA 0.56
3 123 A 321 B 0.7 324 E 0.2 323 H 1 0.14
4 123 A 322 C 0.3 325 J 0.9 NA NA NA 0.27
5 123 A 322 C 0.3 326 G 0.1 NA NA NA 0.03
6 123 A 322 C 0.3 NA NA NA NA NA NA 0.3
我建议整形手术:
library(dplyr)
library(tidyr) # pivot_longer
# preserve a row-wise "id"
dt <- mutate(dt, rn = row_number())
dt %>%
pivot_longer(-rn, names_pattern = c("(.*)_([0-9])"), names_to = c(".value", "num"))
# # A tibble: 24 x 6
# rn num id name p na
# <int> <chr> <dbl> <chr> <dbl> <chr>
# 1 1 0 123 A NA <NA>
# 2 1 1 321 B 0.7 <NA>
# 3 1 2 NA <NA> NA <NA>
# 4 1 3 NA <NA> NA <NA>
# 5 2 0 123 A NA <NA>
# 6 2 1 321 B 0.7 <NA>
# 7 2 2 323 D 0.8 <NA>
# 8 2 3 NA <NA> NA <NA>
# 9 3 0 123 A NA <NA>
# 10 3 1 321 B 0.7 <NA>
# # ... with 14 more rows
有了这个,我们可以很容易地group_by
,计算乘积,...
dt %>%
pivot_longer(-rn, names_pattern = c("(.*)_([0-9])"), names_to = c(".value", "num")) %>%
group_by(rn) %>%
summarize(Product = prod(p, na.rm = TRUE))
# # A tibble: 6 x 2
# rn Product
# <int> <dbl>
# 1 1 0.7
# 2 2 0.560
# 3 3 0.140
# 4 4 0.27
# 5 5 0.03
# 6 6 0.3
...然后加入到 dt
。
dt %>%
pivot_longer(-rn, names_pattern = c("(.*)_([0-9])"), names_to = c(".value", "num")) %>%
group_by(rn) %>%
summarize(Product = prod(p, na.rm = TRUE)) %>%
left_join(dt, ., by = "rn") %>%
select(-rn)
# # A tibble: 6 x 12
# id_0 name_0 id_1 name_1 p_1 id_2 name_2 p_2 id_3 na_3 p_3 Product
# <dbl> <chr> <dbl> <chr> <dbl> <int> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
# 1 123 A 321 B 0.7 NA <NA> NA NA <NA> NA 0.7
# 2 123 A 321 B 0.7 323 D 0.8 NA <NA> NA 0.560
# 3 123 A 321 B 0.7 324 E 0.2 323 H 1 0.140
# 4 123 A 322 C 0.3 325 F 0.9 NA <NA> NA 0.27
# 5 123 A 322 C 0.3 326 G 0.1 NA <NA> NA 0.03
# 6 123 A 322 C 0.3 NA <NA> NA NA <NA> NA 0.3
(旁注:根据您对 “任意数量的 p_*
” 的评论,将数据保存为更长的格式(出于pivot_longer
) 进行更多处理。)
一个衬里(也许可以改进为更 dplyr 的形式):
> dt$Product = apply(dt %>% select(starts_with('p_')), 1, prod, na.rm = T)
> dt
# A tibble: 6 x 12
id_0 name_0 id_1 name_1 p_1 id_2 name_2 p_2 id_3 na_3 p_3 Product
<dbl> <chr> <dbl> <chr> <dbl> <int> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 123 A 321 B 0.7 NA NA NA NA NA NA 0.7
2 123 A 321 B 0.7 323 D 0.8 NA NA NA 0.56
3 123 A 321 B 0.7 324 E 0.2 323 H 1 0.14
4 123 A 322 C 0.3 325 F 0.9 NA NA NA 0.27
5 123 A 322 C 0.3 326 G 0.1 NA NA NA 0.03
6 123 A 322 C 0.3 NA NA NA NA NA NA 0.3
利用magrittr
%<>%
:
dt %<>% mutate(Product = apply(dt %>% select(starts_with('p_')), 1, prod, na.rm = T))