根据组的 colSums 筛选行

Filter rows based on colSums for groups

在下面的数据框中,我想删除 loc 列中超过 50% 的 ID 中 Freq 高于 0.5 的行。例如,在下面,所有包含 G__Achromobacter 的行都应该被删除,因为 Freq 超过 0.5 超过 50% 的 loc.

我已经尝试 tidyversegroup_by 使用 loc 和 colSums,但没有搞清楚。

                  ID              loc absolute      Freq   variable     value
2     G__Abiotrophia            Brain        9 0.2294118 NotPresent 0.4705882
11    G__Abiotrophia      Gallbladder       13 0.1652174 NotPresent 0.4347826
12    G__Abiotrophia Gastroesophageal        7 0.1750000 NotPresent 0.1250000
31    G__Abiotrophia Urothelial tract       82 0.5503356 NotPresent 0.4496644
82  G__Achromobacter            Brain       11 0.1470588 NotPresent 0.3529412
93  G__Achromobacter    Head and neck       33 0.5409836 NotPresent 0.4590164
95  G__Achromobacter           Kidney       66 0.5365854 NotPresent 0.4634146
99  G__Achromobacter      Mesothelium       19 0.5135135 NotPresent 0.4864865
102 G__Achromobacter         Pancreas       63 0.5575221 NotPresent 0.4424779

输入

df <- structure(list(ID = c("G__Abiotrophia", "G__Abiotrophia", "G__Abiotrophia", 
"G__Abiotrophia", "G__Achromobacter", "G__Achromobacter", "G__Achromobacter", 
"G__Achromobacter", "G__Achromobacter"), loc = c("Brain", "Gallbladder", 
"Gastroesophageal", "Urothelial tract", "Brain", "Head and neck", 
"Kidney", "Mesothelium", "Pancreas"), absolute = c(9L, 13L, 7L, 
82L, 11L, 33L, 66L, 19L, 63L), Freq = c(0.229411764705882, 0.165217391304348, 
0.175, 0.550335570469799, 0.147058823529412, 0.540983606557377, 
0.536585365853659, 0.513513513513513, 0.557522123893805), variable = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("NotPresent", "Present"
), class = "factor"), value = c(0.470588235294118, 0.434782608695652, 
0.125, 0.449664429530201, 0.352941176470588, 0.459016393442623, 
0.463414634146341, 0.486486486486487, 0.442477876106195)), row.names = c(2L, 
11L, 12L, 31L, 82L, 93L, 95L, 99L, 102L), class = "data.frame")

我找到了解决您问题的方法。这可能不是最简单的解决方案,但它确实有效。

 df %>% group_by(ID) %>% mutate(condition=ifelse(Freq>0.5,0,1)) %>%
   mutate(selected=sum(condition/length(ID))) %>%
   filter(selected>0.5) %>% 
   select(!c(condition,selected))
# A tibble: 4 × 6
# Groups:   ID [1]
  ID             loc              absolute  Freq variable   value
  <chr>          <chr>               <int> <dbl> <fct>      <dbl>
1 G__Abiotrophia Brain                   9 0.229 NotPresent 0.471
2 G__Abiotrophia Gallbladder            13 0.165 NotPresent 0.435
3 G__Abiotrophia Gastroesophageal        7 0.175 NotPresent 0.125
4 G__Abiotrophia Urothelial tract       82 0.550 NotPresent 0.450

让我向您解释一下发生了什么。 首先,您按 ID 对变量进行分组,以便每个函数在每个组上独立执行,然后检查每一行的频率是否高于 0.5,如果满足此条件,则分配值 0 (FALSE)。然后将真实行的总和除以组中行的总和,如果该总和高于 0.5,则它满足您的条件。最终您删除了我创建的行,并且您的数据框已正确修剪了您想要删除的行。

计算 Freq > 0.5 的数量,如果该数量大于独特 loc 的 50%,则将其移除。

library(tidyverse)

df %>% 
  group_by(ID) %>% 
  filter(ifelse(sum(Freq > 0.5) > length(unique(loc))/2, F, T))

# A tibble: 4 x 6
# Groups:   ID [1]
  ID             loc              absolute  Freq variable   value
  <chr>          <chr>               <int> <dbl> <fct>      <dbl>
1 G__Abiotrophia Brain                   9 0.229 NotPresent 0.471
2 G__Abiotrophia Gallbladder            13 0.165 NotPresent 0.435
3 G__Abiotrophia Gastroesophageal        7 0.175 NotPresent 0.125
4 G__Abiotrophia Urothelial tract       82 0.550 NotPresent 0.450