根据组的 colSums 筛选行
Filter rows based on colSums for groups
在下面的数据框中,我想删除 loc
列中超过 50% 的 ID 中 Freq
高于 0.5 的行。例如,在下面,所有包含 G__Achromobacter
的行都应该被删除,因为 Freq
超过 0.5 超过 50% 的 loc
.
我已经尝试 tidyverse
和 group_by
使用 loc 和 colSums,但没有搞清楚。
ID loc absolute Freq variable value
2 G__Abiotrophia Brain 9 0.2294118 NotPresent 0.4705882
11 G__Abiotrophia Gallbladder 13 0.1652174 NotPresent 0.4347826
12 G__Abiotrophia Gastroesophageal 7 0.1750000 NotPresent 0.1250000
31 G__Abiotrophia Urothelial tract 82 0.5503356 NotPresent 0.4496644
82 G__Achromobacter Brain 11 0.1470588 NotPresent 0.3529412
93 G__Achromobacter Head and neck 33 0.5409836 NotPresent 0.4590164
95 G__Achromobacter Kidney 66 0.5365854 NotPresent 0.4634146
99 G__Achromobacter Mesothelium 19 0.5135135 NotPresent 0.4864865
102 G__Achromobacter Pancreas 63 0.5575221 NotPresent 0.4424779
输入
df <- structure(list(ID = c("G__Abiotrophia", "G__Abiotrophia", "G__Abiotrophia",
"G__Abiotrophia", "G__Achromobacter", "G__Achromobacter", "G__Achromobacter",
"G__Achromobacter", "G__Achromobacter"), loc = c("Brain", "Gallbladder",
"Gastroesophageal", "Urothelial tract", "Brain", "Head and neck",
"Kidney", "Mesothelium", "Pancreas"), absolute = c(9L, 13L, 7L,
82L, 11L, 33L, 66L, 19L, 63L), Freq = c(0.229411764705882, 0.165217391304348,
0.175, 0.550335570469799, 0.147058823529412, 0.540983606557377,
0.536585365853659, 0.513513513513513, 0.557522123893805), variable = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("NotPresent", "Present"
), class = "factor"), value = c(0.470588235294118, 0.434782608695652,
0.125, 0.449664429530201, 0.352941176470588, 0.459016393442623,
0.463414634146341, 0.486486486486487, 0.442477876106195)), row.names = c(2L,
11L, 12L, 31L, 82L, 93L, 95L, 99L, 102L), class = "data.frame")
我找到了解决您问题的方法。这可能不是最简单的解决方案,但它确实有效。
df %>% group_by(ID) %>% mutate(condition=ifelse(Freq>0.5,0,1)) %>%
mutate(selected=sum(condition/length(ID))) %>%
filter(selected>0.5) %>%
select(!c(condition,selected))
# A tibble: 4 × 6
# Groups: ID [1]
ID loc absolute Freq variable value
<chr> <chr> <int> <dbl> <fct> <dbl>
1 G__Abiotrophia Brain 9 0.229 NotPresent 0.471
2 G__Abiotrophia Gallbladder 13 0.165 NotPresent 0.435
3 G__Abiotrophia Gastroesophageal 7 0.175 NotPresent 0.125
4 G__Abiotrophia Urothelial tract 82 0.550 NotPresent 0.450
让我向您解释一下发生了什么。
首先,您按 ID 对变量进行分组,以便每个函数在每个组上独立执行,然后检查每一行的频率是否高于 0.5,如果满足此条件,则分配值 0 (FALSE)。然后将真实行的总和除以组中行的总和,如果该总和高于 0.5,则它满足您的条件。最终您删除了我创建的行,并且您的数据框已正确修剪了您想要删除的行。
计算 Freq > 0.5
的数量,如果该数量大于独特 loc
的 50%,则将其移除。
library(tidyverse)
df %>%
group_by(ID) %>%
filter(ifelse(sum(Freq > 0.5) > length(unique(loc))/2, F, T))
# A tibble: 4 x 6
# Groups: ID [1]
ID loc absolute Freq variable value
<chr> <chr> <int> <dbl> <fct> <dbl>
1 G__Abiotrophia Brain 9 0.229 NotPresent 0.471
2 G__Abiotrophia Gallbladder 13 0.165 NotPresent 0.435
3 G__Abiotrophia Gastroesophageal 7 0.175 NotPresent 0.125
4 G__Abiotrophia Urothelial tract 82 0.550 NotPresent 0.450
在下面的数据框中,我想删除 loc
列中超过 50% 的 ID 中 Freq
高于 0.5 的行。例如,在下面,所有包含 G__Achromobacter
的行都应该被删除,因为 Freq
超过 0.5 超过 50% 的 loc
.
我已经尝试 tidyverse
和 group_by
使用 loc 和 colSums,但没有搞清楚。
ID loc absolute Freq variable value
2 G__Abiotrophia Brain 9 0.2294118 NotPresent 0.4705882
11 G__Abiotrophia Gallbladder 13 0.1652174 NotPresent 0.4347826
12 G__Abiotrophia Gastroesophageal 7 0.1750000 NotPresent 0.1250000
31 G__Abiotrophia Urothelial tract 82 0.5503356 NotPresent 0.4496644
82 G__Achromobacter Brain 11 0.1470588 NotPresent 0.3529412
93 G__Achromobacter Head and neck 33 0.5409836 NotPresent 0.4590164
95 G__Achromobacter Kidney 66 0.5365854 NotPresent 0.4634146
99 G__Achromobacter Mesothelium 19 0.5135135 NotPresent 0.4864865
102 G__Achromobacter Pancreas 63 0.5575221 NotPresent 0.4424779
输入
df <- structure(list(ID = c("G__Abiotrophia", "G__Abiotrophia", "G__Abiotrophia",
"G__Abiotrophia", "G__Achromobacter", "G__Achromobacter", "G__Achromobacter",
"G__Achromobacter", "G__Achromobacter"), loc = c("Brain", "Gallbladder",
"Gastroesophageal", "Urothelial tract", "Brain", "Head and neck",
"Kidney", "Mesothelium", "Pancreas"), absolute = c(9L, 13L, 7L,
82L, 11L, 33L, 66L, 19L, 63L), Freq = c(0.229411764705882, 0.165217391304348,
0.175, 0.550335570469799, 0.147058823529412, 0.540983606557377,
0.536585365853659, 0.513513513513513, 0.557522123893805), variable = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("NotPresent", "Present"
), class = "factor"), value = c(0.470588235294118, 0.434782608695652,
0.125, 0.449664429530201, 0.352941176470588, 0.459016393442623,
0.463414634146341, 0.486486486486487, 0.442477876106195)), row.names = c(2L,
11L, 12L, 31L, 82L, 93L, 95L, 99L, 102L), class = "data.frame")
我找到了解决您问题的方法。这可能不是最简单的解决方案,但它确实有效。
df %>% group_by(ID) %>% mutate(condition=ifelse(Freq>0.5,0,1)) %>%
mutate(selected=sum(condition/length(ID))) %>%
filter(selected>0.5) %>%
select(!c(condition,selected))
# A tibble: 4 × 6
# Groups: ID [1]
ID loc absolute Freq variable value
<chr> <chr> <int> <dbl> <fct> <dbl>
1 G__Abiotrophia Brain 9 0.229 NotPresent 0.471
2 G__Abiotrophia Gallbladder 13 0.165 NotPresent 0.435
3 G__Abiotrophia Gastroesophageal 7 0.175 NotPresent 0.125
4 G__Abiotrophia Urothelial tract 82 0.550 NotPresent 0.450
让我向您解释一下发生了什么。 首先,您按 ID 对变量进行分组,以便每个函数在每个组上独立执行,然后检查每一行的频率是否高于 0.5,如果满足此条件,则分配值 0 (FALSE)。然后将真实行的总和除以组中行的总和,如果该总和高于 0.5,则它满足您的条件。最终您删除了我创建的行,并且您的数据框已正确修剪了您想要删除的行。
计算 Freq > 0.5
的数量,如果该数量大于独特 loc
的 50%,则将其移除。
library(tidyverse)
df %>%
group_by(ID) %>%
filter(ifelse(sum(Freq > 0.5) > length(unique(loc))/2, F, T))
# A tibble: 4 x 6
# Groups: ID [1]
ID loc absolute Freq variable value
<chr> <chr> <int> <dbl> <fct> <dbl>
1 G__Abiotrophia Brain 9 0.229 NotPresent 0.471
2 G__Abiotrophia Gallbladder 13 0.165 NotPresent 0.435
3 G__Abiotrophia Gastroesophageal 7 0.175 NotPresent 0.125
4 G__Abiotrophia Urothelial tract 82 0.550 NotPresent 0.450