根据相邻行折叠基本数据框
Collapsing a basic dataframe based on adjecent rows
我正在处理一个大数据框,可以用以下示例表示:
chromosome position position2 name Occup
Chr1 1 1 - 0.023
Chr1 2 2 - 0.023
Chr1 3 3 - 0.023
Chr1 4 4 - 0.023
Chr1 5 5 - 0.023
Chr1 6 6 - 0.069
Chr1 7 7 - 0.069
Chr1 8 8 - 0.069
Chr1 9 9 - 0.069
Chr1 10 10 - 0.116
Chr1 11 11 - 0.116
Chr1 12 12 - 0.116
Chr1 13 13 - 0.023
Chr1 14 14 - 0.023
Chr1 15 15 - 0.023
Chr1 16 16 - 0.023
Chr1 17 17 - 0.023
您可以阅读为:
dtf = data.frame(chromosome=c("Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1"),
position=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17),
position2=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17),
name=c("-","-","-","-","-","-","-","-","-","-","-","-","-","-","-","-","-"),
Occup=c(0.023,0.023,0.023,0.023,0.023,0.069,0.069,0.069,0.069,0.116,0.116,0.116,0.023,0.023,0.023,0.023,0.023))
我想将它折叠成这样的数据框:
chromosome position position2 name Occup
Chr1 1 5 - 0.023
Chr1 6 9 - 0.069
Chr1 10 12 - 0.116
Chr1 13 17 - 0.023
基本折叠的问题是占用值被放在一组中。这不是我想要的。我希望它们聚集在一个组中,直到下一行发生变化。
如果我这样做:
library(plyr)
test<-ddply(dtf, .(Occup), summarise,
position_start=min(position),
position_end= max(position2))
我明白了
Occup position_start position_end
0.023 1 17
0.069 6 9
0.116 10 12
所以它接近我想要的但不是我想要的。
无需考虑第 1 列或第 3 列,因为在这种情况下这些列是任意的,并且包含所有行的相同信息。
这应该有效:
library(dplyr)
dtf_grouped <- dtf %>%
arrange(position) %>% # to ensure data is sequential
mutate(
occup_shift = Occup - lag(Occup, 1) != 0, # flag row change
occup_shift = ifelse(is.na(occup_shift), FALSE, occup_shift), # replace NA's
group_id = cumsum(occup_shift)
) %>%
group_by(group_id) %>%
summarize(
Occup = min(Occup),
position_start = position[1],
position_end = position2[n()]
) %>%
select(-group_id)
head(dtf_grouped)
# A tibble: 4 x 3
Occup position_start position_end
<dbl> <dbl> <dbl>
1 0.0230 1 5
2 0.0690 6 9
3 0.116 10 12
4 0.0230 13 17
我们可以按连续的数字分组(Occup)然后得到min,max:
library(dplyr)
res <- dtf %>%
group_by(chromosome,
# create group for consecutive numbers
myGroup = cumsum(c(1, diff(Occup) != 0))) %>%
summarise(position = min(position),
position2 = max(position2),
Occup = min(Occup)) %>%
ungroup() %>%
select(-myGroup)
res
# # A tibble: 4 x 4
# chromosome position position2 Occup
# <fct> <dbl> <dbl> <dbl>
# 1 Chr1 1. 5. 0.0230
# 2 Chr1 6. 9. 0.0690
# 3 Chr1 10. 12. 0.116
# 4 Chr1 13. 17. 0.0230
我正在处理一个大数据框,可以用以下示例表示:
chromosome position position2 name Occup
Chr1 1 1 - 0.023
Chr1 2 2 - 0.023
Chr1 3 3 - 0.023
Chr1 4 4 - 0.023
Chr1 5 5 - 0.023
Chr1 6 6 - 0.069
Chr1 7 7 - 0.069
Chr1 8 8 - 0.069
Chr1 9 9 - 0.069
Chr1 10 10 - 0.116
Chr1 11 11 - 0.116
Chr1 12 12 - 0.116
Chr1 13 13 - 0.023
Chr1 14 14 - 0.023
Chr1 15 15 - 0.023
Chr1 16 16 - 0.023
Chr1 17 17 - 0.023
您可以阅读为:
dtf = data.frame(chromosome=c("Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1"),
position=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17),
position2=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17),
name=c("-","-","-","-","-","-","-","-","-","-","-","-","-","-","-","-","-"),
Occup=c(0.023,0.023,0.023,0.023,0.023,0.069,0.069,0.069,0.069,0.116,0.116,0.116,0.023,0.023,0.023,0.023,0.023))
我想将它折叠成这样的数据框:
chromosome position position2 name Occup
Chr1 1 5 - 0.023
Chr1 6 9 - 0.069
Chr1 10 12 - 0.116
Chr1 13 17 - 0.023
基本折叠的问题是占用值被放在一组中。这不是我想要的。我希望它们聚集在一个组中,直到下一行发生变化。
如果我这样做:
library(plyr)
test<-ddply(dtf, .(Occup), summarise,
position_start=min(position),
position_end= max(position2))
我明白了
Occup position_start position_end
0.023 1 17
0.069 6 9
0.116 10 12
所以它接近我想要的但不是我想要的。
无需考虑第 1 列或第 3 列,因为在这种情况下这些列是任意的,并且包含所有行的相同信息。
这应该有效:
library(dplyr)
dtf_grouped <- dtf %>%
arrange(position) %>% # to ensure data is sequential
mutate(
occup_shift = Occup - lag(Occup, 1) != 0, # flag row change
occup_shift = ifelse(is.na(occup_shift), FALSE, occup_shift), # replace NA's
group_id = cumsum(occup_shift)
) %>%
group_by(group_id) %>%
summarize(
Occup = min(Occup),
position_start = position[1],
position_end = position2[n()]
) %>%
select(-group_id)
head(dtf_grouped)
# A tibble: 4 x 3
Occup position_start position_end
<dbl> <dbl> <dbl>
1 0.0230 1 5
2 0.0690 6 9
3 0.116 10 12
4 0.0230 13 17
我们可以按连续的数字分组(Occup)然后得到min,max:
library(dplyr)
res <- dtf %>%
group_by(chromosome,
# create group for consecutive numbers
myGroup = cumsum(c(1, diff(Occup) != 0))) %>%
summarise(position = min(position),
position2 = max(position2),
Occup = min(Occup)) %>%
ungroup() %>%
select(-myGroup)
res
# # A tibble: 4 x 4
# chromosome position position2 Occup
# <fct> <dbl> <dbl> <dbl>
# 1 Chr1 1. 5. 0.0230
# 2 Chr1 6. 9. 0.0690
# 3 Chr1 10. 12. 0.116
# 4 Chr1 13. 17. 0.0230