根据相邻行折叠基本数据框

Collapsing a basic dataframe based on adjecent rows

我正在处理一个大数据框,可以用以下示例表示:

chromosome  position    position2   name    Occup       
Chr1    1   1   -   0.023
Chr1    2   2   -   0.023
Chr1    3   3   -   0.023
Chr1    4   4   -   0.023
Chr1    5   5   -   0.023
Chr1    6   6   -   0.069
Chr1    7   7   -   0.069
Chr1    8   8   -   0.069
Chr1    9   9   -   0.069
Chr1    10  10  -   0.116
Chr1    11  11  -   0.116
Chr1    12  12  -   0.116
Chr1    13  13  -   0.023
Chr1    14  14  -   0.023
Chr1    15  15  -   0.023
Chr1    16  16  -   0.023
Chr1    17  17  -   0.023

您可以阅读为:

dtf = data.frame(chromosome=c("Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1"), 
                position=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17), 
                position2=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17),        
                name=c("-","-","-","-","-","-","-","-","-","-","-","-","-","-","-","-","-"), 
                Occup=c(0.023,0.023,0.023,0.023,0.023,0.069,0.069,0.069,0.069,0.116,0.116,0.116,0.023,0.023,0.023,0.023,0.023))

我想将它折叠成这样的数据框:

chromosome  position    position2   name    Occup       
Chr1    1   5   -   0.023
Chr1    6   9   -   0.069
Chr1    10  12  -   0.116
Chr1    13  17  -   0.023

基本折叠的问题是占用值被放在一组中。这不是我想要的。我希望它们聚集在一个组中,直到下一行发生变化。

如果我这样做:

library(plyr)
test<-ddply(dtf, .(Occup), summarise,
      position_start=min(position),
      position_end= max(position2))

我明白了

Occup   position_start  position_end    
0.023   1   17
0.069   6   9
0.116   10  12

所以它接近我想要的但不是我想要的。

无需考虑第 1 列或第 3 列,因为在这种情况下这些列是任意的,并且包含所有行的相同信息。

这应该有效:

library(dplyr)

dtf_grouped <- dtf %>%
    arrange(position) %>% # to ensure data is sequential
    mutate(
        occup_shift = Occup - lag(Occup, 1) != 0, # flag row change
        occup_shift = ifelse(is.na(occup_shift), FALSE, occup_shift), # replace NA's
        group_id = cumsum(occup_shift)
        ) %>%
    group_by(group_id) %>%
    summarize(
        Occup = min(Occup),
        position_start = position[1],
        position_end = position2[n()]
    ) %>%
    select(-group_id)

head(dtf_grouped)

# A tibble: 4 x 3
   Occup position_start position_end
   <dbl>          <dbl>        <dbl>
1 0.0230              1            5
2 0.0690              6            9
3 0.116              10           12
4 0.0230             13           17

我们可以按连续的数字分组(Occup)然后得到min,max:

library(dplyr)

res <- dtf %>% 
  group_by(chromosome,
           # create group for consecutive numbers
           myGroup = cumsum(c(1, diff(Occup) != 0))) %>% 
  summarise(position = min(position),
            position2 = max(position2),
            Occup = min(Occup)) %>% 
  ungroup() %>% 
  select(-myGroup)


res

# # A tibble: 4 x 4
#   chromosome position position2  Occup
#   <fct>         <dbl>     <dbl>  <dbl>
# 1 Chr1             1.        5. 0.0230
# 2 Chr1             6.        9. 0.0690
# 3 Chr1            10.       12. 0.116 
# 4 Chr1            13.       17. 0.0230