仅子集不断增加值到最大值

Subset only continuously increasing values to max value

我正在尝试找到一个解决方案,允许我通过找到一个不断增加的向量的起点来对数字数据进行子集化,并在最大值处停止。

一些示例数据:

if(!require(data.table)) {install.packages("data.table"); library(data.table)}
if(!require(zoo)) {install.packages("zoo"); library(zoo)}
if(!require(dplyr)) {install.packages("dplyr"); library(dplyr)}

depth <- c(1.1, 2, 1.6, 1.2, 1.6, 1.2, 1.5, 1.7, 2.1, 3.1, 3.8, 5.2, 6.1, 7.0, 6.9, 6.9, 6.9, 6.0, 4.3, 2.1, 2.0)
temp <- c(17.9, 17.9, 17.8, 17.9, 17.7, 17.9, 17.9, 17.8, 17.7, 17.6, 17.5, 17.3, 17.2, 17.1, 17.0, 16.9, 16.7, 16.9, 17.2, 17.5, 17.9)
testdf <- data.frame(depth = depth, temp = temp)

我尝试了几种解决方案,一种不起作用,另一种起作用,但我觉得它在某些情况下可能有局限性。

只找到 1:max。类似的解决方案建议删除任何递减的值,其中 diff 将为负数。这些不是我想要的。

setDT(testdf)[, .SD[1:which.max(depth)]]
    depth temp
 1:   1.1 17.9
 2:   2.0 17.9
 3:   1.6 17.8
 4:   1.2 17.9
 5:   1.6 17.7
 6:   1.2 17.9
 7:   1.5 17.9
 8:   1.7 17.8
 9:   2.1 17.7
10:   3.1 17.6
11:   3.8 17.5
12:   5.2 17.3
13:   6.1 17.2
14:   7.0 17.1

我正在尝试取回它:

    depth temp
 6:   1.2 17.9
 7:   1.5 17.9
 8:   1.7 17.8
 9:   2.1 17.7
10:   3.1 17.6
11:   3.8 17.5
12:   5.2 17.3
13:   6.1 17.2
14:   7.0 17.1

使用 diffrollapply 来任意分箱(此处为 n = 10)。在此特定用途中,我将一个额外的行填充到最大索引,为了得到它,必须将 diff 设置为 0,否则 rollapply 会停止在最大值以下。

testdf$diff <- c(diff(testdf$depth), NA) # add diff column and NA to empty cell
testdf <- testdf[1:(which(testdf$depth == max(testdf$depth)) + 1),] # subset to max depth row, plus one
testdf$diff[(which(testdf$depth == max(testdf$depth))) : (which(testdf$depth == max(testdf$depth)) + 1)] <- 0 # set any diff entry after max depth to 0, for rollapply to work

testdf <- testdf %>% 
mutate(diff = rollapply(diff, width = 10, min, align = "left", fill = 0, na.rm = TRUE)) %>% 
filter(diff >= 0)

Returns我想要的:

   depth temp diff
1    1.2 17.9    0
2    1.5 17.9    0
3    1.7 17.8    0
4    2.1 17.7    0
5    3.1 17.6    0
6    3.8 17.5    0
7    5.2 17.3    0
8    6.1 17.2    0
9    7.0 17.1    0
10   6.9 17.0    0 # an extra padded row

此解决方案可能并非始终有效,使用任意 window。似乎理想的解决方案只是找到最大索引,然后上升到最后一个正 diff 值,然后对该范围进行子集化,但我正在尝试找出一种不涉及循环的方法。

编辑

while 循环有效,但我试图避免循环。

findmindepth <- function(x) {
  maxdi <- NA
  mindi <- NA
  maxdi <- (which(x$depth == max(x$depth)) - 1)
  while(x$diff[maxdi] > 0) {
    maxdi = maxdi - 1
  }
  mindi = maxdi + 1
  newx <- x[mindi:(which(x$depth == max(x$depth)) + 1),]
}

您可以使用 run-length encodingdiff 来查找所有 decreasing/increasing start/end 点:

which_max <- which.max(testdf$depth)
encoding <- rle(diff(testdf$depth) > 0)

# these contain the start/end indices of all continuously increasing/decreasing subsets
ends <- cumsum(encoding$lengths) + 1L
starts <- ends - encoding$lengths

# filter out the decreasing subsets
starts <- starts[encoding$values]
ends <- ends[encoding$values]

# find the one that contains the maximum
interval <- which(starts <= which_max & ends >= which_max)
out <- testdf[starts[interval]:ends[interval],]
out
   depth temp
6    1.2 17.9
7    1.5 17.9
8    1.7 17.8
9    2.1 17.7
10   3.1 17.6
11   3.8 17.5
12   5.2 17.3
13   6.1 17.2
14   7.0 17.1

编辑:实际上,如果您只关心包含最大值的子集, 你可以做一些更简单的事情:

which_max <- which.max(testdf$depth)
if (which_max == 1L) {
  out <- testdf[1L, , drop = FALSE]
}
else {
  subset1 <- testdf$depth[which_max:1L]
  len <- which.max(diff(subset1) > 0)
  out <- testdf[(which_max - len + 1L):which_max,]
}