根据排序顺序从 data.table 中排除行

Excluding rows from a data.table based on sort order

我需要一些帮助来过滤 R 中的 data.table。我有一个包含 百万 行的文件,每行 4 个单词。

我想删除一些我不需要的行。每行有 4 个词和一个频率。

对于前 3 个单词的每个组合,我只想保留带有 "most frequency" 的 3 个。

下面是 data.table 的示例以及我需要的输出。

text <- c("Run to the hills", "Run to the mountains", "Run to the highway", "Run to the top", "Run to the horizon",
          "Go away with him", "Go away with her",
          "I am a good", "I am a bad", "I am a uggly", "I am a guy", "I am a woman",
          "I am the most")

frequency <- c(0.1, 0.09, 0.2, 0.05, 0.001,
               0.05, 0.04,
               0.1, 0.06, 0.3, 0.05, 0.1,
               0.2)

DT <- data.table(text = text, frequency = frequency)

#Original output:
                    text frequency
 1:     Run to the hills     0.100
 2: Run to the mountains     0.090
 3:   Run to the highway     0.200
 4:       Run to the top     0.050
 5:   Run to the horizon     0.001
 6:     Go away with him     0.050
 7:     Go away with her     0.040
 8:          I am a good     0.100
 9:           I am a bad     0.060
10:         I am a uggly     0.300
11:           I am a guy     0.050
12:         I am a woman     0.100
13:         I am awesome     0.200

需要输出:(仅来自相同 "first 3 words" 的前 3 个频率)

                 text frequency
  1: Go away with him      0.05     
  2: Go away with her      0.04
  3: I am a uggly          0.30  
  4: I am a woman          0.10
  5: I am a good           0.10
  6: I am the most         0.20
  7: Run to the highway    0.20
  8: Run to the hills      0.10
  9: Run to the mountains 0.09

所以,我只想保留按频率列排序的前 3 个:"Run to the XXXXX"、"Go away with XXXXX"、"I am a XXXXX"、"I am the XXXXX"

在这种情况下,我会删除:"Run to the top"、"Run to the horizon"、"I am a bad"、"I am a guy"

我正在考虑使用正则表达式,但我现在有点迷路了:-\

您可以使用 sub() 创建一个包含前三个单词的 id 列,然后使用它来获取频率的前三个值。

做起来比说的容易...

library(data.table)

## add an id column containing only the first three words
DT[, id := sub(" \S+$", "", text)]
## order by frequency, take the top three by id, remove id and NAs
## and with a little help from Frank :)
na.omit(
  DT[order(frequency, decreasing = TRUE), .SD[1:3], keyby = id][, id := NULL][]
)
#                    text frequency
# 1:     Go away with him      0.05
# 2:     Go away with her      0.04
# 3:         I am a uggly      0.30
# 4:          I am a good      0.10
# 5:         I am a woman      0.10
# 6:        I am the most      0.20
# 7:   Run to the highway      0.20
# 8:     Run to the hills      0.10
# 9: Run to the mountains      0.09
DT[,group := sub(" \S+$", "", text)]
DT[,grank:=base::rank(-frequency),by=group]
DT[grank <= 3]

使用排名函数,因此 OP 可以指定如何处理平局。