根据排序顺序从 data.table 中排除行
Excluding rows from a data.table based on sort order
我需要一些帮助来过滤 R 中的 data.table。我有一个包含 百万 行的文件,每行 4 个单词。
我想删除一些我不需要的行。每行有 4 个词和一个频率。
对于前 3 个单词的每个组合,我只想保留带有 "most frequency" 的 3 个。
下面是 data.table 的示例以及我需要的输出。
text <- c("Run to the hills", "Run to the mountains", "Run to the highway", "Run to the top", "Run to the horizon",
"Go away with him", "Go away with her",
"I am a good", "I am a bad", "I am a uggly", "I am a guy", "I am a woman",
"I am the most")
frequency <- c(0.1, 0.09, 0.2, 0.05, 0.001,
0.05, 0.04,
0.1, 0.06, 0.3, 0.05, 0.1,
0.2)
DT <- data.table(text = text, frequency = frequency)
#Original output:
text frequency
1: Run to the hills 0.100
2: Run to the mountains 0.090
3: Run to the highway 0.200
4: Run to the top 0.050
5: Run to the horizon 0.001
6: Go away with him 0.050
7: Go away with her 0.040
8: I am a good 0.100
9: I am a bad 0.060
10: I am a uggly 0.300
11: I am a guy 0.050
12: I am a woman 0.100
13: I am awesome 0.200
需要输出:(仅来自相同 "first 3 words" 的前 3 个频率)
text frequency
1: Go away with him 0.05
2: Go away with her 0.04
3: I am a uggly 0.30
4: I am a woman 0.10
5: I am a good 0.10
6: I am the most 0.20
7: Run to the highway 0.20
8: Run to the hills 0.10
9: Run to the mountains 0.09
所以,我只想保留按频率列排序的前 3 个:"Run to the XXXXX"、"Go away with XXXXX"、"I am a XXXXX"、"I am the XXXXX"
在这种情况下,我会删除:"Run to the top"、"Run to the horizon"、"I am a bad"、"I am a guy"
我正在考虑使用正则表达式,但我现在有点迷路了:-\
您可以使用 sub()
创建一个包含前三个单词的 id 列,然后使用它来获取频率的前三个值。
做起来比说的容易...
library(data.table)
## add an id column containing only the first three words
DT[, id := sub(" \S+$", "", text)]
## order by frequency, take the top three by id, remove id and NAs
## and with a little help from Frank :)
na.omit(
DT[order(frequency, decreasing = TRUE), .SD[1:3], keyby = id][, id := NULL][]
)
# text frequency
# 1: Go away with him 0.05
# 2: Go away with her 0.04
# 3: I am a uggly 0.30
# 4: I am a good 0.10
# 5: I am a woman 0.10
# 6: I am the most 0.20
# 7: Run to the highway 0.20
# 8: Run to the hills 0.10
# 9: Run to the mountains 0.09
DT[,group := sub(" \S+$", "", text)]
DT[,grank:=base::rank(-frequency),by=group]
DT[grank <= 3]
使用排名函数,因此 OP 可以指定如何处理平局。
我需要一些帮助来过滤 R 中的 data.table。我有一个包含 百万 行的文件,每行 4 个单词。
我想删除一些我不需要的行。每行有 4 个词和一个频率。
对于前 3 个单词的每个组合,我只想保留带有 "most frequency" 的 3 个。
下面是 data.table 的示例以及我需要的输出。
text <- c("Run to the hills", "Run to the mountains", "Run to the highway", "Run to the top", "Run to the horizon",
"Go away with him", "Go away with her",
"I am a good", "I am a bad", "I am a uggly", "I am a guy", "I am a woman",
"I am the most")
frequency <- c(0.1, 0.09, 0.2, 0.05, 0.001,
0.05, 0.04,
0.1, 0.06, 0.3, 0.05, 0.1,
0.2)
DT <- data.table(text = text, frequency = frequency)
#Original output:
text frequency
1: Run to the hills 0.100
2: Run to the mountains 0.090
3: Run to the highway 0.200
4: Run to the top 0.050
5: Run to the horizon 0.001
6: Go away with him 0.050
7: Go away with her 0.040
8: I am a good 0.100
9: I am a bad 0.060
10: I am a uggly 0.300
11: I am a guy 0.050
12: I am a woman 0.100
13: I am awesome 0.200
需要输出:(仅来自相同 "first 3 words" 的前 3 个频率)
text frequency
1: Go away with him 0.05
2: Go away with her 0.04
3: I am a uggly 0.30
4: I am a woman 0.10
5: I am a good 0.10
6: I am the most 0.20
7: Run to the highway 0.20
8: Run to the hills 0.10
9: Run to the mountains 0.09
所以,我只想保留按频率列排序的前 3 个:"Run to the XXXXX"、"Go away with XXXXX"、"I am a XXXXX"、"I am the XXXXX"
在这种情况下,我会删除:"Run to the top"、"Run to the horizon"、"I am a bad"、"I am a guy"
我正在考虑使用正则表达式,但我现在有点迷路了:-\
您可以使用 sub()
创建一个包含前三个单词的 id 列,然后使用它来获取频率的前三个值。
做起来比说的容易...
library(data.table)
## add an id column containing only the first three words
DT[, id := sub(" \S+$", "", text)]
## order by frequency, take the top three by id, remove id and NAs
## and with a little help from Frank :)
na.omit(
DT[order(frequency, decreasing = TRUE), .SD[1:3], keyby = id][, id := NULL][]
)
# text frequency
# 1: Go away with him 0.05
# 2: Go away with her 0.04
# 3: I am a uggly 0.30
# 4: I am a good 0.10
# 5: I am a woman 0.10
# 6: I am the most 0.20
# 7: Run to the highway 0.20
# 8: Run to the hills 0.10
# 9: Run to the mountains 0.09
DT[,group := sub(" \S+$", "", text)]
DT[,grank:=base::rank(-frequency),by=group]
DT[grank <= 3]
使用排名函数,因此 OP 可以指定如何处理平局。