在字符串 R 中保留关键字后的单词
keep words after keyword in string R
问题: 我正在使用分词器进行文本挖掘,我想限制输入数据中字符串的长度。下面的代码保留了
如果包含单词,则为整个字符串。
#create data frame with data
dd <- data.frame(
text = c("hello how are you doing thank
you for helping me with this
problem","junk","junk"), stringsAsFactors = F)
#keep string that only include term "how"
dd <- filter(dd, grepl('how', text))
问题:如何修改代码,只保留关键字后面的N个字。
例如
如果 N =1 那么 dd 将包括:
如果 N =2 那么 dd 将包括:你好吗
如果 N =3 那么 dd 将包括:你好吗
...
如果我在 keep 中还包含其他单词,我需要能够工作的代码:
#keep string that only include terms "how" and "with"
dd <- filter(dd, grepl('how|with', text))
这里有一个可能的方法,使用 tidy 文本挖掘包:
(因此检查依赖项...-
library(tidytext) # install.packages("tidytext")
library(tidyr) # install.packages("tidyr")
library(dplyr) # install.packages("dplyr")
dd <- data.frame(
text = c("hello how are you doing thank
you for helping me with this
problem","junk","junk"), stringsAsFactors = F)
我提到 scope
你关于单词 horizon 的参数;很容易把下面的代码变成一个函数:
scope=2
dd %>%
unnest_tokens(ngram, text, token = "ngrams", n = 1+scope) %>%
separate(ngram, paste("word",1:(scope+1),sep=""), sep = " ") %>%
filter(word1 %in% c("how","me"))
# A tibble: 2 × 3
word1 word2 word3
<chr> <chr> <chr>
1 how are you
2 me with this
如果你想以字符串结尾,你必须折叠回 ngrams,例如第二个例子:
scope=3
dd %>%
unnest_tokens(ngram, text, token = "ngrams", n = 1+scope) %>%
separate(ngram, paste("word",1:(scope+1),sep=""), sep = " ") %>%
filter(word1 %in% c("how")) %>% apply(.,1,paste, collapse= " ")
[1] "how are you doing"
关于您的评论:
现在,如果您想按块(字符串)处理块(字符串),则必须通过处理明确地执行该组。
这里有一个例子:
scope=2
subsets <-
dd %>%
mutate(id=1:length(text)) %>%
split(., .$id)
unlist(lapply(subsets, function(dd) {
dd %>%
unnest_tokens(ngram, text, token = "ngrams", n = 1+scope) %>%
separate(ngram, paste("word",1:(scope+1),sep=""), sep = " ") %>%
filter(word1 %in% c("how","problem")) %>%
apply(.,1,FUN=function(vec) paste(vec[-1],collapse=" "))
}))
1
"how are you"
问题: 我正在使用分词器进行文本挖掘,我想限制输入数据中字符串的长度。下面的代码保留了 如果包含单词,则为整个字符串。
#create data frame with data
dd <- data.frame(
text = c("hello how are you doing thank
you for helping me with this
problem","junk","junk"), stringsAsFactors = F)
#keep string that only include term "how"
dd <- filter(dd, grepl('how', text))
问题:如何修改代码,只保留关键字后面的N个字。
例如
如果 N =1 那么 dd 将包括:
如果 N =2 那么 dd 将包括:你好吗
如果 N =3 那么 dd 将包括:你好吗
...
如果我在 keep 中还包含其他单词,我需要能够工作的代码:
#keep string that only include terms "how" and "with"
dd <- filter(dd, grepl('how|with', text))
这里有一个可能的方法,使用 tidy 文本挖掘包: (因此检查依赖项...-
library(tidytext) # install.packages("tidytext")
library(tidyr) # install.packages("tidyr")
library(dplyr) # install.packages("dplyr")
dd <- data.frame(
text = c("hello how are you doing thank
you for helping me with this
problem","junk","junk"), stringsAsFactors = F)
我提到 scope
你关于单词 horizon 的参数;很容易把下面的代码变成一个函数:
scope=2
dd %>%
unnest_tokens(ngram, text, token = "ngrams", n = 1+scope) %>%
separate(ngram, paste("word",1:(scope+1),sep=""), sep = " ") %>%
filter(word1 %in% c("how","me"))
# A tibble: 2 × 3
word1 word2 word3
<chr> <chr> <chr>
1 how are you
2 me with this
如果你想以字符串结尾,你必须折叠回 ngrams,例如第二个例子:
scope=3
dd %>%
unnest_tokens(ngram, text, token = "ngrams", n = 1+scope) %>%
separate(ngram, paste("word",1:(scope+1),sep=""), sep = " ") %>%
filter(word1 %in% c("how")) %>% apply(.,1,paste, collapse= " ")
[1] "how are you doing"
关于您的评论: 现在,如果您想按块(字符串)处理块(字符串),则必须通过处理明确地执行该组。 这里有一个例子:
scope=2
subsets <-
dd %>%
mutate(id=1:length(text)) %>%
split(., .$id)
unlist(lapply(subsets, function(dd) {
dd %>%
unnest_tokens(ngram, text, token = "ngrams", n = 1+scope) %>%
separate(ngram, paste("word",1:(scope+1),sep=""), sep = " ") %>%
filter(word1 %in% c("how","problem")) %>%
apply(.,1,FUN=function(vec) paste(vec[-1],collapse=" "))
}))
1
"how are you"