R：可以从每个句子（行）中提取词组吗？并创建数据框（或矩阵）？

Question

我为每个单词创建了列表以从句子中提取单词，例如像这样

hello<- NULL
for (i in 1:length(text)){
hello[i]<-as.character(regmatches(text[i], gregexpr("[H|h]ello?", text[i])))
}

但是我有超过 25 个单词列表要提取，那是很长的编码。 是否可以从文本数据中提取一组字符（词）？

以下只是伪集

words<-c("[H|h]ello","you","so","tea","egg")

text=c("Hello! How's you and how did saturday go?",  
       "hello, I was just texting to see if you'd decided to do anything later",
       "U dun say so early.",
       "WINNER!! As a valued network customer you have been selected" ,
       "Lol you're always so convincing.",
       "Did you catch the bus ? Are you frying an egg ? ",
       "Did you make a tea and egg?"
)

subsets<-NULL
for ( i in 1:length(text)){
.....???
   }

预期输出如下

[1] Hello you
[2] hello you
[3] you
[4] you so
[5] you you egg
[6] you tea egg

Answer 1

在 base R 中，你可以这样做：

regmatches(text,gregexpr(sprintf("\b(%s)\b",paste0(words,collapse = "|")),text))
[[1]]
[1] "Hello" "you"  

[[2]]
[1] "hello" "you"  

[[3]]
[1] "so"

[[4]]
[1] "you"

[[5]]
[1] "you" "so" 

[[6]]
[1] "you" "you" "egg"

[[7]]
[1] "you" "tea" "egg"

取决于您想要的结果：

trimws(gsub(sprintf(".*?\b(%s).*?|.*$",paste0(words,collapse = "|")),"\1 ",text))
[1] "Hello you"   "hello you"   "so"          "you"         "you so"      "you you egg"
[7] "you tea egg"

Answer 2

你说你有一长串单词集。这是一种将每个词集转换为正则表达式的方法，将其应用于语料库（句子列表）并将匹配项作为字符向量提取。它不区分大小写，并且会检查单词边界，因此您不会从 agent 或 rage[ 中提取 age =23=].

wordsets <- c( "oak dogs cheese age", "fire open jail", "act speed three product" ) library(tidyverse) harvSent <- read_table("SENTENCE Oak is strong and also gives shade. Cats and dogs each hate the other. The pipe began to rust while new. Open the crate but don't break the glass. Add the sum to the product of these three. Thieves who rob friends deserve jail. The ripe taste of cheese improves with age. Act on these orders with great speed. The hog crawled under the high fence. Move the vat over the hot fire.") %>% pull(SENTENCE)

aWset 从词集中构建正则表达式，并将它们应用于句子

aWset <- function(harvSent, wordsets){ # Turn out a vector of regex like "(?ix) \b (oak|dogs|cheese) \b" regexS <- paste0("(?ix) \b (", str_replace_all(wordsets, " ", "|" ), ") \b") # Apply each regex to the sentences map(regexS, ~ str_extract_all(harvSent, .x, simplify = TRUE) %>% # str_extract_all return a character matrix of hits. Paste it together by row. apply( MARGIN = 1, FUN = function(x){ str_trim(paste(x, collapse = " "))})) }

给我们

aWset(harvSent , wordsets) [[1]] [1] "Oak" "dogs" "" "" "" "" "cheese age" "" [9] "" "" [[2]] [1] "" "" "" "Open" "" "jail" "" "" "" "fire" [[3]] [1] "" "" "" "" "product three" "" ""

R：可以从每个句子（行）中提取词组吗？并创建数据框（或矩阵）？

R: Possible to extract groups of words from each sentence(rows)? and create data frame(or matrix)?

r

extract

text-mining