R中是否有在单词级别上运行的文本处理功能?

Are there text processing function that operate on word level in R?

我试图在 R 中找到一组可以在单词级别上运行的函数。例如一个可以 return 单词位置的函数。例如给出以下 sentencequery

sentence <- "A sample sentence for demo"
query <- "for"
  1. 函数会 return 4。for 是第 4 个字。

  2. 如果我能得到一个允许我在左右方向上扩展 query 的实用函数,那就太好了。 例如extend(query, 'right') 会 return for demoextend(query, 'left') 会 return sentence for

我已经使用过 grep、gregexp、stringr 包中的 word 等函数。一切似乎都在角色层面上运作。

正如我在评论中提到的,stringr 在这些情况下很有用。

library(stringr)

sentence <- "A sample sentence for demo"
wordNumber <- 4L

fourthWord <- word(string = sentence,
                   start = wordNumber)

previousWords <- word(string = sentence,
                       start = wordNumber - 1L,
                       end = wordNumber)

laterWords <- word(string = sentence,
                   start = wordNumber,
                   end = wordNumber + 1L)

这会产生:

> fourthWord
[1] "for"
> previousWords
[1] "sentence for"
> laterWords
[1] "for demo"

希望对你有帮助。

如果您使用 scan,它将在空格处拆分输入:

> s.scan <- scan(text=sentence, what="")
Read 5 items
> which(s.scan == query)
[1] 4

需要 what="" 来告诉扫描期望字符而不是数字输入。如果您输入的是完整的英文句子,可能需要将使用 gsub 的标点符号替换为 patt="[[:punct:]]"。如果您尝试对词性进行分类或处理大型文档,可能还需要查看 tm(文本挖掘)包。

我自己写了函数,indexOf方法returns索引word如果在sentence中找到否则returns -1,很像java indexOf()

indexOf <- function(sentence, word){
  listOfWords <- strsplit(sentence, split = " ")
  sentenceAsVector <- unlist(listOfWords)

  if(word %in% sentenceAsVector == FALSE){
    result=-1
  }
  else{
  result = which(sentenceAsVector==word)
  }
  return(result)
}

extend 方法工作正常,但相当冗长,根本不像 R 代码。如果query是句子边界上的词,即第一个词或最后一个词,则返回前两个词或最后两个词

extend <- function(sentence, query, direction){
  listOfWords = strsplit(sentence, split = " ")
  sentenceAsVector = unlist(listOfWords)
  lengthOfSentence = length(sentenceAsVector)
  location = indexOf(sentence, query)
  boundary = FALSE
  if(location == 1 | location == lengthOfSentence){
    boundary = TRUE
  }
  else{
    boundary = FALSE
  } 
  if(!boundary){ 
    if(location> 1 & direction == "right"){  
      return(paste(sentenceAsVector[location], 
                   sentenceAsVector[location + 1],
                   sep=" ")
      )
    }
    else if(location < lengthOfSentence & direction == "left"){
      return(paste(sentenceAsVector[location - 1], 
                   sentenceAsVector[location],
                   sep=" ")
      )

    }
  }
  else{
    if(location == 1 ){
      return(paste(sentenceAsVector[1], sentenceAsVector[2], sep = " "))
    }
    if(location == lengthOfSentence){
      return(paste(sentenceAsVector[lengthOfSentence - 1],
                   sentenceAsVector[lengthOfSentence], sep = " "))
    }
  } 
}

答案取决于您所说的 "word" 是什么意思。如果您指的是空格分隔的标记,那么@imran-ali 的回答就可以了。如果您指的是 Unicode 定义的单词,并特别注意标点符号,那么您需要更复杂的东西。

以下正确处理标点符号:

library(corpus)
sentence <- "A sample sentence for demo"
query <- "for"

# use text_locate to find all instances of the query, with context
text_locate(sentence, query)
##   text             before              instance              after              
## 1 1                 A sample sentence    for     demo             

# find the number of tokens before, then add 1 to get the position
text_ntoken(text_locate(sentence, query)$before) + 1
## 4

如果有多个匹配项,这也有效:

sentence2 <- "for one, for two! for three? for four"
text_ntoken(text_locate(sentence2, query)$before) + 1
## [1]  1  4  7 10

我们可以验证这是正确的:

text_tokens(sentence2)[[1]][c(1, 4, 7, 10)]
## [1] "for" "for" "for" "for"