R中是否有在单词级别上运行的文本处理功能?
Are there text processing function that operate on word level in R?
我试图在 R 中找到一组可以在单词级别上运行的函数。例如一个可以 return 单词位置的函数。例如给出以下 sentence
和 query
sentence <- "A sample sentence for demo"
query <- "for"
函数会 return 4。for
是第 4 个字。
如果我能得到一个允许我在左右方向上扩展 query
的实用函数,那就太好了。
例如extend(query, 'right')
会 return for demo
和 extend(query, 'left')
会 return sentence for
我已经使用过 grep、gregexp、stringr 包中的 word 等函数。一切似乎都在角色层面上运作。
正如我在评论中提到的,stringr
在这些情况下很有用。
library(stringr)
sentence <- "A sample sentence for demo"
wordNumber <- 4L
fourthWord <- word(string = sentence,
start = wordNumber)
previousWords <- word(string = sentence,
start = wordNumber - 1L,
end = wordNumber)
laterWords <- word(string = sentence,
start = wordNumber,
end = wordNumber + 1L)
这会产生:
> fourthWord
[1] "for"
> previousWords
[1] "sentence for"
> laterWords
[1] "for demo"
希望对你有帮助。
如果您使用 scan
,它将在空格处拆分输入:
> s.scan <- scan(text=sentence, what="")
Read 5 items
> which(s.scan == query)
[1] 4
需要 what=""
来告诉扫描期望字符而不是数字输入。如果您输入的是完整的英文句子,可能需要将使用 gsub
的标点符号替换为 patt="[[:punct:]]"
。如果您尝试对词性进行分类或处理大型文档,可能还需要查看 tm
(文本挖掘)包。
我自己写了函数,indexOf
方法returns索引word
如果在sentence
中找到否则returns -1
,很像java indexOf()
indexOf <- function(sentence, word){
listOfWords <- strsplit(sentence, split = " ")
sentenceAsVector <- unlist(listOfWords)
if(word %in% sentenceAsVector == FALSE){
result=-1
}
else{
result = which(sentenceAsVector==word)
}
return(result)
}
extend
方法工作正常,但相当冗长,根本不像 R 代码。如果query
是句子边界上的词,即第一个词或最后一个词,则返回前两个词或最后两个词
extend <- function(sentence, query, direction){
listOfWords = strsplit(sentence, split = " ")
sentenceAsVector = unlist(listOfWords)
lengthOfSentence = length(sentenceAsVector)
location = indexOf(sentence, query)
boundary = FALSE
if(location == 1 | location == lengthOfSentence){
boundary = TRUE
}
else{
boundary = FALSE
}
if(!boundary){
if(location> 1 & direction == "right"){
return(paste(sentenceAsVector[location],
sentenceAsVector[location + 1],
sep=" ")
)
}
else if(location < lengthOfSentence & direction == "left"){
return(paste(sentenceAsVector[location - 1],
sentenceAsVector[location],
sep=" ")
)
}
}
else{
if(location == 1 ){
return(paste(sentenceAsVector[1], sentenceAsVector[2], sep = " "))
}
if(location == lengthOfSentence){
return(paste(sentenceAsVector[lengthOfSentence - 1],
sentenceAsVector[lengthOfSentence], sep = " "))
}
}
}
答案取决于您所说的 "word" 是什么意思。如果您指的是空格分隔的标记,那么@imran-ali 的回答就可以了。如果您指的是 Unicode 定义的单词,并特别注意标点符号,那么您需要更复杂的东西。
以下正确处理标点符号:
library(corpus)
sentence <- "A sample sentence for demo"
query <- "for"
# use text_locate to find all instances of the query, with context
text_locate(sentence, query)
## text before instance after
## 1 1 A sample sentence for demo
# find the number of tokens before, then add 1 to get the position
text_ntoken(text_locate(sentence, query)$before) + 1
## 4
如果有多个匹配项,这也有效:
sentence2 <- "for one, for two! for three? for four"
text_ntoken(text_locate(sentence2, query)$before) + 1
## [1] 1 4 7 10
我们可以验证这是正确的:
text_tokens(sentence2)[[1]][c(1, 4, 7, 10)]
## [1] "for" "for" "for" "for"
我试图在 R 中找到一组可以在单词级别上运行的函数。例如一个可以 return 单词位置的函数。例如给出以下 sentence
和 query
sentence <- "A sample sentence for demo"
query <- "for"
函数会 return 4。
for
是第 4 个字。如果我能得到一个允许我在左右方向上扩展
query
的实用函数,那就太好了。 例如extend(query, 'right')
会 returnfor demo
和extend(query, 'left')
会 returnsentence for
我已经使用过 grep、gregexp、stringr 包中的 word 等函数。一切似乎都在角色层面上运作。
正如我在评论中提到的,stringr
在这些情况下很有用。
library(stringr)
sentence <- "A sample sentence for demo"
wordNumber <- 4L
fourthWord <- word(string = sentence,
start = wordNumber)
previousWords <- word(string = sentence,
start = wordNumber - 1L,
end = wordNumber)
laterWords <- word(string = sentence,
start = wordNumber,
end = wordNumber + 1L)
这会产生:
> fourthWord
[1] "for"
> previousWords
[1] "sentence for"
> laterWords
[1] "for demo"
希望对你有帮助。
如果您使用 scan
,它将在空格处拆分输入:
> s.scan <- scan(text=sentence, what="")
Read 5 items
> which(s.scan == query)
[1] 4
需要 what=""
来告诉扫描期望字符而不是数字输入。如果您输入的是完整的英文句子,可能需要将使用 gsub
的标点符号替换为 patt="[[:punct:]]"
。如果您尝试对词性进行分类或处理大型文档,可能还需要查看 tm
(文本挖掘)包。
我自己写了函数,indexOf
方法returns索引word
如果在sentence
中找到否则returns -1
,很像java indexOf()
indexOf <- function(sentence, word){
listOfWords <- strsplit(sentence, split = " ")
sentenceAsVector <- unlist(listOfWords)
if(word %in% sentenceAsVector == FALSE){
result=-1
}
else{
result = which(sentenceAsVector==word)
}
return(result)
}
extend
方法工作正常,但相当冗长,根本不像 R 代码。如果query
是句子边界上的词,即第一个词或最后一个词,则返回前两个词或最后两个词
extend <- function(sentence, query, direction){
listOfWords = strsplit(sentence, split = " ")
sentenceAsVector = unlist(listOfWords)
lengthOfSentence = length(sentenceAsVector)
location = indexOf(sentence, query)
boundary = FALSE
if(location == 1 | location == lengthOfSentence){
boundary = TRUE
}
else{
boundary = FALSE
}
if(!boundary){
if(location> 1 & direction == "right"){
return(paste(sentenceAsVector[location],
sentenceAsVector[location + 1],
sep=" ")
)
}
else if(location < lengthOfSentence & direction == "left"){
return(paste(sentenceAsVector[location - 1],
sentenceAsVector[location],
sep=" ")
)
}
}
else{
if(location == 1 ){
return(paste(sentenceAsVector[1], sentenceAsVector[2], sep = " "))
}
if(location == lengthOfSentence){
return(paste(sentenceAsVector[lengthOfSentence - 1],
sentenceAsVector[lengthOfSentence], sep = " "))
}
}
}
答案取决于您所说的 "word" 是什么意思。如果您指的是空格分隔的标记,那么@imran-ali 的回答就可以了。如果您指的是 Unicode 定义的单词,并特别注意标点符号,那么您需要更复杂的东西。
以下正确处理标点符号:
library(corpus)
sentence <- "A sample sentence for demo"
query <- "for"
# use text_locate to find all instances of the query, with context
text_locate(sentence, query)
## text before instance after
## 1 1 A sample sentence for demo
# find the number of tokens before, then add 1 to get the position
text_ntoken(text_locate(sentence, query)$before) + 1
## 4
如果有多个匹配项,这也有效:
sentence2 <- "for one, for two! for three? for four"
text_ntoken(text_locate(sentence2, query)$before) + 1
## [1] 1 4 7 10
我们可以验证这是正确的:
text_tokens(sentence2)[[1]][c(1, 4, 7, 10)]
## [1] "for" "for" "for" "for"