如何使用 R 查找单词的尾随和前导单词?
How to find trailing and leading words of a word using R?
我有一个有一百万字的文本文档。现在,我需要知道如何使用 R 查找单词的尾随单词和前导单词。
例如,如果我想找出单词"error"前后的单词。它可以是任何类似前导词的东西
"typo error"
"manual error"
"system error"
并带有像
这样的尾随词
"error corrected"
"error found"
"error occurred"
知道怎么做吗?预先感谢您的意见。
对于错误之前的单词:
x <- "no error and no error and some error" # input
library(gsubfn)
rx <- "(\w+) error"
table(strapplyc(x, rx)[[1]])
给予:
no some
2 1
将rx
替换为错误后的单词:
rx <- "error (\w+)"
这个怎么样:
test <- c("I don't want to match error this This is a random error what I want to match")
# convert to a list
words <- strsplit((test),' ')
# get indexes that match 'error'
indexes <- grep('error',words[[1]], perl=TRUE)
# select words that come after 'error'
words[[1]][indexes+1]
# before 'error'
words[[1]][indexes-1]
我的解决方案是 str_match_all
:
library(stringr)
txt <- "system error corrected hardcore error detected wtf error holymoly"
str_match_all(txt, "\s*(\w+)\serror\s*(\w+)")
[[1]]
[,1] [,2] [,3]
[1,] "system error corrected" "system" "corrected"
[2,] " hardcore error detected" "hardcore" "detected"
[3,] " wtf error holymoly" "wtf" "holymoly"
我有一个有一百万字的文本文档。现在,我需要知道如何使用 R 查找单词的尾随单词和前导单词。
例如,如果我想找出单词"error"前后的单词。它可以是任何类似前导词的东西
"typo error"
"manual error"
"system error"
并带有像
这样的尾随词"error corrected"
"error found"
"error occurred"
知道怎么做吗?预先感谢您的意见。
对于错误之前的单词:
x <- "no error and no error and some error" # input
library(gsubfn)
rx <- "(\w+) error"
table(strapplyc(x, rx)[[1]])
给予:
no some
2 1
将rx
替换为错误后的单词:
rx <- "error (\w+)"
这个怎么样:
test <- c("I don't want to match error this This is a random error what I want to match")
# convert to a list
words <- strsplit((test),' ')
# get indexes that match 'error'
indexes <- grep('error',words[[1]], perl=TRUE)
# select words that come after 'error'
words[[1]][indexes+1]
# before 'error'
words[[1]][indexes-1]
我的解决方案是 str_match_all
:
library(stringr)
txt <- "system error corrected hardcore error detected wtf error holymoly"
str_match_all(txt, "\s*(\w+)\serror\s*(\w+)")
[[1]]
[,1] [,2] [,3]
[1,] "system error corrected" "system" "corrected"
[2,] " hardcore error detected" "hardcore" "detected"
[3,] " wtf error holymoly" "wtf" "holymoly"