R中字符串中单词的相对位置

Question

我有以下术语 documnet 矩阵和数据框。

tdm <- c('Free', 'New', 'Limited', 'Offer')



Subject                                               Free New Limited Offer                                                    

'Free Free Free! Clear Cover with New Phone',          0   0     0      0
'Offer ! Buy New phone and get earphone at             0   0     0      0
1000. Limited Offer!'

我想导出以下数据帧作为输出

Subject                                              Free  New Limited Offer    
'Free Free Free! Clear Cover with New Phone',        1,2,3  8   NA     NA
Offer ! Buy New phone and get earphone at  1000.      NA    3   12      1,13
Limited Offer!'

我尝试了以下代码并得到了结果，但这只给出了单词在字符串中的位置。我需要单词的位置，如 Hell0 - 1 new- 2.

for(i in 1:length(tdm))
{    word.locations <- 
gsub(")","",gsub("c(","",unlist(paste(gregexpr(pattern 
= tdm[i], DF$Subject))), fixed = TRUE), fixed = TRUE)
  df <- cbind(DF,word.locations)
  }
  colnames(DF) <- c("text", word)

我请人帮忙。

Answer 1

鉴于输入：

tdm <- c('Free', 'New', 'Limited', 'Offer')
subject <- c("Free Free Free! Clear Cover with New Phone",
             "Offer ! Buy New phone and get earphone at 1000. Limited Offer!")

我会这样做：

sapply(tolower(tdm), function(x) {
    lapply(strsplit(tolower(subject), "(\s+)|(?!')(?=[[:punct:]])", perl = TRUE), 
      function(y) {
        y <- y[nzchar(y)]
        toString(grep(x, y))
      })
})
##      free      new limited offer  
## [1,] "1, 2, 3" "8" ""      ""     
## [2,] ""        "4" "12"    "1, 13"

发生了什么：

在要匹配的字符串和要匹配的字词上使用 tolower。
使用 strsplit 将单词和标点符号拆分为 list 元素中的单独项目。
删除所有带有 nzchar() 的空向量。
使用 grep() 查找匹配项的位置。
使用 toString() 将位置粘贴在一起作为 comma-separated 字符串。

R中字符串中单词的相对位置

Relative Position of Words in String in R

regex

position

r

text-mining