R：计算预定义词典中单词的频率

Question

我有一个非常大的数据集，如下所示：一列包含姓名，第二列包含他们各自的（很长）文本。我还有一个包含至少 20 个术语的预定义词典。如何计算这些关键字在我的数据帧的每一行中出现的次数？我尝试了 str_detect、grep(l) 和 %>% 之类的方法，并在每一行上循环，但问题似乎是我想检测太多的术语，当我使用 15 时这些函数停止工作+ 条款左右。

如果有人能帮我解决这个问题，我会很高兴！

col1<- c("Henrik", "Joseph", "Lucy")
col2 <- c("I am going to get groceries", "He called me at six.", "No, he did not")
df <- data.frame(col1, col2)```
dict <- c("groceries", "going", "me") #but my actual dictionary is much larger

Answer 1

为您的行创建一个唯一标识符。按单词拆分 col2，每行一个。仅过滤字典中的 select 个词。然后逐行计数。最后，结合原始 df 并将 NA 设置为 Zeros 以获取没有来自您的字典的任何单词的行。

library(dplyr)

col1 <- c("A","B","A")
col2 <- c("I am going to get groceries", "He called me at six.", "No, he did not")
df <- data.frame(col1, col2, stringsAsFactors = FALSE)
dict <- c("groceries", "going", "me")

df <- df %>% mutate(row=row_number()) %>% select(row, everything())

counts <- df %>% tidyr::separate_rows(col2) %>% filter(col2 %in% dict) %>% group_by(row) %>% count(name = "counts")

final <- left_join(df, counts, by="row") %>% tidyr::replace_na(list(counts=0L))
final
#>   row col1                        col2 counts
#> 1   1    A I am going to get groceries      2
#> 2   2    B        He called me at six.      1
#> 3   3    A              No, he did not      0

Answer 2

这是使用 gregexpr

的基础 R 选项

dfout <- within(
  df,
  counts <- sapply(
    gregexpr(paste0(dict, collapse = "|"), col2),
    function(x) sum(x > 0)
  )
)

或

dfout <- within(
  df,
  counts <- sapply(
    regmatches(col2, gregexpr("\w+", col2)),
    function(v) sum(v %in% dict)
  )
)

这给出了

> dfout
  col1                        col2 counts
1    1 I am going to get groceries      2
2    2        He called me at six.      1
3    3              No, he did not      0

数据

structure(list(col1 = 1:3, col2 = c("I am going to get groceries", 
"He called me at six.", "No, he did not")), class = "data.frame", row.names = c(NA, 
-3L))

Answer 3

我认为我的解决方案可以为您提供所需的输出 - 即对于“字典”列表中的每个单词，您可以看到它在每个句子中出现了多少次。每行都是 df$col2 中的一个条目，即一个句子。 “Dict”是您要匹配的术语向量。我们可以遍历向量，对于向量中的每个条目，我们使用 stringr::str_count 匹配该条目在每个 row/sentence 中出现的次数。请注意 str_count 的语法：str_count（正在检查的字符串，您要匹配的表达式）

str_count returns 一个向量，表示单词在每一行中出现的次数。我创建了这些向量的数据框，其中包含的行数与 dict 向量中的条目数相同。然后你可以将“dict”绑定到那个数据框，你可以看到每个单词在每个句子中使用了多少次。我在最后调整了列名称，以便您可以将单词与句子 #'s 匹配。请注意，如果您想计算行意味着您需要对最终数据框的“dict”列进行子集化，因为它是字符。

 library(stringr)
 col1<- c("Henrik", "Joseph", "Lucy")
 col2 <- c("I am going to get groceries", "He called me at six.", "No, he    
 did not")
 df <- data.frame(col1, col2)
 dict <- c("groceries", "going", "me")

 word_matches <- data.frame()
 for (i in dict) {
 word_tot<-(str_count(df$col2, i))
 word_matches <- rbind(word_matches,word_tot)
 }
 word_matches
 colnames(word_matches) <- paste("Sentence", 1:ncol(word_matches))
 cbind(dict,word_matches)


        dict Sentence 1    Sentence 2    Sentence 3
 1 groceries        1           0           0
 2     going        1           0           0
 3        me        0           1           0

R：计算预定义词典中单词的频率

R: Counting frequency of words from predefined dictionary

dictionary

nlp

r

frequency