如何将文本拆分为一个向量,其中每个条目对应于分配给每个唯一单词的索引值?
How to split a text into a vector, where each entry corresponds to an index value assigned to each unique word?
假设我有一个包含一些文本的文档,例如来自 SO:
doc <- 'Questions with similar titles have frequently been downvoted and/or closed. Consider using a title that more accurately describes your question.'
然后我可以制作一个数据框,其中每个单词在 df 中都有一行:
library(stringi)
dfall <- data.frame(words = unlist(stri_extract_all_words(stri_trans_tolower(doc))))
我们将添加第三列及其唯一 ID。要获取 ID,请删除重复项:
library(dplyr)
uniquedf <- distinct(data.frame(words = unlist(stri_extract_all_words(stri_trans_tolower(doc)))))
我正在努力研究如何将行与两个数据帧进行匹配,以从 uniquedf
中提取行索引值作为 df
的新行值
alldf <- alldf %>% mutate(id = which(uniquedf$words == words))
像这样的 dply 方法不起作用。
有没有更有效的方法来做到这一点?
为了给出一个更简单的示例来显示预期的输出,我想要一个如下所示的数据框:
words id
1 to 1
2 row 2
3 zip 3
4 zip 3
我的起始词向量是:doc <- c('to', 'row', 'zip', 'zip')
或 doc <- c('to row zip zip')
。 id 列为每个唯一的单词添加一个唯一的 id。
使用 sapply
的便宜方式
数据
doc <- 'Questions with with titles have frequently been downvoted and/or closed. Consider using a title that more accurately describes your question.'
函数
alldf=cbind(dfall,sapply(1:nrow(dfall),function(x) which(uniquedf$words==dfall$words[x])))
colnames(alldf)=c("words","id")
> alldf
words id
1 questions 1
2 with 2
3 with 2
4 titles 3
5 have 4
6 frequently 5
7 been 6
8 downvoted 7
9 and 8
10 or 9
11 closed 10
12 consider 11
13 using 12
14 a 13
15 title 14
16 that 15
17 more 16
18 accurately 17
19 describes 18
20 your 19
21 question 20
假设我有一个包含一些文本的文档,例如来自 SO:
doc <- 'Questions with similar titles have frequently been downvoted and/or closed. Consider using a title that more accurately describes your question.'
然后我可以制作一个数据框,其中每个单词在 df 中都有一行:
library(stringi)
dfall <- data.frame(words = unlist(stri_extract_all_words(stri_trans_tolower(doc))))
我们将添加第三列及其唯一 ID。要获取 ID,请删除重复项:
library(dplyr)
uniquedf <- distinct(data.frame(words = unlist(stri_extract_all_words(stri_trans_tolower(doc)))))
我正在努力研究如何将行与两个数据帧进行匹配,以从 uniquedf
中提取行索引值作为 df
alldf <- alldf %>% mutate(id = which(uniquedf$words == words))
像这样的 dply 方法不起作用。
有没有更有效的方法来做到这一点?
为了给出一个更简单的示例来显示预期的输出,我想要一个如下所示的数据框:
words id
1 to 1
2 row 2
3 zip 3
4 zip 3
我的起始词向量是:doc <- c('to', 'row', 'zip', 'zip')
或 doc <- c('to row zip zip')
。 id 列为每个唯一的单词添加一个唯一的 id。
使用 sapply
数据
doc <- 'Questions with with titles have frequently been downvoted and/or closed. Consider using a title that more accurately describes your question.'
函数
alldf=cbind(dfall,sapply(1:nrow(dfall),function(x) which(uniquedf$words==dfall$words[x])))
colnames(alldf)=c("words","id")
> alldf
words id
1 questions 1
2 with 2
3 with 2
4 titles 3
5 have 4
6 frequently 5
7 been 6
8 downvoted 7
9 and 8
10 or 9
11 closed 10
12 consider 11
13 using 12
14 a 13
15 title 14
16 that 15
17 more 16
18 accurately 17
19 describes 18
20 your 19
21 question 20