保留 R 语料库中的 EXACT 单词
Keep EXACT words from R corpus
来自发布于:Keep document ID with R corpus by @MrFlick
我正在尝试稍微修改一个很好的例子。
问题:如何修改content_transformer
函数只保留exact的话?您可以在 inspect 输出中看到,wonderful 被视为 wonder,ratio 被视为 rational。我对gregexpr
和regmatches
的理解不是很深。
创建数据框:
dd <- data.frame(
id = 10:13,
text = c("No wonderful, then, that ever",
"So that in many cases such a ",
"But there were still other and",
"Not even at the rationale")
, stringsAsFactors = F
)
现在,为了从 data.frame 中读取特殊属性,我们将使用 readTabular
函数来制作我们自己的自定义 data.frame reader
library(tm)
myReader <- readTabular(mapping = list(content = "text", id = "id"))
指定要用于内容的列和 data.frame 中的 ID。现在我们用 DataframeSource
读取它,但使用我们的自定义 reader.
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))
现在如果我们只想保留一组特定的单词,我们可以创建自己的 content_transformer 函数。一种方法是
keepOnlyWords <- content_transformer(function(x, words) {
regmatches(x,
gregexpr(paste0("\b(", paste(words, collapse = "|"), "\b)"), x)
, invert = T) <- " "
x
})
这会将不在单词列表中的所有内容替换为 space。请注意,在此之后您可能想要 运行 stripWhitespace
。因此我们的转换看起来像
keep <- c("wonder", "then", "that", "the")
tm <- tm_map(tm, content_transformer(tolower))
tm <- tm_map(tm, keepOnlyWords, keep)
tm <- tm_map(tm, stripWhitespace)
检查 dtm 矩阵:
> inspect(dtm)
<<DocumentTermMatrix (documents: 4, terms: 4)>>
Non-/sparse entries: 7/9
Sparsity : 56%
Maximal term length: 6
Weighting : term frequency (tf)
Terms
Docs ratio that the wonder
10 0 1 1 1
11 0 1 0 0
12 0 0 1 0
13 1 0 1 0
将语法切换到 tidytext
,您当前的转换是
library(tidyverse)
library(tidytext)
library(stringr)
dd %>% unnest_tokens(word, text) %>%
mutate(word = str_replace_all(word, setNames(keep, paste0('.*', keep, '.*')))) %>%
inner_join(data_frame(word = keep))
## id word
## 1 10 wonder
## 2 10 the
## 3 10 that
## 4 11 that
## 5 12 the
## 6 12 the
## 7 13 the
保持精确匹配更容易,因为您可以使用连接(使用 ==
)而不是正则表达式:
dd %>% unnest_tokens(word, text) %>%
inner_join(data_frame(word = keep))
## id word
## 1 10 then
## 2 10 that
## 3 11 that
## 4 13 the
要将其带回文档术语矩阵,
library(tm)
dd %>% mutate(id = factor(id)) %>% # to keep empty rows of DTM
unnest_tokens(word, text) %>%
inner_join(data_frame(word = keep)) %>%
mutate(i = 1) %>%
cast_dtm(id, word, i) %>%
inspect()
## <<DocumentTermMatrix (documents: 4, terms: 3)>>
## Non-/sparse entries: 4/8
## Sparsity : 67%
## Maximal term length: 4
## Weighting : term frequency (tf)
##
## Terms
## Docs then that the
## 10 1 1 0
## 11 0 1 0
## 12 0 0 0
## 13 0 0 1
目前,您的函数正在将 words
与 之前或 之后的边界相匹配。要将其更改为 之前和 之后,请更改 collapse
参数以包含边界:
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))
keepOnlyWords<-content_transformer(function(x,words) {
regmatches(x,
gregexpr(paste0("(\b", paste(words, collapse = "\b|\b"), "\b)"), x)
, invert = T) <- " "
x
})
tm <- tm_map(tm, content_transformer(tolower))
tm <- tm_map(tm, keepOnlyWords, keep)
tm <- tm_map(tm, stripWhitespace)
inspect(DocumentTermMatrix(tm))
## <<DocumentTermMatrix (documents: 4, terms: 3)>>
## Non-/sparse entries: 4/8
## Sparsity : 67%
## Maximal term length: 4
## Weighting : term frequency (tf)
##
## Terms
## Docs that the then
## 10 1 0 1
## 11 1 0 0
## 12 0 0 0
## 13 0 1 0
我用 tm 得到了与@alistaire 相同的结果,在 keepOnlyWords 内容转换器中修改了以下行,首先由@BEMR 定义:
gregexpr(paste0("\b(", paste(words, collapse = "|"), ")\b"), x)
@BEMR 首先指定的 gregexpr 中有一个错位的“)”,即应该是“)\\b”而不是“\\b)”
我认为上面的gregexpr等同于@alistaire指定的:
gregexpr(paste0("(\b", paste(words, collapse = "\b|\b"), "\b)"), x)
来自发布于:Keep document ID with R corpus by @MrFlick
我正在尝试稍微修改一个很好的例子。
问题:如何修改content_transformer
函数只保留exact的话?您可以在 inspect 输出中看到,wonderful 被视为 wonder,ratio 被视为 rational。我对gregexpr
和regmatches
的理解不是很深。
创建数据框:
dd <- data.frame(
id = 10:13,
text = c("No wonderful, then, that ever",
"So that in many cases such a ",
"But there were still other and",
"Not even at the rationale")
, stringsAsFactors = F
)
现在,为了从 data.frame 中读取特殊属性,我们将使用 readTabular
函数来制作我们自己的自定义 data.frame reader
library(tm)
myReader <- readTabular(mapping = list(content = "text", id = "id"))
指定要用于内容的列和 data.frame 中的 ID。现在我们用 DataframeSource
读取它,但使用我们的自定义 reader.
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))
现在如果我们只想保留一组特定的单词,我们可以创建自己的 content_transformer 函数。一种方法是
keepOnlyWords <- content_transformer(function(x, words) {
regmatches(x,
gregexpr(paste0("\b(", paste(words, collapse = "|"), "\b)"), x)
, invert = T) <- " "
x
})
这会将不在单词列表中的所有内容替换为 space。请注意,在此之后您可能想要 运行 stripWhitespace
。因此我们的转换看起来像
keep <- c("wonder", "then", "that", "the")
tm <- tm_map(tm, content_transformer(tolower))
tm <- tm_map(tm, keepOnlyWords, keep)
tm <- tm_map(tm, stripWhitespace)
检查 dtm 矩阵:
> inspect(dtm)
<<DocumentTermMatrix (documents: 4, terms: 4)>>
Non-/sparse entries: 7/9
Sparsity : 56%
Maximal term length: 6
Weighting : term frequency (tf)
Terms
Docs ratio that the wonder
10 0 1 1 1
11 0 1 0 0
12 0 0 1 0
13 1 0 1 0
将语法切换到 tidytext
,您当前的转换是
library(tidyverse)
library(tidytext)
library(stringr)
dd %>% unnest_tokens(word, text) %>%
mutate(word = str_replace_all(word, setNames(keep, paste0('.*', keep, '.*')))) %>%
inner_join(data_frame(word = keep))
## id word
## 1 10 wonder
## 2 10 the
## 3 10 that
## 4 11 that
## 5 12 the
## 6 12 the
## 7 13 the
保持精确匹配更容易,因为您可以使用连接(使用 ==
)而不是正则表达式:
dd %>% unnest_tokens(word, text) %>%
inner_join(data_frame(word = keep))
## id word
## 1 10 then
## 2 10 that
## 3 11 that
## 4 13 the
要将其带回文档术语矩阵,
library(tm)
dd %>% mutate(id = factor(id)) %>% # to keep empty rows of DTM
unnest_tokens(word, text) %>%
inner_join(data_frame(word = keep)) %>%
mutate(i = 1) %>%
cast_dtm(id, word, i) %>%
inspect()
## <<DocumentTermMatrix (documents: 4, terms: 3)>>
## Non-/sparse entries: 4/8
## Sparsity : 67%
## Maximal term length: 4
## Weighting : term frequency (tf)
##
## Terms
## Docs then that the
## 10 1 1 0
## 11 0 1 0
## 12 0 0 0
## 13 0 0 1
目前,您的函数正在将 words
与 之前或 之后的边界相匹配。要将其更改为 之前和 之后,请更改 collapse
参数以包含边界:
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))
keepOnlyWords<-content_transformer(function(x,words) {
regmatches(x,
gregexpr(paste0("(\b", paste(words, collapse = "\b|\b"), "\b)"), x)
, invert = T) <- " "
x
})
tm <- tm_map(tm, content_transformer(tolower))
tm <- tm_map(tm, keepOnlyWords, keep)
tm <- tm_map(tm, stripWhitespace)
inspect(DocumentTermMatrix(tm))
## <<DocumentTermMatrix (documents: 4, terms: 3)>>
## Non-/sparse entries: 4/8
## Sparsity : 67%
## Maximal term length: 4
## Weighting : term frequency (tf)
##
## Terms
## Docs that the then
## 10 1 0 1
## 11 1 0 0
## 12 0 0 0
## 13 0 1 0
我用 tm 得到了与@alistaire 相同的结果,在 keepOnlyWords 内容转换器中修改了以下行,首先由@BEMR 定义:
gregexpr(paste0("\b(", paste(words, collapse = "|"), ")\b"), x)
@BEMR 首先指定的 gregexpr 中有一个错位的“)”,即应该是“)\\b”而不是“\\b)”
我认为上面的gregexpr等同于@alistaire指定的:
gregexpr(paste0("(\b", paste(words, collapse = "\b|\b"), "\b)"), x)