R 如何使用 TermDocumentMatrix() 保留标点符号

Question

我有一个大型数据框，我在其中识别字符串中的模式，然后提取它们。我提供了一小部分来说明我的任务。我通过创建一个包含多个单词的 TermDocumentMatrix 来生成我的模式。我将这些模式与来自 stringi 和 stringr 包的 stri_extract 和 str_replace 一起使用，以在 'punct_prob' 数据帧中进行搜索。

我的问题是我需要在 'punct_prob$description' 中保留标点符号以保持每个字符串中的字面含义。例如，我不能让 2.35 毫米变成 235 毫米。然而，我正在使用的 TermDocumentMatrix 程序正在删除标点符号（或至少是句点），因此我的模式搜索功能无法匹配它们。

简而言之...如何在生成 TDM 时保留标点符号？我尝试在 TermDocumentMatrix 控件参数中包含 removePunctuation=FALSE 但没有成功。

library(tm)
punct_prob = data.frame(description = tolower(c("CONTRA ANGLE HEAD 2:1 FOR 2.35mm BUR",
                                    "TITANIUM LINE MINI P.B F.O. TRIP SPRAY",
                                    "TITANIUM LINE POWER P. B F.O. TRIP SPR",
                                    "MEDESY SPECIAL ITEM")))

punct_prob$description = as.character(punct_prob$description)

# a control for the number of words in phrases
max_ngram = max(sapply(strsplit(punct_prob$description, " "), length))

#set up ngrams and tdm
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = max_ngram, max = max_ngram))}
punct_prob_corpus = Corpus(VectorSource(punct_prob$description))
punct_prob_tdm <- TermDocumentMatrix(punct_prob_corpus, control = list(tokenize = BigramTokenizer, removePunctuation=FALSE))
inspect(punct_prob_tdm)

检查结果 - 没有标点符号....

                                   Docs
Terms                              1 2 3 4
  angle head 2 1 for 2 35mm bur    1 0 0 0
  contra angle head 2 1 for 2 35mm 1 0 0 0
  line mini p b f o trip spray     0 1 0 0
  line power p b f o trip spr      0 0 1 0
  titanium line mini p b f o trip  0 1 0 0
  titanium line power p b f o trip 0 0 1 0

提前感谢您的帮助:)

Answer 1

问题不在于 termdocumentmatrix，而在于基于 RWEKA 的 ngram 分词器。 Rweka 在进行分词时删除标点符号。

如果您使用 nlp 分词器，它会保留标点符号。请参阅下面的代码。

P.S。我在你的第三行文本中删除了一个 space，所以 P. B. 是 P.B。就像在第 2 行一样。

library(tm)
punct_prob = data.frame(description = tolower(c("CONTRA ANGLE HEAD 2:1 FOR 2.35mm BUR",
                                                "TITANIUM LINE MINI P.B F.O. TRIP SPRAY",
                                                "TITANIUM LINE POWER P.B F.O. TRIP SPR",
                                                "MEDESY SPECIAL ITEM")))
punct_prob$description = as.character(punct_prob$description)

max_ngram = max(sapply(strsplit(punct_prob$description, " "), length))

punct_prob_corpus = Corpus(VectorSource(punct_prob$description))




NLPBigramTokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), max_ngram), paste, collapse = " "), use.names = FALSE)
}


punct_prob_tdm <- TermDocumentMatrix(punct_prob_corpus, control = list(tokenize = NLPBigramTokenizer))
inspect(punct_prob_tdm)

<<TermDocumentMatrix (terms: 3, documents: 4)>>
Non-/sparse entries: 3/9
Sparsity           : 75%
Maximal term length: 38
Weighting          : term frequency (tf)

                                        Docs
Terms                                    1 2 3 4
  contra angle head 2:1 for 2.35mm bur   1 0 0 0
  titanium line mini p.b f.o. trip spray 0 1 0 0
  titanium line power p.b f.o. trip spr  0 0 1 0

Answer 2

quanteda 包非常智能，无需将字内标点符号视为 "punctuation" 即可进行分词。这使得构建矩阵变得非常容易：

txt <- c("CONTRA ANGLE HEAD 2:1 FOR 2.35mm BUR",
         "TITANIUM LINE MINI P.B F.O. TRIP SPRAY",
         "TITANIUM LINE POWER P.B F.O. TRIP SPR",
         "MEDESY SPECIAL ITEM")

require(quanteda)
myDfm <- dfm(txt, ngrams = 6:8, concatenator = " ")
t(myDfm)
#                                        docs
# features                                text1 text2 text3 text4
#   contra angle head for 2.35mm bur          1     0     0     0
#   titanium line mini p.b f.o trip           0     1     0     0
#   line mini p.b f.o trip spray              0     1     0     0
#   titanium line mini p.b f.o trip spray     0     1     0     0
#   titanium line power p.b f.o trip          0     0     1     0
#   line power p.b f.o trip spr               0     0     1     0
#   titanium line power p.b f.o trip spr      0     0     1     0

如果你想保留"punctuation"，它会在一个词结束时被标记为一个单独的标记：

myDfm2 <- dfm(txt, ngrams = 8, concatenator = " ", removePunct = FALSE)
t(myDfm2)
#                                          docs
# features                                  text1 text2 text3 text4
#   titanium line mini p.b f.o . trip spray     0     1     0     0
#   titanium line power p.b f.o . trip spr      0     0     1     0

请注意，ngrams 参数是完全灵活的，可以采用 ngram 大小的向量，如第一个示例中的 ngrams = 6:8 表示它应该形成 6-、7- 和8 克。

R 如何使用 TermDocumentMatrix() 保留标点符号

R How do i keep punctuation with TermDocumentMatrix()

r

punctuation

tm

term-document-matrix