如何在不删除 R 中的句号的情况下使用 tm 包删除标点符号?
How to remove punctuation using tm package without removing a period in R?
我正在使用 tm
包来删除标点符号。当我在句点和后面的单词之间没有 space 时,删除标点符号只是删除句点并连接前一个单词
例如:
"transactions.Proceessed"
"trnsaction.It"
使用 tm 包应用 "remove punctuation" 后,我得到如下输出:
"transactionsProceessed"
"trnsactionIt"
是否可以在使用删除标点符号功能的情况下,在单词之间添加 space 或保留句点?
更新
提供的示例是示例。输入文件是一个巨大的文本文件。我正在使用 tm_map 函数来删除标点符号。这是我使用的代码
# set parameters
candidates <- c("Obama", "Romney")
pathname <- "H:/datasets/"
# clean texts
cleanCorpus <- function(corpus){
#corpus.tmp <- tm_map(corpus, removePunctuation)
##corpus.tmp <- gsub(".", " ", corpus, fixed = TRUE)
f <- content_transformer(function(x, pattern) sub(pattern, " ", s.cor ))
corpus.tmp <- tm_map(s.cor, f, "[[:punct:]]")
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp, content_transformer(tolower))
##corpus.tmp <- tm_map(corpus.tmp, stemDocument)
corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("english"))
return(corpus.tmp)
}
# create text document matrix
generateTDM <- function(cand, path){
s.dir <- sprintf("%s/%s", path, cand)
s.cor <- Corpus(DirSource(directory = s.dir, encoding = "UTF-8"))
s.cor.cl <- cleanCorpus(s.cor)
s.tdm <- TermDocumentMatrix(s.cor.cl)
s.tdm <- removeSparseTerms(s.tdm, 0.7)
result <- list(name = cand, tdm = s.tdm)
}
# execute function and create a Text Document Matrix
tdm <- lapply(candidates, generateTDM, path = pathname)
................................................ .....................
这个(这些)解决方案适用于您的第一个选项(删除句号,一般来说,所有标点符号并在两者之间添加 space):
如果您的输入像示例一样简单,您可以尝试 sub
from base:
sub(".", " ", "transactions.Proceessed", fixed=TRUE)
#[1] "transactions Proceessed"
sub(".", " ", "trnsaction.It", fixed=TRUE)
#[1] "trnsaction It"
x <- c("transactions.Processed", "trnsaction.It")
sub(".", " ", x, fixed=TRUE)
#[1] "transactions Processed" "trnsaction It"
#this one should remove all punctuation
sub("[[:punct:]]", " ",x)
#[1] "transactions Processed" "trnsaction It"
class VCorpus
或 Corpus
的对象的想法是相同的,但是您必须使用 content_transformer
来这样做:
#You would have to switch to your actual corpus
x <- c("transactions.Processed", "trnsaction.It")
sub("[[:punct:]]", " ",x)
#[1] "transactions Processed" "trnsaction It"
xx <- VCorpus(VectorSource(x))
f <- content_transformer(function(x, pattern) sub(pattern, " ", x))
xx2 <- tm_map(xx, f, "[[:punct:]]")
在这里你可以看到完整的结构:
xx2
# List of 2
# $ 1:List of 2
# ..$ content: chr "transactions Processed"
# ..$ meta :List of 7
# .. ..$ author : chr(0)
# .. ..$ datetimestamp: POSIXlt[1:1], format: "2015-09-29 09:24:42"
# .. ..$ description : chr(0)
# .. ..$ heading : chr(0)
# .. ..$ id : chr "1"
# .. ..$ language : chr "en"
# .. ..$ origin : chr(0)
# .. ..- attr(*, "class")= chr "TextDocumentMeta"
# ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
# $ 2:List of 2
# ..$ content: chr "trnsaction It"
# ..$ meta :List of 7
# .. ..$ author : chr(0)
# .. ..$ datetimestamp: POSIXlt[1:1], format: "2015-09-29 09:24:42"
# .. ..$ description : chr(0)
# .. ..$ heading : chr(0)
# .. ..$ id : chr "2"
# .. ..$ language : chr "en"
# .. ..$ origin : chr(0)
# .. ..- attr(*, "class")= chr "TextDocumentMeta"
# ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
# - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
或者只是内容:
xx2[[1]][1]
#$content
#[1] "transactions Processed"
xx2[[2]][1]
#$content
#[1] "trnsaction It"
我正在使用 tm
包来删除标点符号。当我在句点和后面的单词之间没有 space 时,删除标点符号只是删除句点并连接前一个单词
例如:
"transactions.Proceessed"
"trnsaction.It"
使用 tm 包应用 "remove punctuation" 后,我得到如下输出:
"transactionsProceessed"
"trnsactionIt"
是否可以在使用删除标点符号功能的情况下,在单词之间添加 space 或保留句点?
更新
提供的示例是示例。输入文件是一个巨大的文本文件。我正在使用 tm_map 函数来删除标点符号。这是我使用的代码
# set parameters
candidates <- c("Obama", "Romney")
pathname <- "H:/datasets/"
# clean texts
cleanCorpus <- function(corpus){
#corpus.tmp <- tm_map(corpus, removePunctuation)
##corpus.tmp <- gsub(".", " ", corpus, fixed = TRUE)
f <- content_transformer(function(x, pattern) sub(pattern, " ", s.cor ))
corpus.tmp <- tm_map(s.cor, f, "[[:punct:]]")
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp, content_transformer(tolower))
##corpus.tmp <- tm_map(corpus.tmp, stemDocument)
corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("english"))
return(corpus.tmp)
}
# create text document matrix
generateTDM <- function(cand, path){
s.dir <- sprintf("%s/%s", path, cand)
s.cor <- Corpus(DirSource(directory = s.dir, encoding = "UTF-8"))
s.cor.cl <- cleanCorpus(s.cor)
s.tdm <- TermDocumentMatrix(s.cor.cl)
s.tdm <- removeSparseTerms(s.tdm, 0.7)
result <- list(name = cand, tdm = s.tdm)
}
# execute function and create a Text Document Matrix
tdm <- lapply(candidates, generateTDM, path = pathname)
................................................ .....................
这个(这些)解决方案适用于您的第一个选项(删除句号,一般来说,所有标点符号并在两者之间添加 space):
如果您的输入像示例一样简单,您可以尝试 sub
from base:
sub(".", " ", "transactions.Proceessed", fixed=TRUE)
#[1] "transactions Proceessed"
sub(".", " ", "trnsaction.It", fixed=TRUE)
#[1] "trnsaction It"
x <- c("transactions.Processed", "trnsaction.It")
sub(".", " ", x, fixed=TRUE)
#[1] "transactions Processed" "trnsaction It"
#this one should remove all punctuation
sub("[[:punct:]]", " ",x)
#[1] "transactions Processed" "trnsaction It"
class VCorpus
或 Corpus
的对象的想法是相同的,但是您必须使用 content_transformer
来这样做:
#You would have to switch to your actual corpus
x <- c("transactions.Processed", "trnsaction.It")
sub("[[:punct:]]", " ",x)
#[1] "transactions Processed" "trnsaction It"
xx <- VCorpus(VectorSource(x))
f <- content_transformer(function(x, pattern) sub(pattern, " ", x))
xx2 <- tm_map(xx, f, "[[:punct:]]")
在这里你可以看到完整的结构:
xx2
# List of 2
# $ 1:List of 2
# ..$ content: chr "transactions Processed"
# ..$ meta :List of 7
# .. ..$ author : chr(0)
# .. ..$ datetimestamp: POSIXlt[1:1], format: "2015-09-29 09:24:42"
# .. ..$ description : chr(0)
# .. ..$ heading : chr(0)
# .. ..$ id : chr "1"
# .. ..$ language : chr "en"
# .. ..$ origin : chr(0)
# .. ..- attr(*, "class")= chr "TextDocumentMeta"
# ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
# $ 2:List of 2
# ..$ content: chr "trnsaction It"
# ..$ meta :List of 7
# .. ..$ author : chr(0)
# .. ..$ datetimestamp: POSIXlt[1:1], format: "2015-09-29 09:24:42"
# .. ..$ description : chr(0)
# .. ..$ heading : chr(0)
# .. ..$ id : chr "2"
# .. ..$ language : chr "en"
# .. ..$ origin : chr(0)
# .. ..- attr(*, "class")= chr "TextDocumentMeta"
# ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
# - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
或者只是内容:
xx2[[1]][1]
#$content
#[1] "transactions Processed"
xx2[[2]][1]
#$content
#[1] "trnsaction It"