如何从标记中删除以数字开头的单词?
How to remove words that start with digits from tokens?
如何从 quanteda 中的标记中删除以数字开头的单词?示例词:21st, 80s, 8th, 5k,但它们可以完全不同,我事先不知道。
我有一个包含英文句子的数据框。我使用 quanteda 将其转换为语料库。接下来,我将语料库转换为标记,并进行了一些清理,如 remove_punct
、remove_symbols
、remove_numbers
等。但是,remove_numbers
函数不会删除以数字开头的单词。我想删除这些词,但我不知道它们的确切形式——可以是例如第 21、22 等
library("quanteda")
data = data.frame(
text = c("R is free software and 2k comes with ABSOLUTELY NO WARRANTY.",
"You are welcome to redistribute it under 80s certain conditions.",
"Type 'license()' or 21st 'licence()' for distribution details.",
"R is a collaborative 6th project with many contributors.",
"Type 'contributors()' for more information and",
"'citation()' on how to cite R or R packages in publications."),
stringsAsFactors = FALSE
)
corp = corpus(data, text_field = "text")
toks = tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE,
remove_separators = TRUE, split_hyphens = TRUE)
dfmat = dfm(toks, tolower = TRUE, stem = TRUE, remove = stopwords("english"))
这类问题需要找到规律。
这是使用 gsub 的解决方案:
text = c("R is free software and 2k comes with ABSOLUTELY NO WARRANTY.",
"You are welcome to redistribute it under 80s certain conditions.",
"Type 'license()' or 21st 'licence()' for distribution details.",
"R is a collaborative 6th project with many contributors.",
"Type 'contributors()' for more information and",
"'citation()' on how to cite R or R packages in publications.")
text1<-gsub("[0-9]+[a-z]{2}","",text)
#
# [1] "R is free software and 2k comes with ABSOLUTELY NO WARRANTY." "You are welcome to redistribute it under 80s certain conditions."
# [3] "Type 'license()' or 'licence()' for distribution details." "R is a collaborative project with many contributors."
# [5] "Type 'contributors()' for more information and" "'citation()' on how to cite R or R packages in publications."
详情可参考以下问题:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
您只需明确删除它们,因为它们不受 remove_numbers = TRUE
管理。只需使用一个简单的正则表达式来查找字符前的一些数字。在下面的示例中,我查找 1 到 5 之间的数字序列(例如 (?<=\d{1,5}
)。您可以调整这两个数字来微调您的正则表达式。
这里是仅使用 quanteda 但显式添加 tokens_remove()
的示例。
library("quanteda")
#> Package version: 2.0.0
#> Parallel computing: 2 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#>
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#>
#> View
data = data.frame(
text = c("R is free software and 2k comes with ABSOLUTELY NO WARRANTY.",
"You are welcome to redistribute it under 80s certain conditions.",
"Type 'license()' or 21st 'licence()' for distribution details.",
"R is a collaborative 6th project with many contributors.",
"Type 'contributors()' for more information and",
"'citation()' on how to cite R or R packages in publications."),
stringsAsFactors = FALSE
)
corp = corpus(data, text_field = "text")
toks = tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE,
remove_separators = TRUE, split_hyphens = TRUE)
toks = tokens_remove(toks, pattern = "(?<=\d{1,5})\w+", valuetype = "regex" )
dfmat = dfm(toks, tolower = TRUE, stem = TRUE, remove = stopwords("english"))
由 reprex package (v0.3.0)
于 2020-05-03 创建
如何从 quanteda 中的标记中删除以数字开头的单词?示例词:21st, 80s, 8th, 5k,但它们可以完全不同,我事先不知道。
我有一个包含英文句子的数据框。我使用 quanteda 将其转换为语料库。接下来,我将语料库转换为标记,并进行了一些清理,如 remove_punct
、remove_symbols
、remove_numbers
等。但是,remove_numbers
函数不会删除以数字开头的单词。我想删除这些词,但我不知道它们的确切形式——可以是例如第 21、22 等
library("quanteda")
data = data.frame(
text = c("R is free software and 2k comes with ABSOLUTELY NO WARRANTY.",
"You are welcome to redistribute it under 80s certain conditions.",
"Type 'license()' or 21st 'licence()' for distribution details.",
"R is a collaborative 6th project with many contributors.",
"Type 'contributors()' for more information and",
"'citation()' on how to cite R or R packages in publications."),
stringsAsFactors = FALSE
)
corp = corpus(data, text_field = "text")
toks = tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE,
remove_separators = TRUE, split_hyphens = TRUE)
dfmat = dfm(toks, tolower = TRUE, stem = TRUE, remove = stopwords("english"))
这类问题需要找到规律。 这是使用 gsub 的解决方案:
text = c("R is free software and 2k comes with ABSOLUTELY NO WARRANTY.",
"You are welcome to redistribute it under 80s certain conditions.",
"Type 'license()' or 21st 'licence()' for distribution details.",
"R is a collaborative 6th project with many contributors.",
"Type 'contributors()' for more information and",
"'citation()' on how to cite R or R packages in publications.")
text1<-gsub("[0-9]+[a-z]{2}","",text)
#
# [1] "R is free software and 2k comes with ABSOLUTELY NO WARRANTY." "You are welcome to redistribute it under 80s certain conditions."
# [3] "Type 'license()' or 'licence()' for distribution details." "R is a collaborative project with many contributors."
# [5] "Type 'contributors()' for more information and" "'citation()' on how to cite R or R packages in publications."
详情可参考以下问题:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
您只需明确删除它们,因为它们不受 remove_numbers = TRUE
管理。只需使用一个简单的正则表达式来查找字符前的一些数字。在下面的示例中,我查找 1 到 5 之间的数字序列(例如 (?<=\d{1,5}
)。您可以调整这两个数字来微调您的正则表达式。
这里是仅使用 quanteda 但显式添加 tokens_remove()
的示例。
library("quanteda")
#> Package version: 2.0.0
#> Parallel computing: 2 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#>
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#>
#> View
data = data.frame(
text = c("R is free software and 2k comes with ABSOLUTELY NO WARRANTY.",
"You are welcome to redistribute it under 80s certain conditions.",
"Type 'license()' or 21st 'licence()' for distribution details.",
"R is a collaborative 6th project with many contributors.",
"Type 'contributors()' for more information and",
"'citation()' on how to cite R or R packages in publications."),
stringsAsFactors = FALSE
)
corp = corpus(data, text_field = "text")
toks = tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE,
remove_separators = TRUE, split_hyphens = TRUE)
toks = tokens_remove(toks, pattern = "(?<=\d{1,5})\w+", valuetype = "regex" )
dfmat = dfm(toks, tolower = TRUE, stem = TRUE, remove = stopwords("english"))
由 reprex package (v0.3.0)
于 2020-05-03 创建