R 文本挖掘——去除特殊字符和引号
R text mining - remove special characters and quotes
我正在 R 中执行文本挖掘任务。
任务:
1) 数句子
2) 在向量中识别和保存引号
问题:
像“...”这样的假句号和像 "Mr." 这样的标题中的句点必须被处理。
文本 body 数据中肯定有引号,其中会有“...”。我想从主 body 中提取这些引号并将它们保存在一个向量中。 (也需要对它们进行一些操作。)
重要提示:我的文本数据在 Word 文档中。我使用 readtext("path to .docx file") 在 R 中加载。当我查看文本时,与可重现的文本相反,引号只是“但不是 \”。
path <- "C:/Users/.../"
a <- readtext(paste(path, "Text.docx", sep = ""))
title <- a$doc_id
text <- a$text
可复制的文本
text <- "Mr. and Mrs. Keyboard have two children. Keyboard Jr. and Miss. Keyboard. ...
However, Miss. Keyboard likes being called Miss. K [Miss. Keyboard is a bit of a princess ...]
\"Mom how are you o.k. with being called Mrs. Keyboard? I'll never get it...\". "
# splitting by "."
unlist(strsplit(text, "\."))
问题是它被 false full-stops 分割了
我试过的解决方案:
# getting rid of . in titles
vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
library(gsubfn)
# replacing . in titles
gsubfn("\S+", setNames(as.list(vec.rep), vec), text)
问题是它没有取代 [Miss.通过 [小姐
识别引号:
stri_extract_all_regex(text, '"\S+"')
但这也行不通。 (它正在使用 \" 和下面的代码)
stri_extract_all_regex("some text \"quote\" some other text", '"\S+"')
确切的预期向量是:
sentences <- c("Mr and Mrs Keyboard have two children. ", "Keyboard Jr and Miss Keyboard.", "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]", ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""
我想把句子分开(这样我就可以数出每个段落中有多少句子)。
引号也分开了。
quotes <- ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""
您可以使用
匹配所有当前 vec
值
gsubfn("\w+\.", setNames(as.list(vec.rep), vec), text)
也就是说,\w+
匹配 1 个或多个字符,\.
匹配一个点。
接下来,如果您只想提取引号,请使用
regmatches(text, gregexpr('"[^"]*"', text))
"
匹配 "
并且 [^"]*
匹配除 "
.
之外的 0 个或多个字符
如果你打算用引号来匹配你的句子,你可以考虑
regmatches(text, gregexpr('\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
详情
\s*
- 0+ 个空格
"[^"]*"
- "
,除 "
和 "
之外的 0+ 个字符
|
- 或
[^"?!.]+
- ?
、"
、!
和 .
以外的 0+ 个字符
[[:space:]?!.]+
- 1 个或多个空格,?
、!
或 .
个字符
[^"[:alnum:]]*
- 0+ 个非字母数字和 "
个字符
R示例代码:
> vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
> vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
> library(gsubfn)
> text <- gsubfn("\w+\.", setNames(as.list(vec.rep), vec), text)
> regmatches(text, gregexpr('\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
[[1]]
[1] "Mr and Mrs Keyboard have two children. "
[2] "Keyboard Jr and Miss Keyboard. ... \n"
[3] "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]\n "
[4] "\"Mom how are you o.k. with being called Mrs Keyboard? I'll never get it...\""
我正在 R 中执行文本挖掘任务。
任务:
1) 数句子
2) 在向量中识别和保存引号
问题:
像“...”这样的假句号和像 "Mr." 这样的标题中的句点必须被处理。
文本 body 数据中肯定有引号,其中会有“...”。我想从主 body 中提取这些引号并将它们保存在一个向量中。 (也需要对它们进行一些操作。)
重要提示:我的文本数据在 Word 文档中。我使用 readtext("path to .docx file") 在 R 中加载。当我查看文本时,与可重现的文本相反,引号只是“但不是 \”。
path <- "C:/Users/.../"
a <- readtext(paste(path, "Text.docx", sep = ""))
title <- a$doc_id
text <- a$text
可复制的文本
text <- "Mr. and Mrs. Keyboard have two children. Keyboard Jr. and Miss. Keyboard. ...
However, Miss. Keyboard likes being called Miss. K [Miss. Keyboard is a bit of a princess ...]
\"Mom how are you o.k. with being called Mrs. Keyboard? I'll never get it...\". "
# splitting by "."
unlist(strsplit(text, "\."))
问题是它被 false full-stops 分割了 我试过的解决方案:
# getting rid of . in titles
vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
library(gsubfn)
# replacing . in titles
gsubfn("\S+", setNames(as.list(vec.rep), vec), text)
问题是它没有取代 [Miss.通过 [小姐
识别引号:
stri_extract_all_regex(text, '"\S+"')
但这也行不通。 (它正在使用 \" 和下面的代码)
stri_extract_all_regex("some text \"quote\" some other text", '"\S+"')
确切的预期向量是:
sentences <- c("Mr and Mrs Keyboard have two children. ", "Keyboard Jr and Miss Keyboard.", "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]", ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""
我想把句子分开(这样我就可以数出每个段落中有多少句子)。 引号也分开了。
quotes <- ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""
您可以使用
匹配所有当前vec
值
gsubfn("\w+\.", setNames(as.list(vec.rep), vec), text)
也就是说,\w+
匹配 1 个或多个字符,\.
匹配一个点。
接下来,如果您只想提取引号,请使用
regmatches(text, gregexpr('"[^"]*"', text))
"
匹配 "
并且 [^"]*
匹配除 "
.
如果你打算用引号来匹配你的句子,你可以考虑
regmatches(text, gregexpr('\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
详情
\s*
- 0+ 个空格"[^"]*"
-"
,除"
和"
之外的 0+ 个字符
|
- 或[^"?!.]+
-?
、"
、!
和.
以外的 0+ 个字符
[[:space:]?!.]+
- 1 个或多个空格,?
、!
或.
个字符[^"[:alnum:]]*
- 0+ 个非字母数字和"
个字符
R示例代码:
> vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
> vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
> library(gsubfn)
> text <- gsubfn("\w+\.", setNames(as.list(vec.rep), vec), text)
> regmatches(text, gregexpr('\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
[[1]]
[1] "Mr and Mrs Keyboard have two children. "
[2] "Keyboard Jr and Miss Keyboard. ... \n"
[3] "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]\n "
[4] "\"Mom how are you o.k. with being called Mrs Keyboard? I'll never get it...\""