在 R 中删除文本文件中的特殊字符
Removing Special Characters in a Text File in R
我在 R 中使用文本文件并使用 readLine 函数和正则表达式从中提取单词。该文件在单词周围使用特殊字符(例如 # sings before and after a word to show it is bolded 或 @ sings before and after a word to show it should be italicized)来表示特殊含义,这搞乱了我的正则表达式。
到目前为止,这是我的 r 代码,它删除了所有空行,然后将我的文本文件组合成一个向量:
book<-readLines("/Users/Desktop/SAMPLE .txt",encoding="UTF-8")
#remove all empty lines
empty_lines = grepl('^\s*$', book)
book = book[! empty_lines]
#combine book into one variable
xBook = paste(book, collapse = '')
#remove extra white spaces for a single text of the entire book
updated<-trimws(gsub("\s+"," ",xBook))
当我 运行 更新时,我看到存储在变量中的整个文件都已更新但带有特殊字符:
updated
[1] "It is a truth universally acknowledged, that a #single# man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a @man@ may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, @that@ he is considered the rightful property of some one or other of #their# daughters.
如何从我更新的变量中的单词中删除所有前导或尾随的 # 或 @?
我想要的输出只是纯文本,没有指示应加粗或斜体的单词:
updated
[1] "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.
gsub("[@#]([a-zA-Z]+)[@#]", "\1", x)
我在 R 中使用文本文件并使用 readLine 函数和正则表达式从中提取单词。该文件在单词周围使用特殊字符(例如 # sings before and after a word to show it is bolded 或 @ sings before and after a word to show it should be italicized)来表示特殊含义,这搞乱了我的正则表达式。
到目前为止,这是我的 r 代码,它删除了所有空行,然后将我的文本文件组合成一个向量:
book<-readLines("/Users/Desktop/SAMPLE .txt",encoding="UTF-8")
#remove all empty lines
empty_lines = grepl('^\s*$', book)
book = book[! empty_lines]
#combine book into one variable
xBook = paste(book, collapse = '')
#remove extra white spaces for a single text of the entire book
updated<-trimws(gsub("\s+"," ",xBook))
当我 运行 更新时,我看到存储在变量中的整个文件都已更新但带有特殊字符:
updated [1] "It is a truth universally acknowledged, that a #single# man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a @man@ may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, @that@ he is considered the rightful property of some one or other of #their# daughters.
如何从我更新的变量中的单词中删除所有前导或尾随的 # 或 @?
我想要的输出只是纯文本,没有指示应加粗或斜体的单词:
updated [1] "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.
gsub("[@#]([a-zA-Z]+)[@#]", "\1", x)