在 R 中删除文本文件中的特殊字符

Removing Special Characters in a Text File in R

我在 R 中使用文本文件并使用 readLine 函数和正则表达式从中提取单词。该文件在单词周围使用特殊字符(例如 # sings before and after a word to show it is bolded 或 @ sings before and after a word to show it should be italicized)来表示特殊含义,这搞乱了我的正则表达式。

到目前为止,这是我的 r 代码,它删除了所有空行,然后将我的文本文件组合成一个向量:

    book<-readLines("/Users/Desktop/SAMPLE .txt",encoding="UTF-8")
    #remove all empty lines
    empty_lines = grepl('^\s*$', book)
    book = book[! empty_lines]
    #combine book into one variable
    xBook = paste(book, collapse = '')
    #remove extra white spaces for a single text of the entire book
    updated<-trimws(gsub("\s+"," ",xBook))

当我 运行 更新时,我看到存储在变量中的整个文件都已更新但带有特殊字符:

updated [1] "It is a truth universally acknowledged, that a #single# man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a @man@ may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, @that@ he is considered the rightful property of some one or other of #their# daughters.

如何从我更新的变量中的单词中删除所有前导或尾随的 # 或 @?

我想要的输出只是纯文本,没有指示应加粗或斜体的单词:

updated [1] "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.

gsub("[@#]([a-zA-Z]+)[@#]", "\1", x)