R 中的文本挖掘:创建语料库会创建不寻常的文本

Text Mining in R: Creating a Corpus creates unusual text

我正在阅读单个文本文件和下面的代码。它读起来很好,但在整个语料库中随机放置一个 \t。

例子: 原始文本文件 5. 如果您以个人身份回应,...... 在语料库中 "5.\t如果您以个人身份回复,...

或 Q1。我们可以从其他地方吸取什么教训...... "Q1.\t我们可以从其他地方吸取哪些教训.....

语料库中的制表符好像被翻译成了\t

有什么解决办法吗?

谢谢

# set pathway to text files
folder<-"C:\xxxxxx\Text files"
folder
# lists all files in pathway 
list.files(path=folder)
# filters text files only
list.files(path=folder, pattern="*.txt")

# set vector
filelist<-list.files(path=folder, pattern="*.txt")

# assign pathways to files
paste(folder, "\", filelist)
# removes separations in pathways by setting as empty
filelist<-paste(folder, "\", filelist, sep="")
filelist

# apply a function to read in multiple txt files - warnings are OK
a<-lapply(filelist, FUN=readLines)
# apply a function to collaspe into a single element
corpus<-lapply(a, FUN=paste, collaspe=" ")

gsub() 是一个很棒的函数,它将用不同的字符串替换模式的所有实例。对于您的情况,这应该有所帮助:

# apply a function to read in multiple txt files - warnings are OK
a<-lapply(filelist, FUN=readLines)
# apply a function to collaspe into a single element
corpus<-lapply(a, FUN=paste, collaspe=" ")
# replace all '\t' with '   '
corpus<-gsub(pattern = '\\t', replacement = '', corpus)