如何使用 R 从不同文件中的列表元素中删除噪声字符

How to remove noise characters from exporting elements of list in different files using R

我在导出我在 R 中解析一些文本文件后创建的列表元素时遇到困难。

这是我的原始文件 https://my.pcloud.com/publink/show?code=XZDUm3ZjzHjKnBTdF8Bw4osq4eIFXDz0JF7

然后我要做的就是解析它,保留 <BODY> </BODY> 中包含的所有内容, 清除噪音(符号,小写等)并将其放入列表中(我的原始文件由不同的文本组成,我需要将它们拆分)。

然后我将列表的元素导出到不同的文本文件中,但我得到了我的文本以及一些在 R 控制台中看不到的无用字符。

这是我的代码

library(stats)
library(dplyr)
library(proxy)
library(stringr)
library(data.table)
library(proto) ## needed for next library
library(gsubfn) #read multiple times <BODY>
setwd("input_data")

# parse my input file
doc <- lapply( list.files(), readLines )

# parse files and keep text needed
docNew <- strapply(doc, "<BODY>(.*?)</BODY>", simplify = c)


# clear files
doc1 <- lapply(docNew, function(x) {
  text <- gsub("[[:punct:]]", "", x) %>% tolower()
  text <- gsub("\s+", " ", text) %>% str_trim()
  return(text)
  })

for (i in 1:5) {
  write.csv( doc1[[i]], file = paste0("output/",i, ".txt"))
}

事情是这样的,当我在控制台中调用 doc1[[1]]] 时,我得到

>     [1] "showers continue throughout the week in the bahia cocoa zone alleviating the rought since early january an improving prospects for
> the coming temporao although normal humiity levels have not been
> restore comissaria smith sai in its weekly review the ry perio means
> the temporao will be late this year arrivals for the week ene february
> 22 were 155221 bags of 60 kilos making a cumulative total for the
> season of 593 mln against 581 at the same stage last year again it
> seems that cocoa elivere earlier on consignment was inclue in the
> arrivals figures comissaria smith sai there is still some oubt as to
> how much ol crop cocoa is still available as harvesting has
> practically come to an en with total bahia crop estimates aroun 64 mln
> bags an sales staning at almost 62 mln there are a few hunre thousan
> bags still in the hans of farmers milemen exporters an processors
> there are oubts as to how much of this cocoa woul be fit for export as
> shippers are now experiencing ificulties in obtaining ... <truncated>

当我打开我创建的 1.txt 文件时,我有一个如下所示的文本:

"","x" "1","showers continue throughout the week in the bahia cocoa zone alleviating the rought since early january an improving prospects for the coming temporao although normal humiity levels have not been restore comissaria smith sai in its weekly review the ry perio means the temporao will be late this year arrivals for the week ene february 22 were 155221 bags of 60 kilos making a cumulative total for the season of 593 mln against 581 at the same stage last year again it seems that cocoa elivere earlier on consignment was inclue in the arrivals figures comissaria smith sai there is still some oubt as to how much ol crop cocoa is still available as harvesting has practically come to an en with total bahia crop estimates aroun 64 mln bags an sales staning at almost 62 mln there are a few hunre thousan bags still in the hans of farmers milemen exporters an processors there are oubts as to how much of this cocoa woul be fit for export as shippers are now experiencing ificulties in obtaining bahia superior certificates in view of the lower quality over recent weeks farmers have sol a goo part of their cocoa hel on consignment comissaria smith sai spot bean prices rose to 340 to 350 cruzaos per arroba of 15 kilos bean shippers were reluctant to offer nearby shipment an only limite sales were booke for march shipment at 1750 to 1780 lrs per tonne to ports to be name new crop sales were also light an all to open ports with junejuly going at 1850 an 1880 lrs an at 35 an 45 lrs uner new york july augsept at 1870 1875 an 1880 lrs per tonne fob routine sales of butter were mae marchapril sol at 4340 4345 an 4350 lrs aprilmay butter went at 227 times new york may junejuly at 4400 an 4415 lrs augsept at 4351 to 4450 lrs an at 227 an 228 times new york sept an octec at 4480 lrs an 227 times new york ec comissaria smith sai estinations were the us covertible currency areas uruguay an open ports cake sales were registere at 785 to 995 lrs for marchapril 785 lrs for may 753 lrs for aug an 039 times new york ec for octec buyers were the us argentina uruguay an convertible currency areas liquor sales were limite with marchapril selling at 2325 an 2380 lrs junejuly at 2375 lrs an at 125 times new york july augsept at 2400 lrs an at 125 times new york sept an octec at 125 times new york ec comissaria smith sai total bahia sales are currently estimate at 613 mln bags against the 198687 crop an 106 mln bags against the 198788 crop final figures for the perio to february 28 are expecte to be publishe by the brazilian cocoa trae commission after carnival which ens miay on february 27 reuter 3"

我怎样才能得到纯文本,删除“”,"x"“1”和文本周围的“”?

我只需要这样的东西:

showers continue throughout the week in the bahia cocoa zone alleviating the rought...

我试过了

for (i in 1:5) {
  write.csv( cat(doc1[[i]]), file = paste0("output/",i, ".txt"))
}

但只打印

""

在我导出的文件中(它似乎在 R 控制台中工作)

如果要将字符串写入文件,最好的选择是 cat。它不需要任何其他功能即可工作。虽然您可以使用 capture.outputsink 来编写一些复杂的内容,但以下内容似乎足以满足您的需求。

for (i in 1:length(doc1)) {
  cat(doc1[[i]], file = sprintf("output/file_%s.txt", i))
}