数据中的 Scrapy replace() 或 strip() br/ 标签

Question

我试图让我抓取的文本数据看起来更干净，并删除 <br> 标签或用 csv 中的实际换行符替换它：

<div>
  "This is an example."
   <br>
   "This is an example too."
<div>

当我使用 xpath 抓取文本并使用 strip() 函数时 response.xpath('//div//text()').extract().strip()（我使用了一个 itemloader，所以实际函数看起来有点不同，但基本相同）输出看起来像这个：

['This is an example text.',
'',
'This is an example too.'],

#data in csv file:
"This is an example text.,This is an example too."

现在我想删除 <br> 标签，或者整个逗号，所以结果如下所示："This is an example text. This is an example too"

或者我想用实际的换行符替换它：

"This is an example text. 
This is an example too."

我已经尝试了几个 .strip() 命令，即 .strip(u'\u0027') 删除引号或 .strip(u'[=18=]A0') 删除空格但没有任何效果

我真的可以用 scrapy 做这个吗？如果是的话有什么想法吗？如果不是，我是否必须稍后使用 pandas 执行此操作？

Answer 1

使用替换功能替换行尾的逗号

result = response.xpath('//div//text()').extract().strip().replace(",\n", "\n")

Answer 2

尝试：

response.xpath(''.join('//div//text()')).extract()

Scrapy replace() or strip() br/ tags from data