Scrapy 忽略了部分文本

Question

我正在尝试使用 Scrapy 从网站上抓取文本并构建一个文本数据集及其一些特征。对于每个包含文本的元素，我都会保存文本本身、元素类型和其他一些东西。它在大多数情况下工作正常，但它不会抓取嵌套元素后面的文本部分。

输入示例：

<p>
  First part of text
  <b>
    Nested text
  </b>
  Second part of text
</p>

输出（只是一个例子，实际上输出保存为csv）：

text: First part of text, element: p
text: Nested text, element: b

预期输出（只是一个例子，实际上输出保存为 csv）：

text: First part of text, element: p
text: Nested text, element: b
text: Second part of text, element: p

我负责抓取文本的部分代码：

for element in response.xpath('//*[normalize-space(text())]'):
    ...
    text_normalized = element.xpath('normalize-space(./text())').get()
    ...

如何获取文本的第二部分？期望一个元素可以包含多个嵌套元素，并且文本本身可以拆分为不止 2 个部分。

Answer 1

如果您将 // 与 text node 一起使用，它将 return 所有文本作为列表，然后您可以使用 .join 方法或列表切片。

text_normalized = element.xpath('normalize-space(.//p//text())').getall()

scrapy 上的实现 shell

In [1]: from scrapy.selector import Selector

In [2]: %paste
doc='''
<p>
  First part of text
  <b>
    Nested text
  </b>
  Second part of text
</p>
'''

## -- End pasted text --

In [3]: sel = Selector(text=doc)

In [4]: sel.xpath('//p//text()').getall()
Out[4]: 
['\n  First part of text\n  ',
 '\n    Nested text\n  ',
 '\n  Second part of text\n']

In [5]: sel.xpath('//p//text()').get()
Out[5]: '\n  First part of text\n  '

In [6]: 

In [6]: p_text=sel.xpath('//p//text()').getall()[0]

In [7]: p_text
Out[7]: '\n  First part of text\n  '

In [8]: p_text=sel.xpath('//p//text()').getall()[0].strip()

In [9]: p_text
Out[9]: 'First part of text'

In [10]: b_text=p_text=sel.xpath('//p//text()').getall()[1].strip()

In [11]: b_text
Out[11]: 'Nested text'

In [12]: p-text1=b_text=p_text=sel.xpath('//p//text()').getall()[2].strip()
  File "<ipython-input-12-6baa2c054111>", line 1
    p-text1=b_text=p_text=sel.xpath('//p//text()').getall()[2].strip()
    ^
SyntaxError: cannot assign to operator


In [13]:  p_text1=b_text=p_text=sel.xpath('//p//text()').getall()[2].strip()

In [14]: p_text1
Out[14]: 'Second part of text'

Scrapy 忽略了部分文本

Scrapy is ignoring part of the text

html

python

scrapy

web-scraping

scrapy 上的实现 shell