Xpath scrapy 结果不符合预期
Xpath scrapy result not as expected
我正在尝试获取前面标记的值。这是我正在做的事情:
html 页面的结构:
...
<tr class="destaque no-hover">
<td class="periodo" colspan="6">2020.1</td>
</tr>
<tr class="linhaPar">
<td>Text1</td>
<td align="center">01</td>
<td align="right">312h</td>
<td align="center">3T12</td>
</tr>
<tr class="linhaImpar">
<td>Text2</td>
<td align="center">01</td>
<td align="right">12h</td>
<td align="center">5M12</td>
</tr>
...
<tr class="destaque no-hover">
<td class="periodo" colspan="6">2016.1</td>
</tr>
<tr class="linhaPar">
<td>Text7</td>
<td align="center">01</td>
<td align="right">2h</td>
<td align="center">2N12</td>
</tr>
<tr class="linhaImpar">
<td>Text8</td>
<td align="center">01</td>
<td align="right">32h</td>
<td align="center">4T12</td>
</tr>
...
<tr class="destaque no-hover">
<td class="periodo" colspan="6">2014.2</td>
</tr>
<tr class="linhaPar">
<td>TextN-1</td>
<td align="center">01</td>
<td align="right">2h</td>
<td align="center">2N12</td>
</tr>
<tr class="linhaImpar">
<td>TextN</td>
<td align="center">01</td>
<td align="right">32h</td>
<td align="center">4T12</td>
</tr>
所以,我正在尝试获取其中每一个的信息 tr classes="linhaPar|linhaImpar"
for i in response.xpath('//tr[@class="linhaPar" or @class="linhaImpar"]')
_aux = i.xpath('./td[1]')
但是,我也需要那些 td[@class="periodo"]
所以,我被 xpath
困住了
# I've tried this, but return a list of elements that matches, not the close one, as I want
_p = _aux.xpath('./preceding::tr[td[@class="periodo"]')
# I've also tried this, but won't work
_p = _aux.xpath('./preceding::tr[td[@class="periodo"] and position()=1]')
已解决
可能我在提这个问题的时候不够清楚。 periodo
将不同数量的 tr 放在一起变化。我尝试搜索的每一种方式,return 都给我一个可能的结果列表或 nada。为了解决这个问题,我尝试了建议的解决方案来考虑 "for loop xpath":
中的 periodo
_p = ""
for i in response.xpath('//tr[@class="linhaPar" or @class="linhaImpar" or @class="destaque no-hover"]'):
# Check if it's a td with period
if 'destaque no-hover' == i.xpath('./@class').get():
_p = i.xpath('./td/text()').get()
continue # Force to go to the next one
这个 XPath:
'//tr[@class="linhaPar" or @class="linhaImpar" or td[@class="periodo"]]'
假设您希望将其存储在 _p
中(每个 tr 上下文节点一个 periodo
):
['2020.1'], ['2020.1'], ['2020.1'], ['2020.1']
使用:
./preceding::td[@class="periodo"][1]
假设您希望将其存储在 _p
中(每组数据一个 periodo
):
['2020.1'], [], ['2020.2'], []
使用:
./preceding-sibling::tr[1]/td[1][@class="periodo"]
如果您需要从创建的列表中删除空元素,请在之后使用 filter
。
对于第二种情况,如@Gilles Quenot 所述,您还可以将上下文节点更改为:
//tr[@class="linhaPar" or @class="linhaImpar" or @class="destaque no-hover"]
并填写您的列表:
_aux = ./td[1][not(@class="periodo")]
_p = ./td[1][@class="periodo"]
或:
_aux = ./td[1][not(starts-with(text(),"2020."))]
_p = ./td[1][starts-with(text(),"2020.")]
我正在尝试获取前面标记的值。这是我正在做的事情:
html 页面的结构:
...
<tr class="destaque no-hover">
<td class="periodo" colspan="6">2020.1</td>
</tr>
<tr class="linhaPar">
<td>Text1</td>
<td align="center">01</td>
<td align="right">312h</td>
<td align="center">3T12</td>
</tr>
<tr class="linhaImpar">
<td>Text2</td>
<td align="center">01</td>
<td align="right">12h</td>
<td align="center">5M12</td>
</tr>
...
<tr class="destaque no-hover">
<td class="periodo" colspan="6">2016.1</td>
</tr>
<tr class="linhaPar">
<td>Text7</td>
<td align="center">01</td>
<td align="right">2h</td>
<td align="center">2N12</td>
</tr>
<tr class="linhaImpar">
<td>Text8</td>
<td align="center">01</td>
<td align="right">32h</td>
<td align="center">4T12</td>
</tr>
...
<tr class="destaque no-hover">
<td class="periodo" colspan="6">2014.2</td>
</tr>
<tr class="linhaPar">
<td>TextN-1</td>
<td align="center">01</td>
<td align="right">2h</td>
<td align="center">2N12</td>
</tr>
<tr class="linhaImpar">
<td>TextN</td>
<td align="center">01</td>
<td align="right">32h</td>
<td align="center">4T12</td>
</tr>
所以,我正在尝试获取其中每一个的信息 tr classes="linhaPar|linhaImpar"
for i in response.xpath('//tr[@class="linhaPar" or @class="linhaImpar"]')
_aux = i.xpath('./td[1]')
但是,我也需要那些 td[@class="periodo"]
所以,我被 xpath
# I've tried this, but return a list of elements that matches, not the close one, as I want
_p = _aux.xpath('./preceding::tr[td[@class="periodo"]')
# I've also tried this, but won't work
_p = _aux.xpath('./preceding::tr[td[@class="periodo"] and position()=1]')
已解决
可能我在提这个问题的时候不够清楚。 periodo
将不同数量的 tr 放在一起变化。我尝试搜索的每一种方式,return 都给我一个可能的结果列表或 nada。为了解决这个问题,我尝试了建议的解决方案来考虑 "for loop xpath":
periodo
_p = ""
for i in response.xpath('//tr[@class="linhaPar" or @class="linhaImpar" or @class="destaque no-hover"]'):
# Check if it's a td with period
if 'destaque no-hover' == i.xpath('./@class').get():
_p = i.xpath('./td/text()').get()
continue # Force to go to the next one
这个 XPath:
'//tr[@class="linhaPar" or @class="linhaImpar" or td[@class="periodo"]]'
假设您希望将其存储在 _p
中(每个 tr 上下文节点一个 periodo
):
['2020.1'], ['2020.1'], ['2020.1'], ['2020.1']
使用:
./preceding::td[@class="periodo"][1]
假设您希望将其存储在 _p
中(每组数据一个 periodo
):
['2020.1'], [], ['2020.2'], []
使用:
./preceding-sibling::tr[1]/td[1][@class="periodo"]
如果您需要从创建的列表中删除空元素,请在之后使用 filter
。
对于第二种情况,如@Gilles Quenot 所述,您还可以将上下文节点更改为:
//tr[@class="linhaPar" or @class="linhaImpar" or @class="destaque no-hover"]
并填写您的列表:
_aux = ./td[1][not(@class="periodo")]
_p = ./td[1][@class="periodo"]
或:
_aux = ./td[1][not(starts-with(text(),"2020."))]
_p = ./td[1][starts-with(text(),"2020.")]