Python: 用于提取内容的 lxml xpath
Python: lxml xpath to extract content
下面的代码能够从下面的路透社 link 中提取 PE。但是,我的方法并不可靠,因为另一只股票的网页少了两行,导致数据发生变化。我怎么会遇到这个问题。想直接点PE的部分提取数据,但不知道怎么做。
link 1: http://www.reuters.com/finance/stocks/financialHighlights?symbol=MYEG.KL
link 2: http://www.reuters.com/finance/stocks/financialHighlights?symbol=ANNJ.KL
from lxml import html
import lxml
page2 = requests.get('http://www.reuters.com/finance/stocks/financialHighlights?symbol=MYEG.KL')
treea = html.fromstring(page2.content)
tree4 = treea.xpath('//td[@class]/text()')
PE= tree4[37]
这是我希望代码只提取这部分的部分,这样网页的任何更改都不会受到影响。
<tr class="stripe">
<td>P/E Ratio (TTM)</td>
<td class="data">36.79</td>
<td class="data">25.99</td>
<td class="data">21.70</td>
</tr>
使用文本找到第一个td 然后提取兄弟td's:
treea.xpath('//td[contains(.,"P/E Ratio")]/following-sibling::td/text()')
不管怎样都行得通:
In [8]: page2 = requests.get('http://www.reuters.com/finance/stocks/financialHighlights?symbol=MYEG.KL')
In [9]: treea = html.fromstring(page2.content)
In [10]: tree4 = treea.xpath('//td[contains(.,"P/E Ratio")]/following-sibling::td/text()')
In [11]: print(tree4)
['36.79', '25.99', '21.41']
In [12]: page2 = requests.get('http://www.reuters.com/finance/stocks/financialHighlights?symbol=ANNJ.KL')
In [13]: treea = html.fromstring(page2.content)
In [14]: tree4 = treea.xpath('//td[contains(.,"P/E Ratio")]/following-sibling::td/text()')
In [15]: print(tree4)
['--', '25.49', '17.30']
下面的代码能够从下面的路透社 link 中提取 PE。但是,我的方法并不可靠,因为另一只股票的网页少了两行,导致数据发生变化。我怎么会遇到这个问题。想直接点PE的部分提取数据,但不知道怎么做。 link 1: http://www.reuters.com/finance/stocks/financialHighlights?symbol=MYEG.KL link 2: http://www.reuters.com/finance/stocks/financialHighlights?symbol=ANNJ.KL
from lxml import html
import lxml
page2 = requests.get('http://www.reuters.com/finance/stocks/financialHighlights?symbol=MYEG.KL')
treea = html.fromstring(page2.content)
tree4 = treea.xpath('//td[@class]/text()')
PE= tree4[37]
这是我希望代码只提取这部分的部分,这样网页的任何更改都不会受到影响。
<tr class="stripe">
<td>P/E Ratio (TTM)</td>
<td class="data">36.79</td>
<td class="data">25.99</td>
<td class="data">21.70</td>
</tr>
使用文本找到第一个td 然后提取兄弟td's:
treea.xpath('//td[contains(.,"P/E Ratio")]/following-sibling::td/text()')
不管怎样都行得通:
In [8]: page2 = requests.get('http://www.reuters.com/finance/stocks/financialHighlights?symbol=MYEG.KL')
In [9]: treea = html.fromstring(page2.content)
In [10]: tree4 = treea.xpath('//td[contains(.,"P/E Ratio")]/following-sibling::td/text()')
In [11]: print(tree4)
['36.79', '25.99', '21.41']
In [12]: page2 = requests.get('http://www.reuters.com/finance/stocks/financialHighlights?symbol=ANNJ.KL')
In [13]: treea = html.fromstring(page2.content)
In [14]: tree4 = treea.xpath('//td[contains(.,"P/E Ratio")]/following-sibling::td/text()')
In [15]: print(tree4)
['--', '25.49', '17.30']