lxml xpath 无法显示 html 项
lxml xpath unable to display html items
我正在尝试使用 lxml 来解析下面的网页。但是我的 xpath 似乎有问题。我不确定我做错了什么。
web_content = requests.get(r"https://www.quandl.com/data/TSE").content
dataset_count = html.fromstring(web_content)
print(dataset_count.xpath(r'//*[@id="ember667"]/div[2]/main/section/section/section[2]/div[3]/div[2]/span[2]'))
我正在尝试将其设置为 return 数据集编号 3908。但是这个 xpath 似乎对我不起作用。有什么想法吗?
此外,我希望如果我通过请求传递另一个 quandl link,我可以使用相同的 xpath 来提取数据集编号。可以吗?
requests
收到的响应中没有 3908 号码,因为该号码是通过附加请求动态加载的。
解决它的一个选择是使用真正的浏览器并用 selenium
. Here is an example working code that uses PhantomJS
headless browser 控制它:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get("https://www.quandl.com/data/TSE")
wait = WebDriverWait(driver, 10)
elm = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".database-statistics .column:nth-child(2) span:nth-child(2)")))
print(elm.text)
driver.close()
打印 3,908
.
似乎数据集计数也在 <noscript>
元素中:
<div class='centered' id='main' role='main'>
<div id='content'>
<noscript>
<table>
<tbody>
<tr>
<td>Database Name</td>
<td>Tokyo Stock Exchange</td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td>Datasets</td>
<td>3908</td>
</tr>
<tr>
<td>Downloads</td>
<td>4067259</td>
</tr>
<tr>
...
所以你可以使用这样的东西来获取它:
>>> import requests
>>> import lxml.html
>>> r = requests.get('https://www.quandl.com/data/TSE')
>>> h = lxml.html.fromstring(r.text)
>>> h
<Element html at 0x7ffb5f6ed0a8>
>>> h.xpath('//noscript')
[<Element noscript at 0x7ffb5c16ac58>, <Element noscript at 0x7ffb5c16ac00>]
>>> h.xpath('string(//noscript//tr[td[1]="Datasets"]/td[2])')
'3908'
>>> h.xpath('string(//div[@id="content"]//noscript//tr[td[1]="Datasets"]/td[2])')
'3908'
>>> h.xpath('number(//div[@id="content"]//noscript//tr[td[1]="Datasets"]/td[2])')
3908.0
OP 要求的 XPath 解释:
//div[@id="content"] <-- look for a <div> element with "id" attribute equal to "content"
//noscript <-- look for a <noscript> descendant
//tr[ <-- look for a <tr> descendant...
td[1]="Datasets" <-- ... which 1st <td> child string value is "Datasets"...
(this is true if the <td> contains only 1 text node "Datasets"
]
/td[2] <-- select the 2nd <td> of previous matching <tr> rows
我正在尝试使用 lxml 来解析下面的网页。但是我的 xpath 似乎有问题。我不确定我做错了什么。
web_content = requests.get(r"https://www.quandl.com/data/TSE").content
dataset_count = html.fromstring(web_content)
print(dataset_count.xpath(r'//*[@id="ember667"]/div[2]/main/section/section/section[2]/div[3]/div[2]/span[2]'))
我正在尝试将其设置为 return 数据集编号 3908。但是这个 xpath 似乎对我不起作用。有什么想法吗?
此外,我希望如果我通过请求传递另一个 quandl link,我可以使用相同的 xpath 来提取数据集编号。可以吗?
requests
收到的响应中没有 3908 号码,因为该号码是通过附加请求动态加载的。
解决它的一个选择是使用真正的浏览器并用 selenium
. Here is an example working code that uses PhantomJS
headless browser 控制它:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get("https://www.quandl.com/data/TSE")
wait = WebDriverWait(driver, 10)
elm = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".database-statistics .column:nth-child(2) span:nth-child(2)")))
print(elm.text)
driver.close()
打印 3,908
.
似乎数据集计数也在 <noscript>
元素中:
<div class='centered' id='main' role='main'>
<div id='content'>
<noscript>
<table>
<tbody>
<tr>
<td>Database Name</td>
<td>Tokyo Stock Exchange</td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td>Datasets</td>
<td>3908</td>
</tr>
<tr>
<td>Downloads</td>
<td>4067259</td>
</tr>
<tr>
...
所以你可以使用这样的东西来获取它:
>>> import requests
>>> import lxml.html
>>> r = requests.get('https://www.quandl.com/data/TSE')
>>> h = lxml.html.fromstring(r.text)
>>> h
<Element html at 0x7ffb5f6ed0a8>
>>> h.xpath('//noscript')
[<Element noscript at 0x7ffb5c16ac58>, <Element noscript at 0x7ffb5c16ac00>]
>>> h.xpath('string(//noscript//tr[td[1]="Datasets"]/td[2])')
'3908'
>>> h.xpath('string(//div[@id="content"]//noscript//tr[td[1]="Datasets"]/td[2])')
'3908'
>>> h.xpath('number(//div[@id="content"]//noscript//tr[td[1]="Datasets"]/td[2])')
3908.0
OP 要求的 XPath 解释:
//div[@id="content"] <-- look for a <div> element with "id" attribute equal to "content"
//noscript <-- look for a <noscript> descendant
//tr[ <-- look for a <tr> descendant...
td[1]="Datasets" <-- ... which 1st <td> child string value is "Datasets"...
(this is true if the <td> contains only 1 text node "Datasets"
]
/td[2] <-- select the 2nd <td> of previous matching <tr> rows