使用 LXML.HTML 和 Xpath 进行网页抓取
WebScraping with LXML.HTML and Xpath
我尝试从网站中提取信息,但不幸的是我只能得到有限的信息。我在使用正确的 Xpath 时遇到问题,它接收的不仅仅是整个 table 的第一个元素。为了显示 Xpath,我使用 Chrome DevTools。如何使 Xpath 更通用以获得所需的结果?或者有谁知道我怎样才能更巧妙地做到这一点?我的目标是稍后获得一个 json 文件。
import requests
import lxml.html
html = requests.get('http://volcano.oregonstate.edu/volcano_table')
doc = lxml.html.fromstring(html.content)
volcanoes = doc.xpath('//*[@id="content"]/div/div[2]/table/tbody/tr[1]/td[1]/a/text()')
country = doc.xpath('//*[@id="content"]/div/div[2]/table/tbody/tr[1]/td[2]/text()')
latitude = doc.xpath('//*[@id="content"]/div/div[2]/table/tbody/tr[1]/td[4]/text()')
longitude = doc.xpath('//*[@id="content"]/div/div[2]/table/tbody/tr[1]/td[5]/text()')
elevation = doc.xpath('//*[@id="content"]/div/div[2]/table/tbody/tr[1]/td[6]/text()')
output = []
for info in zip(volcanoes, country, latitude, longitude, elevation):
resp = {}
resp['volcanoes'] = info[0]
resp['country'] = info[1]
resp['latitude'] = info[2]
resp['longitude'] = info[3]
resp['elevation'] = info[4]
output.append(resp)
print(output)
这是代码目前能够接收的内容:
[{'volcanoes': 'Abu', 'country': '\n Japan ', 'latitude': '\n 34.50 ', 'longitude': '\n 131.60 ', 'elevation': '\n 641 '}]
您定义的 xpaths
容易出错。我试图改进它们。现在,以下应该为您提供所需的内容。
import json
import requests
from lxml.html import fromstring
res = requests.get('http://volcano.oregonstate.edu/volcano_table')
root = fromstring(res.text)
data = []
for item in root.xpath("//*[starts-with(@class,'views-table')]//tbody//tr"):
d = {}
d['volcan'] = item.xpath('.//td/a/text()')[0].strip()
d['country'] = item.xpath('.//td/text()')[2].strip()
d['lat'] = item.xpath('.//td/text()')[4].strip()
d['longitude'] = item.xpath('.//td/text()')[5].strip()
d['elevation'] = item.xpath('.//td/text()')[6].strip()
data.append(d)
print(json.dumps(data,indent=4))
您可能喜欢的输出:
[
{
"volcan": "Abu",
"country": "Japan",
"lat": "34.50",
"longitude": "131.60",
"elevation": "641"
},
{
"volcan": "Acamarachi",
"country": "Chile",
"lat": "-23.30",
"longitude": "-67.62",
"elevation": "6046"
},
我尝试从网站中提取信息,但不幸的是我只能得到有限的信息。我在使用正确的 Xpath 时遇到问题,它接收的不仅仅是整个 table 的第一个元素。为了显示 Xpath,我使用 Chrome DevTools。如何使 Xpath 更通用以获得所需的结果?或者有谁知道我怎样才能更巧妙地做到这一点?我的目标是稍后获得一个 json 文件。
import requests
import lxml.html
html = requests.get('http://volcano.oregonstate.edu/volcano_table')
doc = lxml.html.fromstring(html.content)
volcanoes = doc.xpath('//*[@id="content"]/div/div[2]/table/tbody/tr[1]/td[1]/a/text()')
country = doc.xpath('//*[@id="content"]/div/div[2]/table/tbody/tr[1]/td[2]/text()')
latitude = doc.xpath('//*[@id="content"]/div/div[2]/table/tbody/tr[1]/td[4]/text()')
longitude = doc.xpath('//*[@id="content"]/div/div[2]/table/tbody/tr[1]/td[5]/text()')
elevation = doc.xpath('//*[@id="content"]/div/div[2]/table/tbody/tr[1]/td[6]/text()')
output = []
for info in zip(volcanoes, country, latitude, longitude, elevation):
resp = {}
resp['volcanoes'] = info[0]
resp['country'] = info[1]
resp['latitude'] = info[2]
resp['longitude'] = info[3]
resp['elevation'] = info[4]
output.append(resp)
print(output)
这是代码目前能够接收的内容:
[{'volcanoes': 'Abu', 'country': '\n Japan ', 'latitude': '\n 34.50 ', 'longitude': '\n 131.60 ', 'elevation': '\n 641 '}]
您定义的 xpaths
容易出错。我试图改进它们。现在,以下应该为您提供所需的内容。
import json
import requests
from lxml.html import fromstring
res = requests.get('http://volcano.oregonstate.edu/volcano_table')
root = fromstring(res.text)
data = []
for item in root.xpath("//*[starts-with(@class,'views-table')]//tbody//tr"):
d = {}
d['volcan'] = item.xpath('.//td/a/text()')[0].strip()
d['country'] = item.xpath('.//td/text()')[2].strip()
d['lat'] = item.xpath('.//td/text()')[4].strip()
d['longitude'] = item.xpath('.//td/text()')[5].strip()
d['elevation'] = item.xpath('.//td/text()')[6].strip()
data.append(d)
print(json.dumps(data,indent=4))
您可能喜欢的输出:
[
{
"volcan": "Abu",
"country": "Japan",
"lat": "34.50",
"longitude": "131.60",
"elevation": "641"
},
{
"volcan": "Acamarachi",
"country": "Chile",
"lat": "-23.30",
"longitude": "-67.62",
"elevation": "6046"
},