从网页中提取 2 个参数

Question

我想从维基百科页面的 <a> 标签中提取 2 个参数（title 和 href）。

我想要这个输出例如 (https://en.wikipedia.org/wiki/Riddley_Walker):

Canterbury Cathedral  
/wiki/Canterbury_Cathedral

代码：

import os, re, lxml.html, urllib

def extractplaces(hlink):
    connection = urllib.urlopen(hlink)
    places = {}

    dom =  lxml.html.fromstring(connection.read())

    for name in dom.xpath('//a/@title'): # select the url in href for all a tags(links)
            print name

在这种情况下，我只得到 @title。

Answer 1

你应该获取标签为 a 且具有 title 属性的元素（而不是直接获取 title 属性）。然后对元素使用 .attrib 来获取属性需要。例子-

for name in dom.xpath('//a[@title]'):
    print('title :',name.attrib['title'])
    print('href :',name.attrib['href'])

从网页中提取 2 个参数

Extract 2 arguments from web page

python

xpath

html-parsing

web-scraping

lxml.html