使用 lxml 和 etree 抓取元素和 Child 文本时出现问题
Problem Scraping Element & Child Text with lxml & etree
我正在尝试以特定格式从维基百科页面(例如:https://de.wikipedia.org/wiki/Liste_der_Bisch%C3%B6fe_von_Sk%C3%A1lholt)中抓取列表。我遇到了让 'li' 和 'a href' 匹配的问题。
例如,在上面的页面中,第九个项目符号的文本为:
1238–1268:SigvarðurÞéttmarsson(挪威语)
与 HTML:
<li>1238–1268: <a href="/wiki/Sigvar%C3%B0ur_%C3%9E%C3%A9ttmarsson" title="Sigvarður Þéttmarsson">Sigvarður Þéttmarsson</a> (Norweger)</li>
我想把它拼成字典:
'1238–1268: SigvarðurÞéttmarsson(挪威)':'/wiki/Sigvar%C3%B0ur_%C3%9E%C3%A9ttmarsson'
['li' 和 'a' child] 两部分的全文:['a' 的 href child]
我知道我可以使用 lxml/etree 来执行此操作,但我不完全确定如何操作。下面的一些重组?
from lxml import etree
tree = etree.HTML(html)
bishops = tree.cssselect('li').text for bishop
text = [li.text for li in bishops]
links = tree.cssselect('li a')
hrefs = [bishop.get('href') for bishop in links]
更新:我已经使用 BeautifulSoup 解决了这个问题,如下所示:
from bs4 import BeautifulSoup
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
bishops_with_links = {}
bishops = soup.select('li')
for bishop in bishops:
if bishop.findChildren('a'):
bishops_with_links[bishop.text] = 'https://de.wikipedia.org' + bishop.a.get('href')
else:
bishops_with_links[bishop.text] = ''
return bishops_with_links
我正在尝试以特定格式从维基百科页面(例如:https://de.wikipedia.org/wiki/Liste_der_Bisch%C3%B6fe_von_Sk%C3%A1lholt)中抓取列表。我遇到了让 'li' 和 'a href' 匹配的问题。
例如,在上面的页面中,第九个项目符号的文本为:
1238–1268:SigvarðurÞéttmarsson(挪威语)
与 HTML:
<li>1238–1268: <a href="/wiki/Sigvar%C3%B0ur_%C3%9E%C3%A9ttmarsson" title="Sigvarður Þéttmarsson">Sigvarður Þéttmarsson</a> (Norweger)</li>
我想把它拼成字典:
'1238–1268: SigvarðurÞéttmarsson(挪威)':'/wiki/Sigvar%C3%B0ur_%C3%9E%C3%A9ttmarsson'
['li' 和 'a' child] 两部分的全文:['a' 的 href child]
我知道我可以使用 lxml/etree 来执行此操作,但我不完全确定如何操作。下面的一些重组?
from lxml import etree
tree = etree.HTML(html)
bishops = tree.cssselect('li').text for bishop
text = [li.text for li in bishops]
links = tree.cssselect('li a')
hrefs = [bishop.get('href') for bishop in links]
更新:我已经使用 BeautifulSoup 解决了这个问题,如下所示:
from bs4 import BeautifulSoup
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
bishops_with_links = {}
bishops = soup.select('li')
for bishop in bishops:
if bishop.findChildren('a'):
bishops_with_links[bishop.text] = 'https://de.wikipedia.org' + bishop.a.get('href')
else:
bishops_with_links[bishop.text] = ''
return bishops_with_links