使用 Python Lxml 解析静态 html 文件中的隐藏元素
Parsing Hidden Elements in Static html file using Python Lxml
我有一组静态 Html 文件,我需要解析和获取一些详细信息 from.I 我正在使用 Python - lxml 模块来获取所需的 details.A 静态 html 文件中的示例如下所示:
<div class="top">
<a data-bind="text">abc</a>
<span data-bind="visible:hotel.marca1!='' && hotel.marca1!='logo_ha', attr:{title:hotel.textoMarca1}" title="Hotusa" style="display: none;">
</span>
<span class="marca" data-bind="visible:hotel.marca1==='' || hotel.marca1==='logo_ha'">
</span>
<span class="star sprite-disponibilidad star1" data-bind="visible:hotel.cat === '1'" style="display: none;"></span>
<span class="star sprite-disponibilidad star2" data-bind="visible:hotel.cat === '2'" style="display: none;"></span>
<span class="star sprite-disponibilidad star3" data-bind="visible:hotel.cat === '3'" style="display: none;"></span>
<span class="star sprite-disponibilidad star4" data-bind="visible:hotel.cat === '4'"></span>
<span class="star sprite-disponibilidad star5" data-bind="visible:hotel.cat === '5'" style="display: none;"></span>
<div class="adr">
<span></span>
<span class="locality" data-bind="text: hotel.pob"></span>
</div>
</div>
<div class="top">
<a data-bind="text">dfg</a>
<span data-bind="visible:hotel.marca1!='' && hotel.marca1!='logo_ha', attr:{title:hotel.textoMarca1}" title="Hotusa" style="display: none;">
</span>
<span class="marca" data-bind="visible:hotel.marca1==='' || hotel.marca1==='logo_ha'">
</span>
<span class="star sprite-disponibilidad star1" data-bind="visible:hotel.cat === '1'" style="display: none;"></span>
<span class="star sprite-disponibilidad star2" data-bind="visible:hotel.cat === '2'" style="display: none;"></span>
<span class="star sprite-disponibilidad star3" data-bind="visible:hotel.cat === '3'" style="display: none;"></span>
<span class="star sprite-disponibilidad star4" data-bind="visible:hotel.cat === '4'" style="display: none;"></span>
<span class="star sprite-disponibilidad star5" data-bind="visible:hotel.cat === '5'" style="display: none;"></span>
<div class="adr">
<span></span>
<span class="locality" data-bind="text: hotel.pob"></span>
</div>
所以这是我需要从可见的跨度 class = 'star' 元素中获取星级的问题;例如,在第一个 div[@top] 中,可见范围的星级为“4”,而第二个 div[@top] 没有可见范围[class=star] 元素,因此它应该 return 星级为“0”。
但是,由于这些元素是隐藏的,所以我在获取 em 以及让脚本在 div 具有所有 span[@class= 的 div 元素上达到 return '0' 星级时遇到了问题星] 'hidden'.
这是我到目前为止尝试过的方法:
tree = html.fromstring(page)
for sali in tree.xpath('//div[@class="top"]'):
for x in sali.xpath('a'):
for sal in sali.xpath('span[not(contains(@style,"display:none"))]'):
print x , sal.attrib['data-bind']
但是这段代码并没有帮助我得到我想要的结果,我做错了什么?
预期输出:
美国广播公司 4
随机数 0
有几种方法可以解决这个问题,这里是一种解决方法:获取 "star" 评分元素和 return 第一个 "visible" 元素的索引如果找到 none,则下降到 0。我们可以使用 next()
and enumerate()
来实现:
def is_visible(element):
"""Naive implementation of the element visibility check."""
return 'display: none;' not in element.attrib.get("style", "")
def get_rating(entry):
rating_elements = entry.xpath(".//span[contains(@class, 'star')]")
visibile_rating = (index
for index, element in enumerate(rating_elements, start=1)
if is_visible(element))
return next(visibile_rating, 0)
root = fromstring(html)
for sali in root.xpath('//div[@class="top"]'):
for x in sali.xpath('a'):
print(x.text, get_rating(sali))
打印:
('abc', 4)
('dfg', 0)
请注意 class
属性是一个多值属性,严格来说,contains()
并不是通过 [=31= 查找元素的最佳工具] 值:
- How can I match on an attribute that contains a certain string?
您可以通过 BeautifulSoup 使用 lxml。更熟悉 Python 的人可能会整理一下
from bs4 import BeautifulSoup
html = '''
<div class="top">
<a data-bind="text">abc</a>
<span data-bind="visible:hotel.marca1!='' && hotel.marca1!='logo_ha', attr:{title:hotel.textoMarca1}" title="Hotusa" style="display: none;">
</span>
<span class="marca" data-bind="visible:hotel.marca1==='' || hotel.marca1==='logo_ha'">
</span>
<span class="star sprite-disponibilidad star1" data-bind="visible:hotel.cat === '1'" style="display: none;"></span>
<span class="star sprite-disponibilidad star2" data-bind="visible:hotel.cat === '2'" style="display: none;"></span>
<span class="star sprite-disponibilidad star3" data-bind="visible:hotel.cat === '3'" style="display: none;"></span>
<span class="star sprite-disponibilidad star4" data-bind="visible:hotel.cat === '4'"></span>
<span class="star sprite-disponibilidad star5" data-bind="visible:hotel.cat === '5'" style="display: none;"></span>
<div class="adr">
<span></span>
<span class="locality" data-bind="text: hotel.pob"></span>
</div>
</div>
<div class="top">
<a data-bind="text">dfg</a>
<span data-bind="visible:hotel.marca1!='' && hotel.marca1!='logo_ha', attr:{title:hotel.textoMarca1}" title="Hotusa" style="display: none;">
</span>
<span class="marca" data-bind="visible:hotel.marca1==='' || hotel.marca1==='logo_ha'">
</span>
<span class="star sprite-disponibilidad star1" data-bind="visible:hotel.cat === '1'" style="display: none;"></span>
<span class="star sprite-disponibilidad star2" data-bind="visible:hotel.cat === '2'" style="display: none;"></span>
<span class="star sprite-disponibilidad star3" data-bind="visible:hotel.cat === '3'" style="display: none;"></span>
<span class="star sprite-disponibilidad star4" data-bind="visible:hotel.cat === '4'" style="display: none;"></span>
<span class="star sprite-disponibilidad star5" data-bind="visible:hotel.cat === '5'" style="display: none;"></span>
<div class="adr">
<span></span>
<span class="locality" data-bind="text: hotel.pob"></span>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
ratings = []
for item in soup.select("div.top"):
hotel = item.select_one('a').text
found = False
for item2 in item.select("[data-bind*='visible:hotel.cat']"):
try:
style = item2['style']
except KeyError as e:
rating = item2['data-bind'].strip("visible:hotel.cat === ").strip("'")
found = True
break
ratings.append([hotel + ' ' + rating if found else hotel + ' 0'])
print(ratings)
输出:
我有一组静态 Html 文件,我需要解析和获取一些详细信息 from.I 我正在使用 Python - lxml 模块来获取所需的 details.A 静态 html 文件中的示例如下所示:
<div class="top">
<a data-bind="text">abc</a>
<span data-bind="visible:hotel.marca1!='' && hotel.marca1!='logo_ha', attr:{title:hotel.textoMarca1}" title="Hotusa" style="display: none;">
</span>
<span class="marca" data-bind="visible:hotel.marca1==='' || hotel.marca1==='logo_ha'">
</span>
<span class="star sprite-disponibilidad star1" data-bind="visible:hotel.cat === '1'" style="display: none;"></span>
<span class="star sprite-disponibilidad star2" data-bind="visible:hotel.cat === '2'" style="display: none;"></span>
<span class="star sprite-disponibilidad star3" data-bind="visible:hotel.cat === '3'" style="display: none;"></span>
<span class="star sprite-disponibilidad star4" data-bind="visible:hotel.cat === '4'"></span>
<span class="star sprite-disponibilidad star5" data-bind="visible:hotel.cat === '5'" style="display: none;"></span>
<div class="adr">
<span></span>
<span class="locality" data-bind="text: hotel.pob"></span>
</div>
</div>
<div class="top">
<a data-bind="text">dfg</a>
<span data-bind="visible:hotel.marca1!='' && hotel.marca1!='logo_ha', attr:{title:hotel.textoMarca1}" title="Hotusa" style="display: none;">
</span>
<span class="marca" data-bind="visible:hotel.marca1==='' || hotel.marca1==='logo_ha'">
</span>
<span class="star sprite-disponibilidad star1" data-bind="visible:hotel.cat === '1'" style="display: none;"></span>
<span class="star sprite-disponibilidad star2" data-bind="visible:hotel.cat === '2'" style="display: none;"></span>
<span class="star sprite-disponibilidad star3" data-bind="visible:hotel.cat === '3'" style="display: none;"></span>
<span class="star sprite-disponibilidad star4" data-bind="visible:hotel.cat === '4'" style="display: none;"></span>
<span class="star sprite-disponibilidad star5" data-bind="visible:hotel.cat === '5'" style="display: none;"></span>
<div class="adr">
<span></span>
<span class="locality" data-bind="text: hotel.pob"></span>
</div>
所以这是我需要从可见的跨度 class = 'star' 元素中获取星级的问题;例如,在第一个 div[@top] 中,可见范围的星级为“4”,而第二个 div[@top] 没有可见范围[class=star] 元素,因此它应该 return 星级为“0”。 但是,由于这些元素是隐藏的,所以我在获取 em 以及让脚本在 div 具有所有 span[@class= 的 div 元素上达到 return '0' 星级时遇到了问题星] 'hidden'.
这是我到目前为止尝试过的方法:
tree = html.fromstring(page)
for sali in tree.xpath('//div[@class="top"]'):
for x in sali.xpath('a'):
for sal in sali.xpath('span[not(contains(@style,"display:none"))]'):
print x , sal.attrib['data-bind']
但是这段代码并没有帮助我得到我想要的结果,我做错了什么?
预期输出:
美国广播公司 4
随机数 0
有几种方法可以解决这个问题,这里是一种解决方法:获取 "star" 评分元素和 return 第一个 "visible" 元素的索引如果找到 none,则下降到 0。我们可以使用 next()
and enumerate()
来实现:
def is_visible(element):
"""Naive implementation of the element visibility check."""
return 'display: none;' not in element.attrib.get("style", "")
def get_rating(entry):
rating_elements = entry.xpath(".//span[contains(@class, 'star')]")
visibile_rating = (index
for index, element in enumerate(rating_elements, start=1)
if is_visible(element))
return next(visibile_rating, 0)
root = fromstring(html)
for sali in root.xpath('//div[@class="top"]'):
for x in sali.xpath('a'):
print(x.text, get_rating(sali))
打印:
('abc', 4)
('dfg', 0)
请注意 class
属性是一个多值属性,严格来说,contains()
并不是通过 [=31= 查找元素的最佳工具] 值:
- How can I match on an attribute that contains a certain string?
您可以通过 BeautifulSoup 使用 lxml。更熟悉 Python 的人可能会整理一下
from bs4 import BeautifulSoup
html = '''
<div class="top">
<a data-bind="text">abc</a>
<span data-bind="visible:hotel.marca1!='' && hotel.marca1!='logo_ha', attr:{title:hotel.textoMarca1}" title="Hotusa" style="display: none;">
</span>
<span class="marca" data-bind="visible:hotel.marca1==='' || hotel.marca1==='logo_ha'">
</span>
<span class="star sprite-disponibilidad star1" data-bind="visible:hotel.cat === '1'" style="display: none;"></span>
<span class="star sprite-disponibilidad star2" data-bind="visible:hotel.cat === '2'" style="display: none;"></span>
<span class="star sprite-disponibilidad star3" data-bind="visible:hotel.cat === '3'" style="display: none;"></span>
<span class="star sprite-disponibilidad star4" data-bind="visible:hotel.cat === '4'"></span>
<span class="star sprite-disponibilidad star5" data-bind="visible:hotel.cat === '5'" style="display: none;"></span>
<div class="adr">
<span></span>
<span class="locality" data-bind="text: hotel.pob"></span>
</div>
</div>
<div class="top">
<a data-bind="text">dfg</a>
<span data-bind="visible:hotel.marca1!='' && hotel.marca1!='logo_ha', attr:{title:hotel.textoMarca1}" title="Hotusa" style="display: none;">
</span>
<span class="marca" data-bind="visible:hotel.marca1==='' || hotel.marca1==='logo_ha'">
</span>
<span class="star sprite-disponibilidad star1" data-bind="visible:hotel.cat === '1'" style="display: none;"></span>
<span class="star sprite-disponibilidad star2" data-bind="visible:hotel.cat === '2'" style="display: none;"></span>
<span class="star sprite-disponibilidad star3" data-bind="visible:hotel.cat === '3'" style="display: none;"></span>
<span class="star sprite-disponibilidad star4" data-bind="visible:hotel.cat === '4'" style="display: none;"></span>
<span class="star sprite-disponibilidad star5" data-bind="visible:hotel.cat === '5'" style="display: none;"></span>
<div class="adr">
<span></span>
<span class="locality" data-bind="text: hotel.pob"></span>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
ratings = []
for item in soup.select("div.top"):
hotel = item.select_one('a').text
found = False
for item2 in item.select("[data-bind*='visible:hotel.cat']"):
try:
style = item2['style']
except KeyError as e:
rating = item2['data-bind'].strip("visible:hotel.cat === ").strip("'")
found = True
break
ratings.append([hotel + ' ' + rating if found else hotel + ' 0'])
print(ratings)
输出: