Python 嵌套 html 个标签 Beautifulsoup
Python nested html tags with Beautifulsoup
我正在尝试从嵌套的 html 代码中获取一些所有 href URL:
...
<li class="dropdown">
<a href="#" class="dropdown-toggle wide-nav-link" data-toggle="dropdown">TEXT_1 <b class="caret"></b></a>
<ul class="dropdown-menu">
<li class="class_A"><a title="Title_1" href="http://www.customurl_1.com">Title_1</a></li>
<li class="class_B"><a title="Title_2" href="http://www.customurl_2.com">Title_2</a></li>
...
<li class="class_A"><a title="Title_X" href="http://www.customurl_X.com">Title_X</a></li>
</ul>
</li>
...
<li class="dropdown">
<a href="#" class="dropdown-toggle wide-nav-link" data-toggle="dropdown">TEXT_2 <b class="caret"></b></a>
<ul class="dropdown-menu">
<li class="class_A"><a title="Title_1" href="http://www.customurl_1.com">Title_1</a></li>
<li class="class_B"><a title="Title_2" href="http://www.customurl_2.com">Title_2</a></li>
...
<li class="class_A"><a title="Title_X" href="http://www.customurl_X.com">Title_X</a></li>
</ul>
</li>
...
在原始 html 代码中,大约有 15 "li" 个块 class "dropdown",
但我只想从文本 = TEXT_1 的块中获取 URL。
可以使用 BeautifulSoup?
捕获所有这些嵌套的 url
感谢您的帮助
lxml 和 Xpath 的示例:
from lxml import etree
from io import StringIO
parser = etree.HTMLParser()
tree = etree.parse(StringIO(html), parser)
hrefs = tree.xpath('//li[@class="dropdown" and a[starts-with(.,"TEXT_1")]]/ul[@class="dropdown-menu"]/li/a/@href')
print hrefs
其中 html
是包含您的 html 内容的 unicode 字符串。结果:
['http://www.customurl_1.com', 'http://www.customurl_2.com', 'http://www.customurl_X.com']
注意:我使用 starts-with
函数在 XPath 查询中更精确,但如果 TEXT_1
并不总是在开头,您可以以相同的方式使用 contains
文本节点。
查询详情:
// # anywhere in the domtree
li # a li tag with the following conditions:
[ # (opening condition bracket for li)
@class="dropdown" # li has a class attribute equal to "dropdown"
and # and
a # a child tag "a"
[ # (open a condition for "a")
starts-with(.,"TEXT_1") # that the text starts with "TEXT_1"
] # (close a condition for "a")
] # (close the condition for li)
/ # li's child (/ stands for immediate descendant)
ul[@class="dropdown-menu"] # "ul" with class equal to "dropdown-menu"
/li # "li" children of "ul"
/a # "a" children of "li"
/@href # href attributes children of "a"
虽然不如 Xpath 优雅,但您始终可以使用日常 Python 迭代来编写逻辑。 BeautifulSoup 允许将函数作为过滤器传递给 find_all
,以应对复杂情况,例如这个。
from bs4 import BeautifulSoup
html_doc = """<html>..."""
soup = BeautifulSoup(html_doc)
def matches_block(tag):
return matches_dropdown(tag) and tag.find(matches_text) != None
def matches_dropdown(tag):
return tag.name == 'li' and tag.has_attr('class') and 'dropdown' in tag['class']
def matches_text(tag):
return tag.name == 'a' and tag.get_text().startswith('TEXT_1')
for li in soup.find_all(matches_block):
for ul in li.find_all('ul', class_='dropdown-menu'):
for a in ul.find_all('a'):
if a.has_attr('href'):
print (a['href'])
我正在尝试从嵌套的 html 代码中获取一些所有 href URL:
...
<li class="dropdown">
<a href="#" class="dropdown-toggle wide-nav-link" data-toggle="dropdown">TEXT_1 <b class="caret"></b></a>
<ul class="dropdown-menu">
<li class="class_A"><a title="Title_1" href="http://www.customurl_1.com">Title_1</a></li>
<li class="class_B"><a title="Title_2" href="http://www.customurl_2.com">Title_2</a></li>
...
<li class="class_A"><a title="Title_X" href="http://www.customurl_X.com">Title_X</a></li>
</ul>
</li>
...
<li class="dropdown">
<a href="#" class="dropdown-toggle wide-nav-link" data-toggle="dropdown">TEXT_2 <b class="caret"></b></a>
<ul class="dropdown-menu">
<li class="class_A"><a title="Title_1" href="http://www.customurl_1.com">Title_1</a></li>
<li class="class_B"><a title="Title_2" href="http://www.customurl_2.com">Title_2</a></li>
...
<li class="class_A"><a title="Title_X" href="http://www.customurl_X.com">Title_X</a></li>
</ul>
</li>
...
在原始 html 代码中,大约有 15 "li" 个块 class "dropdown", 但我只想从文本 = TEXT_1 的块中获取 URL。 可以使用 BeautifulSoup?
捕获所有这些嵌套的 url感谢您的帮助
lxml 和 Xpath 的示例:
from lxml import etree
from io import StringIO
parser = etree.HTMLParser()
tree = etree.parse(StringIO(html), parser)
hrefs = tree.xpath('//li[@class="dropdown" and a[starts-with(.,"TEXT_1")]]/ul[@class="dropdown-menu"]/li/a/@href')
print hrefs
其中 html
是包含您的 html 内容的 unicode 字符串。结果:
['http://www.customurl_1.com', 'http://www.customurl_2.com', 'http://www.customurl_X.com']
注意:我使用 starts-with
函数在 XPath 查询中更精确,但如果 TEXT_1
并不总是在开头,您可以以相同的方式使用 contains
文本节点。
查询详情:
// # anywhere in the domtree
li # a li tag with the following conditions:
[ # (opening condition bracket for li)
@class="dropdown" # li has a class attribute equal to "dropdown"
and # and
a # a child tag "a"
[ # (open a condition for "a")
starts-with(.,"TEXT_1") # that the text starts with "TEXT_1"
] # (close a condition for "a")
] # (close the condition for li)
/ # li's child (/ stands for immediate descendant)
ul[@class="dropdown-menu"] # "ul" with class equal to "dropdown-menu"
/li # "li" children of "ul"
/a # "a" children of "li"
/@href # href attributes children of "a"
虽然不如 Xpath 优雅,但您始终可以使用日常 Python 迭代来编写逻辑。 BeautifulSoup 允许将函数作为过滤器传递给 find_all
,以应对复杂情况,例如这个。
from bs4 import BeautifulSoup
html_doc = """<html>..."""
soup = BeautifulSoup(html_doc)
def matches_block(tag):
return matches_dropdown(tag) and tag.find(matches_text) != None
def matches_dropdown(tag):
return tag.name == 'li' and tag.has_attr('class') and 'dropdown' in tag['class']
def matches_text(tag):
return tag.name == 'a' and tag.get_text().startswith('TEXT_1')
for li in soup.find_all(matches_block):
for ul in li.find_all('ul', class_='dropdown-menu'):
for a in ul.find_all('a'):
if a.has_attr('href'):
print (a['href'])