BeautifulSoup - 按标签内的文本搜索
BeautifulSoup - search by text inside a tag
观察以下问题:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
Edit
</a>
""")
# This returns the <a> element
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
# This returns None
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)
出于某种原因,BeautifulSoup 将无法匹配文本,即使 <i>
标签也存在。找到标签并显示其文本会产生
>>> a2 = soup.find(
'a',
href="/customer-menu/1/accounts/1/update"
)
>>> print(repr(a2.text))
'\n Edit\n'
没错。根据Docs,soup使用正则表达式的匹配功能,而不是搜索功能。所以我需要提供 DOTALL 标志:
pattern = re.compile('.*Edit.*')
pattern.match('\n Edit\n') # Returns None
pattern = re.compile('.*Edit.*', flags=re.DOTALL)
pattern.match('\n Edit\n') # Returns MatchObject
好的。看起来不错。让我们用汤试试
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*", flags=re.DOTALL)
) # Still return None... Why?!
编辑
我基于壁虎答案的解决方案:我实现了这些助手:
import re
MATCH_ALL = r'.*'
def like(string):
"""
Return a compiled regular expression that matches the given
string with any prefix and postfix, e.g. if string = "hello",
the returned regex matches r".*hello.*"
"""
string_ = string
if not isinstance(string_, str):
string_ = str(string_)
regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
return re.compile(regex, flags=re.DOTALL)
def find_by_text(soup, text, tag, **kwargs):
"""
Find the tag in soup that matches all provided kwargs, and contains the
text.
If no match is found, return None.
If more than one match is found, raise ValueError.
"""
elements = soup.find_all(tag, **kwargs)
matches = []
for element in elements:
if element.find(text=like(text)):
matches.append(element)
if len(matches) > 1:
raise ValueError("Too many matches:\n" + "\n".join(matches))
elif len(matches) == 0:
return None
else:
return matches[0]
现在,当我想找到上面的元素时,我只需要 运行 find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')
问题是您的 <a>
标签内含 <i>
标签,但没有您期望的 string
属性。首先让我们看一下 text=""
参数 find()
的作用。
注意:text
参数是一个旧名称,因为 BeautifulSoup 4.4.0 它被称为 string
。
来自docs:
Although string is for finding strings, you can combine it with
arguments that find tags: Beautiful Soup will find all tags whose
.string matches your value for string. This code finds the tags
whose .string is “Elsie”:
soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
现在让我们看看Tag
的string
属性是什么(再次来自docs):
If a tag has only one child, and that child is a NavigableString, the
child is made available as .string:
title_tag.string
# u'The Dormouse's story'
(...)
If a tag contains more than one thing, then it’s not clear what
.string should refer to, so .string is defined to be None:
print(soup.html.string)
# None
这正是你的情况。您的 <a>
标签包含文本 和 <i>
标签。因此,在尝试搜索字符串时查找结果为 None
,因此无法匹配。
如何解决?
也许有更好的解决方案,但我可能会选择这样的方法:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")
for link in links:
if link.find(text=re.compile("Edit")):
thelink = link
break
print(thelink)
我觉得指向/customer-menu/1/accounts/1/update
的链接不多所以应该够快了。
如果 a
text 包含 "Edit",您可以将 return True
的 function 传递给 .find
In [51]: def Edit_in_text(tag):
....: return tag.name == 'a' and 'Edit' in tag.text
....:
In [52]: soup.find(Edit_in_text, href="/customer-menu/1/accounts/1/update")
Out[52]:
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
编辑:
您可以在函数中使用 .get_text()
方法而不是 text
方法,结果相同:
def Edit_in_text(tag):
return tag.name == 'a' and 'Edit' in tag.get_text()
在一行中使用 lambda
soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)
通过 soupsieve 2.1.0,您可以使用 :-soup-contains
css 伪 class 选择器来定位节点的文本。这取代了 :contains()
.
的弃用形式
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
Edit
</a>
""")
single = soup.select_one('a:-soup-contains("Edit")').text.strip()
multiple = [i.text.strip() for i in soup.select('a:-soup-contains("Edit")')]
print(single, '\n', multiple)
观察以下问题:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
Edit
</a>
""")
# This returns the <a> element
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
# This returns None
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)
出于某种原因,BeautifulSoup 将无法匹配文本,即使 <i>
标签也存在。找到标签并显示其文本会产生
>>> a2 = soup.find(
'a',
href="/customer-menu/1/accounts/1/update"
)
>>> print(repr(a2.text))
'\n Edit\n'
没错。根据Docs,soup使用正则表达式的匹配功能,而不是搜索功能。所以我需要提供 DOTALL 标志:
pattern = re.compile('.*Edit.*')
pattern.match('\n Edit\n') # Returns None
pattern = re.compile('.*Edit.*', flags=re.DOTALL)
pattern.match('\n Edit\n') # Returns MatchObject
好的。看起来不错。让我们用汤试试
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*", flags=re.DOTALL)
) # Still return None... Why?!
编辑
我基于壁虎答案的解决方案:我实现了这些助手:
import re
MATCH_ALL = r'.*'
def like(string):
"""
Return a compiled regular expression that matches the given
string with any prefix and postfix, e.g. if string = "hello",
the returned regex matches r".*hello.*"
"""
string_ = string
if not isinstance(string_, str):
string_ = str(string_)
regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
return re.compile(regex, flags=re.DOTALL)
def find_by_text(soup, text, tag, **kwargs):
"""
Find the tag in soup that matches all provided kwargs, and contains the
text.
If no match is found, return None.
If more than one match is found, raise ValueError.
"""
elements = soup.find_all(tag, **kwargs)
matches = []
for element in elements:
if element.find(text=like(text)):
matches.append(element)
if len(matches) > 1:
raise ValueError("Too many matches:\n" + "\n".join(matches))
elif len(matches) == 0:
return None
else:
return matches[0]
现在,当我想找到上面的元素时,我只需要 运行 find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')
问题是您的 <a>
标签内含 <i>
标签,但没有您期望的 string
属性。首先让我们看一下 text=""
参数 find()
的作用。
注意:text
参数是一个旧名称,因为 BeautifulSoup 4.4.0 它被称为 string
。
来自docs:
Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string. This code finds the tags whose .string is “Elsie”:
soup.find_all("a", string="Elsie") # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
现在让我们看看Tag
的string
属性是什么(再次来自docs):
If a tag has only one child, and that child is a NavigableString, the child is made available as .string:
title_tag.string # u'The Dormouse's story'
(...)
If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:
print(soup.html.string) # None
这正是你的情况。您的 <a>
标签包含文本 和 <i>
标签。因此,在尝试搜索字符串时查找结果为 None
,因此无法匹配。
如何解决?
也许有更好的解决方案,但我可能会选择这样的方法:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")
for link in links:
if link.find(text=re.compile("Edit")):
thelink = link
break
print(thelink)
我觉得指向/customer-menu/1/accounts/1/update
的链接不多所以应该够快了。
如果 a
text 包含 "Edit",您可以将 return True
的 function 传递给 .find
In [51]: def Edit_in_text(tag):
....: return tag.name == 'a' and 'Edit' in tag.text
....:
In [52]: soup.find(Edit_in_text, href="/customer-menu/1/accounts/1/update")
Out[52]:
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
编辑:
您可以在函数中使用 .get_text()
方法而不是 text
方法,结果相同:
def Edit_in_text(tag):
return tag.name == 'a' and 'Edit' in tag.get_text()
在一行中使用 lambda
soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)
通过 soupsieve 2.1.0,您可以使用 :-soup-contains
css 伪 class 选择器来定位节点的文本。这取代了 :contains()
.
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
Edit
</a>
""")
single = soup.select_one('a:-soup-contains("Edit")').text.strip()
multiple = [i.text.strip() for i in soup.select('a:-soup-contains("Edit")')]
print(single, '\n', multiple)