BeautifulSoup - 按标签内的文本搜索

BeautifulSoup - search by text inside a tag

观察以下问题:

import re
from bs4 import BeautifulSoup as BS

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
    Edit
</a>
""")

# This returns the <a> element
soup.find(
    'a',
    href="/customer-menu/1/accounts/1/update",
    text=re.compile(".*Edit.*")
)

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
    <i class="fa fa-edit"></i> Edit
</a>
""")

# This returns None
soup.find(
    'a',
    href="/customer-menu/1/accounts/1/update",
    text=re.compile(".*Edit.*")
)

出于某种原因,BeautifulSoup 将无法匹配文本,即使 <i> 标签也存在。找到标签并显示其文本会产生

>>> a2 = soup.find(
        'a',
        href="/customer-menu/1/accounts/1/update"
    )
>>> print(repr(a2.text))
'\n Edit\n'

没错。根据Docs,soup使用正则表达式的匹配功能,而不是搜索功能。所以我需要提供 DOTALL 标志:

pattern = re.compile('.*Edit.*')
pattern.match('\n Edit\n')  # Returns None

pattern = re.compile('.*Edit.*', flags=re.DOTALL)
pattern.match('\n Edit\n')  # Returns MatchObject

好的。看起来不错。让我们用汤试试

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
    <i class="fa fa-edit"></i> Edit
</a>
""")

soup.find(
    'a',
    href="/customer-menu/1/accounts/1/update",
    text=re.compile(".*Edit.*", flags=re.DOTALL)
)  # Still return None... Why?!

编辑

我基于壁虎答案的解决方案:我实现了这些助手:

import re

MATCH_ALL = r'.*'


def like(string):
    """
    Return a compiled regular expression that matches the given
    string with any prefix and postfix, e.g. if string = "hello",
    the returned regex matches r".*hello.*"
    """
    string_ = string
    if not isinstance(string_, str):
        string_ = str(string_)
    regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
    return re.compile(regex, flags=re.DOTALL)


def find_by_text(soup, text, tag, **kwargs):
    """
    Find the tag in soup that matches all provided kwargs, and contains the
    text.

    If no match is found, return None.
    If more than one match is found, raise ValueError.
    """
    elements = soup.find_all(tag, **kwargs)
    matches = []
    for element in elements:
        if element.find(text=like(text)):
            matches.append(element)
    if len(matches) > 1:
        raise ValueError("Too many matches:\n" + "\n".join(matches))
    elif len(matches) == 0:
        return None
    else:
        return matches[0]

现在,当我想找到上面的元素时,我只需要 运行 find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')

问题是您的 <a> 标签内含 <i> 标签,但没有您期望的 string 属性。首先让我们看一下 text="" 参数 find() 的作用。

注意:text 参数是一个旧名称,因为 BeautifulSoup 4.4.0 它被称为 string

来自docs

Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string. This code finds the tags whose .string is “Elsie”:

soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

现在让我们看看Tagstring属性是什么(再次来自docs):

If a tag has only one child, and that child is a NavigableString, the child is made available as .string:

title_tag.string
# u'The Dormouse's story'

(...)

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:

print(soup.html.string)
# None

这正是你的情况。您的 <a> 标签包含文本 <i> 标签。因此,在尝试搜索字符串时查找结果为 None,因此无法匹配。

如何解决?

也许有更好的解决方案,但我可能会选择这样的方法:

import re
from bs4 import BeautifulSoup as BS

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
    <i class="fa fa-edit"></i> Edit
</a>
""")

links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")

for link in links:
    if link.find(text=re.compile("Edit")):
        thelink = link
        break

print(thelink)

我觉得指向/customer-menu/1/accounts/1/update的链接不多所以应该够快了。

如果 a text 包含 "Edit",您可以将 return Truefunction 传递给 .find

In [51]: def Edit_in_text(tag):
   ....:     return tag.name == 'a' and 'Edit' in tag.text
   ....: 

In [52]: soup.find(Edit_in_text, href="/customer-menu/1/accounts/1/update")
Out[52]: 
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>

编辑:

您可以在函数中使用 .get_text() 方法而不是 text 方法,结果相同:

def Edit_in_text(tag):
    return tag.name == 'a' and 'Edit' in tag.get_text()

在一行中使用 lambda

soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)

通过 soupsieve 2.1.0,您可以使用 :-soup-contains css 伪 class 选择器来定位节点的文本。这取代了 :contains().

的弃用形式
from bs4 import BeautifulSoup as BS

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
    Edit
</a>
""")
single = soup.select_one('a:-soup-contains("Edit")').text.strip()
multiple = [i.text.strip() for i in soup.select('a:-soup-contains("Edit")')]
print(single, '\n', multiple)