使用 BeautifulSoup 查找关键字的子字符串

Question

我正在尝试从 [=21= 列表的 <td> 标签中找出包含 string/substring 的 url ]s 使用 BeautifulSoup。如果存在完整的字符串但对于子字符串失败，则它工作正常。这是我到目前为止编写的代码：

for url in urls:
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content, 'lxml')
    words = soup.find_all("td", text=the_word)
    print(words)
    print(url)

我知道的不多。谁能指导我也搜索子字符串？

Answer 1

无法直接执行此操作。我能想到的唯一方法是将 'td' 标签中的所有文本放入列表或字典等数据结构中并在那里进行测试。

Answer 2

您可以使用 custom function 检查单词是否出现在文本中。

html = '''
<td>the keyword is present in the text</td>
<td>the keyword</td>
<td></td>
<td>the word is not present in the text</td>'''

soup = BeautifulSoup(html, 'lxml')
the_word = 'keyword'
tags = soup.find_all('td', text=lambda t: t and the_word in t)
print(tags)
# [<td>the keyword is present in the text</td>, <td>the keyword</td>]

通常只有 the_word in t 可以。但是，如果有任何 <td> 标签没有任何文本，如示例 (<td></td>) 所示，使用 the_word in t 将引发 TypeError: argument of type 'NoneType' is not iterable。这就是为什么我们首先必须检查文本是否不是 None。因此函数 lambda t: t and the_word in t.

如果您对 lambdas 不满意，您可以使用一个与上述函数等效的简单函数：

def contains_word(t):
    return t and 'keyword' in t

tags = soup.find_all('td', text=contains_word)

使用 BeautifulSoup 查找关键字的子字符串

Find substring of keyword using BeautifulSoup

python

lxml

beautifulsoup