在没有 <span> 的情况下抓取 HTML 标签的主要文本内容

Question

我正在构建一个 Python 网络抓取工具，它遍历 eBay 搜索结果页面（在本例中 'Gaming laptops'）并抓取每件待售商品的标题。我正在使用 BeautifulSoup 首先获取存储每个标题的 h1 标签，然后将其打印为文本：

    for item_name in soup.findAll('h1', {'class': 'it-ttl'}):
    print(item_name.text)

但是，在每个带有 'it-ttl' class 的 h1 标签中，还有一个包含一些文本的 span 标签：

<h1 class="it-ttl" itemprop="name" id="itemTitle">
 <span class="g-hdn">Details about  &nbsp;</span>
 Acer - Nitro 5 15.6" Gaming Laptop - Intel Core i5 - 8GB Memory - NVIDIA GeFo…
</h1>

我当前的程序打印出 span 标签的内容和项目标题： My console output

有人可以向我解释如何在忽略包含 "Details About" 文本的 span 标记的同时获取仅项目标题吗？谢谢！

Answer 1

只需删除有问题的内容即可完成 <span>:

item = """
<h1 class="it-ttl" itemprop="name" id="itemTitle">
 <span class="g-hdn">Details about  &nbsp;</span>
 Acer - Nitro 5 15.6" Gaming Laptop - Intel Core i5 - 8GB Memory - NVIDIA GeFo…
</h1>
"""
from bs4 import BeautifulSoup as bs
soup = bs(item,'lxml')
target = soup.select_one('h1')
target.select_one('span').decompose()
print(target.text.strip())

输出：

Acer - Nitro 5 15.6" Gaming Laptop - Intel Core i5 - 8GB Memory - NVIDIA GeFo…

Answer 2

另一个解决方案。

from simplified_scrapy import SimplifiedDoc,req,utils
html = '''
<h1 class="it-ttl" itemprop="name" id="itemTitle">
 <span class="g-hdn">Details about  &nbsp;</span>
 Acer - Nitro 5 15.6" Gaming Laptop - Intel Core i5 - 8GB Memory - NVIDIA GeFo…
</h1>
'''
doc = SimplifiedDoc(html)
item_names = doc.selects('h1.it-ttl').span.nextText()

print(item_names)

结果：

['Acer - Nitro 5 15.6" Gaming Laptop - Intel Core i5 - 8GB Memory - NVIDIA GeFo…']

这里有更多例子。 https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

在没有 <span> 的情况下抓取 HTML 标签的主要文本内容

Grabbing the main text content of an HTML tag without the <span> inside

python

beautifulsoup

web-crawler

html-parsing

web-scraping