使用 BS4 排除跨度 - Python

Question

所以我试图排除（而不是提取）跨度中包含的信息。这是 HTML:

<li><span>Type:</span> Cardiac Ultrasound</li>

这是我的代码：

item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos: 
        description_elements = description_el.find('span')
        for el in description_elements: 
            curr_el = {}
            key = el.replace(':', '')
            print(el)
            print(description_el.text.replace(' ', ''))

列出汤的地方基本上是整个页面（在我的示例中 HTML）当我这样做时，我得到：

Type:
Type: CardiacUltrasound

如你所见。由于某些特殊原因 :P，span 不受我的 replace() 方法的影响，即使 .text 产生 str

编辑：抱歉。我的 objective 是创建一堆 dictionnaries，其中 key 是 span，value 紧随其后。

Answer 1

注意：“创建一堆词典”时要小心，因为词典不能有重复的键。但是你可以有一个字典列表，在那种情况下，这无关紧要（在每个单独的字典中仍然很重要）。

选项 1：

使用.next_sibling()

from bs4 import BeautifulSoup

html = '''
<div class="item_description">
<li><span>Type:</span> Cardiac Ultrasound</li></div>'''

listing_soup = BeautifulSoup(html, 'html.parser')

item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos: 
    k = description_el.find('span').text.replace(':', '')
    v = description_el.find('span').next_sibling.strip()
    
    print(k)
    print(v)

选项 2：

只需从 description_el、.split(':') 获取文本。然后你得到了你想要的 2 个元素（如果我没看错你的问题。

from bs4 import BeautifulSoup

html = '''
<div class="item_description">
<li><span>Type:</span> Cardiac Ultrasound</li></div>'''

listing_soup = BeautifulSoup(html, 'html.parser')

item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos: 
    descText = description_el.text.split(':', 1)
    k = descText[0].strip()
    v = descText[-1].strip()
    
    print(k)
    print(v)

选项 3：

获取 <span> 文本。去掉它。然后获取<li>中剩余的文本。虽然因为你不想提取，可能对你没有用。

from bs4 import BeautifulSoup

html = '''
<div class="item_description">
<li><span>Type:</span> Cardiac Ultrasound</li></div>'''

listing_soup = BeautifulSoup(html, 'html.parser')

item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos: 
    k = description_el.find('span').text.replace(':','')
    description_el.find('span').extract()
    v = description_el.text.strip()
    
    print(k)
    print(v)

输出：

Type
Cardiac Ultrasound

Answer 2

要提取不包括子标签内容的标签文本，您可以使用答案中的方法。通常，您只需要遍历 <li> 标签并从包含子 <span>.

的标签中获取文本

代码：

from bs4 import BeautifulSoup, NavigableString

html = """<html><body>
<li><span>Key1:</span> Value1</li>
<li><span>Key2:</span> Value2</li>
<li><NoKeyValue</li>
<li><span>Key3:</span> Value3</li>
<li><span>Key4:</span> Value4</li>
</body></html>"""

result = {}
for li in BeautifulSoup(html, "html.parser").find_all("li"):
    span = li.find("span")
    if span:
        result[span.text.strip(" :")] = \
            "".join(e for e in li if isinstance(e, NavigableString)).strip()

你可以帮助我的国家，勾选my profile info。

使用 BS4 排除跨度 - Python

Exclusion of span with BS4 - Python

python

beautifulsoup

web-scraping