BeautifulSoup - 无法按标点符号过滤列表结果

BeautifulSoup - Cant filter list results by punctuation

我试图从 Python 的结果中排除问号和冒号,但它们一直出现在最终输出中。结果按 'None' 过滤,但不按标点符号过滤。

如有任何帮助,我们将不胜感激。

#Scrape BBC for Headline text
url = 'https://www.bbc.co.uk/news'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')

tags = soup.find_all(class_='gs-c-promo-heading__title')
#print(headlines)
headlines = list()

for i in tags:
    if i.string is not None:
        if i.string != ":":
            if i.string != "?":
                headlines.append(i.string)

您正在将整个字符串与字符进行比较,但想知道字符串是否包含字符 - 如果您真的想那样做,只需使用 not in 即可:

if ':' not in i.string:
    if '?' not in i.string:

您的方法存在问题,您将跳过结果。觉得在循环中清理结果,替换这样的字符会好很多:

for i in tags:
    print(i.string.replace(':', '').replace(':',''))

如果您想清除更多字符,可能有更好的正则表达式方法。

例子

import requests
from bs4 import BeautifulSoup
url = 'https://www.bbc.co.uk/news'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')

tags = soup.find_all(class_='gs-c-promo-heading__title')
#print(headlines)
headlines = list()

for i in tags:
    if i.string is not None:
        if ':' not in i.string:
            if '?' not in i.string:
                headlines.append(i.string)
headlines

这是一个正则表达式格式化函数,用于从字符串中排除 ?: :

def hd_format(text):
   return re.sub(r"\?|\:", "", text)

您可以添加您想要排除的任何其他字符,只需将它们用 | 分隔并使用 \ 转义特殊字符

完整代码

from bs4 import BeautifulSoup

#Scrape BBC for Headline text
url = 'https://www.bbc.co.uk/news'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')

tags = soup.find_all(class_='gs-c-promo-heading__title')
#print(headlines)
headlines = []

def hd_format(text):
    return re.sub(r"\?|\:", "", text)

for i in tags:
    if i.string is not None:
        headlines.append(hd_format(i.string))