BeautifulSoup - 无法按标点符号过滤列表结果
BeautifulSoup - Cant filter list results by punctuation
我试图从 Python 的结果中排除问号和冒号,但它们一直出现在最终输出中。结果按 'None' 过滤,但不按标点符号过滤。
如有任何帮助,我们将不胜感激。
#Scrape BBC for Headline text
url = 'https://www.bbc.co.uk/news'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
tags = soup.find_all(class_='gs-c-promo-heading__title')
#print(headlines)
headlines = list()
for i in tags:
if i.string is not None:
if i.string != ":":
if i.string != "?":
headlines.append(i.string)
您正在将整个字符串与字符进行比较,但想知道字符串是否包含字符 - 如果您真的想那样做,只需使用 not in
即可:
if ':' not in i.string:
if '?' not in i.string:
您的方法存在问题,您将跳过结果。觉得在循环中清理结果,替换这样的字符会好很多:
for i in tags:
print(i.string.replace(':', '').replace(':',''))
如果您想清除更多字符,可能有更好的正则表达式方法。
例子
import requests
from bs4 import BeautifulSoup
url = 'https://www.bbc.co.uk/news'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
tags = soup.find_all(class_='gs-c-promo-heading__title')
#print(headlines)
headlines = list()
for i in tags:
if i.string is not None:
if ':' not in i.string:
if '?' not in i.string:
headlines.append(i.string)
headlines
这是一个正则表达式格式化函数,用于从字符串中排除 ?
和 :
:
def hd_format(text):
return re.sub(r"\?|\:", "", text)
您可以添加您想要排除的任何其他字符,只需将它们用 |
分隔并使用 \
转义特殊字符
完整代码
from bs4 import BeautifulSoup
#Scrape BBC for Headline text
url = 'https://www.bbc.co.uk/news'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
tags = soup.find_all(class_='gs-c-promo-heading__title')
#print(headlines)
headlines = []
def hd_format(text):
return re.sub(r"\?|\:", "", text)
for i in tags:
if i.string is not None:
headlines.append(hd_format(i.string))
我试图从 Python 的结果中排除问号和冒号,但它们一直出现在最终输出中。结果按 'None' 过滤,但不按标点符号过滤。
如有任何帮助,我们将不胜感激。
#Scrape BBC for Headline text
url = 'https://www.bbc.co.uk/news'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
tags = soup.find_all(class_='gs-c-promo-heading__title')
#print(headlines)
headlines = list()
for i in tags:
if i.string is not None:
if i.string != ":":
if i.string != "?":
headlines.append(i.string)
您正在将整个字符串与字符进行比较,但想知道字符串是否包含字符 - 如果您真的想那样做,只需使用 not in
即可:
if ':' not in i.string:
if '?' not in i.string:
您的方法存在问题,您将跳过结果。觉得在循环中清理结果,替换这样的字符会好很多:
for i in tags:
print(i.string.replace(':', '').replace(':',''))
如果您想清除更多字符,可能有更好的正则表达式方法。
例子
import requests
from bs4 import BeautifulSoup
url = 'https://www.bbc.co.uk/news'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
tags = soup.find_all(class_='gs-c-promo-heading__title')
#print(headlines)
headlines = list()
for i in tags:
if i.string is not None:
if ':' not in i.string:
if '?' not in i.string:
headlines.append(i.string)
headlines
这是一个正则表达式格式化函数,用于从字符串中排除 ?
和 :
:
def hd_format(text):
return re.sub(r"\?|\:", "", text)
您可以添加您想要排除的任何其他字符,只需将它们用 |
分隔并使用 \
转义特殊字符
完整代码
from bs4 import BeautifulSoup
#Scrape BBC for Headline text
url = 'https://www.bbc.co.uk/news'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
tags = soup.find_all(class_='gs-c-promo-heading__title')
#print(headlines)
headlines = []
def hd_format(text):
return re.sub(r"\?|\:", "", text)
for i in tags:
if i.string is not None:
headlines.append(hd_format(i.string))