如何提取 BeautifulSoup 中 <em> 标签外的文本
How can I extract the text outside the <em> tag in BeautifulSoup
谁能帮我提取 From 之后的测试,我想提取发件人姓名。它位于 em 标签之外。我正在使用 python BeautifulSoup 包。
这里是 link 网页:http://seclists.org/fulldisclosure/2016/Jan/0
我能够成功提取电子邮件标题,因为它在标签中。 html 页面中没有其他 div 或 类。
这是页面的 html 代码:
这是我试过的方法
def title_spider(max_pages):
page = 0
while page <= max_pages:
url = 'http://seclists.org/fulldisclosure/2016/Jan/' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for email_title in soup.find('b'):
title = email_title.string
print(title)
for date_stamp in soup.em:
date = date_stamp
print(date)
page += 1
title_spider(2)
`
你想要下一个兄弟姐妹,如果你想要特定 em 的发件人和日期,你可以结合正则表达式:
import re
def title_spider(max_pages):
for page in range(max_pages + 1):
url = 'http://seclists.org/fulldisclosure/2016/Jan/{}'.format(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for email_title in soup.find('b'):
title = email_title.string
print(title)
for em in soup.find_all("em", text=re.compile("From|Date")):
print(em.text, em.next_sibling)
这给你:
In [5]: title_spider(2)
Alcatel Lucent Home Device Manager - Management Console Multiple XSS
From : Uğur Cihan KOÇ <u.cihan.koc () gmail com>
Date : Sun, 3 Jan 2016 13:20:53 +0200
Executable installers/self-extractors are vulnerable^WEVIL (case 17): Kaspersky Labs utilities
From : "Stefan Kanthak" <stefan.kanthak () nexgo de>
Date : Sun, 3 Jan 2016 16:12:50 +0100
Possible vulnerability in F5 BIG-IP LTM - Improper input validation of the HTTP version number of the HTTP reqest allows any payload size and conent to pass through
From : Eitan Caspi <eitanc () yahoo com>
Date : Sun, 3 Jan 2016 21:10:27 +0000 (UTC)
谁能帮我提取 From 之后的测试,我想提取发件人姓名。它位于 em 标签之外。我正在使用 python BeautifulSoup 包。
这里是 link 网页:http://seclists.org/fulldisclosure/2016/Jan/0
我能够成功提取电子邮件标题,因为它在标签中。 html 页面中没有其他 div 或 类。
这是页面的 html 代码:
这是我试过的方法
def title_spider(max_pages):
page = 0
while page <= max_pages:
url = 'http://seclists.org/fulldisclosure/2016/Jan/' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for email_title in soup.find('b'):
title = email_title.string
print(title)
for date_stamp in soup.em:
date = date_stamp
print(date)
page += 1
title_spider(2)
`
你想要下一个兄弟姐妹,如果你想要特定 em 的发件人和日期,你可以结合正则表达式:
import re
def title_spider(max_pages):
for page in range(max_pages + 1):
url = 'http://seclists.org/fulldisclosure/2016/Jan/{}'.format(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for email_title in soup.find('b'):
title = email_title.string
print(title)
for em in soup.find_all("em", text=re.compile("From|Date")):
print(em.text, em.next_sibling)
这给你:
In [5]: title_spider(2)
Alcatel Lucent Home Device Manager - Management Console Multiple XSS
From : Uğur Cihan KOÇ <u.cihan.koc () gmail com>
Date : Sun, 3 Jan 2016 13:20:53 +0200
Executable installers/self-extractors are vulnerable^WEVIL (case 17): Kaspersky Labs utilities
From : "Stefan Kanthak" <stefan.kanthak () nexgo de>
Date : Sun, 3 Jan 2016 16:12:50 +0100
Possible vulnerability in F5 BIG-IP LTM - Improper input validation of the HTTP version number of the HTTP reqest allows any payload size and conent to pass through
From : Eitan Caspi <eitanc () yahoo com>
Date : Sun, 3 Jan 2016 21:10:27 +0000 (UTC)