如何提取 BeautifulSoup 中 <em> 标签外的文本

How can I extract the text outside the <em> tag in BeautifulSoup

谁能帮我提取 From 之后的测试,我想提取发件人姓名。它位于 em 标签之外。我正在使用 python BeautifulSoup 包。

这里是 link 网页:http://seclists.org/fulldisclosure/2016/Jan/0

我能够成功提取电子邮件标题,因为它在标签中。 html 页面中没有其他 div 或 类。

这是页面的 html 代码:

这是我试过的方法

def title_spider(max_pages):
    page = 0
    while page <= max_pages:
        url = 'http://seclists.org/fulldisclosure/2016/Jan/' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for email_title in soup.find('b'):
            title = email_title.string
            print(title)

        for date_stamp in soup.em:
            date = date_stamp
            print(date)
        page += 1

title_spider(2)

`

你想要下一个兄弟姐妹,如果你想要特定 em 的发件人和日期,你可以结合正则表达式:

import re

def title_spider(max_pages):
    for page in range(max_pages + 1):
        url = 'http://seclists.org/fulldisclosure/2016/Jan/{}'.format(page) 
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for email_title in soup.find('b'):
            title = email_title.string
            print(title)

        for em in soup.find_all("em", text=re.compile("From|Date")):
            print(em.text, em.next_sibling)

这给你:

In [5]: title_spider(2)
Alcatel Lucent Home Device Manager - Management Console Multiple XSS
From : Uğur Cihan KOÇ <u.cihan.koc () gmail com>
Date : Sun, 3 Jan 2016 13:20:53 +0200
Executable installers/self-extractors are vulnerable^WEVIL  (case 17): Kaspersky Labs utilities
From : "Stefan Kanthak" <stefan.kanthak () nexgo de>
Date : Sun, 3 Jan 2016 16:12:50 +0100
Possible vulnerability in F5 BIG-IP LTM - Improper input validation of the HTTP version number of the HTTP reqest allows any payload size and conent to pass through
From : Eitan Caspi <eitanc () yahoo com>
Date : Sun, 3 Jan 2016 21:10:27 +0000 (UTC)