无法从网站上抓取受保护的电子邮件

Question

我想从 this website 抓取电子邮件，但它们受到保护。它们在网站上可见，但在抓取受保护的电子邮件时会出现已解码的电子邮件。

我试过抓取但得到了这个结果

<a href="/cdn-cgi/l/email-protection#d5a7bba695b9a6b0b2fbb6bab8"><span class="__cf_email__" data-cfemail="c0b2aeb380acb3a5a7eea3afad">[email protected]</span></a>

我的代码：

from bs4 import BeautifulSoup as bs
import requests
import re


r = requests.get('https://www.accesswire.com/api/newsroom.ashx')
p = re.compile(r" $\('#newslist'\)\.after\('(.*)\);")
html = p.findall(r.text)[0]
soup = bs(html, 'lxml')
headlines = [item['href'] for item in soup.select('a.headlinelink')]

for head in headlines:
        response2 = requests.get(head, headers=header)
        soup2 = bs(response2.content, 'html.parser')

        print([a for a in soup2.select("a")])

我想要正文中的电子邮件 例如电子邮件：theramedhealthcorp@gmail.com 此电子邮件来自本网站 https://www.accesswire.com/546295/Theramed-Provides-Update-on-New-Sales-Channel-for-Nevada-Facility 但是电子邮件受到保护，如何像真实电子邮件地址一样以文本形式抓取它？谢谢

Answer 1

我先尝试了你的代码，我也收到了 [email protected]

然后我意识到网站可能正在通过 JavaScript 加载该数据。

您可以使用 selenium 或任何轻型浏览器完成您的工作。

我使用 PyQt5 库打开页面，因为它会在启用 JavaScript 的浏览器中打开，然后我从中获取源代码并执行正常的 BeautifulSoup 代码。

先决条件安装命令（如果您是 windows 用户）：

安装 PyQt5：pip install pyqt5

PyQt5 windows 发行版没有 PyQtWebEngine 我们需要单独安装它:

pip install PyQtWebEngine

为了使用 pyqt4 呈现基于 JavaScript 的页面，我在这里观看了 SentDex 的视频：https://www.youtube.com/watch?v=FSH77vnOGqU

但它是针对 pyqt4 的。要从 pyqt4 过渡到 pyqt5，这个 Whosebug 答案帮助了我：

我的代码：

import requests
import re
from bs4 import BeautifulSoup as bs

import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEnginePage

class Client(QWebEnginePage):
    def __init__(self,url):
        self.app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)

        self.html=""
        self.loadFinished.connect(self.on_page_load)

        self.load(QUrl(url))
        self.app.exec_()

    def on_page_load(self):
        self.html=self.toHtml(self.Callable)
        print("In on_page_load \n \t HTML: ",self.html)

    def Callable(self,html_str):
        print("In Callable \n \t HTML_STR: ",len(html_str))
        self.html=html_str
        print("In Callable \n \t HTML_STR: ",len(self.html))
        self.app.quit()

url="https://www.accesswire.com/546227/InterRent-Announces-Voting-Results-from-the-2019-Annual-and-Special-Meeting"

client_response= Client(url)

soup = bs(client_response.html, 'html.parser')
table = soup.find_all('table')
#print(len(table))
table = table[len(table)-1]
#print(table)
a = table.find_all('a')
#print(len(a))
for i in a:
    print(i.text)

输出：

mmcgahan@interrentreit.com
bcutsey@interrentreit.com
cmillar@interrentreit.com

无法从网站上抓取受保护的电子邮件

Cannot scrape protected email from website

python

email

selenium

data-protection

beautifulsoup