如何在 python 3 中使用 selenium phantomJS 从网页的 html 源中提取 table 元素？

Question

我正在做一个网络爬虫项目，该项目应该将两个日期作为输入（例如 2019-03-01 和 2019-03-05），然后每天在这两个日期之间附加一个基数link（例如基础 link + 日期是 https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/2019-1-3）。我想在 web_page 源中提取 table 和 "tablesaw-sortable" class_name 并将其保存在文本文件或任何其他类似文件格式中。

我开发了这段代码：

from datetime import timedelta, date
from bs4 import BeautifulSoup
import urllib.request
from selenium import webdriver

class webcrawler():
    def __init__(self, st_date, end_date):
        self.base_url = 'https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/'
        self.st_date = st_date
        self.end_date = end_date

    def date_list(self):
        return [str(date1 + timedelta(n)) for n in range(int ((self.end_date - self.st_date).days)+1)]

    def create_link(self, attachment):
        url = str(self.base_url) 
        url += attachment
        return url

    def open_link(self, link):
        driver = webdriver.PhantomJS()
        driver.get(link)
        html = driver.page_source
        return html

    def extract_table(self, html):
        soup = BeautifulSoup(html)
        print(soup.prettify())

    def output_to_csv(self):
        pass

date1 = date(2018, 3, 1)
date2 = date(2019, 3, 5)

test = webcrawler(st_date=date1, end_date=date2)
date_list = test.date_list()
link = test.create_link(date_list[0])
html = test.open_link(link)
test.extract_table(html)

问题是，我花了很长时间才等到 page.source 中的一个 link。我已经使用了 urllib.request 但该方法的问题是有时它会在不等待 table 完全加载的情况下获取 html 内容。

我怎样才能加快这个过程，只提取提到的 table 并访问它的 html 来源，不要等待其余的。我只想将 table 行中的信息保存在每个日期的某个文本文件中。

谁能帮我解决这个问题？

Answer 1

此代码以及您使用这些库的方式有很多明显的错误。让我尝试修复它。

首先，我没有看到您使用 urllib.request 库。您可以删除它，或者如果您在代码的其他位置使用它，我推荐高度评价的 requests 模块。如果您只是想从站点获取 HTML 源，我还建议使用请求库而不是 selenium，因为 selenium 更适合导航站点并充当 'real' 人。

您可以使用 response = requests.get('https://your.url.here') 然后 response.text 得到返回的 HTML.

接下来我注意到在 open_link() 方法中，每次调用该方法时都会创建 PhantomJS class 的新实例。这真的很低效，因为 selenium 使用大量资源（并且需要很长时间，具体取决于您使用的驱动程序）。这可能是导致您的代码运行比预期慢的重要原因。您应该尽可能多地重用 driver 实例，因为 selenium 被设计为以这种方式使用。一个很好的解决方案是在 webcrawler.__init__() 方法中创建 driver 实例。

class WebCrawler():
    def __init__(self, st_date, end_date):
        self.driver = webdriver.PhantomJS()
        self.base_url = 'https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/'
        self.st_date = st_date
        self.end_date = end_date

    def open_link(self, link):
        self.driver.get(link)
        html = driver.page_source
        return html

# Alternatively using the requests library

class WebCrawler():
    def __init__(self, st_date, end_date):
        self.base_url = 'https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/'
        self.st_date = st_date
        self.end_date = end_date

    def open_link(self, link):
        response = requests.get(link)
        html = response.text
        return html

旁注：对于 class 名称，您应该使用 CamelCase 而不是小写字母。这只是一个建议，但是 python 的原始创建者创建了 PEP8 来定义编写 python 代码的通用风格指南。在这里查看：Class Naming

我发现的另一件奇怪的事情是您正在将字符串转换为...字符串。您在 url = str(self.base_url) 执行此操作。这不会伤害任何东西，但也无济于事。我找不到任何 resources/links 但我怀疑这会占用口译员额外的时间。由于速度是一个问题，我建议只使用 url = self.base_url 因为基础 url 已经是一个字符串。

我看到您正在手动格式化和创建 url，但如果您想要更多的控制和更少的错误，请查看 furl 库。

def create_link(self, attachment):
        f = furl(self.base_url)

        # The '/=' operator means append to the end, docs: https://github.com/gruns/furl/blob/master/API.md#path
        f.path /= attachment

        # Cleanup and remove invalid characters in the url
        f.path.normalize()        

        return f.url  # returns the url as a string

另一个潜在的问题是 extract_table() 方法不提取任何内容，它只是以人类可读的方式简单地格式化 html。我不会深入探讨这一点，但我建议学习 CSS 选择器或 XPath 选择器，以便轻松地从 HTML.

中提取数据

在 date_list() 方法中，您试图使用 date1 变量，但尚未在任何地方定义它。我会分解那里的 lambda，并将它扩展成几行，这样你就可以轻松阅读并理解它正在尝试做什么。

下面是完整的、重构的、建议的代码。

from datetime import timedelta, date
from bs4 import BeautifulSoup
import requests
from furl import furl

class WebCrawler():
    def __init__(self, st_date, end_date):
        self.base_url = 'https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/'
        self.st_date = st_date
        self.end_date = end_date

    def date_list(self):
        dates = []
        total_days = int((self.end_date - self.st_date).days + 1)

        for i in range(total_days):
            date = self.st_date + timedelta(days=i)
            dates.append(date.strftime(%Y-%m-%d))

        return dates

    def create_link(self, attachment):
        f = furl(self.base_url)

        # The '/=' operator means append to the end, docs: https://github.com/gruns/furl/blob/master/API.md#path
        f.path /= attachment

        # Cleanup and remove invalid characters in the url
        f.path.normalize()        

        return f.url  # returns the url as a string

    def open_link(self, link):
        response = requests.get(link)
        html = response.text
        return html

    def extract_table(self, html):
        soup = BeautifulSoup(html)
        print(soup.prettify())

    def output_to_csv(self):
        pass

date1 = date(2018, 3, 1)
date2 = date(2019, 3, 5)

test = webcrawler(st_date=date1, end_date=date2)
date_list = test.date_list()
link = test.create_link(date_list[0])
html = test.open_link(link)
test.extract_table(html)

如何在 python 3 中使用 selenium phantomJS 从网页的 html 源中提取 table 元素？

How to extract a table element from html source of a web page using selenium phantomJS in python 3?

urllib

python-3.x

selenium-webdriver