Scrapy 和 Selenium：从文本文件加载起始 URL 不起作用

Question

我已经阅读了有关我的问题的不同文章，但它仍然无效。基本上，我使用 Scrapy 和 Selenium 来抓取网站。此网站的 URL 当前保存在一个文本文件中。该文本文件仅包含一列。在此列的每一行中都有一个 URL.

我仍然收到一条错误消息：selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: 'url' must be a string

这是我当前的代码：

class AlltipsSpider(Spider):
    name = 'alltips'
    allowed_domains = ['blogabet.com']   

    def start_requests(self):
        with open ("urls.txt", "rt") as f:
            start_urls = [l.strip() for l in open('urls.txt').readlines()]
        self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
        self.driver.get(start_urls)
        self.driver.find_element_by_id('currentTab').click()

[已更新]

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Spider
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import Request
from time import sleep
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import re
import csv

class AlltipsSpider(Spider):
    name = 'alltips'
    allowed_domains = ['blogabet.com']

    def start_requests(self):

        self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')    
        with open("urls.txt", "rt") as f:
            start_urls = [l.strip() for l in f.readlines()]

        self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
        for url in start_urls:
            self.driver.get(url)

            self.driver.find_element_by_id('currentTab').click()
            sleep(3)
            self.logger.info('Sleeping for 5 sec.')
            self.driver.find_element_by_xpath('//*[@id="_blog-menu"]/div[2]/div/div[2]/a[3]').click()
            sleep(7)
            self.logger.info('Sleeping for 7 sec.')
            yield Request(self.driver.current_url, callback=self.crawltips)     

    def crawltips(self, response):
        sel = Selector(text=self.driver.page_source)
        allposts = sel.xpath('//*[@class="block media _feedPick feed-pick"]')

        for post in allposts:
            username = post.xpath('.//div[@class="col-sm-7 col-lg-6 no-padding"]/a/@title').extract()
            publish_date = post.xpath('.//*[@class="bet-age text-muted"]/text()').extract()


            yield{'Username': username,
                'Publish date': publish_date
                }

Answer 1

start_urls 是一个列表，而不是 str。你需要迭代它。您也不需要打开文件两次

def start_requests(self):
    with open("urls.txt", "rt") as f:
        start_urls = [l.strip() for l in f.readlines()]

    self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
    for url in start_urls:
        self.driver.get(url)
        self.driver.find_element_by_id('currentTab').click()

Scrapy 和 Selenium：从文本文件加载起始 URL 不起作用

Scrapy & Selenium: Load starturls from text file is not working

python

selenium

web-crawler

scrapy

selenium-webdriver