Python scrapy 抓取链接不抓取

Question

我是 Scrapy Framework 的新手，正在尝试学习网页抓取我有一个包含网站页面 links 的 txt 文件，我正在列出这些 links 并将它们存储在 start_urls 但是解析函数不起作用，它没有抓取任何东西

这是代码

try:
    import scrapy
except ImportError:
    print "\nERROR IMPORTING THE NESSASARY LIBRARIES\n"

#File with all the links
crimefile = open('links.txt', 'r')
#making a list with all the links
yourResult = [line for line in crimefile.readlines()]

class SpiderMan(scrapy.Spider):
    name = 'man spider'

    #making start_urls equal to that list
    start_urls = yourResult

    def parse(self, response):
        SET_SELECTOR = '.c411Listing.jsResultsList'
        for man in response.css(SET_SELECTOR):
            name = '.c411ListedName a ::text'
            address = '.adr ::text'
            phone = '.c411Phone ::text'
            yield { 

                    'NAME': man.css(name).extract_first(),
                    'ADDRESS': man.css(address).extract_first(),
                    'PHONE': man.css(phone).extract_first(),
                    }

ad 是输出，由于某种原因，解析功能无法正常工作，但正在抓取每个 link

我做错了什么？在这个简单的代码中 ?

Answer 1

问题是您的网址以“%0D%0A”结尾。如果您将 scrapy 日志中的一个 URL 输入浏览器，您将看到一个屏幕：

"Postal code entered is of wrong format."

“%0D%0A”是您的 URL 文件中的换行符，这些换行符在加载文件并将其分成多行时以某种方式保留。删除它们，你会没事的。

轻松修复 - 添加对 strip() 的调用：

yourResult = [line.strip() for line in crimefile.readlines()]

Python scrapy 抓取链接不抓取

Python scrapy Crawling the links not scraping

python

scrapy

web-scraping

scrapy-spider