Python Scrapy Parse 使用另一个函数提取 link

Question

我是 scrapy 的新手，我正在尝试抓取黄页以供学习，一切正常，但我想要电子邮件地址，但要做到这一点，我需要访问 links 在解析中提取并解析它另一个 parse_email 函数，但它不起作用。

我的意思是我测试了 parse_email 函数，它可以工作，但它在主解析函数内部不起作用，我希望 parse_email 函数获取 link 的源代码，所以我正在使用回调调用 parse_email 函数，但它只 returns links 像这些 <GET https://www.yellowpages.com/los-angeles-ca/mip/palm-tree-la-7254813?lid=7254813> 它应该 return 由于某种原因，电子邮件 parse_email 功能无法正常工作，只是 return 发送 link 而不打开页面

这是我注释过的部分代码

import scrapy
import requests
from urlparse import urljoin

scrapy.optional_features.remove('boto')

class YellowSpider(scrapy.Spider):
    name = 'yellow spider'
    start_urls = ['https://www.yellowpages.com/search?search_terms=restaurant&geo_location_terms=Los+Angeles%2C+CA']

    def parse(self, response):
        SET_SELECTOR = '.info'
        for brickset in response.css(SET_SELECTOR):

            NAME_SELECTOR = 'h3 a ::text'
            ADDRESS_SELECTOR = '.adr ::text'
            PHONE = '.phone.primary ::text'
            WEBSITE = '.links a ::attr(href)'


            #Getiing the link of the page that has the email usiing this selector
            EMAIL_SELECTOR = 'h3 a ::attr(href)'

            #extracting the link
            email = brickset.css(EMAIL_SELECTOR).extract_first()

            #joining and making complete url
            url = urljoin(response.url, brickset.css('h3 a ::attr(href)').extract_first())



            yield {
                'name': brickset.css(NAME_SELECTOR).extract_first(),
                'address': brickset.css(ADDRESS_SELECTOR).extract_first(),
                'phone': brickset.css(PHONE).extract_first(),
                'website': brickset.css(WEBSITE).extract_first(),

                #ONLY Returning Link of the page not calling the function

                'email': scrapy.Request(url, callback=self.parse_email),
            }

        NEXT_PAGE_SELECTOR = '.pagination ul a ::attr(href)'
        next_page = response.css(NEXT_PAGE_SELECTOR).extract()[-1]
        if next_page:
            yield scrapy.Request(
                response.urljoin(next_page),
                callback=self.parse
            )

    def parse_email(self, response):

        #xpath for the email address in the nested page

        EMAIL_SELECTOR = '//a[@class="email-business"]/@href'

        #returning the extracted email WORKS XPATH WORKS I CHECKED BUT FUNCTION NOT CALLING FOR SOME REASON
        yield {
            'email': response.xpath(EMAIL_SELECTOR).extract_first().replace('mailto:', '')
        }

我不知道我做错了什么

Answer 1

你正在生成一个 dict 里面有一个 Request，Scrapy 不会发送它，因为它不知道它在那里（它们在创建后不会自动发送） ).您需要产生实际的 Request.

在 parse_email 函数中，为了 "remember" 每封电子邮件属于哪个项目，您需要将其余项目数据与请求一起传递。您可以使用 meta 参数来做到这一点。

示例：

在 parse 中：

yield scrapy.Request(url, callback=self.parse_email, meta={'item': {
    'name': brickset.css(NAME_SELECTOR).extract_first(),
    'address': brickset.css(ADDRESS_SELECTOR).extract_first(),
    'phone': brickset.css(PHONE).extract_first(),
    'website': brickset.css(WEBSITE).extract_first(),
}})

在 parse_email 中：

item = response.meta['item']  # The item this email belongs to
item['email'] = response.xpath(EMAIL_SELECTOR).extract_first().replace('mailto:', '')
return item

Python Scrapy Parse 使用另一个函数提取 link

Python Scrapy Parse extracted link with another function

python

scrapy

web-scraping

scrapy-spider