尝试抓取 url link 并用 scrapy 命名。 python

Question

我正在尝试从 here

中抓取放在锚标记中的公司名称列表以及锚标记中提到的 link

<a content="https://www.adapt.io/company/a-a-technology-group" 
href="https://www.adapt.io/company/a-a-technology-group">A&amp;A Technology 
Group</a>

例如：company_name = A&A 技术组，

source_url = https://www.adapt.io/company/a-a-technology-group

谁能告诉我提取url和公司名称的方法

import scrapy

class CompanySpider(scrapy.Spider):
    name = 'company'
    start_urls = [
        'https://www.adapt.io/directory/industry/telecommunications/A-1'
    ]

    def parse(self,response):
        all_div_company = response.css('div.DirectoryList_linkItemWrapper__3F2UE ')
        company_name = all_div_company.xpath('a/text').extract()
        source_url = all_div_company.xpath('a/@href').extract()

        yield{
            'company_name' : company_name,
            'source_url' : source_url
        }

Answer 1

我认为您正试图将所有公司一一抓取。

import scrapy

class CompanySpider(scrapy.Spider):
    name = 'company'
    start_urls = ['https://www.adapt.io/directory/industry/telecommunications/A-1']
    custom_settings = {"TELNETCONSOLE_ENABLED" : False,"ROBOTSTXT_OBEY" : False}
    
    def parse(self,response):
        for company in response.xpath("//div[contains(@class,'DirectoryList_link')]"):
            yield{
                'company_name' : company.xpath("./a/text()").get(),
                'source_url' : company.xpath("./a/@href").get()
            }

我为你写了剧本。你必须在 for 循环中抓取，而网站不允许抓取，所以你必须在设置中设置 ROBOTSTXT_OBEY = FALSE。

尝试抓取 url link 并用 scrapy 命名。 python

Trying to scrape url link and name with scrapy. python

python

scrapy