尝试抓取 url link 并用 scrapy 命名。 python
Trying to scrape url link and name with scrapy. python
我正在尝试从 here
中抓取放在锚标记中的公司名称列表以及锚标记中提到的 link
<a content="https://www.adapt.io/company/a-a-technology-group"
href="https://www.adapt.io/company/a-a-technology-group">A&A Technology
Group</a>
例如:company_name = A&A 技术组,
source_url = https://www.adapt.io/company/a-a-technology-group
谁能告诉我提取url和公司名称的方法
import scrapy
class CompanySpider(scrapy.Spider):
name = 'company'
start_urls = [
'https://www.adapt.io/directory/industry/telecommunications/A-1'
]
def parse(self,response):
all_div_company = response.css('div.DirectoryList_linkItemWrapper__3F2UE ')
company_name = all_div_company.xpath('a/text').extract()
source_url = all_div_company.xpath('a/@href').extract()
yield{
'company_name' : company_name,
'source_url' : source_url
}
我认为您正试图将所有公司一一抓取。
import scrapy
class CompanySpider(scrapy.Spider):
name = 'company'
start_urls = ['https://www.adapt.io/directory/industry/telecommunications/A-1']
custom_settings = {"TELNETCONSOLE_ENABLED" : False,"ROBOTSTXT_OBEY" : False}
def parse(self,response):
for company in response.xpath("//div[contains(@class,'DirectoryList_link')]"):
yield{
'company_name' : company.xpath("./a/text()").get(),
'source_url' : company.xpath("./a/@href").get()
}
我为你写了剧本。你必须在 for 循环中抓取,而网站不允许抓取,所以你必须在设置中设置 ROBOTSTXT_OBEY = FALSE
。
我正在尝试从 here
中抓取放在锚标记中的公司名称列表以及锚标记中提到的 link<a content="https://www.adapt.io/company/a-a-technology-group"
href="https://www.adapt.io/company/a-a-technology-group">A&A Technology
Group</a>
例如:company_name = A&A 技术组,
source_url = https://www.adapt.io/company/a-a-technology-group
谁能告诉我提取url和公司名称的方法
import scrapy
class CompanySpider(scrapy.Spider):
name = 'company'
start_urls = [
'https://www.adapt.io/directory/industry/telecommunications/A-1'
]
def parse(self,response):
all_div_company = response.css('div.DirectoryList_linkItemWrapper__3F2UE ')
company_name = all_div_company.xpath('a/text').extract()
source_url = all_div_company.xpath('a/@href').extract()
yield{
'company_name' : company_name,
'source_url' : source_url
}
我认为您正试图将所有公司一一抓取。
import scrapy
class CompanySpider(scrapy.Spider):
name = 'company'
start_urls = ['https://www.adapt.io/directory/industry/telecommunications/A-1']
custom_settings = {"TELNETCONSOLE_ENABLED" : False,"ROBOTSTXT_OBEY" : False}
def parse(self,response):
for company in response.xpath("//div[contains(@class,'DirectoryList_link')]"):
yield{
'company_name' : company.xpath("./a/text()").get(),
'source_url' : company.xpath("./a/@href").get()
}
我为你写了剧本。你必须在 for 循环中抓取,而网站不允许抓取,所以你必须在设置中设置 ROBOTSTXT_OBEY = FALSE
。