如何让 Python Scrapy 从网页中提取所有外部链接的所有域？

Question

我希望循环检查每个 link - 如果它转到外部域输出它 - 目前它输出所有 links（内部和外部）。我搞砸了什么？（为了测试，我调整了代码以仅在单个页面上工作，而不抓取网站的其余部分。）

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import re

class MySpider(CrawlSpider):
    name = 'crawlspider'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/BBC_News']

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        item = dict()
        item['url'] = response.url
        item['title']=response.xpath('//title').extract_first()
        for link in LinkExtractor(allow=(),deny=self.allowed_domains).extract_links(response):
            item['links']=response.xpath('//a/@href').extract()
        return item

Answer 1

您的 parse_item 方法中的逻辑看起来不太正确

def parse_item(self, response):
    item = dict()
    item['url'] = response.url
    item['title']=response.xpath('//title').extract_first()
    for link in LinkExtractor(allow=(),deny=self.allowed_domains).extract_links(response):
        item['links']=response.xpath('//a/@href').extract()
    return item

您正在循环遍历提取器中的每个 link，但随后总是将 item["links"] 设置为完全相同的东西（来自响应页面的所有链接）。我希望您尝试将 item["links"] 设置为来自 LinkExtractor 的所有链接？如果是这样，您应该将方法更改为

def parse_item(self, response):
    item = dict()
    item['url'] = response.url
    item['title'] = response.xpath('//title').extract_first()
    links = [link.url for link in LinkExtractor(deny=self.allowed_domains).extract_links(response)]        
    item['links'] = links
    return item

如果您真的只想要域，那么您可以使用 urllib.parse 中的 urlparse 来获取 netloc。您可能还想删除带有 set 的重复项。所以你的解析方法会变成（导入最好在你的文件的顶部）

def parse_item(self, response):
    from urllib.parse import urlparse
    item = dict()
    item["url"] = response.url
    item["title"] = response.xpath("//title").extract_first()
    item["links"] = {
        urlparse(link.url).netloc
        for link in LinkExtractor(deny=self.allowed_domains).extract_links(response)
    }   
    return item

如何让 Python Scrapy 从网页中提取所有外部链接的所有域？

How to I get Python Scrapy to extract all of the domains of all external links from a web page?

python

scrapy