如何使用Scrapy获取域名？

Question

我知道 html 中有一个命令：var x = document.domain; 获取域，但我如何在 Scrapy 中实现它以便我可以获得域名？

Answer 1

您可以 extract the domain name 来自 response.url:

from urlparse import urlparse

def parse(self, response):
    parsed_uri = urlparse(response.url)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    print domain

Answer 2

对于 Python3，对 'from' 和 'print' 进行了两个非常小的更改。 alecxe 的回答适合 Python2。

此外，对于 Scrapy 的 CrawlSpider，请将上面的名称 'parse' 更改为其他名称，因为 CrawlSpider 本身使用 'parse'。

from urllib.parse import urlparse

def get_domain(self, response):
    parsed_uri = urlparse(response.url)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    print(domain)
    return domain

然后你就可以使用它了，作为OP的例子

x = get_domain

或者对于我的情况，我想将域传递给 Scrapy 的 CrawlSpider 的 Rule 的 LinkExtractor 的 allow_domains。呸。这会限制对该域的抓取。

rules = [ 
    Rule( 
        LinkExtractor( 
            canonicalize=True, 
            unique=True,
            strip=True,
            allow_domains=(domain)
        ), 
        follow=True, 
        callback="someparser" 
    ) 
]

Answer 3

尝试：

_rl = response.url
url = _rl.split("/")[2]

print (url)

如何使用Scrapy获取域名？

How can i obtain a domain name with Scrapy?

html

python

scrapy