Scrapy 爬虫到 return 爬取时仅 URL 和 Referrer
Scrapy crawler to return only URL and Referrer when crawling
我对 scrapy 很陌生,昨天才发现它,只有基本的 python 经验。
我有一组子域(大约 200 个)需要映射,每个内部和外部 link。
我只是不理解我认为的事情的输出方面。
这是我目前所拥有的。
import scrapy
class LinkSpider(scrapy.Spider):
name = 'links'
allowed_domains = ['example.com']
start_urls = ['https://www.example.com/']
def parse(self, response):
# follow all links
for href in response.css('a::attr(href)'):
yield response.follow(href, self.parse)
它像这样输出到终端:
DEBUG: Crawled (200) <GET http://www.example.com/> (referer: None)
DEBUG: Crawled (200) <GET http://www.example.com/aaa/A-content-page> (referer: http://www.example.com/)
DEBUG: Crawled (200) <GET http://aaa.example.com/bbb/something/> (referer: http://www.example.com/)
我要的是 CSV 或 TSV
URL Referer
http://www.example.com/ None
http://www.example.com/aaa/A-content-page http://www.example.com/
http://aaa.example.com/bbb/something/ http://www.example.com/
http://aaa.example.com/bbb/another/ http://aaa.example.com/bbb/something/
感谢任何帮助,但比起直接解决方案,更愿意推荐文档。
这是我想出的解决方案。
def parse(self, response):
filename = "output.tsv"
f = open(filename, 'w')
f.write("URL\tLink\tReferer\n")
f.close()
# follow all links
for href in response.css('a::attr(href)'):
yield response.follow(href, self.parse)
with open(filename, 'a') as f:
url = response.url
links = response.css('a::attr(href)').getall()
referer = response.request.headers.get('referer', None).decode('utf-8')
for item in links:
f.write("{0}\t{1}\t{2}\n".format(url, item, referer))
f.close()
虽然这不是 100% 正确,但这是一个很好的开始。
def parse(self, response):
filename = "output.tsv"
# follow all links
for href in response.css('a::attr(href)'):
yield response.follow(href, self.parse)
with open(filename, 'a') as f:
links = response.css('a::attr(href)').getall()
referer = response.request.headers.get('Referer', None)
for item in links:
f.write("{0}\t{1}\n".format(item, referer))
您可以简单地在解析中获得这两个网址。
referer = response.request.headers.get('Referer')
original_url = response.url
yield {'referer': referer, 'url': original_url}
您可以使用
将输出写入文件
scrapy crawl spider_name -o bettybarclay.json
我对 scrapy 很陌生,昨天才发现它,只有基本的 python 经验。
我有一组子域(大约 200 个)需要映射,每个内部和外部 link。
我只是不理解我认为的事情的输出方面。
这是我目前所拥有的。
import scrapy
class LinkSpider(scrapy.Spider):
name = 'links'
allowed_domains = ['example.com']
start_urls = ['https://www.example.com/']
def parse(self, response):
# follow all links
for href in response.css('a::attr(href)'):
yield response.follow(href, self.parse)
它像这样输出到终端:
DEBUG: Crawled (200) <GET http://www.example.com/> (referer: None)
DEBUG: Crawled (200) <GET http://www.example.com/aaa/A-content-page> (referer: http://www.example.com/)
DEBUG: Crawled (200) <GET http://aaa.example.com/bbb/something/> (referer: http://www.example.com/)
我要的是 CSV 或 TSV
URL Referer
http://www.example.com/ None
http://www.example.com/aaa/A-content-page http://www.example.com/
http://aaa.example.com/bbb/something/ http://www.example.com/
http://aaa.example.com/bbb/another/ http://aaa.example.com/bbb/something/
感谢任何帮助,但比起直接解决方案,更愿意推荐文档。
这是我想出的解决方案。
def parse(self, response):
filename = "output.tsv"
f = open(filename, 'w')
f.write("URL\tLink\tReferer\n")
f.close()
# follow all links
for href in response.css('a::attr(href)'):
yield response.follow(href, self.parse)
with open(filename, 'a') as f:
url = response.url
links = response.css('a::attr(href)').getall()
referer = response.request.headers.get('referer', None).decode('utf-8')
for item in links:
f.write("{0}\t{1}\t{2}\n".format(url, item, referer))
f.close()
虽然这不是 100% 正确,但这是一个很好的开始。
def parse(self, response):
filename = "output.tsv"
# follow all links
for href in response.css('a::attr(href)'):
yield response.follow(href, self.parse)
with open(filename, 'a') as f:
links = response.css('a::attr(href)').getall()
referer = response.request.headers.get('Referer', None)
for item in links:
f.write("{0}\t{1}\n".format(item, referer))
您可以简单地在解析中获得这两个网址。
referer = response.request.headers.get('Referer')
original_url = response.url
yield {'referer': referer, 'url': original_url}
您可以使用
将输出写入文件scrapy crawl spider_name -o bettybarclay.json