为什么我会收到这些代理服务器的 400 个错误请求?
Why am I getting 400 Bad Requests with these proxy servers?
所以我对网络和代理服务器的使用还很陌生。我有一个抓取某些网站的抓取工具,但我意识到我需要更改我的 IP 地址等等,这样我就不会从该网站启动。我在 GitHub 上找到了我想使用的以下程序:
https://github.com/aivarsk/scrapy-proxies
我已经按如下方式实现了所有内容:
蜘蛛:
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from backpage_scrape import items
#from toolz import first
#import ipdb
#from lxml import html
from datetime import datetime, timedelta
import os
HOME = os.environ['HOMEPATH']
os.chdir(HOME + "/Desktop/GitHub/Rover/backpage_scrape/backpage_scrape/spiders/")
# Method that gets today's date
def backpage_date_today():
now = datetime.utcnow() - timedelta(hours=4)
weekdays = ['Mon. ','Tue. ','Wed. ','Thu. ','Fri. ','Sat. ','Sun. ']
months = ['Jan. ','Feb. ','Mar. ','Apr. ','May. ', 'Jun. ','Jul. ','Aug. ','Sep. ','Oct. ','Nov. ','Dec. ']
backpage_date = weekdays[now.weekday()] + months[now.month-1] + str(now.day)
return backpage_date
# Method that gets yesterday's date
def backpage_date_yesterday():
now = datetime.utcnow() - timedelta(days=1, hours=4)
weekdays = ['Mon. ','Tue. ','Wed. ','Thu. ','Fri. ','Sat. ','Sun. ']
months = ['Jan. ','Feb. ','Mar. ','Apr. ','May. ', 'Jun. ','Jul. ','Aug. ','Sep. ','Oct. ','Nov. ','Dec. ']
backpage_date = weekdays[now.weekday()] + months[now.month-1] + str(now.day)
return backpage_date
# Open file which contains input urls
with open("test_urls.txt","rU") as infile:
urls = [row.strip("\n") for row in infile]
class BackpageSpider(CrawlSpider):
name = 'backpage'
allowed_domains = ['backpage.com']
start_urls = urls
def parse(self,response):
if response.status < 600:
todays_links = []
backpage_date = backpage_date_today()
yesterday_date = backpage_date_yesterday()
if backpage_date in response.body:
# Get all URLs to iterate through
todays_links = response.xpath("//div[@class='date'][1]/following-sibling::div[@class='date'][1]/preceding-sibling::div[preceding-sibling::div[@class='date']][contains(@class, 'cat')]/a/@href").extract()
# timeOut = 0
for url in todays_links:
# Iterate through pages and scrape
# if timeOut == 10:
# time.sleep(600)
# timeOut = 0
# else:
# timeOut += 1
yield scrapy.Request(url,callback=self.parse_ad_into_content)
for url in set(response.xpath('//a[@class="pagination next"]/@href').extract()):
yield scrapy.Request(url,callback=self.parse)
else:
time.sleep(600)
yield scrapy.Request(response.url,callback=self.parse)
# Parse page
def parse_ad_into_content(self,response):
item = items.BackpageScrapeItem(url=response.url,
backpage_id=response.url.split('.')[0].split('/')[2].encode('utf-8'),
text = response.body,
posting_body= response.xpath("//div[@class='postingBody']").extract()[0].encode('utf-8'),
date = datetime.utcnow()-timedelta(hours=5),
posted_date = response.xpath("//div[@class='adInfo']/text()").extract()[0].encode('utf-8'),
posted_age = response.xpath("//p[@class='metaInfoDisplay']/text()").extract()[0].encode('utf-8'),
posted_title = response.xpath("//div[@id='postingTitle']//h1/text()").extract()[0].encode('utf-8')
)
return item
settings.py的部分:
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
# Fix path to this module
'backpage_scrape.randomproxy.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
PROXY_LIST = 'C:/Users/LPrice/Desktop/GitHub/Rover/backpage_scrape/backpage_scrape/proxies.txt'
randomproxy.py 与 GitHub link.
上的完全相同
Proxies.txt:
https://6.hidemyass.com/ip-4
https://5.hidemyass.com/ip-1
https://4.hidemyass.com/ip-1
https://4.hidemyass.com/ip-2
https://4.hidemyass.com/ip-3
https://3.hidemyass.com/ip-1
https://3.hidemyass.com/ip-2
https://3.hidemyass.com/ip-3
https://2.hidemyass.com/ip-1
https://2.hidemyass.com/ip-2
https://2.hidemyass.com/ip-3
https://1.hidemyass.com/ip-1
https://1.hidemyass.com/ip-2
https://1.hidemyass.com/ip-3
https://1.hidemyass.com/ip-4
https://1.hidemyass.com/ip-5
https://1.hidemyass.com/ip-6
https://1.hidemyass.com/ip-7
https://1.hidemyass.com/ip-8
因此,如果您查看 GitHub 自述文件的顶部,您会看到它说 "copy-paste into text file and reformat to http://host:port format." 我不确定我是如何做到的,或者如果已经在其中格式。
正如我所说,我的错误是 400 个错误请求。我不确定它是否有用,但控制台显示:
Retrying <GET http://sf.backpage.com/restOfURL> <failed 10 times>: 400 Bad Request
是否应该在 "sf.backpage.com" 部分之前显示上面 URL 中的代理?
非常感谢您抽出宝贵时间...非常感谢您的帮助。
编辑:另外,我不确定 where/how 在 GitHub 的自述文件底部插入代码片段。对此的任何建议也会很有用。
您 proxies.txt 中的 URL 并不是真正的代理。
转到 http://proxylist.hidemyass.com/ and search proxies for HTTP protocol. You need to take IP address and Port columns from search results and write them to proxies.txt file in http://IP Address:Port 格式。
所以我对网络和代理服务器的使用还很陌生。我有一个抓取某些网站的抓取工具,但我意识到我需要更改我的 IP 地址等等,这样我就不会从该网站启动。我在 GitHub 上找到了我想使用的以下程序:
https://github.com/aivarsk/scrapy-proxies
我已经按如下方式实现了所有内容:
蜘蛛:
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from backpage_scrape import items
#from toolz import first
#import ipdb
#from lxml import html
from datetime import datetime, timedelta
import os
HOME = os.environ['HOMEPATH']
os.chdir(HOME + "/Desktop/GitHub/Rover/backpage_scrape/backpage_scrape/spiders/")
# Method that gets today's date
def backpage_date_today():
now = datetime.utcnow() - timedelta(hours=4)
weekdays = ['Mon. ','Tue. ','Wed. ','Thu. ','Fri. ','Sat. ','Sun. ']
months = ['Jan. ','Feb. ','Mar. ','Apr. ','May. ', 'Jun. ','Jul. ','Aug. ','Sep. ','Oct. ','Nov. ','Dec. ']
backpage_date = weekdays[now.weekday()] + months[now.month-1] + str(now.day)
return backpage_date
# Method that gets yesterday's date
def backpage_date_yesterday():
now = datetime.utcnow() - timedelta(days=1, hours=4)
weekdays = ['Mon. ','Tue. ','Wed. ','Thu. ','Fri. ','Sat. ','Sun. ']
months = ['Jan. ','Feb. ','Mar. ','Apr. ','May. ', 'Jun. ','Jul. ','Aug. ','Sep. ','Oct. ','Nov. ','Dec. ']
backpage_date = weekdays[now.weekday()] + months[now.month-1] + str(now.day)
return backpage_date
# Open file which contains input urls
with open("test_urls.txt","rU") as infile:
urls = [row.strip("\n") for row in infile]
class BackpageSpider(CrawlSpider):
name = 'backpage'
allowed_domains = ['backpage.com']
start_urls = urls
def parse(self,response):
if response.status < 600:
todays_links = []
backpage_date = backpage_date_today()
yesterday_date = backpage_date_yesterday()
if backpage_date in response.body:
# Get all URLs to iterate through
todays_links = response.xpath("//div[@class='date'][1]/following-sibling::div[@class='date'][1]/preceding-sibling::div[preceding-sibling::div[@class='date']][contains(@class, 'cat')]/a/@href").extract()
# timeOut = 0
for url in todays_links:
# Iterate through pages and scrape
# if timeOut == 10:
# time.sleep(600)
# timeOut = 0
# else:
# timeOut += 1
yield scrapy.Request(url,callback=self.parse_ad_into_content)
for url in set(response.xpath('//a[@class="pagination next"]/@href').extract()):
yield scrapy.Request(url,callback=self.parse)
else:
time.sleep(600)
yield scrapy.Request(response.url,callback=self.parse)
# Parse page
def parse_ad_into_content(self,response):
item = items.BackpageScrapeItem(url=response.url,
backpage_id=response.url.split('.')[0].split('/')[2].encode('utf-8'),
text = response.body,
posting_body= response.xpath("//div[@class='postingBody']").extract()[0].encode('utf-8'),
date = datetime.utcnow()-timedelta(hours=5),
posted_date = response.xpath("//div[@class='adInfo']/text()").extract()[0].encode('utf-8'),
posted_age = response.xpath("//p[@class='metaInfoDisplay']/text()").extract()[0].encode('utf-8'),
posted_title = response.xpath("//div[@id='postingTitle']//h1/text()").extract()[0].encode('utf-8')
)
return item
settings.py的部分:
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
# Fix path to this module
'backpage_scrape.randomproxy.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
PROXY_LIST = 'C:/Users/LPrice/Desktop/GitHub/Rover/backpage_scrape/backpage_scrape/proxies.txt'
randomproxy.py 与 GitHub link.
上的完全相同Proxies.txt:
https://6.hidemyass.com/ip-4
https://5.hidemyass.com/ip-1
https://4.hidemyass.com/ip-1
https://4.hidemyass.com/ip-2
https://4.hidemyass.com/ip-3
https://3.hidemyass.com/ip-1
https://3.hidemyass.com/ip-2
https://3.hidemyass.com/ip-3
https://2.hidemyass.com/ip-1
https://2.hidemyass.com/ip-2
https://2.hidemyass.com/ip-3
https://1.hidemyass.com/ip-1
https://1.hidemyass.com/ip-2
https://1.hidemyass.com/ip-3
https://1.hidemyass.com/ip-4
https://1.hidemyass.com/ip-5
https://1.hidemyass.com/ip-6
https://1.hidemyass.com/ip-7
https://1.hidemyass.com/ip-8
因此,如果您查看 GitHub 自述文件的顶部,您会看到它说 "copy-paste into text file and reformat to http://host:port format." 我不确定我是如何做到的,或者如果已经在其中格式。
正如我所说,我的错误是 400 个错误请求。我不确定它是否有用,但控制台显示:
Retrying <GET http://sf.backpage.com/restOfURL> <failed 10 times>: 400 Bad Request
是否应该在 "sf.backpage.com" 部分之前显示上面 URL 中的代理?
非常感谢您抽出宝贵时间...非常感谢您的帮助。
编辑:另外,我不确定 where/how 在 GitHub 的自述文件底部插入代码片段。对此的任何建议也会很有用。
您 proxies.txt 中的 URL 并不是真正的代理。
转到 http://proxylist.hidemyass.com/ and search proxies for HTTP protocol. You need to take IP address and Port columns from search results and write them to proxies.txt file in http://IP Address:Port 格式。