如何将 IP 循环集成到我的网络抓取程序中?我一直被 Crunchbase 屏蔽
How to integrate IP cycling into my webscraping program? I keep getting blocked from Crunchbase
我编写了一个程序,使用 Beautiful Soup 从 Crunchbase 中提取公司列表的资金信息,并将该信息导出到 CSV 文件中。我什至将我的请求间隔了 30 秒,并且该程序直到今天都运行良好 - 现在我什至无法在不收到 HTTPError: Forbidden.
的情况下发送一个请求
我一直在阅读这方面的内容,人们制作了 IP 循环程序,因为看起来 Crunchbase 一直在阻止我的 IP 地址——即使我循环我的用户代理,我仍然被阻止。我什至尝试使用几个免费的 VPN,但我仍然被阻止。
import urllib.request
from bs4 import BeautifulSoup
import csv
import time
import random
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169
Safari/537.36'
headers = {'User-Agent': user_agent, }
def scraper(url):
return_list = []
try:
request = urllib.request.Request(url, None, headers)
response = urllib.request.urlopen(request)
except:
return_list.append("No Crunchbase Page Found")
return_list.append("No Crunchbase Page Found")
print("Not found")
else:
data = response.read()
soup = BeautifulSoup(data, "html.parser")
try:
funding_status = soup.find_all("span", class_= "component--field-formatter field-type-enum ng-star-inserted")[1].text
return_list.append(funding_status)
except:
return_list.append("N/A")
try:
last_funding_type = soup.find("a", class_= "cb-link component--field-formatter field-type-enum ng-star-inserted").text
if last_funding_type[:6] != "Series" and last_funding_type[:7] != "Venture" and last_funding_type[:4] != "Seed" and last_funding_type[:3] != "Pre" and last_funding_type[:5] != "Angel" and last_funding_type[:7] != "Private" and last_funding_type[:4] != "Debt" and last_funding_type[:11] != "Convertible" and last_funding_type[:5] != "Grant" and last_funding_type[:9] != "Corporate" and last_funding_type[:6] != "Equity" and last_funding_type[:7] != "Product" and last_funding_type[:9] != "Secondary" and last_funding_type[:4] != "Post" and last_funding_type[:3] != "Non" and last_funding_type[:7] != "Initial" and last_funding_type[:7] != "Funding":
return_list.append("N/A")
else:
return_list.append(last_funding_type)
except:
return_list.append("N/A")
return return_list
user_input = input("CSV File Name (e.g: myfile.csv): ")
user_input2 = input("New CSV file name (e.g: newfile.csv): ")
print()
scrape_file = open(user_input, "r", newline = '', encoding = "utf-8")
row_count = sum(1 for row in csv.reader(scrape_file))
scrape_file = open(user_input, "r", newline = '', encoding = "utf-8")
new_file = open(user_input2, "w", newline = '', encoding = "utf-8")
writer = csv.writer(new_file)
writer.writerow(["Company Name", "Description", "Website", "Founded",
"Product Name", "Country", "Funding Status", "Last Funding Type"])
count = 0
for row in csv.reader(scrape_file):
company_name = row[0]
if company_name == "Company Name":
continue
count += 1
print("Scraping company {} of {}".format(count, row_count))
company_name = company_name.replace(",", "")
company_name = company_name.replace("'", "")
company_name = company_name.replace("-", " ")
company_name = company_name.replace(".", " ")
s = "-"
join_name = s.join(company_name.lower().split())
company_url = "https://www.crunchbase.com/organization/" + join_name
writer.writerow([row[0], row[1], row[2], row[3], row[4], row[5], scraper(company_url)[0], scraper(company_url)[1]])
time.sleep(random.randint(30, 40))
new_file.close()
print("Done! You can now open your file %s." % user_input2)
如果有人能为我指出正确的方向,让我知道如何在这个项目中集成 IP 循环,以便它从不同的 IP 地址发送请求,我将不胜感激!我不想为私人代理付费,但我看到有人用 public 地址来做。谢谢!
如果您想收到回复,您需要某种代理,您自己的代理,如 squidproxy、付费私人代理或 public(或您提到的 VPN)。没有其他办法了。您可以在发送到某个虚假 IP 的数据包中欺骗您的 IP,但您将不会收到响应。如果您想使用代理,我建议您使用优秀的 requests
库,因为它是许多进行网络抓取的人的首选工具,并且使用代理非常容易。示例如下:
import requests
proxies = {
'http': 'http://10.10.1.10:3128', #this could be an public proxy address
'https': 'http://10.10.1.10:1080',
}
requests.get("https://www.google.com",proxies=proxies)
如果你想循环浏览 public 代理列表,只需循环处理异常,如下所示:
import requests
import logging
proxies = [{
'http': 'http://10.10.1.10:3128', #this could be an public proxy address
'https': 'http://10.10.1.10:1080',
},...]
for proxy in proxies:
try:
requests.get("https://www.google.com",proxies=proxies)
break
except Exception as e:
logging.exception(e)
continue
我编写了一个程序,使用 Beautiful Soup 从 Crunchbase 中提取公司列表的资金信息,并将该信息导出到 CSV 文件中。我什至将我的请求间隔了 30 秒,并且该程序直到今天都运行良好 - 现在我什至无法在不收到 HTTPError: Forbidden.
的情况下发送一个请求我一直在阅读这方面的内容,人们制作了 IP 循环程序,因为看起来 Crunchbase 一直在阻止我的 IP 地址——即使我循环我的用户代理,我仍然被阻止。我什至尝试使用几个免费的 VPN,但我仍然被阻止。
import urllib.request
from bs4 import BeautifulSoup
import csv
import time
import random
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169
Safari/537.36'
headers = {'User-Agent': user_agent, }
def scraper(url):
return_list = []
try:
request = urllib.request.Request(url, None, headers)
response = urllib.request.urlopen(request)
except:
return_list.append("No Crunchbase Page Found")
return_list.append("No Crunchbase Page Found")
print("Not found")
else:
data = response.read()
soup = BeautifulSoup(data, "html.parser")
try:
funding_status = soup.find_all("span", class_= "component--field-formatter field-type-enum ng-star-inserted")[1].text
return_list.append(funding_status)
except:
return_list.append("N/A")
try:
last_funding_type = soup.find("a", class_= "cb-link component--field-formatter field-type-enum ng-star-inserted").text
if last_funding_type[:6] != "Series" and last_funding_type[:7] != "Venture" and last_funding_type[:4] != "Seed" and last_funding_type[:3] != "Pre" and last_funding_type[:5] != "Angel" and last_funding_type[:7] != "Private" and last_funding_type[:4] != "Debt" and last_funding_type[:11] != "Convertible" and last_funding_type[:5] != "Grant" and last_funding_type[:9] != "Corporate" and last_funding_type[:6] != "Equity" and last_funding_type[:7] != "Product" and last_funding_type[:9] != "Secondary" and last_funding_type[:4] != "Post" and last_funding_type[:3] != "Non" and last_funding_type[:7] != "Initial" and last_funding_type[:7] != "Funding":
return_list.append("N/A")
else:
return_list.append(last_funding_type)
except:
return_list.append("N/A")
return return_list
user_input = input("CSV File Name (e.g: myfile.csv): ")
user_input2 = input("New CSV file name (e.g: newfile.csv): ")
print()
scrape_file = open(user_input, "r", newline = '', encoding = "utf-8")
row_count = sum(1 for row in csv.reader(scrape_file))
scrape_file = open(user_input, "r", newline = '', encoding = "utf-8")
new_file = open(user_input2, "w", newline = '', encoding = "utf-8")
writer = csv.writer(new_file)
writer.writerow(["Company Name", "Description", "Website", "Founded",
"Product Name", "Country", "Funding Status", "Last Funding Type"])
count = 0
for row in csv.reader(scrape_file):
company_name = row[0]
if company_name == "Company Name":
continue
count += 1
print("Scraping company {} of {}".format(count, row_count))
company_name = company_name.replace(",", "")
company_name = company_name.replace("'", "")
company_name = company_name.replace("-", " ")
company_name = company_name.replace(".", " ")
s = "-"
join_name = s.join(company_name.lower().split())
company_url = "https://www.crunchbase.com/organization/" + join_name
writer.writerow([row[0], row[1], row[2], row[3], row[4], row[5], scraper(company_url)[0], scraper(company_url)[1]])
time.sleep(random.randint(30, 40))
new_file.close()
print("Done! You can now open your file %s." % user_input2)
如果有人能为我指出正确的方向,让我知道如何在这个项目中集成 IP 循环,以便它从不同的 IP 地址发送请求,我将不胜感激!我不想为私人代理付费,但我看到有人用 public 地址来做。谢谢!
如果您想收到回复,您需要某种代理,您自己的代理,如 squidproxy、付费私人代理或 public(或您提到的 VPN)。没有其他办法了。您可以在发送到某个虚假 IP 的数据包中欺骗您的 IP,但您将不会收到响应。如果您想使用代理,我建议您使用优秀的 requests
库,因为它是许多进行网络抓取的人的首选工具,并且使用代理非常容易。示例如下:
import requests
proxies = {
'http': 'http://10.10.1.10:3128', #this could be an public proxy address
'https': 'http://10.10.1.10:1080',
}
requests.get("https://www.google.com",proxies=proxies)
如果你想循环浏览 public 代理列表,只需循环处理异常,如下所示:
import requests
import logging
proxies = [{
'http': 'http://10.10.1.10:3128', #this could be an public proxy address
'https': 'http://10.10.1.10:1080',
},...]
for proxy in proxies:
try:
requests.get("https://www.google.com",proxies=proxies)
break
except Exception as e:
logging.exception(e)
continue