如何将列表添加到另一个列表以通过数据抓取提取数据
How to add a list to another list in order to extract data through data scraping
我正在尝试从网站上抓取一些数据。但是,我感兴趣的数据存储在单页登陆页面中,其中 URL 根据公司名称而变化。
我首先创建了一个循环,从“头版”中抓取所有公司名称,然后将它们分配到一个列表中,url_list:
url_list= []
for page in range(1,76): # 94 is max; though I suspect you might get blocked by host
req = requests.get("https://proteindirectory.com/alt-protein-database/?_protein_category=plant-based&_load_more=" + str(page), headers=headers)
soup = BeautifulSoup(req.text, 'html.parser')
for span in soup.find_all(id='span-1117-390'):
url_list.append(span.text)
url_list = [e.replace(" ", "-") for e in url_list]
url_list = [a.replace("&", "") for a in url_list]
之后,我尝试创建另一个列表,我在其中应用 url_list 作为标签,其中每个公司名称都应应用在目标 URL 中。但是,我得到一个空列表,所以我的代码有问题:
companyList = []
def getCompanies(url_list):
url= f'https://proteindirectory.com/company/[url_list]'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
company = soup.find_all('div', {'id': 'div_block-6-1850', 'class': 'ct-div-block small-text'})
companyName = soup.find_all('section', {'class': ' ct-section', 'id': 'section-2-1850'})
for item in company or companyName:
companies = {
'name': item.find('span', {'class': 'ct-span', 'id': 'span-11-1850'}).text,
'primaryFocus': item.find('span', {'class': 'ct-span', 'id': 'span-1554-1850'}).text,
'location': item.find('span', {'class': 'ct-span', 'id': 'span-41-1850'}).text,
'founded': item.find('span', {'class': 'ct-span', 'id': 'span-1532-1850'}).text,
'website': item.find('span', {'class': 'ct-span', 'id': 'span-61-1850'}).text,
'businessModel': item.find('span', {'class': 'ct-span', 'id': 'span-44-1850'}).text,
'proteinCategory': item.find('span', {'class': 'ct-span', 'id': 'span-1625-1850'}).text,
'ingredients': item.find('span', {'class': 'ct-span', 'id': 'span-1664-1850'}).text,
'endProductApplication': item.find('span', {'class': 'ct-span', 'id': 'span-1621-1850'}).text,
}
companyList.append(companies)
return
getCompanies(url_list)
print(companyList)
希望有人能帮助新手:-)
https://proteindirectory.com/company/[url_list]
不是站点地址。此外,您应该在 <a>
标签中寻找实际的 href,而不是尝试从您正在拉取的 span-1117-390
元素中硬编码 url 模式。
接下来,您需要像处理页面一样在 for 循环中遍历 url 的列表。我只浏览了前 2 页,但试试这个:
代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
url_list= []
for page in range(1,76): # 94 is max; though I suspect you might get blocked by host
print(page)
req = requests.get("https://proteindirectory.com/alt-protein-database/?_protein_category=plant-based&_load_more=" + str(page), headers=headers)
soup = BeautifulSoup(req.text, 'html.parser')
for a in soup.find_all('a',id='div_block-7-390', href=True):
url_list.append(a['href'])
companyList = []
def getCompanies(url):
print(url)
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
company = soup.find_all('div', {'id': 'div_block-6-1850', 'class': 'ct-div-block small-text'})
for item in company:
try:
name = soup.find('span', {'class': 'ct-span', 'id': 'span-11-1850'}).text
except:
name = 'N/A'
try:
primaryFocus = item.find('span', {'class': 'ct-span', 'id': 'span-1554-1850'}).text
except:
primaryFocus = 'N/A'
try:
location = item.find('span', {'class': 'ct-span', 'id': 'span-41-1850'}).text
except:
location = 'N/A'
try:
founded = item.find('span', {'class': 'ct-span', 'id': 'span-1532-1850'}).text
except:
founded = 'N/A'
try:
website = item.find('span', {'class': 'ct-span', 'id': 'span-61-1850'}).text
except:
website = 'N/A'
try:
businessModel = item.find('span', {'class': 'ct-span', 'id': 'span-44-1850'}).text
except:
businessModel = 'N/A'
try:
proteinCategory = item.find('span', {'class': 'ct-span', 'id': 'span-1625-1850'}).text
except:
proteinCategory = 'N/A'
try:
ingredients = item.find('span', {'class': 'ct-span', 'id': 'span-1664-1850'}).text
except:
ingredients = 'N/A'
try:
endProductApplication = item.find('span', {'class': 'ct-span', 'id': 'span-1621-1850'}).text
except:
endProductApplication = 'N/A'
companies = {
'name': name,
'primaryFocus': primaryFocus,
'location': location,
'founded': founded,
'website': website,
'businessModel': businessModel,
'proteinCategory': proteinCategory,
'ingredients': ingredients,
'endProductApplication': endProductApplication
}
companyList.append(companies)
for url in url_list:
getCompanies(url)
print(companyList)
df = pd.DataFrame(companyList)
输出:
print(df.to_string())
name primaryFocus location founded website businessModel proteinCategory ingredients endProductApplication
0 New Barn Organics Food and beverages United States 2015 newbarnorganics.com End-consumer brands & products Plant-based Almond, Coconut Dairy, Milk
1 Plantstrong Food and beverages United States 2007 plantstrongfoods.com End-consumer brands & products Plant-based N/A Ready-to-eat meals & snacks
2 Pop & Bottle Food and beverages United States 2015 popandbottle.com End-consumer brands & products Plant-based Oat Dairy, Milk
3 Friedas Food and beverages United States 1962 friedas.com End-consumer brands & products Plant-based Soy Meat & fish, Sausage
4 Creations Foods Food and beverages United States 2019 creationsfoods.com End-consumer brands & products Plant-based N/A Ice-cream and desserts
5 Biocatalysts Ltd Food and beverages United Kingdom 1986 biocatalysts.com Ingredients & inputs Fermentation, Plant-based N/A N/A
6 Oterra Food and beverages Denmark oterra.com Ingredients & inputs Plant-based N/A N/A
7 Sydsel Africa Food and beverages Kenya 2015 sydselafrica.com Ingredients & inputs Plant-based Mushroom, Soy, Wheat, Yeast N/A
8 PhycoSystems Food and beverages Germany 2021 phycosystems.de Ingredients & inputs Plant-based Algae, Microalgae N/A
9 Meta Burger Food and beverages United States 2018 metaburger.com End-consumer brands & products Plant-based N/A Burger, Meat & fish
10 C-Merak Food and beverages Canada 2018 c-merak.ca Ingredients & inputs Plant-based Fava bean N/A
11 New Protein Global Food and beverages Canada newproteinglobal.com Ingredients & inputs Plant-based Soy N/A
12 Kagome Food and beverages United States 1989 kagomeusa.com Contract manufacturing, End-consumer brands & products Plant-based Sunflower Oils and fats
13 GK Foods Food and beverages United States 2020 gkfoods.co Contract manufacturing Plant-based N/A N/A
14 Global Food and Ingredients Inc. Food and beverages Canada 2018 gfiglobalfood.com Ingredients & inputs Plant-based Beans, Chickpea, Lentils, Pea N/A
15 CP Kelco Food and beverages United States 1929 cpkelco.com Ingredients & inputs Fermentation, Plant-based N/A N/A
16 Greenest Food and beverages India 2017 greenestfoods.com End-consumer brands & products Plant-based N/A Meat & fish
17 Montana Pure Protein Food and beverages United States 2020 montanapure.us Ingredients & inputs Plant-based Pulses N/A
18 Alghética Food and beverages Italy 2021 alghetica.com Ingredients & inputs Fermentation, Plant-based Algae N/A
19 Charoen Pokphand Foods Animal feed and pet food, Food and beverages Thailand cpfworldwide.com End-consumer brands & products, Ingredients & inputs Plant-based N/A Meat & fish
20 Dahmes Stainless, Inc. Food and beverages United States 1994 dahmes.com Infrastructure & equipment Plant-based N/A N/A
21 Shandong Wonderful Industrial Group Co., Ltd. Food and beverages China 2001 wandefugroup.com Ingredients & inputs Plant-based Soy N/A
22 Benson Hill Food and beverages United States 2012 bensonhill.com Ingredients & inputs Plant-based Pea, Soy N/A
23 Brookside Flavors & Ingredients Food and beverages United States 2015 brooksideflavors.com Ingredients & inputs Plant-based N/A N/A
24 Cereal Ingredients (CII) Food and beverages United States 1984 ciifoods.com Ingredients & inputs Plant-based Chickpea, Fava bean, Pea, Rice, Soy, Wheat N/A
25 Yantai T.Full Biotech Co. Ltd. Food and beverages China 2011 en.tfull.com Ingredients & inputs Plant-based Chickpea, Fava bean, Mung Bean, Pea N/A
26 Devigere biosolutions Pvt Ltd Food and beverages India 2020 devigerebiosolutions.in Ingredients & inputs Plant-based Pulses N/A
27 CHKP Foods Food and beverages Israel 2019 chkpfoods.com End-consumer brands & products Plant-based Chickpea Dairy, Yogurt
28 Ingredient Alliance Food and beverages United States 2017 linkedin.com Ingredients & inputs Plant-based N/A N/A
29 Yantai Shuangta Food co., LTD Food and beverages China 1992 shuangtafood.com Ingredients & inputs Plant-based Mushroom, Pea N/A
30 Shandong Jianyuan Bioengineering Co.,Ltd Food and beverages China 2003 jianyuangroup.com Ingredients & inputs Plant-based Pea N/A
31 Harvest B Food and beverages Australia 2020 harvestb.io Ingredients & inputs Plant-based N/A N/A
32 Ergo Bioscience Food and beverages Argentina 2020 ergofoods.com Ingredients & inputs Cultivated, Plant-based Carrots Dairy, Meat & fish
33 Living Jin Food and beverages United States 2016 livingjin.com End-consumer brands & products, Ingredients & inputs Plant-based Agar N/A
34 Vitmark Food and beverages Ukraine 1994 int.vitmark.com End-consumer brands & products Plant-based Almond, Oat, Rice Dairy, Milk
35 Its Veego Food and beverages Australia itsveego.com End-consumer brands & products Plant-based Coconut, Hemp, Pea Ready-to-eat meals & snacks
我正在尝试从网站上抓取一些数据。但是,我感兴趣的数据存储在单页登陆页面中,其中 URL 根据公司名称而变化。
我首先创建了一个循环,从“头版”中抓取所有公司名称,然后将它们分配到一个列表中,url_list:
url_list= []
for page in range(1,76): # 94 is max; though I suspect you might get blocked by host
req = requests.get("https://proteindirectory.com/alt-protein-database/?_protein_category=plant-based&_load_more=" + str(page), headers=headers)
soup = BeautifulSoup(req.text, 'html.parser')
for span in soup.find_all(id='span-1117-390'):
url_list.append(span.text)
url_list = [e.replace(" ", "-") for e in url_list]
url_list = [a.replace("&", "") for a in url_list]
之后,我尝试创建另一个列表,我在其中应用 url_list 作为标签,其中每个公司名称都应应用在目标 URL 中。但是,我得到一个空列表,所以我的代码有问题:
companyList = []
def getCompanies(url_list):
url= f'https://proteindirectory.com/company/[url_list]'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
company = soup.find_all('div', {'id': 'div_block-6-1850', 'class': 'ct-div-block small-text'})
companyName = soup.find_all('section', {'class': ' ct-section', 'id': 'section-2-1850'})
for item in company or companyName:
companies = {
'name': item.find('span', {'class': 'ct-span', 'id': 'span-11-1850'}).text,
'primaryFocus': item.find('span', {'class': 'ct-span', 'id': 'span-1554-1850'}).text,
'location': item.find('span', {'class': 'ct-span', 'id': 'span-41-1850'}).text,
'founded': item.find('span', {'class': 'ct-span', 'id': 'span-1532-1850'}).text,
'website': item.find('span', {'class': 'ct-span', 'id': 'span-61-1850'}).text,
'businessModel': item.find('span', {'class': 'ct-span', 'id': 'span-44-1850'}).text,
'proteinCategory': item.find('span', {'class': 'ct-span', 'id': 'span-1625-1850'}).text,
'ingredients': item.find('span', {'class': 'ct-span', 'id': 'span-1664-1850'}).text,
'endProductApplication': item.find('span', {'class': 'ct-span', 'id': 'span-1621-1850'}).text,
}
companyList.append(companies)
return
getCompanies(url_list)
print(companyList)
希望有人能帮助新手:-)
https://proteindirectory.com/company/[url_list]
不是站点地址。此外,您应该在 <a>
标签中寻找实际的 href,而不是尝试从您正在拉取的 span-1117-390
元素中硬编码 url 模式。
接下来,您需要像处理页面一样在 for 循环中遍历 url 的列表。我只浏览了前 2 页,但试试这个:
代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
url_list= []
for page in range(1,76): # 94 is max; though I suspect you might get blocked by host
print(page)
req = requests.get("https://proteindirectory.com/alt-protein-database/?_protein_category=plant-based&_load_more=" + str(page), headers=headers)
soup = BeautifulSoup(req.text, 'html.parser')
for a in soup.find_all('a',id='div_block-7-390', href=True):
url_list.append(a['href'])
companyList = []
def getCompanies(url):
print(url)
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
company = soup.find_all('div', {'id': 'div_block-6-1850', 'class': 'ct-div-block small-text'})
for item in company:
try:
name = soup.find('span', {'class': 'ct-span', 'id': 'span-11-1850'}).text
except:
name = 'N/A'
try:
primaryFocus = item.find('span', {'class': 'ct-span', 'id': 'span-1554-1850'}).text
except:
primaryFocus = 'N/A'
try:
location = item.find('span', {'class': 'ct-span', 'id': 'span-41-1850'}).text
except:
location = 'N/A'
try:
founded = item.find('span', {'class': 'ct-span', 'id': 'span-1532-1850'}).text
except:
founded = 'N/A'
try:
website = item.find('span', {'class': 'ct-span', 'id': 'span-61-1850'}).text
except:
website = 'N/A'
try:
businessModel = item.find('span', {'class': 'ct-span', 'id': 'span-44-1850'}).text
except:
businessModel = 'N/A'
try:
proteinCategory = item.find('span', {'class': 'ct-span', 'id': 'span-1625-1850'}).text
except:
proteinCategory = 'N/A'
try:
ingredients = item.find('span', {'class': 'ct-span', 'id': 'span-1664-1850'}).text
except:
ingredients = 'N/A'
try:
endProductApplication = item.find('span', {'class': 'ct-span', 'id': 'span-1621-1850'}).text
except:
endProductApplication = 'N/A'
companies = {
'name': name,
'primaryFocus': primaryFocus,
'location': location,
'founded': founded,
'website': website,
'businessModel': businessModel,
'proteinCategory': proteinCategory,
'ingredients': ingredients,
'endProductApplication': endProductApplication
}
companyList.append(companies)
for url in url_list:
getCompanies(url)
print(companyList)
df = pd.DataFrame(companyList)
输出:
print(df.to_string())
name primaryFocus location founded website businessModel proteinCategory ingredients endProductApplication
0 New Barn Organics Food and beverages United States 2015 newbarnorganics.com End-consumer brands & products Plant-based Almond, Coconut Dairy, Milk
1 Plantstrong Food and beverages United States 2007 plantstrongfoods.com End-consumer brands & products Plant-based N/A Ready-to-eat meals & snacks
2 Pop & Bottle Food and beverages United States 2015 popandbottle.com End-consumer brands & products Plant-based Oat Dairy, Milk
3 Friedas Food and beverages United States 1962 friedas.com End-consumer brands & products Plant-based Soy Meat & fish, Sausage
4 Creations Foods Food and beverages United States 2019 creationsfoods.com End-consumer brands & products Plant-based N/A Ice-cream and desserts
5 Biocatalysts Ltd Food and beverages United Kingdom 1986 biocatalysts.com Ingredients & inputs Fermentation, Plant-based N/A N/A
6 Oterra Food and beverages Denmark oterra.com Ingredients & inputs Plant-based N/A N/A
7 Sydsel Africa Food and beverages Kenya 2015 sydselafrica.com Ingredients & inputs Plant-based Mushroom, Soy, Wheat, Yeast N/A
8 PhycoSystems Food and beverages Germany 2021 phycosystems.de Ingredients & inputs Plant-based Algae, Microalgae N/A
9 Meta Burger Food and beverages United States 2018 metaburger.com End-consumer brands & products Plant-based N/A Burger, Meat & fish
10 C-Merak Food and beverages Canada 2018 c-merak.ca Ingredients & inputs Plant-based Fava bean N/A
11 New Protein Global Food and beverages Canada newproteinglobal.com Ingredients & inputs Plant-based Soy N/A
12 Kagome Food and beverages United States 1989 kagomeusa.com Contract manufacturing, End-consumer brands & products Plant-based Sunflower Oils and fats
13 GK Foods Food and beverages United States 2020 gkfoods.co Contract manufacturing Plant-based N/A N/A
14 Global Food and Ingredients Inc. Food and beverages Canada 2018 gfiglobalfood.com Ingredients & inputs Plant-based Beans, Chickpea, Lentils, Pea N/A
15 CP Kelco Food and beverages United States 1929 cpkelco.com Ingredients & inputs Fermentation, Plant-based N/A N/A
16 Greenest Food and beverages India 2017 greenestfoods.com End-consumer brands & products Plant-based N/A Meat & fish
17 Montana Pure Protein Food and beverages United States 2020 montanapure.us Ingredients & inputs Plant-based Pulses N/A
18 Alghética Food and beverages Italy 2021 alghetica.com Ingredients & inputs Fermentation, Plant-based Algae N/A
19 Charoen Pokphand Foods Animal feed and pet food, Food and beverages Thailand cpfworldwide.com End-consumer brands & products, Ingredients & inputs Plant-based N/A Meat & fish
20 Dahmes Stainless, Inc. Food and beverages United States 1994 dahmes.com Infrastructure & equipment Plant-based N/A N/A
21 Shandong Wonderful Industrial Group Co., Ltd. Food and beverages China 2001 wandefugroup.com Ingredients & inputs Plant-based Soy N/A
22 Benson Hill Food and beverages United States 2012 bensonhill.com Ingredients & inputs Plant-based Pea, Soy N/A
23 Brookside Flavors & Ingredients Food and beverages United States 2015 brooksideflavors.com Ingredients & inputs Plant-based N/A N/A
24 Cereal Ingredients (CII) Food and beverages United States 1984 ciifoods.com Ingredients & inputs Plant-based Chickpea, Fava bean, Pea, Rice, Soy, Wheat N/A
25 Yantai T.Full Biotech Co. Ltd. Food and beverages China 2011 en.tfull.com Ingredients & inputs Plant-based Chickpea, Fava bean, Mung Bean, Pea N/A
26 Devigere biosolutions Pvt Ltd Food and beverages India 2020 devigerebiosolutions.in Ingredients & inputs Plant-based Pulses N/A
27 CHKP Foods Food and beverages Israel 2019 chkpfoods.com End-consumer brands & products Plant-based Chickpea Dairy, Yogurt
28 Ingredient Alliance Food and beverages United States 2017 linkedin.com Ingredients & inputs Plant-based N/A N/A
29 Yantai Shuangta Food co., LTD Food and beverages China 1992 shuangtafood.com Ingredients & inputs Plant-based Mushroom, Pea N/A
30 Shandong Jianyuan Bioengineering Co.,Ltd Food and beverages China 2003 jianyuangroup.com Ingredients & inputs Plant-based Pea N/A
31 Harvest B Food and beverages Australia 2020 harvestb.io Ingredients & inputs Plant-based N/A N/A
32 Ergo Bioscience Food and beverages Argentina 2020 ergofoods.com Ingredients & inputs Cultivated, Plant-based Carrots Dairy, Meat & fish
33 Living Jin Food and beverages United States 2016 livingjin.com End-consumer brands & products, Ingredients & inputs Plant-based Agar N/A
34 Vitmark Food and beverages Ukraine 1994 int.vitmark.com End-consumer brands & products Plant-based Almond, Oat, Rice Dairy, Milk
35 Its Veego Food and beverages Australia itsveego.com End-consumer brands & products Plant-based Coconut, Hemp, Pea Ready-to-eat meals & snacks