抓取网络数据时如何处理空列表项?
How do I deal with empty list items while scraping web data?
我正在尝试从一个列出了我所在行业人员的联系信息的网站上将数据抓取到一个 CSV 文件中。我的代码运行良好,直到我到达其中一个条目没有特定项目的页面。
例如:
我正在努力收集
姓名,Phone,个人资料 URL
如果没有列出 phone 数字,页面上什至没有该字段的标记,我的代码会出现
错误
“IndexError:列表索引超出范围”
我对此很陌生,但到目前为止,我已经设法从各种 youtube tutorials/this 网站拼凑起来,这确实为我节省了大量时间来完成一些任务,否则我可能需要几天时间.如果有人愿意提供任何帮助,我将不胜感激。
我尝试了不同的 if/then 语句,其中如果变量为空,则将变量设置为“空”
编辑:
我更新了代码。我切换到 CSS 选择器以获得更多的特异性和可读性。我还加了一个try/except,至少可以绕过索引错误,但是并没有解决由于每个字段的数据量不均而存储的数据不正确的问题。另外,我要抓取的网站现在在代码中。
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Firefox()
MAX_PAGE_NUM = 5
MAX_PAGE_DIG = 2
with open('results.csv', 'w') as f:
f.write("Name, Number, URL \n")
#Run Through Pages
for i in range(1, MAX_PAGE_NUM + 1):
page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)
website = "https://www.realtor.com/realestateagents/lansing_mi/pg-" + page_num
driver.get(website)
Name = driver.find_elements_by_css_selector('div.agent-list-card-title-text.clearfix > div.agent-name.text-bold > a')
Number = driver.find_elements_by_css_selector('div.agent-list-card-title-text.clearfix > div.agent-phone.hidden-xs.hidden-xxs')
URL = driver.find_elements_by_css_selector('div.agent-list-card-title-text.clearfix > div.agent-name.text-bold > a')
#Collect Data From Each Page
num_page_items = len(Name)
with open('results.csv', 'a') as f:
for i in range(num_page_items):
try:
f.write(Name[i].text.replace(",", ".") + "," + Number[i].text + "," + URL[i].get_attribute('href') + "\n")
print(Name[i].text.replace(",", ".") + "," + Number[i].text + "," + URL[i].get_attribute('href') + "\n")
except IndexError:
f.write("Skip, Skip, Skip \n")
print("Number Missing")
continue
driver.close()
如果我尝试收集的任何字段在单个列表中不存在,我只想在电子表格中将空白字段填写为“空”。
您可以使用 try/except 来解决这个问题。我还选择使用 Pandas 和 BeautifulSoup,因为我对它们更熟悉。
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
import pandas as pd
MAX_PAGE_NUM = 5
MAX_PAGE_DIG = 2
results = pd.DataFrame()
#Run Through Pages
for i in range(1, MAX_PAGE_NUM + 1):
page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)
website = "https://www.realtor.com/realestateagents/lansing_mi/pg-" + page_num
driver.get(website)
soup = BeautifulSoup(driver.page_source, 'html.parser')
agent_cards = soup.find_all('div', {'class':'agent-list-card clearfix'})
for agent in agent_cards:
try:
Name = agent.find('div', {'itemprop':'name'}).text.strip().split('\n')[0]
except:
Name = None
try:
Number = agent.find('div', {'itemprop':'telephone'}).text.strip()
except:
Number = None
try:
URL = 'https://www.realtor.com/' + agent.find('a', href=True)['href']
except:
URL = None
temp_df = pd.DataFrame([[Name, Number, URL]], columns=['Name','Number','URL'])
results = results.append(temp_df, sort=True).reset_index(drop=True)
print('Processed page: %s' %i)
driver.close()
results.to_csv('results.csv', index=False)
输出:
print (results)
Name ... URL
0 Nicole Enz ... https://www.realtor.com//realestateagents/nico...
1 Jennifer Worthington ... https://www.realtor.com//realestateagents/jenn...
2 Katherine Keener ... https://www.realtor.com//realestateagents/kath...
3 Erica Cook ... https://www.realtor.com//realestateagents/eric...
4 Jeff Thornton, Broker, Assoc Broker ... https://www.realtor.com//realestateagents/jeff...
5 Neal Sanford, Agent ... https://www.realtor.com//realestateagents/neal...
6 Sherree Zea ... https://www.realtor.com//realestateagents/sher...
7 Jennifer Cooper ... https://www.realtor.com//realestateagents/jenn...
8 Charlyn Cosgrove ... https://www.realtor.com//realestateagents/char...
9 Kathy Birchen & Chad Dutcher ... https://www.realtor.com//realestateagents/kath...
10 Nancy Petroff ... https://www.realtor.com//realestateagents/nanc...
11 The Angela Averill Team ... https://www.realtor.com//realestateagents/the-...
12 Christina Tamburino ... https://www.realtor.com//realestateagents/chri...
13 Rayce O'Connell ... https://www.realtor.com//realestateagents/rayc...
14 Stephanie Morey ... https://www.realtor.com//realestateagents/step...
15 Sean Gardner ... https://www.realtor.com//realestateagents/sean...
16 John Burg ... https://www.realtor.com//realestateagents/john...
17 Linda Ellsworth-Moore ... https://www.realtor.com//realestateagents/lind...
18 David Bueche ... https://www.realtor.com//realestateagents/davi...
19 David Ledebuhr ... https://www.realtor.com//realestateagents/davi...
20 Aaron Fox ... https://www.realtor.com//realestateagents/aaro...
21 Kristy Seibold ... https://www.realtor.com//realestateagents/kris...
22 Genia Beckman ... https://www.realtor.com//realestateagents/geni...
23 Angela Bolan ... https://www.realtor.com//realestateagents/ange...
24 Constance Benca ... https://www.realtor.com//realestateagents/cons...
25 Lisa Fata ... https://www.realtor.com//realestateagents/lisa...
26 Mike Dedman ... https://www.realtor.com//realestateagents/mike...
27 Jamie Masarik ... https://www.realtor.com//realestateagents/jami...
28 Amy Yaroch ... https://www.realtor.com//realestateagents/amy-...
29 Debbie McCarthy ... https://www.realtor.com//realestateagents/debb...
.. ... ... ...
70 Vickie Blattner ... https://www.realtor.com//realestateagents/vick...
71 Faith F Steller ... https://www.realtor.com//realestateagents/fait...
72 A. Jason Titus ... https://www.realtor.com//realestateagents/a.--...
73 Matt Bunn ... https://www.realtor.com//realestateagents/matt...
74 Joe Vitale ... https://www.realtor.com//realestateagents/joe-...
75 Reozom Real Estate ... https://www.realtor.com//realestateagents/reoz...
76 Shane Broyles ... https://www.realtor.com//realestateagents/shan...
77 Megan Doyle-Busque ... https://www.realtor.com//realestateagents/mega...
78 Linda Holmes ... https://www.realtor.com//realestateagents/lind...
79 Jeff Burke ... https://www.realtor.com//realestateagents/jeff...
80 Jim Convissor ... https://www.realtor.com//realestateagents/jim-...
81 Concetta D'Agostino ... https://www.realtor.com//realestateagents/conc...
82 Melanie McNamara ... https://www.realtor.com//realestateagents/mela...
83 Julie Adams ... https://www.realtor.com//realestateagents/juli...
84 Liz Horford ... https://www.realtor.com//realestateagents/liz-...
85 Miriam Olsen ... https://www.realtor.com//realestateagents/miri...
86 Wanda Williams ... https://www.realtor.com//realestateagents/wand...
87 Troy Seyfert ... https://www.realtor.com//realestateagents/troy...
88 Maggie Gerich ... https://www.realtor.com//realestateagents/magg...
89 Laura Farhat Bramson ... https://www.realtor.com//realestateagents/laur...
90 Peter MacIntyre ... https://www.realtor.com//realestateagents/pete...
91 Mark Jacobsen ... https://www.realtor.com//realestateagents/mark...
92 Deb Good ... https://www.realtor.com//realestateagents/deb-...
93 Mary Jane Vanderstow ... https://www.realtor.com//realestateagents/mary...
94 Ben Magsig ... https://www.realtor.com//realestateagents/ben-...
95 Brenna Chamberlain ... https://www.realtor.com//realestateagents/bren...
96 Deborah Cooper, CNS ... https://www.realtor.com//realestateagents/debo...
97 Huggler, Bashore & Brooks ... https://www.realtor.com//realestateagents/hugg...
98 Jodey Shepardson Custack ... https://www.realtor.com//realestateagents/jode...
99 Madaline Alspaugh-Young ... https://www.realtor.com//realestateagents/mada...
[100 rows x 3 columns]
我正在尝试从一个列出了我所在行业人员的联系信息的网站上将数据抓取到一个 CSV 文件中。我的代码运行良好,直到我到达其中一个条目没有特定项目的页面。
例如:
我正在努力收集
姓名,Phone,个人资料 URL
如果没有列出 phone 数字,页面上什至没有该字段的标记,我的代码会出现
错误“IndexError:列表索引超出范围”
我对此很陌生,但到目前为止,我已经设法从各种 youtube tutorials/this 网站拼凑起来,这确实为我节省了大量时间来完成一些任务,否则我可能需要几天时间.如果有人愿意提供任何帮助,我将不胜感激。
我尝试了不同的 if/then 语句,其中如果变量为空,则将变量设置为“空”
编辑:
我更新了代码。我切换到 CSS 选择器以获得更多的特异性和可读性。我还加了一个try/except,至少可以绕过索引错误,但是并没有解决由于每个字段的数据量不均而存储的数据不正确的问题。另外,我要抓取的网站现在在代码中。
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Firefox()
MAX_PAGE_NUM = 5
MAX_PAGE_DIG = 2
with open('results.csv', 'w') as f:
f.write("Name, Number, URL \n")
#Run Through Pages
for i in range(1, MAX_PAGE_NUM + 1):
page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)
website = "https://www.realtor.com/realestateagents/lansing_mi/pg-" + page_num
driver.get(website)
Name = driver.find_elements_by_css_selector('div.agent-list-card-title-text.clearfix > div.agent-name.text-bold > a')
Number = driver.find_elements_by_css_selector('div.agent-list-card-title-text.clearfix > div.agent-phone.hidden-xs.hidden-xxs')
URL = driver.find_elements_by_css_selector('div.agent-list-card-title-text.clearfix > div.agent-name.text-bold > a')
#Collect Data From Each Page
num_page_items = len(Name)
with open('results.csv', 'a') as f:
for i in range(num_page_items):
try:
f.write(Name[i].text.replace(",", ".") + "," + Number[i].text + "," + URL[i].get_attribute('href') + "\n")
print(Name[i].text.replace(",", ".") + "," + Number[i].text + "," + URL[i].get_attribute('href') + "\n")
except IndexError:
f.write("Skip, Skip, Skip \n")
print("Number Missing")
continue
driver.close()
如果我尝试收集的任何字段在单个列表中不存在,我只想在电子表格中将空白字段填写为“空”。
您可以使用 try/except 来解决这个问题。我还选择使用 Pandas 和 BeautifulSoup,因为我对它们更熟悉。
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
import pandas as pd
MAX_PAGE_NUM = 5
MAX_PAGE_DIG = 2
results = pd.DataFrame()
#Run Through Pages
for i in range(1, MAX_PAGE_NUM + 1):
page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)
website = "https://www.realtor.com/realestateagents/lansing_mi/pg-" + page_num
driver.get(website)
soup = BeautifulSoup(driver.page_source, 'html.parser')
agent_cards = soup.find_all('div', {'class':'agent-list-card clearfix'})
for agent in agent_cards:
try:
Name = agent.find('div', {'itemprop':'name'}).text.strip().split('\n')[0]
except:
Name = None
try:
Number = agent.find('div', {'itemprop':'telephone'}).text.strip()
except:
Number = None
try:
URL = 'https://www.realtor.com/' + agent.find('a', href=True)['href']
except:
URL = None
temp_df = pd.DataFrame([[Name, Number, URL]], columns=['Name','Number','URL'])
results = results.append(temp_df, sort=True).reset_index(drop=True)
print('Processed page: %s' %i)
driver.close()
results.to_csv('results.csv', index=False)
输出:
print (results)
Name ... URL
0 Nicole Enz ... https://www.realtor.com//realestateagents/nico...
1 Jennifer Worthington ... https://www.realtor.com//realestateagents/jenn...
2 Katherine Keener ... https://www.realtor.com//realestateagents/kath...
3 Erica Cook ... https://www.realtor.com//realestateagents/eric...
4 Jeff Thornton, Broker, Assoc Broker ... https://www.realtor.com//realestateagents/jeff...
5 Neal Sanford, Agent ... https://www.realtor.com//realestateagents/neal...
6 Sherree Zea ... https://www.realtor.com//realestateagents/sher...
7 Jennifer Cooper ... https://www.realtor.com//realestateagents/jenn...
8 Charlyn Cosgrove ... https://www.realtor.com//realestateagents/char...
9 Kathy Birchen & Chad Dutcher ... https://www.realtor.com//realestateagents/kath...
10 Nancy Petroff ... https://www.realtor.com//realestateagents/nanc...
11 The Angela Averill Team ... https://www.realtor.com//realestateagents/the-...
12 Christina Tamburino ... https://www.realtor.com//realestateagents/chri...
13 Rayce O'Connell ... https://www.realtor.com//realestateagents/rayc...
14 Stephanie Morey ... https://www.realtor.com//realestateagents/step...
15 Sean Gardner ... https://www.realtor.com//realestateagents/sean...
16 John Burg ... https://www.realtor.com//realestateagents/john...
17 Linda Ellsworth-Moore ... https://www.realtor.com//realestateagents/lind...
18 David Bueche ... https://www.realtor.com//realestateagents/davi...
19 David Ledebuhr ... https://www.realtor.com//realestateagents/davi...
20 Aaron Fox ... https://www.realtor.com//realestateagents/aaro...
21 Kristy Seibold ... https://www.realtor.com//realestateagents/kris...
22 Genia Beckman ... https://www.realtor.com//realestateagents/geni...
23 Angela Bolan ... https://www.realtor.com//realestateagents/ange...
24 Constance Benca ... https://www.realtor.com//realestateagents/cons...
25 Lisa Fata ... https://www.realtor.com//realestateagents/lisa...
26 Mike Dedman ... https://www.realtor.com//realestateagents/mike...
27 Jamie Masarik ... https://www.realtor.com//realestateagents/jami...
28 Amy Yaroch ... https://www.realtor.com//realestateagents/amy-...
29 Debbie McCarthy ... https://www.realtor.com//realestateagents/debb...
.. ... ... ...
70 Vickie Blattner ... https://www.realtor.com//realestateagents/vick...
71 Faith F Steller ... https://www.realtor.com//realestateagents/fait...
72 A. Jason Titus ... https://www.realtor.com//realestateagents/a.--...
73 Matt Bunn ... https://www.realtor.com//realestateagents/matt...
74 Joe Vitale ... https://www.realtor.com//realestateagents/joe-...
75 Reozom Real Estate ... https://www.realtor.com//realestateagents/reoz...
76 Shane Broyles ... https://www.realtor.com//realestateagents/shan...
77 Megan Doyle-Busque ... https://www.realtor.com//realestateagents/mega...
78 Linda Holmes ... https://www.realtor.com//realestateagents/lind...
79 Jeff Burke ... https://www.realtor.com//realestateagents/jeff...
80 Jim Convissor ... https://www.realtor.com//realestateagents/jim-...
81 Concetta D'Agostino ... https://www.realtor.com//realestateagents/conc...
82 Melanie McNamara ... https://www.realtor.com//realestateagents/mela...
83 Julie Adams ... https://www.realtor.com//realestateagents/juli...
84 Liz Horford ... https://www.realtor.com//realestateagents/liz-...
85 Miriam Olsen ... https://www.realtor.com//realestateagents/miri...
86 Wanda Williams ... https://www.realtor.com//realestateagents/wand...
87 Troy Seyfert ... https://www.realtor.com//realestateagents/troy...
88 Maggie Gerich ... https://www.realtor.com//realestateagents/magg...
89 Laura Farhat Bramson ... https://www.realtor.com//realestateagents/laur...
90 Peter MacIntyre ... https://www.realtor.com//realestateagents/pete...
91 Mark Jacobsen ... https://www.realtor.com//realestateagents/mark...
92 Deb Good ... https://www.realtor.com//realestateagents/deb-...
93 Mary Jane Vanderstow ... https://www.realtor.com//realestateagents/mary...
94 Ben Magsig ... https://www.realtor.com//realestateagents/ben-...
95 Brenna Chamberlain ... https://www.realtor.com//realestateagents/bren...
96 Deborah Cooper, CNS ... https://www.realtor.com//realestateagents/debo...
97 Huggler, Bashore & Brooks ... https://www.realtor.com//realestateagents/hugg...
98 Jodey Shepardson Custack ... https://www.realtor.com//realestateagents/jode...
99 Madaline Alspaugh-Young ... https://www.realtor.com//realestateagents/mada...
[100 rows x 3 columns]