Selenium 网络蜘蛛无法使用 Beautiful Soup 连续抓取两个 table <td> 标签
Selenium web spider fails to scrape two table <td> tags in a row using Beautiful Soup
我正在尝试使用 Python、Selenium 和 Beautiful Soup 从 ASP site 中获取多个学校和年份的成绩和教室规模。然后,我的最终目标是将该数据放入 pandas 数据框中以进行 csv 导出。此时在我的脚本中,
cells = rows.find_all('td')
我收到这个错误:
ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
我不太确定我做错了什么,它似乎为我提供了针对同一问题的两种相反的解决方案。这是我 运行 的脚本,print() 语句似乎表明其他一切都很好,Stack Overflow 上的其他解决方案似乎没有提供任何见解。任何帮助将不胜感激。
all_data = []
s = open("toronto_school_ids.txt", "r")
m = s.read().splitlines()
for i in range (0, len(m)):
school_id = m[i]
print ("Beginning the search for all schools in the Toronto District School Board...")
# ontario class size tracker website
driver.get("https://www.app.edu.gov.on.ca/eng/cst/classSize2.asp?sch_no=" + school_id)
print("Got the Ontario website...")
s = open("years.txt", "r")
m = s.read().splitlines()
for i in range (0, len(m)):
year = m[i]
# selenium takes over
dropdown = Select(driver.find_element_by_name("schYR"))
dropdown.select_by_value(year)
print("Got the year we wanted to search for...")
# Now we can grab the search button and click it
search_button = driver.find_element_by_id("frmYearsSubmit")
search_button.click()
print("Searching for said year...")
time.sleep(5)
# We can feed that into Beautiful Soup
soup = BeautifulSoup(driver.page_source, "html.parser")
print("The name of the school we are searching for is...")
school_name = soup.find_all('h2')[0].get_text()
table = soup.find('tbody')
rows = table.find_all('tr')
all_data = []
cells = rows.find_all('td')
print("Now to get grades and class sizes...")
for cell in cells:
grade = cell.find('td', {"style": "border:1px solid #000000; padding-left:3px"}).get_text(strip=True)
students = cell.find('td', {"style": "border:1px solid #000000;"}).get_text(strip=True)
all_data.append({'School ID number': school_id, 'School': school_name, 'Year': year, 'Grade': grade, 'Classroom size': students})
print(grade)
print(students)
当您从文件中读入数据后关闭它。您想在完成年份循环后返回 url。您可以通过循环 tr 和 td 值并写入 csv 文件来立即将 table 放入 csv。
import csv
idsFile = open("toronto_school_ids.txt", "r")
ids = idsFile.read().splitlines()
idsFile.close()
yearsFile = open("years.txt", "r")
years = yearsFile.read().splitlines()
yearsFile.close()
print ("Beginning the search for all schools in the Toronto District School Board...")
# ontario class size tracker website
with open('data.csv', 'w', newline='') as csvfile:
for i in range (0, len(ids)):
school_id = ids[i]
driver.get("https://www.app.edu.gov.on.ca/eng/cst/classSize2.asp?sch_no=" + school_id)
name=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#right_column > h2"))).text
print(name)
for j in range (0, len(years)):
dropdown = Select(driver.find_element_by_name("schYR"))
dropdown.select_by_value(years[j])
print("Got the year we wanted to search for...")
search_button = driver.find_element_by_id("frmYearsSubmit")
search_button.click()
print("Searching for said year...")
table= WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "td > table")))
wr = csv.writer(csvfile)
for row in table.find_elements_by_css_selector('tr'):
wr.writerow([d.text for d in row.find_elements_by_css_selector('td')])
time.sleep(5)
driver.back()
csvfile.close()
产出
我正在尝试使用 Python、Selenium 和 Beautiful Soup 从 ASP site 中获取多个学校和年份的成绩和教室规模。然后,我的最终目标是将该数据放入 pandas 数据框中以进行 csv 导出。此时在我的脚本中,
cells = rows.find_all('td')
我收到这个错误:
ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
我不太确定我做错了什么,它似乎为我提供了针对同一问题的两种相反的解决方案。这是我 运行 的脚本,print() 语句似乎表明其他一切都很好,Stack Overflow 上的其他解决方案似乎没有提供任何见解。任何帮助将不胜感激。
all_data = []
s = open("toronto_school_ids.txt", "r")
m = s.read().splitlines()
for i in range (0, len(m)):
school_id = m[i]
print ("Beginning the search for all schools in the Toronto District School Board...")
# ontario class size tracker website
driver.get("https://www.app.edu.gov.on.ca/eng/cst/classSize2.asp?sch_no=" + school_id)
print("Got the Ontario website...")
s = open("years.txt", "r")
m = s.read().splitlines()
for i in range (0, len(m)):
year = m[i]
# selenium takes over
dropdown = Select(driver.find_element_by_name("schYR"))
dropdown.select_by_value(year)
print("Got the year we wanted to search for...")
# Now we can grab the search button and click it
search_button = driver.find_element_by_id("frmYearsSubmit")
search_button.click()
print("Searching for said year...")
time.sleep(5)
# We can feed that into Beautiful Soup
soup = BeautifulSoup(driver.page_source, "html.parser")
print("The name of the school we are searching for is...")
school_name = soup.find_all('h2')[0].get_text()
table = soup.find('tbody')
rows = table.find_all('tr')
all_data = []
cells = rows.find_all('td')
print("Now to get grades and class sizes...")
for cell in cells:
grade = cell.find('td', {"style": "border:1px solid #000000; padding-left:3px"}).get_text(strip=True)
students = cell.find('td', {"style": "border:1px solid #000000;"}).get_text(strip=True)
all_data.append({'School ID number': school_id, 'School': school_name, 'Year': year, 'Grade': grade, 'Classroom size': students})
print(grade)
print(students)
当您从文件中读入数据后关闭它。您想在完成年份循环后返回 url。您可以通过循环 tr 和 td 值并写入 csv 文件来立即将 table 放入 csv。
import csv
idsFile = open("toronto_school_ids.txt", "r")
ids = idsFile.read().splitlines()
idsFile.close()
yearsFile = open("years.txt", "r")
years = yearsFile.read().splitlines()
yearsFile.close()
print ("Beginning the search for all schools in the Toronto District School Board...")
# ontario class size tracker website
with open('data.csv', 'w', newline='') as csvfile:
for i in range (0, len(ids)):
school_id = ids[i]
driver.get("https://www.app.edu.gov.on.ca/eng/cst/classSize2.asp?sch_no=" + school_id)
name=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#right_column > h2"))).text
print(name)
for j in range (0, len(years)):
dropdown = Select(driver.find_element_by_name("schYR"))
dropdown.select_by_value(years[j])
print("Got the year we wanted to search for...")
search_button = driver.find_element_by_id("frmYearsSubmit")
search_button.click()
print("Searching for said year...")
table= WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "td > table")))
wr = csv.writer(csvfile)
for row in table.find_elements_by_css_selector('tr'):
wr.writerow([d.text for d in row.find_elements_by_css_selector('td')])
time.sleep(5)
driver.back()
csvfile.close()
产出