BeautifulSoup、Selenium 和 Python,通过标签解析
BeautifulSoup, Selenium and Python, parsing by a tag
我正在尝试解析来自该网站的数据
https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010
特别是,我正在尝试根据 Criterion(ITC) 获取数据。我想要的文字是 CC+ECT
我想要的信息 html 似乎是
<a class= js-glossary data-leg= "CC+ECT">
我是网络抓取的新手,我尝试了教程中教授的技术,但没有奏效。我听说过 Selenium,也尝试过这个。但是,此代码也不起作用。
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
driver = webdriver.Firefox(executable_path = r"D:\Python work\driver\geckodriver.exe")
driver.get(r"https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010")
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
data = soup.find_all("a", attrs= {"class":"js-glossary"})
该代码生成一个空列表。我还读到我可以通过将 soup 标签视为字典来提取数据。在这种情况下
data["data-leg"]
我走在正确的轨道上还是偏离了轨道?
您尝试获取的文本由 JavaScript 动态生成。需要等待它的出现才能获得:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Firefox(executable_path = r"D:\Python work\driver\geckodriver.exe")
driver.get(r"https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010")
text = WebDriverWait(driver, 5).until(lambda driver: driver.find_element_by_xpath('//div[.="criterion(itc)"]/following-sibling::div').text)
print(text)
# 'CC + ECT'
看来你很接近。如果您使用 Selenium,您甚至可能不需要 Beautiful Soup。使用 Selenium 您需要诱导 WebDriverwait 以使所需的 元素可见 并且您可以使用以下内容解决方案:
代码块:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox(executable_path = r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get(r"https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010")
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='lbl' and text()='criterion(itc)']//following::div[1]/a"))).get_attribute("innerHTML"))
控制台输出:
CC + ECT
我正在尝试解析来自该网站的数据
https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010
特别是,我正在尝试根据 Criterion(ITC) 获取数据。我想要的文字是 CC+ECT
我想要的信息 html 似乎是
<a class= js-glossary data-leg= "CC+ECT">
我是网络抓取的新手,我尝试了教程中教授的技术,但没有奏效。我听说过 Selenium,也尝试过这个。但是,此代码也不起作用。
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
driver = webdriver.Firefox(executable_path = r"D:\Python work\driver\geckodriver.exe")
driver.get(r"https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010")
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
data = soup.find_all("a", attrs= {"class":"js-glossary"})
该代码生成一个空列表。我还读到我可以通过将 soup 标签视为字典来提取数据。在这种情况下
data["data-leg"]
我走在正确的轨道上还是偏离了轨道?
您尝试获取的文本由 JavaScript 动态生成。需要等待它的出现才能获得:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Firefox(executable_path = r"D:\Python work\driver\geckodriver.exe")
driver.get(r"https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010")
text = WebDriverWait(driver, 5).until(lambda driver: driver.find_element_by_xpath('//div[.="criterion(itc)"]/following-sibling::div').text)
print(text)
# 'CC + ECT'
看来你很接近。如果您使用 Selenium,您甚至可能不需要 Beautiful Soup。使用 Selenium 您需要诱导 WebDriverwait 以使所需的 元素可见 并且您可以使用以下内容解决方案:
代码块:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Firefox(executable_path = r'C:\Utility\BrowserDrivers\geckodriver.exe') driver.get(r"https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010") print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='lbl' and text()='criterion(itc)']//following::div[1]/a"))).get_attribute("innerHTML"))
控制台输出:
CC + ECT