Python - Selenium - 无法从 html 中抓取特定文本内容
Python - Selenium - cant webscrape specific text content from html
我尝试对 html 的这一部分进行网络抓取:
<td class="zebraTable__td zebraTable__td--companyName"><a href="/unternehmen/8116602/schneider-electric-holding-germany-gmbh" data-gtm="companySearch__searchResult--76">
Schneider Electric Holding Germany GmbH
</a></td>
HTML Code
来自此站点:
使用此代码:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
import pandas as pd
import time
driver = webdriver.Chrome('/Users/rieder/Anaconda3/chromedriver_win32/chromedriver.exe')
driver.get('https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=500&employeesTo=100000000&sortMethod=revenueDesc&p=1')
driver.find_element_by_id("cookiesNotificationConfirm").click();
company_name = driver.find_element_by_class_name('zebraTable__td zebraTable__td--companyName')
print(company_name)
我试了4个小时,还是搞不定。我尝试了不同的方法,如 xpath、link 文本等,但我得到的只是一个空的公司名称,如“[]”。
有人知道 selenium 如何准确找到这段文本“Liebherr-Hausgeräte Ochsenhausen GmbH”吗?
非常感谢。
您要查找的内容可以在
下的页面源代码中找到
<div data-company-search><div data-var-name="companyResults" data
并且它是页面源代码的一部分。所以你不需要硒来获得它。只需阅读带有请求的页面并使用 Beautiful Soup 查找数据。
要打印文本 Schneider Electric Holding Germany GmbH 您必须引入 for the visibility_of_element_located()
and you can use either of the following :
使用 CSS_SELECTOR
和 text 属性:
driver.get('https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=0&employeesTo=100000000&sortMethod=revenueDesc&p=4')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#cookiesNotificationConfirm"))).click()
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.zebraTable.zebraTable--companies tr:nth-child(2)>td.zebraTable__td.zebraTable__td--companyName>a"))).text)
使用XPATH
和get_attribute("innerHTML")
:
driver.get('https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=0&employeesTo=100000000&sortMethod=revenueDesc&p=4')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[@id='cookiesNotificationConfirm']"))).click()
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='zebraTable zebraTable--companies']//following::tr[2]/td[@class='zebraTable__td zebraTable__td--companyName']/a"))).get_attribute("innerHTML"))
控制台输出:
Schneider Electric Holding Germany GmbH
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
You can find a relevant discussion in
结尾
Link 到有用的文档:
get_attribute()
方法Gets the given attribute or property of the element.
text
属性returnsThe text of the element.
- Difference between text and innerHTML using Selenium
我尝试对 html 的这一部分进行网络抓取:
<td class="zebraTable__td zebraTable__td--companyName"><a href="/unternehmen/8116602/schneider-electric-holding-germany-gmbh" data-gtm="companySearch__searchResult--76">
Schneider Electric Holding Germany GmbH
</a></td>
HTML Code
来自此站点:
使用此代码:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
import pandas as pd
import time
driver = webdriver.Chrome('/Users/rieder/Anaconda3/chromedriver_win32/chromedriver.exe')
driver.get('https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=500&employeesTo=100000000&sortMethod=revenueDesc&p=1')
driver.find_element_by_id("cookiesNotificationConfirm").click();
company_name = driver.find_element_by_class_name('zebraTable__td zebraTable__td--companyName')
print(company_name)
我试了4个小时,还是搞不定。我尝试了不同的方法,如 xpath、link 文本等,但我得到的只是一个空的公司名称,如“[]”。
有人知道 selenium 如何准确找到这段文本“Liebherr-Hausgeräte Ochsenhausen GmbH”吗?
非常感谢。
您要查找的内容可以在
下的页面源代码中找到<div data-company-search><div data-var-name="companyResults" data
并且它是页面源代码的一部分。所以你不需要硒来获得它。只需阅读带有请求的页面并使用 Beautiful Soup 查找数据。
要打印文本 Schneider Electric Holding Germany GmbH 您必须引入 visibility_of_element_located()
and you can use either of the following
使用
CSS_SELECTOR
和 text 属性:driver.get('https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=0&employeesTo=100000000&sortMethod=revenueDesc&p=4') WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#cookiesNotificationConfirm"))).click() print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.zebraTable.zebraTable--companies tr:nth-child(2)>td.zebraTable__td.zebraTable__td--companyName>a"))).text)
使用
XPATH
和get_attribute("innerHTML")
:driver.get('https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=0&employeesTo=100000000&sortMethod=revenueDesc&p=4') WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[@id='cookiesNotificationConfirm']"))).click() print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='zebraTable zebraTable--companies']//following::tr[2]/td[@class='zebraTable__td zebraTable__td--companyName']/a"))).get_attribute("innerHTML"))
控制台输出:
Schneider Electric Holding Germany GmbH
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
You can find a relevant discussion in
结尾
Link 到有用的文档:
get_attribute()
方法Gets the given attribute or property of the element.
text
属性returnsThe text of the element.
- Difference between text and innerHTML using Selenium