Python - Selenium - 无法从 html 中抓取特定文本内容

Python - Selenium - cant webscrape specific text content from html

我尝试对 html 的这一部分进行网络抓取:

<td class="zebraTable__td zebraTable__td--companyName"><a href="/unternehmen/8116602/schneider-electric-holding-germany-gmbh" data-gtm="companySearch__searchResult--76">
                        Schneider Electric Holding Germany GmbH
                    </a></td>

HTML Code

来自此站点:

https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=0&employeesTo=100000000&sortMethod=revenueDesc&p=4

使用此代码:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
import pandas as pd
import time 

driver = webdriver.Chrome('/Users/rieder/Anaconda3/chromedriver_win32/chromedriver.exe')

driver.get('https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=500&employeesTo=100000000&sortMethod=revenueDesc&p=1')

driver.find_element_by_id("cookiesNotificationConfirm").click();

company_name = driver.find_element_by_class_name('zebraTable__td zebraTable__td--companyName')

print(company_name)

我试了4个小时,还是搞不定。我尝试了不同的方法,如 xpath、link 文本等,但我得到的只是一个空的公司名称,如“[]”。

有人知道 selenium 如何准确找到这段文本“Liebherr-Hausgeräte Ochsenhausen GmbH”吗?

非常感谢。

您要查找的内容可以在

下的页面源代码中找到

<div data-company-search><div data-var-name="companyResults" data 并且它是页面源代码的一部分。所以你不需要硒来获得它。只需阅读带有请求的页面并使用 Beautiful Soup 查找数据。

要打印文本 Schneider Electric Holding Germany GmbH 您必须引入 for the visibility_of_element_located() and you can use either of the following :

  • 使用 CSS_SELECTORtext 属性:

    driver.get('https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=0&employeesTo=100000000&sortMethod=revenueDesc&p=4')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#cookiesNotificationConfirm"))).click()
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.zebraTable.zebraTable--companies tr:nth-child(2)>td.zebraTable__td.zebraTable__td--companyName>a"))).text)
    
  • 使用XPATHget_attribute("innerHTML"):

    driver.get('https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=0&employeesTo=100000000&sortMethod=revenueDesc&p=4')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[@id='cookiesNotificationConfirm']"))).click()
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='zebraTable zebraTable--companies']//following::tr[2]/td[@class='zebraTable__td zebraTable__td--companyName']/a"))).get_attribute("innerHTML"))
    
  • 控制台输出:

    Schneider Electric Holding Germany GmbH
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in


结尾

Link 到有用的文档:

  • get_attribute()方法Gets the given attribute or property of the element.
  • text属性returnsThe text of the element.
  • Difference between text and innerHTML using Selenium