如何在没有 class 的情况下使用 Selenium 获取 <b> 标签

Question

我正在努力获取有关所列产品的信息 here。我正在使用 Selenium 和 Google Colab 。我在访问 b 标签上的文本时遇到问题。对于其他属性，如名称、卖家、价格等，可以抓取没有问题。

这是 HTML 的片段。

<div class="css-1le9c0d pad-bottom">
    <img src="https://assets.tokopedia.net/assets-tokopedia-lite/v2/zeus/kratos/3ac8f50c.svg" alt="">
    <div>Dikirim dari 
      <b>Kota Depok</b>
    </div>
</div>

这是我的驱动程序设置。

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
webdriver_path = webdriver.Chrome('chromedriver', options=options)
driver = webdriver.Chrome('chromedriver', options=options)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                                                                     'AppleWebKit/537.36 (KHTML, like Gecko) '
                                                                     'Chrome/85.0.4183.102 Safari/537.36'})

这是我试过的代码。

sample_link = 'https://www.tokopedia.com/naturashop27/bio-oil-original-penghilang-bekas-luka-strecth-mark-isi-125ml?whid=0'
driver.get(sample_link)
time.sleep(1.5)

try:
    product = driver.find_elements_by_tag_name('h1')[0].text
except:
    product = np.nan

try:
    shop_url = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[@class='css-1n8curp']"))).get_attribute("href")
except:
    shop_url = np.nan

# ....

try:
    WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[contains(@class,'pad-bottom')]//b")))
    loc = driver.find_element_by_xpath("//div[contains(@class,'pad-bottom')]//b").text
except:
    loc = np.nan

这是上面代码的输出。如您所见，b 标签上的文字是 nan 而不是 Kota Depok。

Bio Oil Original Penghilang Bekas Luka & Strecth Mark isi 125ml
https://www.tokopedia.com/naturashop27
nan

请参阅下面的解决方案。问题如下：

在抓取元素之前未完全加载元素。
使用 driver.set_window_size(1124,850) 在 Colab 中有效。

Answer 1

您在选择器中输入错误，请尝试使用

//div[@class='css-1le9c0d pad-bottom']/div/b

而不是

/div[@class='css-1le9c0d pad-bottom']/div/b

你错过了一个斜线

Answer 2

您不需要使用 css-1le9c0d class 名称。 pad-bottom class 名字够独特。
另外，您不需要使用find_elements_by_xpath，只需使用find_element_by_xpath.

text = driver.find_element_by_xpath("//div[contains(@class,'pad-bottom')]//b").text

UPD
尝试使用

WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[contains(@class,'pad-bottom')]//b")))

而不是

WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//div[contains(@class,'pad-bottom')]//b")))

UPD2
或者试试这个：

loc = driver.find_element_by_xpath("//div[contains(@class,'pad-bottom')]//b").get_attribute("textContent")

或者这样：

loc = driver.find_element_by_xpath("//div[contains(@class,'pad-bottom')]//b").get_attribute("innerHTML")

而不是

loc = driver.find_element_by_xpath("//div[contains(@class,'pad-bottom')]//b").text

Answer 3

你可能想试试这个：

元素不在 Selenium 视口中，您需要滚动一下才能完成工作。

try:
    driver.execute_script("window.scrollTo(0, 100)")
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[contains(@class, 'pad-bottom')]"))).text)
except:
    loc = np.nan

O/P :

Dikirim dari Kota Depok

Process finished with exit code 0

我使用了这个 xpath : //div[contains(@class, 'pad-bottom')] 它将打印 Dikirim dari Kota Depok

如果你使用 //div[contains(@class,'pad-bottom')]//b 你会得到 Kota Depok

更新 1：

driver = webdriver.Chrome()
driver.maximize_window()
driver.get("https://www.tokopedia.com/naturashop27/bio-oil-original-penghilang-bekas-luka-strecth-mark-isi-125ml?whid=0")
wait = WebDriverWait(driver, 10)

try:
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.TAG_NAME, "h1"))).text)
except:
    product = np.nan

try:
    shop_url = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[@data-testid='llbPDPFooterShopName']"))).get_attribute("href")
    print(shop_url)
except:
    shop_url = np.nan

try:
    driver.execute_script("window.scrollTo(0, 100)")
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[contains(@class, 'pad-bottom')]"))).text)
except:
    loc = np.nan

这给了我：

Bio Oil Original Penghilang Bekas Luka & Strecth Mark isi 125ml
https://www.tokopedia.com/naturashop27
Dikirim dari Kota Depok

Process finished with exit code 0

如何在没有 class 的情况下使用 Selenium 获取 <b> 标签

How to get <b> tag without class using Selenium

python

selenium

google-colaboratory