如何从隐藏范围 class HTML 中抓取链接?

how to scrape links from hidden span class HTML?

我正在学习网络抓取,因为我从真实网站抓取真实世界的数据。 然而,直到现在我还没有 运行 遇到过这类问题。 通常可以通过右键单击网站部分然后单击检查选项来搜索想要的 HTML 源代码。我会马上跳到例子来解释这个问题。

从上图中,标有 span class 的红色本来是不存在的,但是当我将光标放在(甚至没有点击)用户名上时,会弹出一个该用户的小框,跨度 class 也出现了。我最终想要抓取的是 link 用户个人资料的地址,该地址嵌入在该范围内 class.I 我不确定,但如果我可以解析该范围 class,我想我可以尝试抓取 link 地址,但我一直无法解析隐藏的范围 class.

我没想到那么多,但我的代码当然给了我一个空列表,因为当我的光标不在用户名上时,那个范围 class 没有出现。但是我展示了我的代码来展示我所做的。

from bs4 import BeautifulSoup
from selenium import webdriver

#Incognito Mode
option=webdriver.ChromeOptions()
option.add_argument("--incognito")

#Open Chrome
driver=webdriver.Chrome(executable_path="C:/Users/chromedriver.exe",options=option)

driver.get("https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html")
time.sleep(3)

#parse html
html =driver.page_source
soup=BeautifulSoup(html,"html.parser")

hidden=soup.find_all("span", class_="ui_overlay ui_popover arrow_left")
print (hidden)

是否有任何简单直观的方法来使用 selenium 解析隐藏的范围 class?如果我可以解析它,我可能会使用 'find' 函数为用户解析 link 地址,然后遍历所有用户以获取所有 link 地址。 谢谢。

=======================通过添加以下内容更新了问题================ ===
为了对我想要检索的内容添加一些更详细的解释,我想从下图中获取用红色箭头指向的 link。感谢您指出我需要更多解释。

==========================到目前为止更新的代码=============== ======

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC

#Incognito Mode
option=webdriver.ChromeOptions()
option.add_argument("--incognito")

#Open Chrome
driver=webdriver.Chrome(executable_path="C:/Users/chromedriver.exe",options=option)

driver.get("https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html")
time.sleep(3)

profile=driver.find_element_by_xpath("//div[@class='mainContent']")
profile_pic=profile.find_element_by_xpath("//div[@class='ui_avatar large']")

ActionChains(driver).move_to_element(profile_pic).perform()
ActionChains(driver).move_to_element(profile_pic).click().perform()

#So far I could successfully hover over the first user. A few issues occur after this line.
#The error message says "type object 'By' has no attribute 'xpath'". I thought this would work since I searched on the internet how to enable this function.
waiting=wait(driver, 5).until(EC.element_to_be_clickable((By.xpath,('//span//a[contains(@href,"/Profile/")]'))))

#This gives me also a error message saying that "unable to locate the element".
#Some of the ways to code in Python and Java were different so I searched how to get the value of the xpath which contains "/Profile/" but gives me an error.
profile_box=driver.find_element_by_xpath('//span//a[contains(@href,"/Profile/")]').get_attribute("href")
print (profile_box)


此外,在这种情况下,有什么方法可以遍历 xpath 吗?

我认为你可以使用 requests 库而不是 selenium。

当您将鼠标悬停在用户名上时,您将收到如下请求 URL。

import requests
from bs4 import BeautifulSoup

html = requests.get('https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html')
print(html.status_code)

soup = BeautifulSoup(html.content, 'html.parser')

# Find all UID of username
# Split the string "UID_D37FB22A0982ED20FA4D7345A60B8826-SRC_511863293" into UID, SRC
# And recombine to Request URL
name = soup.find_all('div', class_="memberOverlayLink")
for i in name:
    print(i.get('id'))

# Use url to get profile link
response = requests.get('https://www.tripadvisor.com/MemberOverlay?Mode=owa&uid=805E0639C29797AEDE019E6F7DA9FF4E&c=&src=507403702&fus=false&partner=false&LsoId=&metaReferer=')
soup = BeautifulSoup(response.content, 'html.parser')
result = soup.find('a')
print(result.get('href'))

这是输出:

200
UID_D37FB22A0982ED20FA4D7345A60B8826-SRC_511863293
UID_D37FB22A0982ED20FA4D7345A60B8826-SRC_511863293
UID_D37FB22A0982ED20FA4D7345A60B8826-SRC_511863293
UID_805E0639C29797AEDE019E6F7DA9FF4E-SRC_507403702
UID_805E0639C29797AEDE019E6F7DA9FF4E-SRC_507403702
UID_805E0639C29797AEDE019E6F7DA9FF4E-SRC_507403702
UID_6A86C50AB327BA06D3B8B6F674200EDD-SRC_506453752
UID_6A86C50AB327BA06D3B8B6F674200EDD-SRC_506453752
UID_6A86C50AB327BA06D3B8B6F674200EDD-SRC_506453752
UID_97307AA9DD045AE5484EEEECCF0CA767-SRC_500684401
UID_97307AA9DD045AE5484EEEECCF0CA767-SRC_500684401
UID_97307AA9DD045AE5484EEEECCF0CA767-SRC_500684401
UID_E629D379A14B8F90E01214A5FA52C73B-SRC_496284746
UID_E629D379A14B8F90E01214A5FA52C73B-SRC_496284746
UID_E629D379A14B8F90E01214A5FA52C73B-SRC_496284746
/Profile/JLERPercy

如果要使用selenium来获取弹框,

您可以使用 ActionChains 来执行 hover() 函数。

但我认为它比使用请求效率低。

from selenium.webdriver.common.action_chains import ActionChains
ActionChains(driver).move_to_element(element).perform()

Python

下面的代码将提取 href value.Try 并告诉我它是怎么回事。

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
driver = webdriver.Chrome('/usr/local/bin/chromedriver')  # Optional argument, if not specified will search path.
driver.implicitly_wait(15)

driver.get("https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html");

#finds all the comments or profile pics
profile_pic= driver.find_elements(By.XPATH,"//div[@class='prw_rup prw_reviews_member_info_hsx']//div[@class='ui_avatar large']")

for i in profile_pic:
        #clicks all the profile pic one by one
        ActionChains(driver).move_to_element(i).perform()
        ActionChains(driver).move_to_element(i).click().perform()
        #print the href or link value
        profile_box=driver.find_element_by_xpath('//span//a[contains(@href,"/Profile/")]').get_attribute("href")
        print (profile_box)

driver.quit()

Java 示例:

import java.util.List;
import java.util.concurrent.TimeUnit;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.interactions.Actions;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;

public class Selenium {

    public static void main(String[] args) {
        System.setProperty("webdriver.chrome.driver", "./lib/chromedriver");
        WebDriver driver = new ChromeDriver();
        driver.manage().window().maximize();
        driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
        driver.get("https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html");

        //finds all the comments or profiles
        List<WebElement> profile= driver.findElements(By.xpath("//div[@class='prw_rup prw_reviews_member_info_hsx']//div[@class='ui_avatar large']"));

        for(int i=0;i<profile.size();i++)
        {
            //Hover on user profile photo
            Actions builder = new Actions(driver);
            builder.moveToElement(profile.get(i)).perform();
            builder.moveToElement(profile.get(i)).click().perform();
            //Wait for user details pop-up
            WebDriverWait wait = new WebDriverWait(driver, 10);
            wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath("//span//a[contains(@href,'/Profile/')]")));
            //Extract the href value
            String hrefvalue=driver.findElement(By.xpath("//span//a[contains(@href,'/Profile/')]")).getAttribute("href");
            //Print the extracted value
            System.out.println(hrefvalue);
        }
        //close the browser
        driver.quit(); 

    }

}

输出

https://www.tripadvisor.com/Profile/861kellyd
https://www.tripadvisor.com/Profile/JLERPercy
https://www.tripadvisor.com/Profile/rayn817
https://www.tripadvisor.com/Profile/grossla
https://www.tripadvisor.com/Profile/kapmem