如何使用硒获取 youtube 中的所有评论?

How to get all comments in youtube with selenium?

网页显示有702条评论
target youtube sample

我写了一个函数 get_total_youtube_comments(url) ,很多代码都是从 github.

上的项目复制过来的

project on github

def get_total_youtube_comments(url):
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.common.exceptions import TimeoutException
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    import time
    options = webdriver.ChromeOptions()
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    options.add_argument("--headless")
    driver = webdriver.Chrome(options=options,executable_path='/usr/bin/chromedriver')
    wait = WebDriverWait(driver,60)
    driver.get(url)
    SCROLL_PAUSE_TIME = 2
    CYCLES = 7
    html = driver.find_element_by_tag_name('html')
    html.send_keys(Keys.PAGE_DOWN)   
    html.send_keys(Keys.PAGE_DOWN)   
    time.sleep(SCROLL_PAUSE_TIME * 3)
    for i in range(CYCLES):
        html.send_keys(Keys.END)
        time.sleep(SCROLL_PAUSE_TIME)
    comment_elems = driver.find_elements_by_xpath('//*[@id="content-text"]')
    all_comments = [elem.text for elem in comment_elems]
    return  all_comments

尝试解析示例网页上的所有评论 https://www.youtube.com/watch?v=N0lxfilGfak

url='https://www.youtube.com/watch?v=N0lxfilGfak'
list = get_total_youtube_comments(url)

它可以得到一些评论,只是所有评论中的一小部分。

len(list)
60

60702少很多,如何用selenium获取youtube中的所有评论?
@supputuri,我可以用你的代码提取所有评论。

comments_list = driver.find_elements_by_xpath("//*[@id='content-text']")
len(comments_list)
709
print(driver.find_element_by_xpath("//h2[@id='count']").text)
717 Comments
comments_list[-1].text
'mistake at 23:11 \nin NOT it should return false if x is true.'
comments_list[0].text
'Got a question on the topic? Please share it in the comment section below and our experts will answer it for you. For Edureka Python Course curriculum, Visit our Website:  Use code "YOUTUBE20" to get Flat 20% off on this training.'

为什么页面显示的评论数是709而不是717?

您收到的评论数量有限,因为 YouTube 会在您不断向下滚动时加载评论。该视频还有大约 394 条评论,您必须首先确保所有评论都已加载,然后还要展开所有评论 View Replies 以便达到最大评论数。

注意:我使用以下代码行获得了 700 条评论。

# get the last comment
lastEle = driver.find_element_by_xpath("(//*[@id='content-text'])[last()]")
# scroll to the last comment currently loaded
lastEle.location_once_scrolled_into_view
# wait until the comments loading is done
WebDriverWait(driver,30).until(EC.invisibility_of_element((By.CSS_SELECTOR,"div.active.style-scope.paper-spinner")))

# load all comments
while lastEle != driver.find_element_by_xpath("(//*[@id='content-text'])[last()]"):
    lastEle = driver.find_element_by_xpath("(//*[@id='content-text'])[last()]")
    driver.find_element_by_xpath("(//*[@id='content-text'])[last()]").location_once_scrolled_into_view
    time.sleep(2)
    WebDriverWait(driver,30).until(EC.invisibility_of_element((By.CSS_SELECTOR,"div.active.style-scope.paper-spinner")))

# open all replies
for reply in driver.find_elements_by_xpath("//*[@id='replies']//paper-button[@class='style-scope ytd-button-renderer'][contains(.,'View')]"):
    reply.location_once_scrolled_into_view
    driver.execute_script("arguments[0].click()",reply)
time.sleep(5)
WebDriverWait(driver, 30).until(
        EC.invisibility_of_element((By.CSS_SELECTOR, "div.active.style-scope.paper-spinner")))
# print the total number of comments
print(len(driver.find_elements_by_xpath("//*[@id='content-text']")))

如果您不必使用 Selenium,我建议您查看 google/youtube api。

https://developers.google.com/youtube/v3/getting-started

示例:

https://www.googleapis.com/youtube/v3/commentThreads?key=YourAPIKey&textFormat=plainText&part=snippet&videoId=N0lxfilGfak&maxResults=100

这将为您提供前 100 个结果,并为您提供一个令牌,您可以在下一个请求中附加该令牌以获取接下来的 100 个结果。

我不熟悉 python,但我会告诉您获得所有评论的步骤。 首先,如果你的代码我认为主要问题是

CYCLES = 7

据此,您将滚动 2 秒 7 次。由于您已成功抓取 60 条评论,因此修复上述情况将解决您的问题。

我假设您使用定位器在网站上查找元素没有任何问题。

  1. 您需要获得总评论数才能作为 int 计入变量。 (在你的情况下,假设它是 COMMENTS = 715)

  2. 定义另一个名为 VISIBLECOUNTS = 0 的变量

  3. 如果 COMMENTS > VISIBLECOUNTS

    使用 while 循环滚动
  4. 代码可能是这样的(如果有语法问题真的很抱歉)

    // python - selenium command to get all comments counts.
    COMMENTS = 715
    (715 is just a sample value, it will change upon the total comments count)
    VISIBLECOUNTE = 0 
    SCROLL_PAUSE_TIME = 2
    
    while VISIBLECOUNTS  < COMMENTS :
    html.send_keys(Keys.END)
    time.sleep(SCROLL_PAUSE_TIME)
    VISIBLECOUNTS = len(driver.find_elements_by_xpath('//ytm-comment-thread-renderer'))
    

    有了这个,您将向下滚动直到 COMMENTS = VISIBLECOUNTS。然后你可以获取所有评论,因为它们都共享相同的元素属性,例如 ytm-comment-thread-renderer

    由于我对python不熟悉,我将添加命令以从js中获取评论。你可以在你的浏览器上试试这个并将它转换成你的 python 命令

运行 在您的控制台中进行以下查询并检查。

To get total comments count
var comments = document.querySelector(".comment-section-header-text").innerText.split(" ")
//We can get the text value "Comments • 715" and split by spaces and get the last value

Number(comments[comments.length -1])
//Then convirt string "715" to int, you just need to do these in python - selenium
To get active comments count
$x("//ytm-comment-thread-renderer").length

注意:如果很难提取值,您仍然可以使用 selenium js 执行器并使用 js 进行滚动,直到所有注释都可见。但我想在 python 中做到这一点并不难,因为逻辑是相同的。

非常抱歉无法在python中添加解决方案。 但希望这有所帮助。 干杯。

有几件事:

  • s within the website https://www.youtube.com/ 是动态的。动态渲染的评论也是如此。
  • 在网页 https://www.youtube.com/watch?v=N0lxfilGfak 中,评论不会呈现,除非用户在 Viewport.
  • 中滚动以下元素

  • 评论在:

    <!--css-build:shady-->
    

    适用,Polymer CSS Builder 用于应用 Polymer 的 CSS Mixin shim 和 ShadyDOM 范围。所以一些运行时的工作仍然是在默认设置下转换 CSS 选择器。


考虑到上述因素,这里有一个检索所有评论的解决方案:

代码块:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException, ElementClickInterceptedException, WebDriverException
import time

options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.youtube.com/watch?v=N0lxfilGfak')
driver.execute_script("return scrollBy(0, 400);")
subscribe = WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, "//yt-formatted-string[text()='Subscribe']")))
driver.execute_script("arguments[0].scrollIntoView(true);",subscribe)
comments = []
my_length = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//yt-formatted-string[@class='style-scope ytd-comment-renderer' and @id='content-text'][@slot='content']"))))
while True:
    try:
        driver.execute_script("window.scrollBy(0,800)")
        time.sleep(5)
        comments.append([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//yt-formatted-string[@class='style-scope ytd-comment-renderer' and @id='content-text'][@slot='content']")))])
    except TimeoutException:
        driver.quit()
        break
print(comment)

您需要做的第一件事是向下滚动视频页面以加载所有评论:

$actualHeight = 0;
$nextHeight = 0;
while (true) {
    try {
        
        $nextHeight += 10;      
        $actualHeight =  $this->driver->executeScript('return document.documentElement.scrollHeight;');
        
        if ($nextHeight >= ($actualHeight - 50 ) ) break;
        $this->driver->executeScript("window.scrollTo(0, $nextHeight);");
        $this->driver->manage()->timeouts()->implicitlyWait = 10;
    } catch (Exception $e) {
        break;
    }
}