R: html_nodes 在网站上复制内容

R: html_nodes duplicating content on site

我在 html_nodes 复制网站内容时遇到一个奇怪的问题。

基本代码如下:

# I bring in a sample URL with a lot of CSS and Javascript
address <- "https://www.speedtest.net/"
content <- read_html(URLencode(address))
content %>%
    # I want to analyze the words on the page, so I bring in the body.
    html_nodes("body") %>%
    # I don't want Javascript and CSS cluttering the analysis, so I remove them
    html_nodes(":not(script)") %>%
    html_nodes(":not(style)") %>%
    html_text

html_nodes(":not(script)") 有效地消除了 Javascript 混乱。 但是,出于某种原因,它还复制了网站上的每一行文本,因此我的最终输出如下所示:

Network Status Network Status Privacy Policy Privacy Policy Terms of Use Terms of Use Do Not Sell My Personal Information Do Not Sell My Personal Information

我觉得这只是我的一个语法错误。谁知道怎么修它?还是有更聪明的方法来达到相同的结果?

提前致谢!

你可以这样考虑:

address <- "https://www.speedtest.net/"
content <- read_html(URLencode(address))
content_data <- 
  content %>%
  html_nodes(xpath = "//body/descendant-or-self::*[not(name()='script' or name()='style')]/text()") %>%
  html_text(trim = T) %>%
  .[. != ""]