R: html_nodes 在网站上复制内容

Question

我在 html_nodes 复制网站内容时遇到一个奇怪的问题。

基本代码如下：

# I bring in a sample URL with a lot of CSS and Javascript
address <- "https://www.speedtest.net/"
content <- read_html(URLencode(address))
content %>%
    # I want to analyze the words on the page, so I bring in the body.
    html_nodes("body") %>%
    # I don't want Javascript and CSS cluttering the analysis, so I remove them
    html_nodes(":not(script)") %>%
    html_nodes(":not(style)") %>%
    html_text

html_nodes(":not(script)") 有效地消除了 Javascript 混乱。但是，出于某种原因，它还复制了网站上的每一行文本，因此我的最终输出如下所示：

Network Status Network Status Privacy Policy Privacy Policy Terms of Use Terms of Use Do Not Sell My Personal Information Do Not Sell My Personal Information

我觉得这只是我的一个语法错误。谁知道怎么修它？还是有更聪明的方法来达到相同的结果？

提前致谢！

Answer 1

你可以这样考虑：

address <- "https://www.speedtest.net/"
content <- read_html(URLencode(address))
content_data <- 
  content %>%
  html_nodes(xpath = "//body/descendant-or-self::*[not(name()='script' or name()='style')]/text()") %>%
  html_text(trim = T) %>%
  .[. != ""]

R: html_nodes 在网站上复制内容

R: html_nodes duplicating content on site

r

web-scraping

rvest