R: html_nodes 在网站上复制内容
R: html_nodes duplicating content on site
我在 html_nodes 复制网站内容时遇到一个奇怪的问题。
基本代码如下:
# I bring in a sample URL with a lot of CSS and Javascript
address <- "https://www.speedtest.net/"
content <- read_html(URLencode(address))
content %>%
# I want to analyze the words on the page, so I bring in the body.
html_nodes("body") %>%
# I don't want Javascript and CSS cluttering the analysis, so I remove them
html_nodes(":not(script)") %>%
html_nodes(":not(style)") %>%
html_text
html_nodes(":not(script)") 有效地消除了 Javascript 混乱。 但是,出于某种原因,它还复制了网站上的每一行文本,因此我的最终输出如下所示:
Network Status Network Status Privacy Policy Privacy Policy Terms of
Use Terms of Use Do Not Sell My Personal Information Do
Not Sell My Personal Information
我觉得这只是我的一个语法错误。谁知道怎么修它?还是有更聪明的方法来达到相同的结果?
提前致谢!
你可以这样考虑:
address <- "https://www.speedtest.net/"
content <- read_html(URLencode(address))
content_data <-
content %>%
html_nodes(xpath = "//body/descendant-or-self::*[not(name()='script' or name()='style')]/text()") %>%
html_text(trim = T) %>%
.[. != ""]
我在 html_nodes 复制网站内容时遇到一个奇怪的问题。
基本代码如下:
# I bring in a sample URL with a lot of CSS and Javascript
address <- "https://www.speedtest.net/"
content <- read_html(URLencode(address))
content %>%
# I want to analyze the words on the page, so I bring in the body.
html_nodes("body") %>%
# I don't want Javascript and CSS cluttering the analysis, so I remove them
html_nodes(":not(script)") %>%
html_nodes(":not(style)") %>%
html_text
html_nodes(":not(script)") 有效地消除了 Javascript 混乱。 但是,出于某种原因,它还复制了网站上的每一行文本,因此我的最终输出如下所示:
Network Status Network Status Privacy Policy Privacy Policy Terms of Use Terms of Use Do Not Sell My Personal Information Do Not Sell My Personal Information
我觉得这只是我的一个语法错误。谁知道怎么修它?还是有更聪明的方法来达到相同的结果?
提前致谢!
你可以这样考虑:
address <- "https://www.speedtest.net/"
content <- read_html(URLencode(address))
content_data <-
content %>%
html_nodes(xpath = "//body/descendant-or-self::*[not(name()='script' or name()='style')]/text()") %>%
html_text(trim = T) %>%
.[. != ""]