使用 Rvest 在 R 中提取 Youtube 视频描述
Extract Youtube Video description in R using Rvest
我正在尝试使用 Rvest 提取 YouTube 视频说明。我知道只使用 API 会更容易,但最终目标是更加熟悉 Rvest,而不仅仅是获取视频描述。这是我到目前为止所做的:
# defining website
page <- "https://www.youtube.com/watch?v=4PqdqWWSHJY"
# setting Xpath
Xp <- '/html/body/div[2]/div[4]/div/div[5]/div[2]/div[2]/div/div[2]/meta[2]'
# getting page
Website <- read_html(page)
# printing description
html_attr(Description, name = "content")
虽然这确实指向视频描述,但我没有得到完整的视频描述,而是在几行后被截断的字符串:
[1] "The Conservatives and Labour have been outlining their main pitch to voters. The Prime Minister Boris Johson in his first major speech of the campaign said a..."
预期输出将是完整描述
"The Conservatives and Labour have been outlining their main pitch to voters. The Prime Minister Boris Johnson in his first major speech of the campaign said a Conservative government would unite the country and "level up" the prospects for people with massive investment in health, better infrastructure, more police, and a green revolution. But he said the key issue to solve was Brexit. Meanwhile Labour vowed to outspend the Tories on the NHS in England.
Labour leader Jeremy Corbyn has also faced questions over his position on allowing a second referendum on Scottish independence. Today at the start of a two-day tour of Scotland, he said wouldn't allow one in the first term of a Labour government but later rowed back saying it wouldn't be a priority in the early years.
Sophie Raworth presents tonight's BBC News at Ten and unravels the day's events with the BBC's political editor Laura Kuenssberg, health editor Hugh Pym and Scotland editor Sarah Smith.
Please subscribe HERE: LINK"
有什么方法可以得到 rvest 的完整描述吗?
正如您所说,您专注于学习,我在展示代码后添加一些解释我是如何到达那里的。
可重现代码:
library(rvest)
library(magrittr)
url <- "https://www.youtube.com/watch?v=4PqdqWWSHJY"
url %>%
read_html %>%
html_nodes(xpath = "//*[@id = 'eow-description']") %>%
html_text
解释:
1.定位元素
有几种方法可以解决这个问题。一个常见的第一步是在浏览器中右键单击目标元素,然后 select "inspect element"。你会看到这样的东西:
接下来,您可以尝试提取数据。
url %>%
read_html %>%
html_nodes(xpath = "//*[@id = 'description']")
不幸的是,这对你的情况不起作用。
2。确保您拥有正确的来源
因此您必须确保您的目标数据在您加载的文档中。您可以在浏览器的网络活动中看到它,或者如果您更喜欢在 R 中查看,我为此编写了一个小函数:
showHtmlPage <- function(doc){
tmp <- tempfile(fileext = ".html")
doc %>% toString %>% writeLines(con = tmp)
tmp %>% browseURL(browser = rstudioapi::viewer)
}
用法:
url %>% read_html %>% showHtmlPage
您会看到您的目标数据实际上就在您下载的文档中。所以你可以坚持rvest
。接下来,您必须找到 xpath(或 css),...
3。在下载的文档中找到目标标签
您可以搜索包含您要查找的文本的标签
doc %>% html_nodes(xpath = "//*[contains(text(), 'The Conservatives and ')]")
输出将是:
{xml_nodeset (1)}
[1] <p id="eow-description" class="">The Conservatives and Labour have ....
您会看到您正在寻找 ID 为 eow-description
的标签。
我正在尝试使用 Rvest 提取 YouTube 视频说明。我知道只使用 API 会更容易,但最终目标是更加熟悉 Rvest,而不仅仅是获取视频描述。这是我到目前为止所做的:
# defining website
page <- "https://www.youtube.com/watch?v=4PqdqWWSHJY"
# setting Xpath
Xp <- '/html/body/div[2]/div[4]/div/div[5]/div[2]/div[2]/div/div[2]/meta[2]'
# getting page
Website <- read_html(page)
# printing description
html_attr(Description, name = "content")
虽然这确实指向视频描述,但我没有得到完整的视频描述,而是在几行后被截断的字符串:
[1] "The Conservatives and Labour have been outlining their main pitch to voters. The Prime Minister Boris Johson in his first major speech of the campaign said a..."
预期输出将是完整描述
"The Conservatives and Labour have been outlining their main pitch to voters. The Prime Minister Boris Johnson in his first major speech of the campaign said a Conservative government would unite the country and "level up" the prospects for people with massive investment in health, better infrastructure, more police, and a green revolution. But he said the key issue to solve was Brexit. Meanwhile Labour vowed to outspend the Tories on the NHS in England.
Labour leader Jeremy Corbyn has also faced questions over his position on allowing a second referendum on Scottish independence. Today at the start of a two-day tour of Scotland, he said wouldn't allow one in the first term of a Labour government but later rowed back saying it wouldn't be a priority in the early years.
Sophie Raworth presents tonight's BBC News at Ten and unravels the day's events with the BBC's political editor Laura Kuenssberg, health editor Hugh Pym and Scotland editor Sarah Smith.
Please subscribe HERE: LINK"
有什么方法可以得到 rvest 的完整描述吗?
正如您所说,您专注于学习,我在展示代码后添加一些解释我是如何到达那里的。
可重现代码:
library(rvest)
library(magrittr)
url <- "https://www.youtube.com/watch?v=4PqdqWWSHJY"
url %>%
read_html %>%
html_nodes(xpath = "//*[@id = 'eow-description']") %>%
html_text
解释:
1.定位元素
有几种方法可以解决这个问题。一个常见的第一步是在浏览器中右键单击目标元素,然后 select "inspect element"。你会看到这样的东西:
接下来,您可以尝试提取数据。
url %>%
read_html %>%
html_nodes(xpath = "//*[@id = 'description']")
不幸的是,这对你的情况不起作用。
2。确保您拥有正确的来源
因此您必须确保您的目标数据在您加载的文档中。您可以在浏览器的网络活动中看到它,或者如果您更喜欢在 R 中查看,我为此编写了一个小函数:
showHtmlPage <- function(doc){
tmp <- tempfile(fileext = ".html")
doc %>% toString %>% writeLines(con = tmp)
tmp %>% browseURL(browser = rstudioapi::viewer)
}
用法:
url %>% read_html %>% showHtmlPage
您会看到您的目标数据实际上就在您下载的文档中。所以你可以坚持rvest
。接下来,您必须找到 xpath(或 css),...
3。在下载的文档中找到目标标签
您可以搜索包含您要查找的文本的标签
doc %>% html_nodes(xpath = "//*[contains(text(), 'The Conservatives and ')]")
输出将是:
{xml_nodeset (1)}
[1] <p id="eow-description" class="">The Conservatives and Labour have ....
您会看到您正在寻找 ID 为 eow-description
的标签。