RVest 抓取数据和图片标题
RVest scraping of data and image title
我正在尝试抓取图像的数据和标题。
library(rvest)
test <- paste0("https://247sports.com/Season/2021-Football/Commits/?Page=", 1:38)
test_data <- map_df(commits, ~.x %>% read_html %>%
html_nodes(".ri-page__star-and-score .score , .position , .ri-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
matrix(ncol = 3, byrow = T) %>%
as.data.frame)
这给了我需要的四个数据字段。但我也想获得图片标题,它位于:
<img alt="South Carolina" class="jsonly" src="https://s3media.247sports.com/Uploads/Assets/627/649/4649627.png?fit=bounds&crop=50:50,offset-y0.50&width=50&height=50" title="South Carolina" style="opacity: 1;">
所以我还想提取 alt 或 title,它们都给我名字(在这个例子中,它是 South Carolina
)。我知道如何使用 httr_attr
将其仅提取为标题,但我不知道如何将它们组合在一起,所以总共有五个字段。
您可以通过以下方式获得称号:
library(rvest)
library(tidyverse)
test <- 'https://247sports.com/Season/2021-Football/Commits/?Page=1'
test_data$title <- test %>%
read_html %>%
html_nodes('div.status img') %>%
html_attr('title') %>%
.[c(TRUE, FALSE)]
test_data
# V1 V2 V3 V4 title
#1 Kyle Ecker Committed: 8/5/2020 NA OT San Diego
#2 Josh Bertholotte Committed: 8/5/2020 NA OLB Hawaii
#3 Antario Brown Committed: 8/5/2020 0.8516 RB South Carolina
#4 J'Kalon Carter Committed: 8/5/2020 NA WR Illinois State
#5 Stephon Dubose Committed: 8/5/2020 NA OG Old Dominion
#...
我使用 .[c(TRUE, FALSE)]
到 select 替代值,因为每个值都使用 html_attr('title')
returns 两次。
要将其集成到 map
函数中,我们可以这样做:
all_data <- map_df(commits, ~{
webpage <- .x %>% read_html
df1 <- webpage %>%
html_nodes(".ri-page__star-and-score .score ,.position ,.ri-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
matrix(ncol = 3, byrow = T) %>%
as.data.frame
df1$title <- webpage %>%
html_nodes('div.status img') %>%
html_attr('title') %>%
.[c(TRUE, FALSE)]
df1
})
我正在尝试抓取图像的数据和标题。
library(rvest)
test <- paste0("https://247sports.com/Season/2021-Football/Commits/?Page=", 1:38)
test_data <- map_df(commits, ~.x %>% read_html %>%
html_nodes(".ri-page__star-and-score .score , .position , .ri-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
matrix(ncol = 3, byrow = T) %>%
as.data.frame)
这给了我需要的四个数据字段。但我也想获得图片标题,它位于:
<img alt="South Carolina" class="jsonly" src="https://s3media.247sports.com/Uploads/Assets/627/649/4649627.png?fit=bounds&crop=50:50,offset-y0.50&width=50&height=50" title="South Carolina" style="opacity: 1;">
所以我还想提取 alt 或 title,它们都给我名字(在这个例子中,它是 South Carolina
)。我知道如何使用 httr_attr
将其仅提取为标题,但我不知道如何将它们组合在一起,所以总共有五个字段。
您可以通过以下方式获得称号:
library(rvest)
library(tidyverse)
test <- 'https://247sports.com/Season/2021-Football/Commits/?Page=1'
test_data$title <- test %>%
read_html %>%
html_nodes('div.status img') %>%
html_attr('title') %>%
.[c(TRUE, FALSE)]
test_data
# V1 V2 V3 V4 title
#1 Kyle Ecker Committed: 8/5/2020 NA OT San Diego
#2 Josh Bertholotte Committed: 8/5/2020 NA OLB Hawaii
#3 Antario Brown Committed: 8/5/2020 0.8516 RB South Carolina
#4 J'Kalon Carter Committed: 8/5/2020 NA WR Illinois State
#5 Stephon Dubose Committed: 8/5/2020 NA OG Old Dominion
#...
我使用 .[c(TRUE, FALSE)]
到 select 替代值,因为每个值都使用 html_attr('title')
returns 两次。
要将其集成到 map
函数中,我们可以这样做:
all_data <- map_df(commits, ~{
webpage <- .x %>% read_html
df1 <- webpage %>%
html_nodes(".ri-page__star-and-score .score ,.position ,.ri-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
matrix(ncol = 3, byrow = T) %>%
as.data.frame
df1$title <- webpage %>%
html_nodes('div.status img') %>%
html_attr('title') %>%
.[c(TRUE, FALSE)]
df1
})