使用 Rvest 进行抓取和循环
Scrape and Loop with Rvest
我已经查看了与此类似主题相关的 SO 上类似问题的几个答案,但似乎都不适合我。
()
()
我有一个 URL 列表,我希望从每个 URL 中获取 table 并将其附加到主数据框。
## get all urls into one list
page<- (0:2)
urls <- list()
for (i in 1:length(page)) {
url<- paste0("https://www.mlssoccer.com/stats/season?page=",page[i])
urls[[i]] <- url
}
### loop over the urls and get the table from each page
table<- data.frame()
for (j in urls) {
tbl<- urls[j] %>%
read_html() %>%
html_node("table") %>%
html_table()
table[[j]] <- tbl
}
第一部分按预期工作并获取我想要抓取的 url 列表。我收到以下错误:
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "list"
关于如何更正此错误并将 3 table 循环到单个 DF 中的任何建议?我感谢任何提示或指示。
这是你的问题:
for (j in urls) {
tbl<- urls[j] %>%
当您使用 j in urls
时,j
值不是整数,它们是 url 本身。
尝试:
for (j in 1:length(urls)) {
tbl<- urls[[j]] %>%
read_html() %>%
html_node("table") %>%
html_table()
table[[j]] <- tbl
}
您也可以使用 seq_along()
:
for (j in seq_along(urls))
试试这个:
library(tidyverse)
library(rvest)
page<- (0:2)
urls <- list()
for (i in 1:length(page)) {
url<- paste0("https://www.mlssoccer.com/stats/season?page=",page[i])
urls[[i]] <- url
}
### loop over the urls and get the table from each page
tbl <- list()
j <- 1
for (j in seq_along(urls)) {
tbl[[j]] <- urls[[j]] %>% # tbl[[j]] assigns each table from your urls as an element in the tbl list
read_html() %>%
html_node("table") %>%
html_table()
j <- j+1 # j <- j+1 iterates over each url in turn and assigns the table from the second url as an element of tbl list, [[2]] in this case
}
#convert list to data frame
tbl <- do.call(rbind, tbl)
原始代码中 for 循环末尾的 table[[j]] <- tbl
不是必需的,因为我们在这里将每个 url 分配为 tbl
列表的元素:tbl[[j]] <- urls[[j]]
我已经查看了与此类似主题相关的 SO 上类似问题的几个答案,但似乎都不适合我。
(
(
我有一个 URL 列表,我希望从每个 URL 中获取 table 并将其附加到主数据框。
## get all urls into one list
page<- (0:2)
urls <- list()
for (i in 1:length(page)) {
url<- paste0("https://www.mlssoccer.com/stats/season?page=",page[i])
urls[[i]] <- url
}
### loop over the urls and get the table from each page
table<- data.frame()
for (j in urls) {
tbl<- urls[j] %>%
read_html() %>%
html_node("table") %>%
html_table()
table[[j]] <- tbl
}
第一部分按预期工作并获取我想要抓取的 url 列表。我收到以下错误:
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "list"
关于如何更正此错误并将 3 table 循环到单个 DF 中的任何建议?我感谢任何提示或指示。
这是你的问题:
for (j in urls) {
tbl<- urls[j] %>%
当您使用 j in urls
时,j
值不是整数,它们是 url 本身。
尝试:
for (j in 1:length(urls)) {
tbl<- urls[[j]] %>%
read_html() %>%
html_node("table") %>%
html_table()
table[[j]] <- tbl
}
您也可以使用 seq_along()
:
for (j in seq_along(urls))
试试这个:
library(tidyverse)
library(rvest)
page<- (0:2)
urls <- list()
for (i in 1:length(page)) {
url<- paste0("https://www.mlssoccer.com/stats/season?page=",page[i])
urls[[i]] <- url
}
### loop over the urls and get the table from each page
tbl <- list()
j <- 1
for (j in seq_along(urls)) {
tbl[[j]] <- urls[[j]] %>% # tbl[[j]] assigns each table from your urls as an element in the tbl list
read_html() %>%
html_node("table") %>%
html_table()
j <- j+1 # j <- j+1 iterates over each url in turn and assigns the table from the second url as an element of tbl list, [[2]] in this case
}
#convert list to data frame
tbl <- do.call(rbind, tbl)
原始代码中 for 循环末尾的 table[[j]] <- tbl
不是必需的,因为我们在这里将每个 url 分配为 tbl
列表的元素:tbl[[j]] <- urls[[j]]