R:使用从 RSelenium 抓取的数据创建数据框
R: Creating a dataframe with scraped data from RSelenium
我正在从 Google 书中抓取一些信息(对 NHL 球队进行研究),我正在使用 RSelenium
开始:
library(tidyverse)
library(RSelenium) # using Docker
library(rvest)
library(httr)
remDr <- remoteDriver(port = 4445L, browserName = "chrome")
remDr$open()
remDr$navigate("https://books.google.com/")
books <- remDr$findElement(using = "css", "[name = 'q']")
books$sendKeysToElement(list("NHL teams", key = "enter"))
bookElem <- remDr$findElements(using = "xpath",
"//h3[@class = 'LC20lb']//parent::a")
links <- sapply(bookElem, function(bookElem){
bookElem$getElementAttribute("href")
})
上面导航到正确的页面并搜索 "NHL teams." 但是,需要注意的是,其中一些书籍有 "preview" 页,并且要切入正题(标题,作者等),必须进一步点击 "About this book":
for(link in links) {
remDr$navigate(link)
# If statement to get past book previews
if (str_detect(link, "frontcover")) {
# Finding elements for "About this book"
link2 <- remDr$findElements(using = 'xpath',
'//a[@id="sidebar-atb-link" and span[.="About this book"]]')
# Clicking on the "About this book" links
link2_about <- sapply(link2, function(link2){
link2$getElementAttribute('href')
})
duh <- map(link2_about, read_html)
# NHL book title, author
nhl_title <- duh %>%
map(html_nodes, '#bookinfo > h1 > span.fn > span') %>%
map_chr(html_text) %>%
print()
author1 <- duh %>%
map(html_nodes, '#bookinfo div:nth-child(1) span') %>%
map_chr(html_text) %>%
print()
test_df <- cbind(nhl_title, author1) # ONLY binds the last book/author
print(test_df)
} else {
print("lol you thought this would work?") # haven't built this part out yet
}
}
我对 map
的使用打印出个人 titles/authors,但我不知道如何将它们放入数据框中。每次我使用 tibble()
或 map_dfr()
时都会出错。上面的 for
循环列出了标题和作者,但没有将任何内容放在一起。如何将所有这些绑定到一个框架中?
答案原来很简单。我只需在 for
循环上方添加一个空白列表,然后将其添加到循环内。例如,
blank_list <- list()
for(link in links) {
....
blank_list[[link]] <- tibble(nhl_title, author1)
wow <- bind_rows(blank_list)
print(wow)
}
不要使用 do.call()
或其他选项,bind_rows()
比其他选项更快。
我正在从 Google 书中抓取一些信息(对 NHL 球队进行研究),我正在使用 RSelenium
开始:
library(tidyverse)
library(RSelenium) # using Docker
library(rvest)
library(httr)
remDr <- remoteDriver(port = 4445L, browserName = "chrome")
remDr$open()
remDr$navigate("https://books.google.com/")
books <- remDr$findElement(using = "css", "[name = 'q']")
books$sendKeysToElement(list("NHL teams", key = "enter"))
bookElem <- remDr$findElements(using = "xpath",
"//h3[@class = 'LC20lb']//parent::a")
links <- sapply(bookElem, function(bookElem){
bookElem$getElementAttribute("href")
})
上面导航到正确的页面并搜索 "NHL teams." 但是,需要注意的是,其中一些书籍有 "preview" 页,并且要切入正题(标题,作者等),必须进一步点击 "About this book":
for(link in links) {
remDr$navigate(link)
# If statement to get past book previews
if (str_detect(link, "frontcover")) {
# Finding elements for "About this book"
link2 <- remDr$findElements(using = 'xpath',
'//a[@id="sidebar-atb-link" and span[.="About this book"]]')
# Clicking on the "About this book" links
link2_about <- sapply(link2, function(link2){
link2$getElementAttribute('href')
})
duh <- map(link2_about, read_html)
# NHL book title, author
nhl_title <- duh %>%
map(html_nodes, '#bookinfo > h1 > span.fn > span') %>%
map_chr(html_text) %>%
print()
author1 <- duh %>%
map(html_nodes, '#bookinfo div:nth-child(1) span') %>%
map_chr(html_text) %>%
print()
test_df <- cbind(nhl_title, author1) # ONLY binds the last book/author
print(test_df)
} else {
print("lol you thought this would work?") # haven't built this part out yet
}
}
我对 map
的使用打印出个人 titles/authors,但我不知道如何将它们放入数据框中。每次我使用 tibble()
或 map_dfr()
时都会出错。上面的 for
循环列出了标题和作者,但没有将任何内容放在一起。如何将所有这些绑定到一个框架中?
答案原来很简单。我只需在 for
循环上方添加一个空白列表,然后将其添加到循环内。例如,
blank_list <- list()
for(link in links) {
....
blank_list[[link]] <- tibble(nhl_title, author1)
wow <- bind_rows(blank_list)
print(wow)
}
不要使用 do.call()
或其他选项,bind_rows()
比其他选项更快。