如何使用 rvest 循环 zillow/realtor 来拉链接
how to loop over zillow/realtor using rvest to pull links
我目前正在使用 Rvest 尝试从以下 url 中提取所有链接:https://www.zillow.com/browse/homes/fl/miami-dade-county/ 下面的代码满足了我希望为一个 url 做的事情。
#dl packages
library(tidyverse)
library(rvest)
library(xml2)
library(stringi)
library(dplyr)
library(purrr)
library(stringr)
webpage <- "https://www.zillow.com/browse/homes/fl/miami-dade-county/"
webpage <- read_html(webpage)
url_ <- webpage %>%
html_nodes("a") %>%
html_attr("href")
我正在尝试为充满相同 url (zillow.[=23) 的数据帧(称为 newurl)复制这个(for 循环或 lapply) =]/ 但每个都以不同的县结尾)。我已经尝试了循环和 lapplying 但每次都收到不同的错误。我在下面包含了我最近的错误和代码尝试。寻找有关用于满足我的需要或如何编辑现有代码的建议。谢谢。
我尝试了许多不同的代码,但我最近的代码如下。
bind_rows(lapply(newurl, function(x) {
data.frame(newurl=x, toc_entry=toc <- read_html(newurl[1]) %>%
html_nodes("a") %>%
html_attr("href"))
})) -> toc_entries
这会产生以下错误:
UseMethod("read_xml") 错误:
没有适用于 'read_xml' 的方法应用于 class 的对象“data.frame”
你可以试试这个:
require(tidyverse)
require(rvest)
require(xml2)
start_page <- "https://www.zillow.com/browse/homes/fl/miami-dade-county/"
fl_urls <- start_page %>%
read_html %>%
html_nodes("section a") %>%
html_attr("href") %>% #gives you the link to all ZIP codes
xml2::url_absolute(start_page) # convert to full URL
name_url <- function(x, base = start_page){
data_frame(
name = html_text(x, trim = FALSE),
url = html_attr(x, "href") %>%
xml2::url_absolute(base)
)
}
fl_urls %>%
head(2) %>% # Only loop the first two entries of fl_urls - remove to loop all
map(read_html) %>%
map(html_nodes, "section a") %>%
map_df(name_url)
结果是什么
# A tibble: 51 x 2
name url
<chr> <chr>
1 1000 Hardee Rd - 1191 S Alhambra Cir https://www.zillow.com/browse/homes/fl/miami-d…
2 1195 S Alhambra Cir - 1250 S Alhamb… https://www.zillow.com/browse/homes/fl/miami-d…
3 1250 S Alhambra Cir APT 17 - 1410 C… https://www.zillow.com/browse/homes/fl/miami-d…
4 1410 Mantua Ave - 1508 Zoreta Ave https://www.zillow.com/browse/homes/fl/miami-d…
5 1509 Delgado Ave - 1540 Zuleta Ave https://www.zillow.com/browse/homes/fl/miami-d…
6 1541 Cecilia Ave - 3760 Bird Rd UNI… https://www.zillow.com/browse/homes/fl/miami-d…
7 3760 Bird Rd UNIT 513 - 4100 Salzed… https://www.zillow.com/browse/homes/fl/miami-d…
8 4100 Salzedo St APT 604 - 420 Vitto… https://www.zillow.com/browse/homes/fl/miami-d…
9 4210 Anderson Rd - 4420 Anderson Rd https://www.zillow.com/browse/homes/fl/miami-d…
10 4420 Monserrate St - 4900 Suarez St https://www.zillow.com/browse/homes/fl/miami-d…
# … with 41 more rows
希望这有助于理解循环的逻辑。
我目前正在使用 Rvest 尝试从以下 url 中提取所有链接:https://www.zillow.com/browse/homes/fl/miami-dade-county/ 下面的代码满足了我希望为一个 url 做的事情。
#dl packages
library(tidyverse)
library(rvest)
library(xml2)
library(stringi)
library(dplyr)
library(purrr)
library(stringr)
webpage <- "https://www.zillow.com/browse/homes/fl/miami-dade-county/"
webpage <- read_html(webpage)
url_ <- webpage %>%
html_nodes("a") %>%
html_attr("href")
我正在尝试为充满相同 url (zillow.[=23) 的数据帧(称为 newurl)复制这个(for 循环或 lapply) =]/ 但每个都以不同的县结尾)。我已经尝试了循环和 lapplying 但每次都收到不同的错误。我在下面包含了我最近的错误和代码尝试。寻找有关用于满足我的需要或如何编辑现有代码的建议。谢谢。
我尝试了许多不同的代码,但我最近的代码如下。
bind_rows(lapply(newurl, function(x) {
data.frame(newurl=x, toc_entry=toc <- read_html(newurl[1]) %>%
html_nodes("a") %>%
html_attr("href"))
})) -> toc_entries
这会产生以下错误: UseMethod("read_xml") 错误: 没有适用于 'read_xml' 的方法应用于 class 的对象“data.frame”
你可以试试这个:
require(tidyverse)
require(rvest)
require(xml2)
start_page <- "https://www.zillow.com/browse/homes/fl/miami-dade-county/"
fl_urls <- start_page %>%
read_html %>%
html_nodes("section a") %>%
html_attr("href") %>% #gives you the link to all ZIP codes
xml2::url_absolute(start_page) # convert to full URL
name_url <- function(x, base = start_page){
data_frame(
name = html_text(x, trim = FALSE),
url = html_attr(x, "href") %>%
xml2::url_absolute(base)
)
}
fl_urls %>%
head(2) %>% # Only loop the first two entries of fl_urls - remove to loop all
map(read_html) %>%
map(html_nodes, "section a") %>%
map_df(name_url)
结果是什么
# A tibble: 51 x 2
name url
<chr> <chr>
1 1000 Hardee Rd - 1191 S Alhambra Cir https://www.zillow.com/browse/homes/fl/miami-d…
2 1195 S Alhambra Cir - 1250 S Alhamb… https://www.zillow.com/browse/homes/fl/miami-d…
3 1250 S Alhambra Cir APT 17 - 1410 C… https://www.zillow.com/browse/homes/fl/miami-d…
4 1410 Mantua Ave - 1508 Zoreta Ave https://www.zillow.com/browse/homes/fl/miami-d…
5 1509 Delgado Ave - 1540 Zuleta Ave https://www.zillow.com/browse/homes/fl/miami-d…
6 1541 Cecilia Ave - 3760 Bird Rd UNI… https://www.zillow.com/browse/homes/fl/miami-d…
7 3760 Bird Rd UNIT 513 - 4100 Salzed… https://www.zillow.com/browse/homes/fl/miami-d…
8 4100 Salzedo St APT 604 - 420 Vitto… https://www.zillow.com/browse/homes/fl/miami-d…
9 4210 Anderson Rd - 4420 Anderson Rd https://www.zillow.com/browse/homes/fl/miami-d…
10 4420 Monserrate St - 4900 Suarez St https://www.zillow.com/browse/homes/fl/miami-d…
# … with 41 more rows
希望这有助于理解循环的逻辑。