rvest 抓取不同长度的数据

Question

作为练习项目，我正在尝试从网站上抓取属性数据。（我只是想练习我的网页抓取技巧，无意进一步利用抓取的数据）。但是我发现有些属性没有可用的价格，因此，当我试图将它们组合成一个数据框时，这会产生不同长度的错误。

下面是抓取代码：

library(tidyverse)
library(revest)

web_page <- read_html("https://wx.fang.anjuke.com/loupan/all/a1_p2/")

community_name <- web_page %>% 
  html_nodes(".items-name") %>% 
  html_text()

length(community_name)

listed_price <- web_page %>% 
  html_nodes(".price") %>% 
  html_text()

length(listed_price)
property_data <- data.frame(
  name=community_name,
  price=listed_price
)

如何识别没有列出价格的属性并在没有价值时用 NA 填充价格变量？

Answer 1

查看网页发现，价格有值时class为.price，无值时为.price-txt。因此，一种解决方案是在 html_nodes() 中使用 XPath 表达式并匹配以“price”开头的 classes:

listed_price <- web_page %>% 
  html_nodes(xpath = "//p[starts-with(@class, 'price')]") %>% 
  html_text()

length(listed_price)
[1] 60

rvest 抓取不同长度的数据

rvest scraping data with different length

r

web-scraping

rvest