使用 Rvest 进行网页抓取：将缺失的条目设置为 NA

Question

我绝对是 R 的初学者，我一直在尝试从 this Sprinter Sports page 中抓取鞋子价格，最终目标是拥有一个自动生成的数据集每天加载我感兴趣的鞋子的 (i) 原价和 (ii) 折扣价。

问题是，在目前出售的 24 款鞋中，只有 16 款既有“原价”又有“折扣”价。其余 8 件没有“折扣”价格，因为它们没有以折扣价出售。由于“原始”列有 24 个观察值，而“折扣”列只有 16 个，因此我无法将它们连接到一个数据集中。

如何在没有折扣的情况下加载鞋子，使其“折扣”列设置为 NA？我的代码如下。谢谢！

date_today = substring(gsub("-", "", Sys.Date()),3)

page_sp_merrel <- read_html("https://www.sprintersports.com/pt/sapatilhas-merrell-homem?page=1&per_page=50")

  price_old_sp_merrel <- page_sp_merrel %>%
    html_nodes(".product-card__info-price-old") %>%
    html_text()
  
  price_new_sp_merrel <- page_sp_merrel %>%
    html_nodes(".product-card__info-price-actual") %>%
    html_text()
  
  product_name_sp_merrel <- page_sp_merrel %>%
    html_nodes(".col-md-3 .product-card__info-name") %>%
    html_text()
  
  sp_merrel_df <- tibble(
    price_old = price_old_sp_merrel,
    price_new = price_new_sp_merrel,
    product_name = product_name_sp_merrel,
    date = date_today
      )

Answer 1

这可以这样实现。基本上我的方法与你的不同之处在于我遍历卡片并将所需信息直接提取到数据框中，如果卡片上不存在元素，它会自动给出 NA:

library(rvest)

date_today = substring(gsub("-", "", Sys.Date()),3)

page_sp_merrel <- read_html("https://www.sprintersports.com/pt/sapatilhas-merrell-homem?page=1&per_page=50")

sp_merrel_df <- page_sp_merrel %>% 
  html_nodes(".product-card__info-data") %>% 
  purrr::map_df(function(x) {
    data.frame(
      product_name = html_node(x, ".product-card__info-name") %>% html_text(),
      price_old = html_node(x, ".product-card__info-price-old") %>% html_text(),
      price_new = html_node(x, ".product-card__info-price-actual") %>% html_text(),
      date = date_today
    )
  })

head(sp_merrel_df)
#>                  product_name price_old price_new   date
#> 1          Merrell Riverbed 3   69,99 €   59,99 € 210719
#> 2 Sapatilhas Montanha Merrell      <NA>  114,99 € 210719
#> 3      Merrell Moab Adventure      <NA>   99,99 € 210719
#> 4          Merrel Moab 2 Vent   99,99 €   79,99 € 210719
#> 5          Merrell Alverstone      <NA>   79,99 € 210719
#> 6           Merrell Chameleon      <NA>  129,99 € 210719

使用 Rvest 进行网页抓取：将缺失的条目设置为 NA

Web Scraping with Rvest: set missing entries to NA

r

web-scraping

rvest