使用 R 进行网络抓取。我想从网站中提取一些 table 之类的数据

Question

我在从网站抓取数据时遇到了一些问题。我在网络抓取方面没有太多经验。我的计划是使用 R 从以下网站抓取一些数据：https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands

更准确地说，我想提取右侧的品牌。

目前我的想法：

brands <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>%         html_nodes(xpath='/html/body/div[1]/div/div[2]/div[2]/div[2]/div[4]/div/div/div[3]/div/div[1]/div') %>% html_text()

但这并没有显示预期的信息。在这里真的很感激一些帮助！谢谢！

Answer 1

该数据是从脚本标签中动态提取的。您可以提取该脚本标记的内容并解析为 json。只为返回列表中感兴趣的项目设置子集，然后提取品牌名称：

library(rvest)
library(jsonlite)
library(stringr)

data <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>% 
  html_node('#__NEXT_DATA__') %>% html_text() %>% 
  jsonlite::parse_json()

data <- data$props$pageProps$apolloState
mask <- map(names(data), str_detect, '^Brand:') %>% unlist()  
data <- subset(data, mask)
brands <- lapply(data, function(x){x$name})

我觉得上面的方法更容易阅读，但您可以尝试其他方法，例如

library(rvest)
library(jsonlite)
library(stringr)

brands <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>% 
  html_node('#__NEXT_DATA__') %>% html_text() %>% 
  jsonlite::parse_json() %>% 
  {.$props$pageProps$apolloState} %>% 
  subset(., {str_detect(names(.), 'Brand:')}) %>% 
  lapply(. , function(x){x$name})

我在

的评论中读到，使用 {} 将调用视为表达式而不是函数

使用 R 进行网络抓取。我想从网站中提取一些 table 之类的数据

Web-Scraping using R. I want to extract some table like data from a website

html

r

web-scraping

rvest