使用 R 从网站上抓取图像 URL

Question

我试图在 R 中使用 'rvest' 从网页获取图像 URL，但没有成功。下面是代码：

library(rvest)
library(magrittr)

imageURL <- read_html("https://www.ajio.com/ajio-twill-snapback-cap/p/460022581_royalblue") %>%
    html_nodes(css = "img") %>%
    html_attr("src")

相同的代码适用于“https://en.wikipedia.org/wiki/Lady_Jane_Grey”

不知道我哪里错了。

Answer 1

在您的网络浏览器中打开 https://www.ajio.com/ajio-twill-snapback-cap/p/460022581_royalblue，右键单击 select "view source" 或类似的。然后，搜索 img 的来源。您找不到与您感兴趣的图像对应的任何内容。为什么？因为该页面不包含图像；它包含 javascript 生成包含图像的页面。 rvest 包不评估 javascript；当您单击浏览器中的 "view source" 按钮时，它会直接与您看到的源一起使用。

最重要的是，使用 rvest 将很难处理该页面。您最好的选择可能是改用浏览器驱动程序，例如 Rselenium.

Answer 2

正如 Ista 正确指出的那样，这是一个棘手的问题。但是，使用完整 JavaScript 解决方案的一种替代方法是解析 json 来提供此类脚本。

在源的 html 代码中进行简单搜索，您可以确定图像的 url 存储在以字符串 "window.__ PRELOADED_STATE__ =" 开头的节点内的 json 中。

library(tidyverse)
library(rvest)
library(jsonlite)

obj <- read_html("https://www.ajio.com/ajio-twill-snapback-cap/p/460022581_royalblue")

extracted_json <- obj %>% 
                  html_nodes(xpath = '//script') %>% 
                 .[10] %>% ## The relevant content is in the 10th script node
                 html_text(trim = TRUE) %>% 
                 gsub('^window.__PRELOADED_STATE__ = |[;]$', '', .) ## clean the string to obtain a regular json structure.

object_json <-  fromJSON(extracted_json,simplifyDataFrame = TRUE)

我们打印 object_json 并搜索一组 .jpg 字符串...

object_json

我们在地址“$product$productDetails$images”中找到了一个这样的集群，它恰好是一个数据框而不是一个简单的列表。

DF <- object_json$`product`$`productDetails`$images %>% as_data_frame()
unique(DF$url)

使用 R 从网站上抓取图像 URL

Scrape image URL from website using R

r

web-scraping

rvest