如何在 R 中自动下载多个链接断开的图像?

How to automatically download multiple images with broken links in R?

这里的目标是下载一堆图片,但有些图片 URL 已损坏。我想做的是用一个简单的 next 语句修改代码,这样如果 link returns 除了状态代码 200 之外的任何内容都跳到下一个 URL (或者如果 link returns a 404 跳到下一个),但我不确定如何在矢量化代码中编写它,当我尝试在 for 循环中编写它时,我无法弄清楚如何初始化“图片”类型的向量" 在 for 循环中写入。所以现在我正在查看函数的代码,试图找出调用错误的位置以及将下一条语句或类似的东西放在哪里......如果你不能以某种形式的矢量化代码放置下一条语句:

简单的向量化代码:

library(magick)
library(rsvg)

image_urls <- na.omit(articles$url_to_image)
image_content <- image_read(image_urls)

不透明的“功能”代码(在哪里调用错误?---只是一堆下载不同类型图像的调用)

function (path, density = NULL, depth = NULL, strip = FALSE, 
    coalesce = TRUE, defines = NULL) 
{
    if (is.numeric(density)) 
        density <- paste0(density, "x", density)
    density <- as.character(density)
    depth <- as.integer(depth)
    
    #doesn't seem relevant: https://rdrr.io/cran/magick/src/R/defines.R
    defines <- validate_defines(defines)
    
    #test whether the object is an instance of an S4 class and a function to test inheritance relationships between object and class -- seems relevant maybe?
    image <- if (isS4(path) && methods::is(path, "Image"))
      {
        #bioconductor class
        convert_EBImage(path)
    }
    else if (inherits(path, "nativeRaster") || (is.matrix(path) && 
        is.integer(path))) {
        image_read_nativeraster(path)
    }
    else if (inherits(path, "cimg")) {
        image_read_cimg((path))
    }
    else if (grDevices::is.raster(path)) {
        image_read_raster2(path)
    }
    else if (is.matrix(path) && is.character(path)) {
        image_read_raster2(grDevices::as.raster(path))
    }
    else if (is.array(path)) {
        image_readbitmap(path)
    }
    else if (is.raw(path)) {
        magick_image_readbin(path, density, depth, strip, defines)
    }
    else if (is.character(path) && all(nchar(path))) {
        path <- vapply(path, replace_url, character(1))
        path <- if (is_windows()) {
            enc2utf8(path)
        }
        else {
            enc2native(path)
        }
        magick_image_readpath(path, density, depth, strip, defines)
    }
    else {
        stop("path must be URL, filename or raw vector")
    }
    if (is.character(path) && !isTRUE(magick_config()$rsvg)) {
        if (any(grepl("\.svg$", tolower(path))) || any(grepl("svg|mvg", 
            tolower(image_info(image)$format)))) {
            warning("ImageMagick was built without librsvg which causes poor qualty of SVG rendering.\nFor better results use image_read_svg() which uses the rsvg package.", 
                call. = FALSE)
        }
    }
    if (isTRUE(coalesce) && length(image) > 1 && identical("GIF", 
        toupper(image_info(image)$format[1]))) {
        return(image_coalesce(image))
    }
    return(image)
}

当 link 被破坏时 returns: download_url(path) 错误: URL 损坏时无法下载“link”(HTTP 404)

可能的循环代码?

library(magick)
library(rsvg)

image_urls <- na.omit(articles$url_to_image)

image_content <- c() #doesn't work, nor does NULL 
#nor does setting to typeof image_content <- image_url[1]

for(i in 1:length(image_urls){
  image_content[i] = image_read(image_urls[i])
    if(grepl('404', download_path(url), fixed = TRUE) == T)
    next
}

但同样,我无法初始化,而且我不知道在任何情况下循环是否会在到达 if 语句之前中断。

也许我应该使用另一个库...或者只是另一种语言?

这是一些示例数据

data <- c("https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOgEbG.img?h=488&w=799&m=6&q=60&o=f&l=f", 
"https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOh6FW.img?h=533&w=799&m=6&q=60&o=f&l=f", 
"https://img-s-msn-com.net/tenant/amp/entityid/AAOgIFh.img?h=450&w=799&m=6&q=60&o=f&l=f&x=570&y")

您可以尝试 try 函数:

image_urls <- data

image_content <- lapply(seq_along(image_urls), function(i) try(image_read(image_urls[i])))

这会将您的图像存储在列表中。使用

image_content[[1]]

让您可以访问第一张图片。如果出现

这样的错误
Error in curl::curl_fetch_memory(url) : 
Could not resolve host: img-s-msn-com.net simpleError in curl::curl_fetch_memory(url)

那些被跳过,循环进入下一个任务。

另一种选择是使用 purrr::safely 创建 image_read 的“安全”版本,它将 return resulterror 每个 url.

可以使用类似 purrr::map(y,`[[`, 'result').

的方式从列表中提取结果
# two working links and one broken
urls <- c("https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOgEbG.img?h=488&w=799&m=6&q=60&o=f&l=f", 
          "https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOh6FW.img?h=533&w=799&m=6&q=60&o=f&l=f", 
          "https://img-s-msn-com.net/tenant/amp/entityid/AAOgIFh.img?h=450&w=799&m=6&q=60&o=f&l=f&x=570&y")

# create 'safe' function
image_read_safe <- purrr::safely(magick::image_read)

# apply 'safe' function
y <- purrr::map(urls, image_read_safe)

y
#> [[1]]
#> [[1]]$result
#>   format width height colorspace matte filesize density
#> 1   JPEG   799    488       sRGB FALSE    39743   96x96
#> 
#> [[1]]$error
#> NULL
#> 
#> 
#> [[2]]
#> [[2]]$result
#>   format width height colorspace matte filesize density
#> 1   JPEG   799    533       sRGB FALSE    53910   96x96
#> 
#> [[2]]$error
#> NULL
#> 
#> 
#> [[3]]
#> [[3]]$result
#> NULL
#> 
#> [[3]]$error
#> <simpleError in curl::curl_fetch_memory(url): Could not resolve host: img-s-msn-com.net>

reprex package (v2.0.0)

于 2021-09-10 创建