如何在 R 中自动下载多个链接断开的图像?
How to automatically download multiple images with broken links in R?
这里的目标是下载一堆图片,但有些图片 URL 已损坏。我想做的是用一个简单的 next 语句修改代码,这样如果 link returns 除了状态代码 200 之外的任何内容都跳到下一个 URL (或者如果 link returns a 404 跳到下一个),但我不确定如何在矢量化代码中编写它,当我尝试在 for 循环中编写它时,我无法弄清楚如何初始化“图片”类型的向量" 在 for 循环中写入。所以现在我正在查看函数的代码,试图找出调用错误的位置以及将下一条语句或类似的东西放在哪里......如果你不能以某种形式的矢量化代码放置下一条语句:
简单的向量化代码:
library(magick)
library(rsvg)
image_urls <- na.omit(articles$url_to_image)
image_content <- image_read(image_urls)
不透明的“功能”代码(在哪里调用错误?---只是一堆下载不同类型图像的调用)
function (path, density = NULL, depth = NULL, strip = FALSE,
coalesce = TRUE, defines = NULL)
{
if (is.numeric(density))
density <- paste0(density, "x", density)
density <- as.character(density)
depth <- as.integer(depth)
#doesn't seem relevant: https://rdrr.io/cran/magick/src/R/defines.R
defines <- validate_defines(defines)
#test whether the object is an instance of an S4 class and a function to test inheritance relationships between object and class -- seems relevant maybe?
image <- if (isS4(path) && methods::is(path, "Image"))
{
#bioconductor class
convert_EBImage(path)
}
else if (inherits(path, "nativeRaster") || (is.matrix(path) &&
is.integer(path))) {
image_read_nativeraster(path)
}
else if (inherits(path, "cimg")) {
image_read_cimg((path))
}
else if (grDevices::is.raster(path)) {
image_read_raster2(path)
}
else if (is.matrix(path) && is.character(path)) {
image_read_raster2(grDevices::as.raster(path))
}
else if (is.array(path)) {
image_readbitmap(path)
}
else if (is.raw(path)) {
magick_image_readbin(path, density, depth, strip, defines)
}
else if (is.character(path) && all(nchar(path))) {
path <- vapply(path, replace_url, character(1))
path <- if (is_windows()) {
enc2utf8(path)
}
else {
enc2native(path)
}
magick_image_readpath(path, density, depth, strip, defines)
}
else {
stop("path must be URL, filename or raw vector")
}
if (is.character(path) && !isTRUE(magick_config()$rsvg)) {
if (any(grepl("\.svg$", tolower(path))) || any(grepl("svg|mvg",
tolower(image_info(image)$format)))) {
warning("ImageMagick was built without librsvg which causes poor qualty of SVG rendering.\nFor better results use image_read_svg() which uses the rsvg package.",
call. = FALSE)
}
}
if (isTRUE(coalesce) && length(image) > 1 && identical("GIF",
toupper(image_info(image)$format[1]))) {
return(image_coalesce(image))
}
return(image)
}
当 link 被破坏时 returns: download_url(path) 错误:
URL 损坏时无法下载“link”(HTTP 404)
可能的循环代码?
library(magick)
library(rsvg)
image_urls <- na.omit(articles$url_to_image)
image_content <- c() #doesn't work, nor does NULL
#nor does setting to typeof image_content <- image_url[1]
for(i in 1:length(image_urls){
image_content[i] = image_read(image_urls[i])
if(grepl('404', download_path(url), fixed = TRUE) == T)
next
}
但同样,我无法初始化,而且我不知道在任何情况下循环是否会在到达 if 语句之前中断。
也许我应该使用另一个库...或者只是另一种语言?
这是一些示例数据
data <- c("https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOgEbG.img?h=488&w=799&m=6&q=60&o=f&l=f",
"https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOh6FW.img?h=533&w=799&m=6&q=60&o=f&l=f",
"https://img-s-msn-com.net/tenant/amp/entityid/AAOgIFh.img?h=450&w=799&m=6&q=60&o=f&l=f&x=570&y")
您可以尝试 try
函数:
image_urls <- data
image_content <- lapply(seq_along(image_urls), function(i) try(image_read(image_urls[i])))
这会将您的图像存储在列表中。使用
image_content[[1]]
让您可以访问第一张图片。如果出现
这样的错误
Error in curl::curl_fetch_memory(url) :
Could not resolve host: img-s-msn-com.net simpleError in curl::curl_fetch_memory(url)
那些被跳过,循环进入下一个任务。
另一种选择是使用 purrr::safely
创建 image_read
的“安全”版本,它将 return result
和 error
每个 url.
可以使用类似 purrr::map(y,`[[`, 'result')
.
的方式从列表中提取结果
# two working links and one broken
urls <- c("https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOgEbG.img?h=488&w=799&m=6&q=60&o=f&l=f",
"https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOh6FW.img?h=533&w=799&m=6&q=60&o=f&l=f",
"https://img-s-msn-com.net/tenant/amp/entityid/AAOgIFh.img?h=450&w=799&m=6&q=60&o=f&l=f&x=570&y")
# create 'safe' function
image_read_safe <- purrr::safely(magick::image_read)
# apply 'safe' function
y <- purrr::map(urls, image_read_safe)
y
#> [[1]]
#> [[1]]$result
#> format width height colorspace matte filesize density
#> 1 JPEG 799 488 sRGB FALSE 39743 96x96
#>
#> [[1]]$error
#> NULL
#>
#>
#> [[2]]
#> [[2]]$result
#> format width height colorspace matte filesize density
#> 1 JPEG 799 533 sRGB FALSE 53910 96x96
#>
#> [[2]]$error
#> NULL
#>
#>
#> [[3]]
#> [[3]]$result
#> NULL
#>
#> [[3]]$error
#> <simpleError in curl::curl_fetch_memory(url): Could not resolve host: img-s-msn-com.net>
由 reprex package (v2.0.0)
于 2021-09-10 创建
这里的目标是下载一堆图片,但有些图片 URL 已损坏。我想做的是用一个简单的 next 语句修改代码,这样如果 link returns 除了状态代码 200 之外的任何内容都跳到下一个 URL (或者如果 link returns a 404 跳到下一个),但我不确定如何在矢量化代码中编写它,当我尝试在 for 循环中编写它时,我无法弄清楚如何初始化“图片”类型的向量" 在 for 循环中写入。所以现在我正在查看函数的代码,试图找出调用错误的位置以及将下一条语句或类似的东西放在哪里......如果你不能以某种形式的矢量化代码放置下一条语句:
简单的向量化代码:
library(magick)
library(rsvg)
image_urls <- na.omit(articles$url_to_image)
image_content <- image_read(image_urls)
不透明的“功能”代码(在哪里调用错误?---只是一堆下载不同类型图像的调用)
function (path, density = NULL, depth = NULL, strip = FALSE,
coalesce = TRUE, defines = NULL)
{
if (is.numeric(density))
density <- paste0(density, "x", density)
density <- as.character(density)
depth <- as.integer(depth)
#doesn't seem relevant: https://rdrr.io/cran/magick/src/R/defines.R
defines <- validate_defines(defines)
#test whether the object is an instance of an S4 class and a function to test inheritance relationships between object and class -- seems relevant maybe?
image <- if (isS4(path) && methods::is(path, "Image"))
{
#bioconductor class
convert_EBImage(path)
}
else if (inherits(path, "nativeRaster") || (is.matrix(path) &&
is.integer(path))) {
image_read_nativeraster(path)
}
else if (inherits(path, "cimg")) {
image_read_cimg((path))
}
else if (grDevices::is.raster(path)) {
image_read_raster2(path)
}
else if (is.matrix(path) && is.character(path)) {
image_read_raster2(grDevices::as.raster(path))
}
else if (is.array(path)) {
image_readbitmap(path)
}
else if (is.raw(path)) {
magick_image_readbin(path, density, depth, strip, defines)
}
else if (is.character(path) && all(nchar(path))) {
path <- vapply(path, replace_url, character(1))
path <- if (is_windows()) {
enc2utf8(path)
}
else {
enc2native(path)
}
magick_image_readpath(path, density, depth, strip, defines)
}
else {
stop("path must be URL, filename or raw vector")
}
if (is.character(path) && !isTRUE(magick_config()$rsvg)) {
if (any(grepl("\.svg$", tolower(path))) || any(grepl("svg|mvg",
tolower(image_info(image)$format)))) {
warning("ImageMagick was built without librsvg which causes poor qualty of SVG rendering.\nFor better results use image_read_svg() which uses the rsvg package.",
call. = FALSE)
}
}
if (isTRUE(coalesce) && length(image) > 1 && identical("GIF",
toupper(image_info(image)$format[1]))) {
return(image_coalesce(image))
}
return(image)
}
当 link 被破坏时 returns: download_url(path) 错误: URL 损坏时无法下载“link”(HTTP 404)
可能的循环代码?
library(magick)
library(rsvg)
image_urls <- na.omit(articles$url_to_image)
image_content <- c() #doesn't work, nor does NULL
#nor does setting to typeof image_content <- image_url[1]
for(i in 1:length(image_urls){
image_content[i] = image_read(image_urls[i])
if(grepl('404', download_path(url), fixed = TRUE) == T)
next
}
但同样,我无法初始化,而且我不知道在任何情况下循环是否会在到达 if 语句之前中断。
也许我应该使用另一个库...或者只是另一种语言?
这是一些示例数据
data <- c("https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOgEbG.img?h=488&w=799&m=6&q=60&o=f&l=f",
"https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOh6FW.img?h=533&w=799&m=6&q=60&o=f&l=f",
"https://img-s-msn-com.net/tenant/amp/entityid/AAOgIFh.img?h=450&w=799&m=6&q=60&o=f&l=f&x=570&y")
您可以尝试 try
函数:
image_urls <- data
image_content <- lapply(seq_along(image_urls), function(i) try(image_read(image_urls[i])))
这会将您的图像存储在列表中。使用
image_content[[1]]
让您可以访问第一张图片。如果出现
这样的错误Error in curl::curl_fetch_memory(url) :
Could not resolve host: img-s-msn-com.net simpleError in curl::curl_fetch_memory(url)
那些被跳过,循环进入下一个任务。
另一种选择是使用 purrr::safely
创建 image_read
的“安全”版本,它将 return result
和 error
每个 url.
可以使用类似 purrr::map(y,`[[`, 'result')
.
# two working links and one broken
urls <- c("https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOgEbG.img?h=488&w=799&m=6&q=60&o=f&l=f",
"https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOh6FW.img?h=533&w=799&m=6&q=60&o=f&l=f",
"https://img-s-msn-com.net/tenant/amp/entityid/AAOgIFh.img?h=450&w=799&m=6&q=60&o=f&l=f&x=570&y")
# create 'safe' function
image_read_safe <- purrr::safely(magick::image_read)
# apply 'safe' function
y <- purrr::map(urls, image_read_safe)
y
#> [[1]]
#> [[1]]$result
#> format width height colorspace matte filesize density
#> 1 JPEG 799 488 sRGB FALSE 39743 96x96
#>
#> [[1]]$error
#> NULL
#>
#>
#> [[2]]
#> [[2]]$result
#> format width height colorspace matte filesize density
#> 1 JPEG 799 533 sRGB FALSE 53910 96x96
#>
#> [[2]]$error
#> NULL
#>
#>
#> [[3]]
#> [[3]]$result
#> NULL
#>
#> [[3]]$error
#> <simpleError in curl::curl_fetch_memory(url): Could not resolve host: img-s-msn-com.net>
由 reprex package (v2.0.0)
于 2021-09-10 创建