检查 bash 中是否存在远程文件
Check if a remote file exists in bash
我正在使用此脚本下载文件:
parallel --progress -j16 -a ./temp/img-url.txt 'wget -nc -q -P ./images/ {}; wget -nc -q -P ./images/ {.}_{001..005}.jpg'
是否可以不下载文件,只是在远程端检查它们,如果存在则创建一个虚拟文件而不是下载?
类似于:
if wget --spider $url 2>/dev/null; then
#touch img.file
fi
应该可以,但我不知道如何将这段代码与 GNU Parallel 相结合。
编辑:
根据Ole的回答我写了这段代码:
#!/bin/bash
do_url() {
url=""
wget -q -nc --method HEAD "$url" && touch ./images/${url##*/}
#get filename from $url
url2=${url##*/}
wget -q -nc --method HEAD ${url%.jpg}_{001..005}.jpg && touch ./images/${url2%.jpg}_{001..005}.jpg
}
export -f do_url
parallel --progress -a urls.txt do_url {}
有效,但对某些文件无效。我找不到一致性为什么它适用于某些文件,为什么它对其他文件失败。也许它有一些带有最后一个文件名的东西。第二个 wget 尝试访问当前 url,但之后的 touch 命令根本不会创建所需的文件。首先 wget 总是(正确地)下载没有 _001.jpg、_002.jpg.
的主图像
示例urls.txt:
http://host.com/092401.jpg(工作正常,_001.jpg.._005.jpg 已下载)
http://host.com/HT11019.jpg(不行,只下载主图)
您可以通过 ssh 发送命令以查看远程文件是否存在,如果存在则对其进行 cat:
ssh your_host 'test -e "somefile" && cat "somefile"' > somefile
也可以尝试支持 glob 表达式和递归的 scp。
只是遍历名称?
for uname in ${url%.jpg}_{001..005}.jpg
do
if wget --spider $uname 2>/dev/null; then
touch ./images/${uname##*/}
fi
done
您可以使用 curl
来检查您正在解析的 URL 是否存在,而无需下载任何文件:
if curl --head --fail --silent "$url" >/dev/null; then
touch .images/"${url##*/}"
fi
解释:
--fail
将在请求失败时使退出状态非零。
--head
将避免下载文件内容
--silent
将避免检查本身发出状态或错误。
要解决"looping"问题,您可以这样做:
urls=( "${url%.jpg}"_{001..005}.jpg )
for url in "${urls[@]}"; do
if curl --head --silent --fail "$url" > /dev/null; then
touch .images/${url##*/}
fi
done
据我所知,您的问题并不是关于如何使用 wget
来测试文件是否存在,而是关于如何在 shell 中执行正确的循环脚本。
这里有一个简单的解决方案:
urls=( "${url%.jpg}"_{001..005}.jpg )
for url in "${urls[@]}"; do
if wget -q --method=HEAD "$url"; then
touch .images/${url##*/}
fi
done
它的作用是使用 --method=HEAD
选项调用 Wget。对于 HEAD
请求,服务器将简单地报告文件是否存在,而不 returning 任何数据。
当然,对于大型数据集,这是非常低效的。您正在为您尝试的每个文件创建一个到服务器的新连接。相反,正如另一个答案中所建议的,您可以使用 GNU Wget2。使用 wget2,您可以并行测试所有这些,并使用新的 --stats-server
选项查找所有文件的列表和服务器提供的特定 return 代码。例如:
$ wget2 --spider --progress=none -q --stats-site example.com/{,1,2,3}
Site Statistics:
http://example.com:
Status No. of docs
404 3
http://example.com/3 0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response)
http://example.com/1 0 bytes (gzip) : 0 bytes (decompressed), 241ms (transfer) : 241ms (response)
http://example.com/2 0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response)
200 1
http://example.com/ 0 bytes (identity) : 0 bytes (decompressed), 231ms (transfer) : 231ms (response)
您甚至可以将此数据打印为 CSV 或 JSON 以便于解析
很难理解您真正想要完成的是什么。让我尝试重新表述您的问题。
I have urls.txt
containing:
http://example.com/dira/foo.jpg
http://example.com/dira/bar.jpg
http://example.com/dirb/foo.jpg
http://example.com/dirb/baz.jpg
http://example.org/dira/foo.jpg
On example.com
these URLs exist:
http://example.com/dira/foo.jpg
http://example.com/dira/foo_001.jpg
http://example.com/dira/foo_003.jpg
http://example.com/dira/foo_005.jpg
http://example.com/dira/bar_000.jpg
http://example.com/dira/bar_002.jpg
http://example.com/dira/bar_004.jpg
http://example.com/dira/fubar.jpg
http://example.com/dirb/foo.jpg
http://example.com/dirb/baz.jpg
http://example.com/dirb/baz_001.jpg
http://example.com/dirb/baz_005.jpg
On example.org
these URLs exist:
http://example.org/dira/foo_001.jpg
Given urls.txt
I want to generate the combinations with _001.jpg .. _005.jpg in addition to the original URL. E.g.:
http://example.com/dira/foo.jpg
becomes:
http://example.com/dira/foo.jpg
http://example.com/dira/foo_001.jpg
http://example.com/dira/foo_002.jpg
http://example.com/dira/foo_003.jpg
http://example.com/dira/foo_004.jpg
http://example.com/dira/foo_005.jpg
Then I want to test if these URLs exist without downloading the file. As there are many URLs I want to do this in parallel.
If the URL exists I want an empty file created.
(Version 1): I want the empty file created in a the similar directory structure in the dir images
. This is needed because some of the images have the same name, but in different dirs.
So the files created should be:
images/http:/example.com/dira/foo.jpg
images/http:/example.com/dira/foo_001.jpg
images/http:/example.com/dira/foo_003.jpg
images/http:/example.com/dira/foo_005.jpg
images/http:/example.com/dira/bar_000.jpg
images/http:/example.com/dira/bar_002.jpg
images/http:/example.com/dira/bar_004.jpg
images/http:/example.com/dirb/foo.jpg
images/http:/example.com/dirb/baz.jpg
images/http:/example.com/dirb/baz_001.jpg
images/http:/example.com/dirb/baz_005.jpg
images/http:/example.org/dira/foo_001.jpg
(Version 2): I want the empty file created in the dir images
. This can be done because all the images have unique names.
So the files created should be:
images/foo.jpg
images/foo_001.jpg
images/foo_003.jpg
images/foo_005.jpg
images/bar_000.jpg
images/bar_002.jpg
images/bar_004.jpg
images/baz.jpg
images/baz_001.jpg
images/baz_005.jpg
(Version 3): I want the empty file created in the dir images
called the name from urls.txt
. This can be done because only one of _001.jpg .. _005.jpg exists.
images/foo.jpg
images/bar.jpg
images/baz.jpg
#!/bin/bash
do_url() {
url=""
# Version 1:
# If you want to keep the folder structure from the server (similar to wget -m):
wget -q --method HEAD "$url" && mkdir -p images/"" && touch images/"$url"
# Version 2:
# If all the images have unique names and you want all images in a single dir
wget -q --method HEAD "$url" && touch images/""
# Version 3:
# If all the images have unique names when _###.jpg is removed and you want all images in a single dir
wget -q --method HEAD "$url" && touch images/""
}
export -f do_url
parallel do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg
GNU Parallel 每个作业需要几毫秒。当你的工作这么短时,开销会影响时间。如果 none 个 CPU 核心 运行 处于 100%,您可以 运行 并行处理更多作业:
parallel -j0 do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg
您也可以“展开”循环。这将为每个 URL:
节省 5 个开销
do_url() {
url=""
# Version 2:
# If all the images have unique names and you want all images in a single dir
wget -q --method HEAD "$url".jpg && touch images/"$url".jpg
wget -q --method HEAD "$url"_001.jpg && touch images/"$url"_001.jpg
wget -q --method HEAD "$url"_002.jpg && touch images/"$url"_002.jpg
wget -q --method HEAD "$url"_003.jpg && touch images/"$url"_003.jpg
wget -q --method HEAD "$url"_004.jpg && touch images/"$url"_004.jpg
wget -q --method HEAD "$url"_005.jpg && touch images/"$url"_005.jpg
}
export -f do_url
parallel -j0 do_url {.} :::: urls.txt
您终于可以 运行 超过 250 个职位:https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Running-more-than-250-jobs-workaround
我正在使用此脚本下载文件:
parallel --progress -j16 -a ./temp/img-url.txt 'wget -nc -q -P ./images/ {}; wget -nc -q -P ./images/ {.}_{001..005}.jpg'
是否可以不下载文件,只是在远程端检查它们,如果存在则创建一个虚拟文件而不是下载?
类似于:
if wget --spider $url 2>/dev/null; then
#touch img.file
fi
应该可以,但我不知道如何将这段代码与 GNU Parallel 相结合。
编辑:
根据Ole的回答我写了这段代码:
#!/bin/bash
do_url() {
url=""
wget -q -nc --method HEAD "$url" && touch ./images/${url##*/}
#get filename from $url
url2=${url##*/}
wget -q -nc --method HEAD ${url%.jpg}_{001..005}.jpg && touch ./images/${url2%.jpg}_{001..005}.jpg
}
export -f do_url
parallel --progress -a urls.txt do_url {}
有效,但对某些文件无效。我找不到一致性为什么它适用于某些文件,为什么它对其他文件失败。也许它有一些带有最后一个文件名的东西。第二个 wget 尝试访问当前 url,但之后的 touch 命令根本不会创建所需的文件。首先 wget 总是(正确地)下载没有 _001.jpg、_002.jpg.
的主图像示例urls.txt:
http://host.com/092401.jpg(工作正常,_001.jpg.._005.jpg 已下载) http://host.com/HT11019.jpg(不行,只下载主图)
您可以通过 ssh 发送命令以查看远程文件是否存在,如果存在则对其进行 cat:
ssh your_host 'test -e "somefile" && cat "somefile"' > somefile
也可以尝试支持 glob 表达式和递归的 scp。
只是遍历名称?
for uname in ${url%.jpg}_{001..005}.jpg
do
if wget --spider $uname 2>/dev/null; then
touch ./images/${uname##*/}
fi
done
您可以使用 curl
来检查您正在解析的 URL 是否存在,而无需下载任何文件:
if curl --head --fail --silent "$url" >/dev/null; then
touch .images/"${url##*/}"
fi
解释:
--fail
将在请求失败时使退出状态非零。--head
将避免下载文件内容--silent
将避免检查本身发出状态或错误。
要解决"looping"问题,您可以这样做:
urls=( "${url%.jpg}"_{001..005}.jpg )
for url in "${urls[@]}"; do
if curl --head --silent --fail "$url" > /dev/null; then
touch .images/${url##*/}
fi
done
据我所知,您的问题并不是关于如何使用 wget
来测试文件是否存在,而是关于如何在 shell 中执行正确的循环脚本。
这里有一个简单的解决方案:
urls=( "${url%.jpg}"_{001..005}.jpg )
for url in "${urls[@]}"; do
if wget -q --method=HEAD "$url"; then
touch .images/${url##*/}
fi
done
它的作用是使用 --method=HEAD
选项调用 Wget。对于 HEAD
请求,服务器将简单地报告文件是否存在,而不 returning 任何数据。
当然,对于大型数据集,这是非常低效的。您正在为您尝试的每个文件创建一个到服务器的新连接。相反,正如另一个答案中所建议的,您可以使用 GNU Wget2。使用 wget2,您可以并行测试所有这些,并使用新的 --stats-server
选项查找所有文件的列表和服务器提供的特定 return 代码。例如:
$ wget2 --spider --progress=none -q --stats-site example.com/{,1,2,3}
Site Statistics:
http://example.com:
Status No. of docs
404 3
http://example.com/3 0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response)
http://example.com/1 0 bytes (gzip) : 0 bytes (decompressed), 241ms (transfer) : 241ms (response)
http://example.com/2 0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response)
200 1
http://example.com/ 0 bytes (identity) : 0 bytes (decompressed), 231ms (transfer) : 231ms (response)
您甚至可以将此数据打印为 CSV 或 JSON 以便于解析
很难理解您真正想要完成的是什么。让我尝试重新表述您的问题。
I have
urls.txt
containing:http://example.com/dira/foo.jpg http://example.com/dira/bar.jpg http://example.com/dirb/foo.jpg http://example.com/dirb/baz.jpg http://example.org/dira/foo.jpg
On
example.com
these URLs exist:http://example.com/dira/foo.jpg http://example.com/dira/foo_001.jpg http://example.com/dira/foo_003.jpg http://example.com/dira/foo_005.jpg http://example.com/dira/bar_000.jpg http://example.com/dira/bar_002.jpg http://example.com/dira/bar_004.jpg http://example.com/dira/fubar.jpg http://example.com/dirb/foo.jpg http://example.com/dirb/baz.jpg http://example.com/dirb/baz_001.jpg http://example.com/dirb/baz_005.jpg
On
example.org
these URLs exist:http://example.org/dira/foo_001.jpg
Given
urls.txt
I want to generate the combinations with _001.jpg .. _005.jpg in addition to the original URL. E.g.:http://example.com/dira/foo.jpg
becomes:
http://example.com/dira/foo.jpg http://example.com/dira/foo_001.jpg http://example.com/dira/foo_002.jpg http://example.com/dira/foo_003.jpg http://example.com/dira/foo_004.jpg http://example.com/dira/foo_005.jpg
Then I want to test if these URLs exist without downloading the file. As there are many URLs I want to do this in parallel.
If the URL exists I want an empty file created.
(Version 1): I want the empty file created in a the similar directory structure in the dir
images
. This is needed because some of the images have the same name, but in different dirs.So the files created should be:
images/http:/example.com/dira/foo.jpg images/http:/example.com/dira/foo_001.jpg images/http:/example.com/dira/foo_003.jpg images/http:/example.com/dira/foo_005.jpg images/http:/example.com/dira/bar_000.jpg images/http:/example.com/dira/bar_002.jpg images/http:/example.com/dira/bar_004.jpg images/http:/example.com/dirb/foo.jpg images/http:/example.com/dirb/baz.jpg images/http:/example.com/dirb/baz_001.jpg images/http:/example.com/dirb/baz_005.jpg images/http:/example.org/dira/foo_001.jpg
(Version 2): I want the empty file created in the dir
images
. This can be done because all the images have unique names.So the files created should be:
images/foo.jpg images/foo_001.jpg images/foo_003.jpg images/foo_005.jpg images/bar_000.jpg images/bar_002.jpg images/bar_004.jpg images/baz.jpg images/baz_001.jpg images/baz_005.jpg
(Version 3): I want the empty file created in the dir
images
called the name fromurls.txt
. This can be done because only one of _001.jpg .. _005.jpg exists.images/foo.jpg images/bar.jpg images/baz.jpg
#!/bin/bash
do_url() {
url=""
# Version 1:
# If you want to keep the folder structure from the server (similar to wget -m):
wget -q --method HEAD "$url" && mkdir -p images/"" && touch images/"$url"
# Version 2:
# If all the images have unique names and you want all images in a single dir
wget -q --method HEAD "$url" && touch images/""
# Version 3:
# If all the images have unique names when _###.jpg is removed and you want all images in a single dir
wget -q --method HEAD "$url" && touch images/""
}
export -f do_url
parallel do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg
GNU Parallel 每个作业需要几毫秒。当你的工作这么短时,开销会影响时间。如果 none 个 CPU 核心 运行 处于 100%,您可以 运行 并行处理更多作业:
parallel -j0 do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg
您也可以“展开”循环。这将为每个 URL:
节省 5 个开销do_url() {
url=""
# Version 2:
# If all the images have unique names and you want all images in a single dir
wget -q --method HEAD "$url".jpg && touch images/"$url".jpg
wget -q --method HEAD "$url"_001.jpg && touch images/"$url"_001.jpg
wget -q --method HEAD "$url"_002.jpg && touch images/"$url"_002.jpg
wget -q --method HEAD "$url"_003.jpg && touch images/"$url"_003.jpg
wget -q --method HEAD "$url"_004.jpg && touch images/"$url"_004.jpg
wget -q --method HEAD "$url"_005.jpg && touch images/"$url"_005.jpg
}
export -f do_url
parallel -j0 do_url {.} :::: urls.txt
您终于可以 运行 超过 250 个职位:https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Running-more-than-250-jobs-workaround