并行 wget 下载文件无法正常退出
Parallel wget download files does not exit properly
我正在尝试从包含链接(超过 15 000+)的文件 (test.txt) 下载文件。
我有这个脚本:
#!/bin/bash
function download {
FILE=
while read line; do
url=$line
wget -nc -P ./images/ $url
#downloading images which are not in the test.txt,
#by guessing name: 12345_001.jpg, 12345_002.jpg..12345_005.jpg etc.
wget -nc -P ./images/ ${url%.jpg}_{001..005}.jpg
done < $FILE
}
#test.txt contains the URLs
split -l 1000 ./temp/test.txt ./temp/split
#read splitted files and pass to the download function
for f in ./temp/split*; do
download $f &
done
test.txt:
http://xy.com/12345.jpg
http://xy.com/33442.jpg
...
我 将文件分成几部分并守护 (download $f &
) wget 进程,以便它可以跳转到另一个包含链接的拆分文件。
脚本运行正常,但脚本最后没有退出,我必须在最后按回车。如果我从行 download $f &
中删除 &
它可以工作,但我失去了并行下载。
编辑:
我发现这不是并行化 wget 下载的最佳方式。如果使用 GNU Parallel 就好了。
脚本正在退出,但后台的 wget
进程在脚本退出后产生输出,并在 shell 提示后打印。所以你需要按 Enter 得到另一个提示。
使用 -q
选项 wget
关闭输出。
while read line; do
url=$line
wget -ncq -P ./images/ "$url"
wget -ncq -P ./images/ "${url%.jpg}"_{001..005}.jpg
done < "$FILE"
- 请阅读 wget 手册页/帮助。
记录和输入文件:
-i, --input-file=在本地或外部文件中找到的文件下载 URL。
-o, --output-file=FILE log messages to FILE.
-a, --append-output=FILE append messages to FILE.
-d, --debug print lots of debugging information.
-q, --quiet quiet (no output).
-v, --verbose be verbose (this is the default).
-nv, --no-verbose turn off verboseness, without being quiet.
--report-speed=TYPE Output bandwidth as TYPE. TYPE can be bits.
-i, --input-file=FILE download URLs found in local or external FILE.
-F, --force-html treat input file as HTML.
-B, --base=URL resolves HTML input-file links (-i -F)
relative to URL.
--config=FILE Specify config file to use.
下载:
-nc, --no-clobber 跳过会下载到的下载
现有文件(覆盖它们)。
-t, --tries=NUMBER set number of retries to NUMBER (0 unlimits).
--retry-connrefused retry even if connection is refused.
-O, --output-document=FILE write documents to FILE.
-nc, --no-clobber skip downloads that would download to
existing files (overwriting them).
-c, --continue resume getting a partially-downloaded file.
--progress=TYPE select progress gauge type.
-N, --timestamping don't re-retrieve files unless newer than
local.
--no-use-server-timestamps don't set the local file's timestamp by
the one on the server.
-S, --server-response print server response.
--spider don't download anything.
-T, --timeout=SECONDS set all timeout values to SECONDS.
--dns-timeout=SECS set the DNS lookup timeout to SECS.
--connect-timeout=SECS set the connect timeout to SECS.
--read-timeout=SECS set the read timeout to SECS.
-w, --wait=SECONDS wait SECONDS between retrievals.
--waitretry=SECONDS wait 1..SECONDS between retries of a retrieval.
--random-wait wait from 0.5*WAIT...1.5*WAIT secs between retrievals.
--no-proxy explicitly turn off proxy.
-Q, --quota=NUMBER set retrieval quota to NUMBER.
--bind-address=ADDRESS bind to ADDRESS (hostname or IP) on local host.
--limit-rate=RATE limit download rate to RATE.
--no-dns-cache disable caching DNS lookups.
--restrict-file-names=OS restrict chars in file names to ones OS allows.
--ignore-case ignore case when matching files/directories.
-4, --inet4-only connect only to IPv4 addresses.
-6, --inet6-only connect only to IPv6 addresses.
--prefer-family=FAMILY connect first to addresses of specified family,
one of IPv6, IPv4, or none.
--user=USER set both ftp and http user to USER.
--password=PASS set both ftp and http password to PASS.
--ask-password prompt for passwords.
--no-iri turn off IRI support.
--local-encoding=ENC use ENC as the local encoding for IRIs.
--remote-encoding=ENC use ENC as the default remote encoding.
--unlink remove file before clobber.
- 关注
@Barmar 的回答是正确的。但是,我想提出一个不同的、更有效的解决方案。您可以考虑使用 Wget2.
Wget2 是 GNU Wget 的下一个主要版本。它带有许多新功能,包括多线程下载。因此,使用 GNU wget2,您需要做的就是传递 --max-threads
选项和 select 您想要生成的并行线程数。
您可以很容易地从 git 存储库编译它。 AUR and in Debian
上也存在 Arch Linux 的软件包
编辑:完全披露:我是 GNU Wget 和 GNU Wget2 的维护者之一。
我可以向您推荐 GNU Parallel 吗?
parallel --dry-run -j32 -a URLs.txt 'wget -ncq -P ./images/ {}; wget -ncq -P ./images/ {.}_{001..005}.jpg'
我只是猜测您的输入文件在 URLs.txt
中的样子类似于:
http://somesite.com/image1.jpg
http://someothersite.com/someotherimage.jpg
或者,使用您自己的函数方法:
#/bin/bash
# define and export a function for "parallel" to call
doit(){
wget -ncq -P ./images/ ""
wget -ncq -P ./images/ "_{001..005}.jpg"
}
export -f doit
parallel --dry-run -j32 -a URLs.txt doit {} {.}
我正在尝试从包含链接(超过 15 000+)的文件 (test.txt) 下载文件。
我有这个脚本:
#!/bin/bash
function download {
FILE=
while read line; do
url=$line
wget -nc -P ./images/ $url
#downloading images which are not in the test.txt,
#by guessing name: 12345_001.jpg, 12345_002.jpg..12345_005.jpg etc.
wget -nc -P ./images/ ${url%.jpg}_{001..005}.jpg
done < $FILE
}
#test.txt contains the URLs
split -l 1000 ./temp/test.txt ./temp/split
#read splitted files and pass to the download function
for f in ./temp/split*; do
download $f &
done
test.txt:
http://xy.com/12345.jpg
http://xy.com/33442.jpg
...
我 download $f &
) wget 进程,以便它可以跳转到另一个包含链接的拆分文件。
脚本运行正常,但脚本最后没有退出,我必须在最后按回车。如果我从行 download $f &
中删除 &
它可以工作,但我失去了并行下载。
编辑:
我发现这不是并行化 wget 下载的最佳方式。如果使用 GNU Parallel 就好了。
脚本正在退出,但后台的 wget
进程在脚本退出后产生输出,并在 shell 提示后打印。所以你需要按 Enter 得到另一个提示。
使用 -q
选项 wget
关闭输出。
while read line; do
url=$line
wget -ncq -P ./images/ "$url"
wget -ncq -P ./images/ "${url%.jpg}"_{001..005}.jpg
done < "$FILE"
- 请阅读 wget 手册页/帮助。
记录和输入文件:
-i, --input-file=在本地或外部文件中找到的文件下载 URL。
-o, --output-file=FILE log messages to FILE.
-a, --append-output=FILE append messages to FILE.
-d, --debug print lots of debugging information.
-q, --quiet quiet (no output).
-v, --verbose be verbose (this is the default).
-nv, --no-verbose turn off verboseness, without being quiet.
--report-speed=TYPE Output bandwidth as TYPE. TYPE can be bits.
-i, --input-file=FILE download URLs found in local or external FILE.
-F, --force-html treat input file as HTML.
-B, --base=URL resolves HTML input-file links (-i -F)
relative to URL.
--config=FILE Specify config file to use.
下载:
-nc, --no-clobber 跳过会下载到的下载 现有文件(覆盖它们)。
-t, --tries=NUMBER set number of retries to NUMBER (0 unlimits).
--retry-connrefused retry even if connection is refused.
-O, --output-document=FILE write documents to FILE.
-nc, --no-clobber skip downloads that would download to
existing files (overwriting them).
-c, --continue resume getting a partially-downloaded file.
--progress=TYPE select progress gauge type.
-N, --timestamping don't re-retrieve files unless newer than
local.
--no-use-server-timestamps don't set the local file's timestamp by
the one on the server.
-S, --server-response print server response.
--spider don't download anything.
-T, --timeout=SECONDS set all timeout values to SECONDS.
--dns-timeout=SECS set the DNS lookup timeout to SECS.
--connect-timeout=SECS set the connect timeout to SECS.
--read-timeout=SECS set the read timeout to SECS.
-w, --wait=SECONDS wait SECONDS between retrievals.
--waitretry=SECONDS wait 1..SECONDS between retries of a retrieval.
--random-wait wait from 0.5*WAIT...1.5*WAIT secs between retrievals.
--no-proxy explicitly turn off proxy.
-Q, --quota=NUMBER set retrieval quota to NUMBER.
--bind-address=ADDRESS bind to ADDRESS (hostname or IP) on local host.
--limit-rate=RATE limit download rate to RATE.
--no-dns-cache disable caching DNS lookups.
--restrict-file-names=OS restrict chars in file names to ones OS allows.
--ignore-case ignore case when matching files/directories.
-4, --inet4-only connect only to IPv4 addresses.
-6, --inet6-only connect only to IPv6 addresses.
--prefer-family=FAMILY connect first to addresses of specified family,
one of IPv6, IPv4, or none.
--user=USER set both ftp and http user to USER.
--password=PASS set both ftp and http password to PASS.
--ask-password prompt for passwords.
--no-iri turn off IRI support.
--local-encoding=ENC use ENC as the local encoding for IRIs.
--remote-encoding=ENC use ENC as the default remote encoding.
--unlink remove file before clobber.
- 关注
@Barmar 的回答是正确的。但是,我想提出一个不同的、更有效的解决方案。您可以考虑使用 Wget2.
Wget2 是 GNU Wget 的下一个主要版本。它带有许多新功能,包括多线程下载。因此,使用 GNU wget2,您需要做的就是传递 --max-threads
选项和 select 您想要生成的并行线程数。
您可以很容易地从 git 存储库编译它。 AUR and in Debian
上也存在 Arch Linux 的软件包编辑:完全披露:我是 GNU Wget 和 GNU Wget2 的维护者之一。
我可以向您推荐 GNU Parallel 吗?
parallel --dry-run -j32 -a URLs.txt 'wget -ncq -P ./images/ {}; wget -ncq -P ./images/ {.}_{001..005}.jpg'
我只是猜测您的输入文件在 URLs.txt
中的样子类似于:
http://somesite.com/image1.jpg
http://someothersite.com/someotherimage.jpg
或者,使用您自己的函数方法:
#/bin/bash
# define and export a function for "parallel" to call
doit(){
wget -ncq -P ./images/ ""
wget -ncq -P ./images/ "_{001..005}.jpg"
}
export -f doit
parallel --dry-run -j32 -a URLs.txt doit {} {.}