wget 上的转义字符 --content-disposition 文件命名

Escape Characters On wget --content-disposition Filenaming

有很多关于内容处置的问题,但没有一个符合我的问题。希望这里有人能帮我解决。

所以,我想用 wget 下载很多文件。我使用 --content-disposition 参数来获得良好的文件命名。但不幸的是,当文件名有一些特殊字符时,如 \|/:?"*, <, >, 文件下载被转码.

比方说,我要下载的文件的文件名为 Bussiness Insider:如何启动您的业务。您可以注意到文件名具有特殊字符 :,当我 运行 脚本时,wget 确实下载了文件,但文件名 return 仅 Bussiness Insider 大小为零,没有任何扩展名。

我尝试了 --restrict-file-names=windows 和其他可用的选项,例如 -O 和基本名称,但仍然没有成功。

这是脚本:

wget --content-disposition --referer=$url $dl

先试试这个 --restrict-file-names=nocontrol

如果这行不通,那么对我来说这行得通:--restrict-file-names=unix(因为我在 Linux 框上或者在 Windows 中使用 BASH/Cygwin)。

您可能需要 --restrict-file-names=windows

如果您注意到,它现在会下载带有特殊字符的文件名。

$ wget  --restrict-file-names=unix --content-disposition --referer=$url $dl
$ ls -l
total 17740
-rw-r--r-- 1 giga group 18163514 May 10  2014 iPhone: The Missing Manual, 4th Edition.pdf

Man for wget 在此选项上显示为: --限制文件名=模式 更改在生成本地文件名期间必须转义远程 URL 中的哪些字符。受此选项限制的字符被转义,即替换为 %HH, 其中 HH 是对应于受限字符的十六进制数。此选项也可用于强制所有字母大小写为小写或大写。

       By default, Wget escapes the characters that are not valid or safe as part of file names on your operating system, as well as control characters that are typically unprintable.  This
       option is useful for changing these defaults, perhaps because you are downloading to a non-native partition, or because you want to disable escaping of the control characters, or you
       want to further restrict characters to only those in the ASCII range of values.

       The modes are a comma-separated set of text values. The acceptable values are unix, windows, nocontrol, ascii, lowercase, and uppercase. The values unix and windows are mutually
       exclusive (one will override the other), as are lowercase and uppercase. Those last are special cases, as they do not change the set of characters that would be escaped, but rather
       force local file paths to be converted either to lower- or uppercase.

       When "unix" is specified, Wget escapes the character / and the control characters in the ranges 0--31 and 128--159.  This is the default on Unix-like operating systems.

       When "windows" is given, Wget escapes the characters \, |, /, :, ?, ", *, <, >, and the control characters in the ranges 0--31 and 128--159.  In addition to this, Wget in Windows
       mode uses + instead of : to separate host and port in local file names, and uses @ instead of ? to separate the query portion of the file name from the rest.  Therefore, a URL that
       would be saved as www.xemacs.org:4300/search.pl?input=blah in Unix mode would be saved as www.xemacs.org+4300/search.pl@input=blah in Windows mode.  This mode is the default on
       Windows.

       **If you specify nocontrol, then the escaping of the control characters is also switched off. This option may make sense when you are downloading URLs whose names contain UTF-8
       characters, on a system which can save and display filenames in UTF-8 (some possible byte values used in UTF-8 byte sequences fall in the range of values designated by Wget as
       "controls").**

       The ascii mode is used to specify that any bytes whose values are outside the range of ASCII characters (that is, greater than 127) shall be escaped. This can be useful when saving
       filenames whose encoding does not match the one used locally.