管道 curl 到 awk 以下载和解压缩文件

Question

我想从 HTML 页面的这一部分下载所有文件：

    <td><a class="xm" name="item_1" type="dd" href="/data/24765/dd">Item 1</a></td>
    <td><a class="xm" name="item_2" type="dd" href="/data/12345/dd">Item 2</a></td>
    <td><a class="xm" name="item_3" type="dd" href="/data/75239/dd">Item 3</a></td>

第一个文件的下载 link 是 https://foo.bar/data/24765/dd，因为它是一个 zip 文件，我也想解压它。

我的脚本是这样的：

#!/bin/bash
curl -s "https://foo.bar/path/to/page" > data.html

gawk 'match([=11=], /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' data.html > data.txt

for f in $(cat data.txt); do 
    curl -s "https://foo.bar/$f" > data.zip
    unzip data.zip
done

有没有更优雅的方式来写这个脚本？我想避免保存 html、txt 和 zip 文件。

Answer 1

bsdtar 命令可以从 stdin 解压缩档案，允许您这样做：

curl -s "https://foo.bar/$f" | bsdtar -xf-

当然，您可以将第一个 curl 命令直接传送到 awk:

curl -s "https://foo.bar/path/to/page" |
gawk 'match([=11=], /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' > data.txt

事实上，您也可以将该管道的输出直接输送到一个循环中：

curl -s "https://foo.bar/path/to/page" |
gawk 'match([=12=], /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' |
while read archive; do
    curl -s "https://foo.bar/$archive" | bsdtar -xf-
done

Answer 2

I'd like to avoid saving(...)zip files.

通常许多 linux 终端命令将接受 - 意味着 在需要文件名的地方使用标准输入。粗略搜索后，某些版本的 unzip 似乎不支持此功能（参见 How to redirect output of wget as input to unzip? at unix.stack.exchange) whilst others like one described by freebsd.org do

If specified filename is "-", then data is read from stdin.

因此，如果您使用的是版本，那么请执行此操作

curl -s "https://foo.bar/$f" > data.zip
unzip data.zip

可以改进为

curl -s "https://foo.bar/$f" > unzip -

如果没有，但您想使用 unzip，那么根据 unix.stack.exchange 的回答，使用 busybux 前缀 unzip 将修复

curl -s "https://foo.bar/$f" > busybux unzip -

管道 curl 到 awk 以下载和解压缩文件

Pipe curl to awk to download and unzip files

bash

awk

curl