在 bash 中削减 HTML

Question

我目前正在尝试将 HTML 文件剪切到某个短语或两个短语之间。

<p>unneeded text and top of webpage</p>
    <h2><span style="font-size&#58;18px;">text1</span></h2><pre><b>text2&#58;</b>
admin (you)
    password&#58; password1
adminline2
    password&#58; password2
adminline3
    password&#58; password3
adminline4
    password&#58; password4

<b>Authorized Users&#58;</b>
userline
userline2
userline3
</pre><h2><span style="font-size&#58;18px;">text3</span</h2><ul><li>
more unneeded text and bottem of the web page</ul></li>

使用 Bash 终端，我试图剪掉此网页 html 的顶部和底部，得到：

<h2><span style="font-size&#58;18px;">text1</span></h2><pre><b>text2&#58;</b>
    admin (you)
        password&#58; password1
    adminline2
        password&#58; password2
    adminline3
        password&#58; password3
    adminline4
        password&#58; password4

    <b>Authorized Users&#58;</b>
    userline
    userline2
    userline3
    </pre><h2><span style="font-size&#58;18px;">text3</span</h2>

我试过使用 cut，但是你只能有一个字符的 delim。我还尝试使用 awk 像这样减少顶部：

STARTHTML='<h2><span style="font-size&#58;18px;">text1</span></h2><pre><b>text2&#58;</b>'
awk 'BEGIN {FS="$STARTHTML";}{print }' ~/Desktop/input.txt

但是输出结果是一堆空行。

如何使用 bash.

将网页的 .txt 或 .html 文件缩减为这些特定行

Answer 1

根据你想要的输出，你能检查一下这是否有效吗：

sed -n '/<h2>/,/<\/pre>/p' file_name

解释：

由于您需要以 <h2>--start pattern 和 </pre> --End pattern 开头的模式之间的行，我将它们包含在 sed 语法

中

sed -n '/start_pattern_here/,/end_pattern_here/p' file_name


-n     : Suppress automatic printing of pattern space
p      : Print the current pattern space

在 bash 中削减 HTML

Cutting HTML in bash

html

bash

ubuntu

parsing

ubuntu-16.04