使用 grep 获取 html 标签之间的内容
Get content between html tags using grep
我有一个 html 文件,我正试图从中获取数据。网站是这个https://www.tv2.no/nyheter。我正在尝试从网站上获取所有新闻文章。
我这样做 wget -O news.html https://www.tv2.no/nyheter
这为我创建了一个本地文件。
然后我试图获取所有具有 class article--nyheter 的文章。我尝试 运行 这个命令
tr '\n' ' ' < news.html | grep -E "^<article
class="article-nyheter">.*$"
但是我没有得到任何结果。 html结构是这样的
<body>
<div>
<article class="article column large-4 small-12">
hello
</article>
</div>
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336304/">
<figure class="image image__responsive" style="padding-bottom:51.312%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src=""
data-src="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177"
data-srcset="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=354&compression=92 2x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=265.5&compression=92 1.5x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t27 tm26">IEA: Mulig å nå 2-gradersmålet om løftene fra Glasgow holdes</h2>
</div>
</a>
</article>
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336420/">
<figure class="image image__responsive" style="padding-bottom:115.452%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src=""
data-src="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398"
data-srcset="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=796&compression=92 2x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=597&compression=92 1.5x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t26 tm20">Italienske jegere stoppet på vei ut av landet med 2.027 nedfryste
troster</h2>
</div>
</a>
</article>
示例输出,因为以下两篇文章都包含 class name article--nyheter
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336420/">
<figure class="image image__responsive" style="padding-bottom:115.452%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src=""
data-src="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398"
data-srcset="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=796&compression=92 2x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=597&compression=92 1.5x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t26 tm20">Italienske jegere stoppet på vei ut av landet med 2.027 nedfryste
troster</h2>
</div>
</a>
</article>
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336304/">
<figure class="image image__responsive" style="padding-bottom:51.312%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src=""
data-src="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177"
data-srcset="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=354&compression=92 2x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=265.5&compression=92 1.5x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t27 tm26">IEA: Mulig å nå 2-gradersmålet om løftene fra Glasgow holdes</h2>
</div>
</a>
</article>
我必须为此使用 grep、sed、curl、awk。不能使用任何其他解析器。
所以我的预期输出是获取所有具有特定 class 的文章标签。我想要那些文章标签中的所有内容。
假设:
- 没有使用以 HTML 为中心的工具来解析所需的部分是有正当理由的
- 输入的格式与问题中的一样,否则建议的
sed
解决方案可能无法正常工作
- 提取
<article> ... </article>
对,其中 article class
条目包含字符串 article--nyheter
- OP 的预期输出有两个
article--nyheter
部分以相反的顺序列出;现在我假设这是某种打字错误,并且没有要求对这两个部分进行排序
一个 sed
使用范围提取所需数据的想法:
sed -n '/<article class.*article--nyheter/,/<\/article>/p' news.html
这会生成:
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336304/">
<figure class="image image__responsive" style="padding-bottom:51.312%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src=""
data-src="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177"
data-srcset="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=354&compression=92 2x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=265.5&compression=92 1.5x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t27 tm26">IEA: Mulig å nå 2-gradersmålet om løftene fra Glasgow holdes</h2>
</div>
</a>
</article>
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336420/">
<figure class="image image__responsive" style="padding-bottom:115.452%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src=""
data-src="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398"
data-srcset="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=796&compression=92 2x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=597&compression=92 1.5x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t26 tm20">Italienske jegere stoppet på vei ut av landet med 2.027 nedfryste
troster</h2>
</div>
</a>
</article>
如果输入数据的格式不符合问题中的要求(例如,缺少回车 returns/linefeeds),那么此 sed
解决方案可能无法工作;需要构建更多 'robust' 解析器(例如,通过 awk
)...
我有一个 html 文件,我正试图从中获取数据。网站是这个https://www.tv2.no/nyheter。我正在尝试从网站上获取所有新闻文章。
我这样做 wget -O news.html https://www.tv2.no/nyheter
这为我创建了一个本地文件。
然后我试图获取所有具有 class article--nyheter 的文章。我尝试 运行 这个命令
tr '\n' ' ' < news.html | grep -E "^<article class="article-nyheter">.*$"
但是我没有得到任何结果。 html结构是这样的
<body>
<div>
<article class="article column large-4 small-12">
hello
</article>
</div>
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336304/">
<figure class="image image__responsive" style="padding-bottom:51.312%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src=""
data-src="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177"
data-srcset="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=354&compression=92 2x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=265.5&compression=92 1.5x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t27 tm26">IEA: Mulig å nå 2-gradersmålet om løftene fra Glasgow holdes</h2>
</div>
</a>
</article>
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336420/">
<figure class="image image__responsive" style="padding-bottom:115.452%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src=""
data-src="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398"
data-srcset="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=796&compression=92 2x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=597&compression=92 1.5x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t26 tm20">Italienske jegere stoppet på vei ut av landet med 2.027 nedfryste
troster</h2>
</div>
</a>
</article>
示例输出,因为以下两篇文章都包含 class name article--nyheter
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336420/">
<figure class="image image__responsive" style="padding-bottom:115.452%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src=""
data-src="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398"
data-srcset="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=796&compression=92 2x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=597&compression=92 1.5x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t26 tm20">Italienske jegere stoppet på vei ut av landet med 2.027 nedfryste
troster</h2>
</div>
</a>
</article>
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336304/">
<figure class="image image__responsive" style="padding-bottom:51.312%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src=""
data-src="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177"
data-srcset="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=354&compression=92 2x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=265.5&compression=92 1.5x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t27 tm26">IEA: Mulig å nå 2-gradersmålet om løftene fra Glasgow holdes</h2>
</div>
</a>
</article>
我必须为此使用 grep、sed、curl、awk。不能使用任何其他解析器。
所以我的预期输出是获取所有具有特定 class 的文章标签。我想要那些文章标签中的所有内容。
假设:
- 没有使用以 HTML 为中心的工具来解析所需的部分是有正当理由的
- 输入的格式与问题中的一样,否则建议的
sed
解决方案可能无法正常工作 - 提取
<article> ... </article>
对,其中article class
条目包含字符串article--nyheter
- OP 的预期输出有两个
article--nyheter
部分以相反的顺序列出;现在我假设这是某种打字错误,并且没有要求对这两个部分进行排序
一个 sed
使用范围提取所需数据的想法:
sed -n '/<article class.*article--nyheter/,/<\/article>/p' news.html
这会生成:
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336304/">
<figure class="image image__responsive" style="padding-bottom:51.312%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src=""
data-src="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177"
data-srcset="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=354&compression=92 2x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=265.5&compression=92 1.5x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t27 tm26">IEA: Mulig å nå 2-gradersmålet om løftene fra Glasgow holdes</h2>
</div>
</a>
</article>
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336420/">
<figure class="image image__responsive" style="padding-bottom:115.452%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src=""
data-src="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398"
data-srcset="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=796&compression=92 2x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=597&compression=92 1.5x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t26 tm20">Italienske jegere stoppet på vei ut av landet med 2.027 nedfryste
troster</h2>
</div>
</a>
</article>
如果输入数据的格式不符合问题中的要求(例如,缺少回车 returns/linefeeds),那么此 sed
解决方案可能无法工作;需要构建更多 'robust' 解析器(例如,通过 awk
)...