在终端上用正则表达式分组提取字符串

Question

我有一个包含一些 HTML 信息的文本文件，如下所示：

<li><a href="https://www.youtube.com/watch?v=YDubYJsZ9iM&amp;list=PL5-da3qGB5IBC-MneTc9oBZz0C6kNJ-f2">Lab: K-means Clustering</a> (6:31)</li>
<li><a href="https://www.youtube.com/watch?v=4u3zvtfqb7w&amp;list=PL5-da3qGB5IBC-MneTc9oBZz0C6kNJ-f2">Lab: Hierarchical Clustering</a> (6:33)</li>
<li><a href="https://www.youtube.com/watch?v=jk9S3RTAl38&amp;list=PL5-da3qGB5IC8_kWZXDcmLx7_n4RTBkAS">Interview with John Chambers</a> (10:20)</li>
<li><a href="https://www.youtube.com/watch?v=6l9V1sINzhE&amp;list=PL5-da3qGB5IC8_kWZXDcmLx7_n4RTBkAS">Interview with Bradley Efron</a> (12:08)</li>
<li><a href="https://www.youtube.com/watch?v=79tR7BvYE6w&amp;list=PL5-da3qGB5IC8_kWZXDcmLx7_n4RTBkAS">Interview with Jerome Friedman</a> (10:29)</li>
<li><a href="https://www.youtube.com/watch?v=MEMGOlJxxz0&amp;list=PL5-da3qGB5IC8_kWZXDcmLx7_n4RTBkAS">Interviews with statistics graduate students</a> (7:44)</li>

我用 grep -oP "https:\/\/www.youtube.com\/watch\?v=([A-Za-z0-9-_]+)" list > links 提取链接，这样 list 就是 html 文件。从另一方面我需要提取每个文件的名称，即我需要另一个这样的列表：

Lab: K-means Clustering
Lab: Hierarchical Clustering
Interview with John Chambers
Interview with Bradley Efron
Interview with Jerome Friedman
Interviews with statistics graduate students

问题是我有一些像 <a href="http://www-bcf.usc.edu/~gareth/ISL/">An Introduction to Statistical Learning with Applications in R</a> 这样的标签，因此我不能将某些模式与 a 标签一起使用。所以我必须使用模式分组之类的东西，我将能够使用一些 </code> 作为第一个匹配模式，<code> 作为第二个模式，依此类推 https:\/\/www.youtube.com\/watch\?v=([A-Za-z0-9-_]+)/[SOME INFORMATION ON URL HERE]/([A-Za-z0-9-_]+)。 如何在终端 (Bash) 上执行此操作？

Answer 1

您可以使用 none-greedy 正则表达式，如下所示：

>([^<]+?)</a>

见Demo

或者更准确地说，您可以使用 look-around :

(?<=>)([^<]+?)(?=</a>)

结果：

Lab: K-means Clustering
Lab: Hierarchical Clustering
Interview with John Chambers
Interview with Bradley Efron
Interview with Jerome Friedman
Interviews with statistics graduate students

Answer 2

您可以执行以下操作：

grep -oP "(?<=\">).*(?=</a)" your_file

这将打印：

Lab: K-means Clustering
Lab: Hierarchical Clustering
Interview with John Chambers
Interview with Bradley Efron
Interview with Jerome Friedman
Interviews with statistics graduate students

由于没有简单的方法可以使用 grep 只打印捕获的组，我使用先行断言和后行断言来确保只打印指定的部分。

Answer 3

您可以使用\K删除您真正想要的内容之前的所有匹配内容

grep -oP "a href=\"[^>]+>\K[^<]+" file

Lab: K-means Clustering
Lab: Hierarchical Clustering
Interview with John Chambers
Interview with Bradley Efron
Interview with Jerome Friedman
Interviews with statistics graduate students

或者假设 "> 没有出现在其他任何地方然后

grep -oP "\">\K[^<]+" file

Answer 4

使用便携式 awk 解决方案：

awk -F '<a href[^>]*>|</a>' '{print }' file.html
Lab: K-means Clustering
Lab: Hierarchical Clustering
Interview with John Chambers
Interview with Bradley Efron
Interview with Jerome Friedman
Interviews with statistics graduate students

在终端上用正则表达式分组提取字符串

Extract string with grouping in regular expression on terminal

regex

linux

bash

terminal