如何使用 sed 或 perl 删除 `<a href="file://a>`keep this text`</a>`？

Question

如何使用 sed 或 perl 删除所有 <a href="file://???">保留此文本</a> 但不删除其他 <a></a> 或 </a>？
是：

    <p><a class="a" href="file://any" id="b">keep this text</a>, <a href="http://example.com/abc">example.com/abc</a>, more text</p>

应该是：

    <p>keep this text, <a href="http://example.com/abc">example.com/abc</a>, more text</p>

我有这样的正则表达式，但它太贪心了，删除了所有 </a>

gsed -E -i 's/<a*href="file:[^>]*>(.+?)<\/a>/>/g' file.xhtml

Answer 1

一种方式：

sed -E 's,<a[^>]*?href="file://[^>]*>([^<]*)</a>,,g'

<a[^>]*?href="file://[^>]*> 匹配 <a + 任意数量的非 > （非贪婪）后跟 href="file:// + 任意数量的非 >字符后跟 >
([^<]*) 匹配并捕获任意数量的非 < 字符
匹配 </a>

匹配的所有内容都被替换为 </code> 中的捕获，结尾的 <code>g 使其在每一行的每次出现时都进行替换。

示例：

$ cat data
<p><a class="a" href="file://any" id="b">keep this text</a>, <a id="file:ex" href="http://example.com/abc">example.com/abc</a>, more text</p>
<p><a href="file://any" class="f">keep this text</a>, <a href="http://example.com/abc">example.com/abc</a>, more text</p>

$ sed -E 's,<a[^>]*?href="file://[^>]*>([^<]*)</a>,,g' < data
<p>keep this text, <a id="file:ex" href="http://example.com/abc">example.com/abc</a>, more text</p>
<p>keep this text, <a href="http://example.com/abc">example.com/abc</a>, more text</p>

Answer 2

假设：

OP 无法访问以 HTML 为中心的工具
删除 <a href="file:...">...some_text...</a> 包装器只留下 ...some_text...
仅适用于 file: 个条目
输入数据在 file: 条目中间没有一行 break/feed

示例数据显示多个 file: 条目与一些其他（无意义的）条目散布：

$ cat sample.html
<p><a href="https:/google.com">some text</a><a href="file://any" >keep this text</a>, <a href="http://example.com/abc">example.com/abc</a>, more text</p><a href="file://anyother" >keep this text,too</a>, last test</p>

一个 sed 删除所有 file: 条目的包装的想法：

sed -E 's|<a[^<>]+file:[^>]+>([^<]+)</a>||g' "${infile}"

注意： 某些 [^..] 条目可能有点矫枉过正，但关键 objective 是短路 sed's 默认贪婪匹配 ...

剩下：

<p><a href="https:/google.com">some text</a>keep this text, <a href="http://example.com/abc">example.com/abc</a>, more text</p>keep this text,too, last test</p>

Answer 3

考虑到 <a> 标签包含多行内容的情况， perl 解决方案怎么样：

perl -0777 -i -pe 's#<a.+?href="?file.+?>(.+?)</a>##gs' file.xhtml

-0777 选项告诉 perl 吞噬整个文件。
-i 选项启用就地编辑。
s 运算符末尾的 s 开关使点匹配任何字符包括换行符。
正则表达式 .+? 是 .+ 的非贪婪版本，可实现最短匹配。

如何使用 sed 或 perl 删除 `<a href="file://a>`keep this text`</a>`？

How to remove `<a href="file://a>`keep this text`</a>` using sed or perl?

regex

macos

bash

grep

sed