正则表达式拆分并连接路径基和模式，文件名删除它们之间的部分路径

Question

我有一个 URL 这样的：

a) <a href=\"http://example.com/path-pattern-to-match/subPath/onemoreSubpath/arbitrary-number-of-subpaths/someArticle1\">

或：

b) <a href=\"http://example.com/path-pattern-to-match/someArticle2\">

我需要将路径模式与其基础 URL、<a> 标记的开头分开，并将其与 Iits someArticle 连接起来。中间的所有内容都需要删除。

案例 'b' 保持不变。案例 'a' 需要变成：

<a href=\"http://example.com/path-pattern-to-match/someArticle1\">

请用正则表达式回答，这就是我需要的。如果解释得当，使用 Perl 或 bash 脚本，其他解决方案可能会很有趣，但请避免建议某些编程模块或函数来解析它只是为了说 RegEx 不是最佳解决方案并且没有任何真正的解决方案。

PS: 我需要解析一个非多行文件。 someArticle 是可变的。

Answer 1

如果您有后视支持，请使用

(?<=<a href=\"http:\/\/example\.com\/path-pattern-to-match\/)(?:[^\/]+\/)*([^\/>"]*)(?=\">)

解释

(?<=<a href=\"http:\/\/example\.com\/path-pattern-to-match\/) - 固定宽度的后视确保我们在...

<a href=\"http://example.com/path-pattern-to-match/

使用替换字符串，您可以删除子路径并保留 "someArticle" 部分。

Regex split and concatenate path base and pattern with filename deleting part of path between them