用于匹配 Python 中的 URL 的正则表达式

Question

我有这个示例字符串：

line = '[text] something - https://www.myurl.com/test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3/ marker needle - some more text at the end'

我需要提取 "marker needle" 之前的路径（不带斜线）。以下列出所有路径：

print re.findall('https://www\.myurl\.com/(.+?)/', line)
# ['test1', 'test2', 'test3']

然而，当我将其更改为仅查找我想要的路径（"marker needle" 之前的路径）时，它给出了一个奇怪的输出：

print re.findall('https://www\.myurl\.com/(.+?)/ marker needle', line)
# ['test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3']

我的预期输出：

test3

我用 re.search 试过同样的方法，但结果是一样的。

Answer 1

这个表达式有三个捕获组，其中第二个有我们想要的输出：

(https:\/\/www.myurl.com\/)([A-Za-z0-9-]+)(\/\smarker needle)

如果您愿意，

This tool 可以帮助我们 modify/change 表达式。

正则表达式描述图

jex.im 可视化正则表达式：

Python 测试

# -*- coding: UTF-8 -*-
import re

string = "[text] something - https://www.myurl.com/test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3/ marker needle - some more text at the end"
expression = r'(https:\/\/www.myurl.com\/)([A-Za-z0-9-]+)(\/\smarker needle)'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(2) + "\" is a match  ")
else: 
    print(' Sorry! No matches!')

输出

YAAAY! "test3" is a match

性能测试

此代码段 returns 100 万次 for 循环的运行时间。

const repeat = 10;
const start = Date.now();

for (var i = repeat; i >= 0; i--) {
 const regex = /(.*)(https:\/\/www.myurl.com\/)([A-Za-z0-9-]+)(\/\smarker needle)(.*)/gm;
 const str = "[text] something - https://www.myurl.com/test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3/ marker needle - some more text at the end";
 const subst = ``;

 var match = str.replace(regex, subst);
}

const end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match  ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test.  ");

用于匹配 Python 中的 URL 的正则表达式

RegEx for matching URLs in Python

python

regex

regex-group

python-2.7

regex-greedy

正则表达式描述图

Python 测试

输出

性能测试