用于匹配 Python 中的 URL 的正则表达式

RegEx for matching URLs in Python

我有这个示例字符串:

line = '[text] something - https://www.myurl.com/test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3/ marker needle - some more text at the end'

我需要提取 "marker needle" 之前的路径(不带斜线)。以下列出所有路径:

print re.findall('https://www\.myurl\.com/(.+?)/', line)
# ['test1', 'test2', 'test3']

然而,当我将其更改为仅查找我想要的路径("marker needle" 之前的路径)时,它给出了一个奇怪的输出:

print re.findall('https://www\.myurl\.com/(.+?)/ marker needle', line)
# ['test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3']

我的预期输出:

test3

我用 re.search 试过同样的方法,但结果是一样的。

这个表达式有三个捕获组,其中第二个有我们想要的输出:

(https:\/\/www.myurl.com\/)([A-Za-z0-9-]+)(\/\smarker needle)
如果您愿意,

This tool 可以帮助我们 modify/change 表达式。

正则表达式描述图

jex.im 可视化正则表达式:

Python 测试

# -*- coding: UTF-8 -*-
import re

string = "[text] something - https://www.myurl.com/test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3/ marker needle - some more text at the end"
expression = r'(https:\/\/www.myurl.com\/)([A-Za-z0-9-]+)(\/\smarker needle)'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(2) + "\" is a match  ")
else: 
    print(' Sorry! No matches!')

输出

YAAAY! "test3" is a match 

性能测试

此代码段 returns 100 万次 for 循环的运行时间。

const repeat = 10;
const start = Date.now();

for (var i = repeat; i >= 0; i--) {
 const regex = /(.*)(https:\/\/www.myurl.com\/)([A-Za-z0-9-]+)(\/\smarker needle)(.*)/gm;
 const str = "[text] something - https://www.myurl.com/test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3/ marker needle - some more text at the end";
 const subst = ``;

 var match = str.replace(regex, subst);
}

const end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match  ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test.  ");