用于匹配 Python 中的 URL 的正则表达式
RegEx for matching URLs in Python
我有这个示例字符串:
line = '[text] something - https://www.myurl.com/test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3/ marker needle - some more text at the end'
我需要提取 "marker needle" 之前的路径(不带斜线)。以下列出所有路径:
print re.findall('https://www\.myurl\.com/(.+?)/', line)
# ['test1', 'test2', 'test3']
然而,当我将其更改为仅查找我想要的路径("marker needle" 之前的路径)时,它给出了一个奇怪的输出:
print re.findall('https://www\.myurl\.com/(.+?)/ marker needle', line)
# ['test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3']
我的预期输出:
test3
我用 re.search
试过同样的方法,但结果是一样的。
这个表达式有三个捕获组,其中第二个有我们想要的输出:
(https:\/\/www.myurl.com\/)([A-Za-z0-9-]+)(\/\smarker needle)
如果您愿意,This tool 可以帮助我们 modify/change 表达式。
正则表达式描述图
jex.im 可视化正则表达式:
Python 测试
# -*- coding: UTF-8 -*-
import re
string = "[text] something - https://www.myurl.com/test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3/ marker needle - some more text at the end"
expression = r'(https:\/\/www.myurl.com\/)([A-Za-z0-9-]+)(\/\smarker needle)'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(2) + "\" is a match ")
else:
print(' Sorry! No matches!')
输出
YAAAY! "test3" is a match
性能测试
此代码段 returns 100 万次 for
循环的运行时间。
const repeat = 10;
const start = Date.now();
for (var i = repeat; i >= 0; i--) {
const regex = /(.*)(https:\/\/www.myurl.com\/)([A-Za-z0-9-]+)(\/\smarker needle)(.*)/gm;
const str = "[text] something - https://www.myurl.com/test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3/ marker needle - some more text at the end";
const subst = ``;
var match = str.replace(regex, subst);
}
const end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. ");
我有这个示例字符串:
line = '[text] something - https://www.myurl.com/test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3/ marker needle - some more text at the end'
我需要提取 "marker needle" 之前的路径(不带斜线)。以下列出所有路径:
print re.findall('https://www\.myurl\.com/(.+?)/', line)
# ['test1', 'test2', 'test3']
然而,当我将其更改为仅查找我想要的路径("marker needle" 之前的路径)时,它给出了一个奇怪的输出:
print re.findall('https://www\.myurl\.com/(.+?)/ marker needle', line)
# ['test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3']
我的预期输出:
test3
我用 re.search
试过同样的方法,但结果是一样的。
这个表达式有三个捕获组,其中第二个有我们想要的输出:
(https:\/\/www.myurl.com\/)([A-Za-z0-9-]+)(\/\smarker needle)
如果您愿意,This tool 可以帮助我们 modify/change 表达式。
正则表达式描述图
jex.im 可视化正则表达式:
Python 测试
# -*- coding: UTF-8 -*-
import re
string = "[text] something - https://www.myurl.com/test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3/ marker needle - some more text at the end"
expression = r'(https:\/\/www.myurl.com\/)([A-Za-z0-9-]+)(\/\smarker needle)'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(2) + "\" is a match ")
else:
print(' Sorry! No matches!')
输出
YAAAY! "test3" is a match
性能测试
此代码段 returns 100 万次 for
循环的运行时间。
const repeat = 10;
const start = Date.now();
for (var i = repeat; i >= 0; i--) {
const regex = /(.*)(https:\/\/www.myurl.com\/)([A-Za-z0-9-]+)(\/\smarker needle)(.*)/gm;
const str = "[text] something - https://www.myurl.com/test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3/ marker needle - some more text at the end";
const subst = ``;
var match = str.replace(regex, subst);
}
const end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. ");