向字符串中的所有非单词字符添加反斜杠
Add a backslash to all non-word characters in a string
由于我想把一个list(url_key
)中string的每个元素做成regex来识别另一个list中的元素是否有模式,所以我需要给所有非word加反斜杠url_key
中每个元素的字符使用 python。
我的代码示例:
import re
sentences = ["Disallow DCCP sockets due to such NFC-3456",
"Check at http://www.n.io/search?query=title++sub/file.html",
"Specifies the hash algorithm on them"]
url_key = ['www.n.io/search?query=title++sub', 'someweb.org/dirs.io'] # there are thousands of elements
add_key = ['NFC-[0-9]{4}', 'CEZ-[0-9a-z]{4,8}'] # there are hundreds of elements
pattern = url_key + add_key
mykey = re.compile('(?:% s)' % '|'.join(pattern))
for item in sentences:
if mykey.search(item):
print (item, ' --> Keyword is found')
else:
print (item, ' --> Keyword is not Found')
但是这段代码给我一个错误:
error Traceback (most recent call last)
<ipython-input-80-5348ee9c65ec> in <module>()
8
9 pattern = url_key + add_key
---> 10 mykey = re.compile('(?:% s)' % '|'.join(pattern))
11
12 for item in sentences:
~/anaconda3/lib/python3.6/re.py in compile(pattern, flags)
231 def compile(pattern, flags=0):
232 "Compile a regular expression pattern, returning a pattern object."
--> 233 return _compile(pattern, flags)
234
235 def purge():
~/anaconda3/lib/python3.6/re.py in _compile(pattern, flags)
299 if not sre_compile.isstring(pattern):
300 raise TypeError("first argument must be string or compiled pattern")
--> 301 p = sre_compile.compile(pattern, flags)
302 if not (flags & DEBUG):
303 if len(_cache) >= _MAXCACHE:
~/anaconda3/lib/python3.6/sre_compile.py in compile(p, flags)
560 if isstring(p):
561 pattern = p
--> 562 p = sre_parse.parse(p, flags)
563 else:
564 pattern = None
~/anaconda3/lib/python3.6/sre_parse.py in parse(str, flags, pattern)
853
854 try:
--> 855 p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
856 except Verbose:
857 # the VERBOSE flag was switched on inside the pattern. to be
~/anaconda3/lib/python3.6/sre_parse.py in _parse_sub(source, state, verbose, nested)
414 while True:
415 itemsappend(_parse(source, state, verbose, nested + 1,
--> 416 not nested and not items))
417 if not sourcematch("|"):
418 break
~/anaconda3/lib/python3.6/sre_parse.py in _parse(source, state, verbose, nested, first)
763 sub_verbose = ((verbose or (add_flags & SRE_FLAG_VERBOSE)) and
764 not (del_flags & SRE_FLAG_VERBOSE))
--> 765 p = _parse_sub(source, state, sub_verbose, nested + 1)
766 if not source.match(")"):
767 raise source.error("missing ), unterminated subpattern",
~/anaconda3/lib/python3.6/sre_parse.py in _parse_sub(source, state, verbose, nested)
414 while True:
415 itemsappend(_parse(source, state, verbose, nested + 1,
--> 416 not nested and not items))
417 if not sourcematch("|"):
418 break
~/anaconda3/lib/python3.6/sre_parse.py in _parse(source, state, verbose, nested, first)
617 if item[0][0] in _REPEATCODES:
618 raise source.error("multiple repeat",
--> 619 source.tell() - here + len(this))
620 if sourcematch("?"):
621 subpattern[-1] = (MIN_REPEAT, (min, max, item))
error: multiple repeat at position 31
预期结果:
Disallow DCCP sockets due to such NFC-3456 --> Keyword is found
Check at http://www.n.io/search?query=title++sub/file.html --> Keyword is found
Specifies the hash algorithm on them --> Keyword is not found
如有任何帮助,我们将不胜感激。谢谢。
您应该使用原始字符串:
result = re.sub('(\W)', r'\', mystring)
或者转义反斜杠:
result = re.sub('(\W)', '\\\1', mystring)
您的主要问题是字符串转义在正则表达式替换转义之前生效。切换到原始字符串(以禁止字符串转义)并转义反斜杠(因为 \
本身就是一个替换转义)将解决这个问题:
>>> print(re.sub(r'(\W)', r'\', '?:n.io/search?query=title++sub'))
\?\:n\.io\/search\?query\=title\+\+sub
请注意,您可能不需要如此广泛的转义。如果你只是想转义正则表达式的特殊字符,re.escape
会为你做这个:
>>> print(re.escape('?:n.io/search?query=title++sub'))
\?:n\.io/search\?query=title\+\+sub
不添加不必要的转义(那些不需要对正则表达式字符进行反专业化的转义)。
由于我想把一个list(url_key
)中string的每个元素做成regex来识别另一个list中的元素是否有模式,所以我需要给所有非word加反斜杠url_key
中每个元素的字符使用 python。
我的代码示例:
import re
sentences = ["Disallow DCCP sockets due to such NFC-3456",
"Check at http://www.n.io/search?query=title++sub/file.html",
"Specifies the hash algorithm on them"]
url_key = ['www.n.io/search?query=title++sub', 'someweb.org/dirs.io'] # there are thousands of elements
add_key = ['NFC-[0-9]{4}', 'CEZ-[0-9a-z]{4,8}'] # there are hundreds of elements
pattern = url_key + add_key
mykey = re.compile('(?:% s)' % '|'.join(pattern))
for item in sentences:
if mykey.search(item):
print (item, ' --> Keyword is found')
else:
print (item, ' --> Keyword is not Found')
但是这段代码给我一个错误:
error Traceback (most recent call last)
<ipython-input-80-5348ee9c65ec> in <module>()
8
9 pattern = url_key + add_key
---> 10 mykey = re.compile('(?:% s)' % '|'.join(pattern))
11
12 for item in sentences:
~/anaconda3/lib/python3.6/re.py in compile(pattern, flags)
231 def compile(pattern, flags=0):
232 "Compile a regular expression pattern, returning a pattern object."
--> 233 return _compile(pattern, flags)
234
235 def purge():
~/anaconda3/lib/python3.6/re.py in _compile(pattern, flags)
299 if not sre_compile.isstring(pattern):
300 raise TypeError("first argument must be string or compiled pattern")
--> 301 p = sre_compile.compile(pattern, flags)
302 if not (flags & DEBUG):
303 if len(_cache) >= _MAXCACHE:
~/anaconda3/lib/python3.6/sre_compile.py in compile(p, flags)
560 if isstring(p):
561 pattern = p
--> 562 p = sre_parse.parse(p, flags)
563 else:
564 pattern = None
~/anaconda3/lib/python3.6/sre_parse.py in parse(str, flags, pattern)
853
854 try:
--> 855 p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
856 except Verbose:
857 # the VERBOSE flag was switched on inside the pattern. to be
~/anaconda3/lib/python3.6/sre_parse.py in _parse_sub(source, state, verbose, nested)
414 while True:
415 itemsappend(_parse(source, state, verbose, nested + 1,
--> 416 not nested and not items))
417 if not sourcematch("|"):
418 break
~/anaconda3/lib/python3.6/sre_parse.py in _parse(source, state, verbose, nested, first)
763 sub_verbose = ((verbose or (add_flags & SRE_FLAG_VERBOSE)) and
764 not (del_flags & SRE_FLAG_VERBOSE))
--> 765 p = _parse_sub(source, state, sub_verbose, nested + 1)
766 if not source.match(")"):
767 raise source.error("missing ), unterminated subpattern",
~/anaconda3/lib/python3.6/sre_parse.py in _parse_sub(source, state, verbose, nested)
414 while True:
415 itemsappend(_parse(source, state, verbose, nested + 1,
--> 416 not nested and not items))
417 if not sourcematch("|"):
418 break
~/anaconda3/lib/python3.6/sre_parse.py in _parse(source, state, verbose, nested, first)
617 if item[0][0] in _REPEATCODES:
618 raise source.error("multiple repeat",
--> 619 source.tell() - here + len(this))
620 if sourcematch("?"):
621 subpattern[-1] = (MIN_REPEAT, (min, max, item))
error: multiple repeat at position 31
预期结果:
Disallow DCCP sockets due to such NFC-3456 --> Keyword is found
Check at http://www.n.io/search?query=title++sub/file.html --> Keyword is found
Specifies the hash algorithm on them --> Keyword is not found
如有任何帮助,我们将不胜感激。谢谢。
您应该使用原始字符串:
result = re.sub('(\W)', r'\', mystring)
或者转义反斜杠:
result = re.sub('(\W)', '\\\1', mystring)
您的主要问题是字符串转义在正则表达式替换转义之前生效。切换到原始字符串(以禁止字符串转义)并转义反斜杠(因为 \
本身就是一个替换转义)将解决这个问题:
>>> print(re.sub(r'(\W)', r'\', '?:n.io/search?query=title++sub'))
\?\:n\.io\/search\?query\=title\+\+sub
请注意,您可能不需要如此广泛的转义。如果你只是想转义正则表达式的特殊字符,re.escape
会为你做这个:
>>> print(re.escape('?:n.io/search?query=title++sub'))
\?:n\.io/search\?query=title\+\+sub
不添加不必要的转义(那些不需要对正则表达式字符进行反专业化的转义)。