为什么 NSRegularExpression 在标题引用时不匹配?
Why NSRegularExpression does not match when heading quote?
这是应该匹配的整个样本:
let input = "L’iPhone XR serait un topselling (des prévisions de vente en hausse de 50% avant même sa sortie)"
let pattern = "\b(iPhones?(\s*(se|X((s(\s*Max)?)|r)?|\d(s|c)?(\s*(Plus|Pro))?))?)\b"
let regex: NSRegularExpression
do {
regex = try NSRegularExpression(pattern: pattern, options: [.caseInsensitive, .useUnicodeWordBoundaries])
}
catch let error {
fatalError("pattern ”\(pattern)” has an issue. \(error.localizedDescription)")
}
let range = NSMakeRange(0, input.count)
let matches = regex.matches(in: input, range: range)
目前正则表达式不捕获任何组。我期望它捕获 "iPhone XR" 作为第一组。
这是一个测试平台:https://regex101.com/r/aHcyPQ/2
.useUnicodeWordBoundaries
启用 UREGEX_UWORD
选项:
Controls the behavior of \b
in a pattern. If set, word boundaries are found according to the definitions of word found in Unicode UAX 29, Text Boundaries. By default, word boundaries are identified by means of a simple classification of characters as either “word” or “non-word”, which approximates traditional regular expression behavior. The results obtained with the two options can be quite different in runs of spaces and other non-word characters.
Unicode UAX 29 文档详细描述了这些单词边界并提供了一些漂亮的插图。
’
被归类为 MidLetter 字符:
MidLetter Any of the following:
U+0027 (') APOSTROPHE
U+00B7 (·) MIDDLE DOT
U+05F4 (״) HEBREW PUNCTUATION GERSHAYIM
U+2019 (’) RIGHT SINGLE QUOTATION MARK (curly apostrophe)
U+2027 (‧) HYPHENATION POINT
因此,L’iPhone
中L
和i
之间没有Unicode字边界,删除.useUnicodeWordBoundaries
。
这是应该匹配的整个样本:
let input = "L’iPhone XR serait un topselling (des prévisions de vente en hausse de 50% avant même sa sortie)"
let pattern = "\b(iPhones?(\s*(se|X((s(\s*Max)?)|r)?|\d(s|c)?(\s*(Plus|Pro))?))?)\b"
let regex: NSRegularExpression
do {
regex = try NSRegularExpression(pattern: pattern, options: [.caseInsensitive, .useUnicodeWordBoundaries])
}
catch let error {
fatalError("pattern ”\(pattern)” has an issue. \(error.localizedDescription)")
}
let range = NSMakeRange(0, input.count)
let matches = regex.matches(in: input, range: range)
目前正则表达式不捕获任何组。我期望它捕获 "iPhone XR" 作为第一组。
这是一个测试平台:https://regex101.com/r/aHcyPQ/2
.useUnicodeWordBoundaries
启用 UREGEX_UWORD
选项:
Controls the behavior of
\b
in a pattern. If set, word boundaries are found according to the definitions of word found in Unicode UAX 29, Text Boundaries. By default, word boundaries are identified by means of a simple classification of characters as either “word” or “non-word”, which approximates traditional regular expression behavior. The results obtained with the two options can be quite different in runs of spaces and other non-word characters.
Unicode UAX 29 文档详细描述了这些单词边界并提供了一些漂亮的插图。
’
被归类为 MidLetter 字符:
MidLetter
Any of the following:
U+0027 (') APOSTROPHE
U+00B7 (·) MIDDLE DOT
U+05F4 (״) HEBREW PUNCTUATION GERSHAYIM
U+2019 (’) RIGHT SINGLE QUOTATION MARK (curly apostrophe)
U+2027 (‧) HYPHENATION POINT
因此,L’iPhone
中L
和i
之间没有Unicode字边界,删除.useUnicodeWordBoundaries
。