标记由非单词字符分隔的单词，单引号除外

Question

我尝试实现以下方法：将输入解析为“单词标记”：由非单词字符分隔的单词字符序列。但是，如果非单词字符被引用（用单引号），它们可以成为标记的一部分。
我想使用正则表达式，但无法正确使用我的代码：

public static List<String> wordTokenize(String input) {
    Pattern pattern = Pattern.compile ("\b(?:(?<=\')[^\']*(?=\')|\w+)\b");
    Matcher matcher = pattern.matcher (input);
    ArrayList ans = new ArrayList();
    while (matcher.find ()){
        ans.add (matcher.group ());
    }
    return ans;
}

我的正则表达式无法识别在没有 space 的情况下从单词中间开始并不意味着开始一个新单词。示例：

输入：this-string 'has only three tokens' // 有效
输入： "this*string'has only two@tokens'"
预期：[this, stringhas only two@tokens]
Actual :[this, string, has only two@tokens]
输入："one'two''three' '' four 'twenty-one'"
预期：[一二三，四，二十一]
实际：[一、二、三、四、二十一]

如何修复 spaces？

Answer 1

您想匹配一个或多个单词 char 或最接近的单个直撇号之间的子字符串，并从标记中删除所有这些撇号。

在匹配项上使用以下正则表达式和 .replace("'", "")：

(?:\w|'[^']*')+

见regex demo。详情：

(?: - 非捕获组的开始
- \w - 一个字 char
- | - 或
- ' - 单引号
- [^']* - 除单引号外的任何 0+ 个字符
- ' - 单引号
)+ - 小组结束，出现 1 次以上。

参见 Java demo:

// String s = "this*string'has only two@tokens'"; // => [this, stringhas only two@tokens]
String s = "one'two''three' '' four 'twenty-one'"; // => [onetwothree, , four, twenty-one]
Pattern pattern = Pattern.compile("(?:\w|'[^']*')+", Pattern.UNICODE_CHARACTER_CLASS);
Matcher matcher = pattern.matcher(s);
List<String> tokens = new ArrayList<>();
while (matcher.find()){
    tokens.add(matcher.group(0).replace("'", "")); 
}

请注意，为 \w 模式添加了 Pattern.UNICODE_CHARACTER_CLASS 以匹配所有 Unicode 字母和数字。

标记由非单词字符分隔的单词，单引号除外

Tokenize Words separated by non-word characters exept single quote

java

regex

tokenize