Java 字符串标记化:拆分模式并保留模式
Java string tokenization: Split on pattern and retain pattern
我的问题是 Python 上 查询的 Scala (Java) 变体。
特别是,我有一个字符串 val myStr = "Shall we meet at, let's say, 8:45 AM?"
。我想将其标记化 和 保留分隔符(除空格外的所有分隔符)。如果我的分隔符只是字符,例如.
、:
、?
等,我可以做:
val strArr = myStr.split("((\s+)|(?=[,.;:?])|(?<=\b[,.;:?]))")
产生
[Shall, we, meet, at, ,, let's, say, ,, 8, :, 45, AM, ?]
但是,我希望将拍号 \d+:\d+
设为分隔符,并且仍想保留它。所以,我想要的是
[Shall, we, meet, at, ,, let's, say, ,, 8:45, AM, ?]
注:
- 在拆分语句的表达式中添加析取符
(?=(\d+:\d+))
没有帮助
- 在拍号之外,
:
本身就是一个分隔符
我怎样才能做到这一点?
我建议匹配你所有的标记,而不是拆分字符串,因为这样你可以更好地控制你得到的东西:
\b\d{1,2}:\d{2}\b|[,.;:?]+|(?:(?!\b\d{1,2}:\d{2}\b)[^\s,.;:?])+
参见regex demo。
我们开始匹配最具体的模式,最后一个是最通用的模式。
详情
\b\d{1,2}:\d{2}\b
- 1 到 2 位数字,:
,2 位数字用单词边界括起来
|
- 或
[,.;:?]+
- 1 个或多个 ,
、.
、;
、:
、?
个字符
|
- 或
(?:(?!\b\d{1,2}:\d{2}\b)[^\s,.;:?])+
- 匹配任何不是我们的定界符字符或不是时间字符串起点的空格 ([^\s,.;:?]
) 的字符。
考虑 this snippet:
val str = "Shall we meet at, let's say, 8:45 AM?"
var rx = """\b\d{1,2}:\d{2}\b|[,.;:?]+|(?:(?!\b\d{1,2}:\d{2}\b)[^\s,.;:?])+""".r
rx findAllIn str foreach println
输出:
Shall
we
meet
at
,
let's
say
,
8:45
AM
?
/**
* StringPatternTokenizer is simlular to java.util.StringTokenizer
* But it uses regex string as the tokenizer separator.
* See inside method #testCase for detail usage.
*/
public class StringPatternTokenizer {
Pattern pattern;
public StringPatternTokenizer(String regex) {
this.pattern = Pattern.compile(regex);
}
public void getTokens(String str, NextToken nextToken) {
Matcher matcher = pattern.matcher(str);
int index = 0;
Result result = null;
while (matcher.find()) {
if (matcher.start() > index) {
result = nextToken.visit(null, str.substring(index, matcher.start()));
}
if (result != Result.STOP) {
index = matcher.end();
result = nextToken.visit(matcher, null);
}
if (result == Result.STOP) {
return;
}
}
if (index < str.length()) {
nextToken.visit(null, str.substring(index));
}
}
enum Result {
CONTINUE,
STOP,
}
public interface NextToken {
Result visit(Matcher matcher, String str);
}
/***********************************/
/***** test cases FOR IT ***********/
/***********************************/
public void testCase() {
// as a test, it tries access tokenizer result for each part,
// then replace variable parts by given values.
// And finally, we collect the result target string as output.
String strSource = "My name is {{NAME}}, nice to meet you.";
String strTarget = "My name is TokenTst, nice to meet you.";
// separator pattern for: variable names in two curly brackets
String variableRegex = "\{\{([A-Za-z]+)\}\}";
// variable values
org.json.JSONObject data = new org.json.JSONObject(
java.util.Collections.singletonMap("NAME", "TokenTst")
);
StringBuilder sb = new StringBuilder();
new StringPatternTokenizer(variableRegex)
.getTokens(strSource, (matcher, str) -> {
sb.append(matcher == null ? str
: data.optString(matcher.group(1), ""));
return StringPatternTokenizer.Result.CONTINUE;
});
// check the result as expected
org.junit.Assert.assertEquals(strTarget, sb.toString());
}
}
我的问题是 Python 上
特别是,我有一个字符串 val myStr = "Shall we meet at, let's say, 8:45 AM?"
。我想将其标记化 和 保留分隔符(除空格外的所有分隔符)。如果我的分隔符只是字符,例如.
、:
、?
等,我可以做:
val strArr = myStr.split("((\s+)|(?=[,.;:?])|(?<=\b[,.;:?]))")
产生
[Shall, we, meet, at, ,, let's, say, ,, 8, :, 45, AM, ?]
但是,我希望将拍号 \d+:\d+
设为分隔符,并且仍想保留它。所以,我想要的是
[Shall, we, meet, at, ,, let's, say, ,, 8:45, AM, ?]
注:
- 在拆分语句的表达式中添加析取符
(?=(\d+:\d+))
没有帮助 - 在拍号之外,
:
本身就是一个分隔符
我怎样才能做到这一点?
我建议匹配你所有的标记,而不是拆分字符串,因为这样你可以更好地控制你得到的东西:
\b\d{1,2}:\d{2}\b|[,.;:?]+|(?:(?!\b\d{1,2}:\d{2}\b)[^\s,.;:?])+
参见regex demo。
我们开始匹配最具体的模式,最后一个是最通用的模式。
详情
\b\d{1,2}:\d{2}\b
- 1 到 2 位数字,:
,2 位数字用单词边界括起来|
- 或[,.;:?]+
- 1 个或多个,
、.
、;
、:
、?
个字符|
- 或(?:(?!\b\d{1,2}:\d{2}\b)[^\s,.;:?])+
- 匹配任何不是我们的定界符字符或不是时间字符串起点的空格 ([^\s,.;:?]
) 的字符。
考虑 this snippet:
val str = "Shall we meet at, let's say, 8:45 AM?"
var rx = """\b\d{1,2}:\d{2}\b|[,.;:?]+|(?:(?!\b\d{1,2}:\d{2}\b)[^\s,.;:?])+""".r
rx findAllIn str foreach println
输出:
Shall
we
meet
at
,
let's
say
,
8:45
AM
?
/**
* StringPatternTokenizer is simlular to java.util.StringTokenizer
* But it uses regex string as the tokenizer separator.
* See inside method #testCase for detail usage.
*/
public class StringPatternTokenizer {
Pattern pattern;
public StringPatternTokenizer(String regex) {
this.pattern = Pattern.compile(regex);
}
public void getTokens(String str, NextToken nextToken) {
Matcher matcher = pattern.matcher(str);
int index = 0;
Result result = null;
while (matcher.find()) {
if (matcher.start() > index) {
result = nextToken.visit(null, str.substring(index, matcher.start()));
}
if (result != Result.STOP) {
index = matcher.end();
result = nextToken.visit(matcher, null);
}
if (result == Result.STOP) {
return;
}
}
if (index < str.length()) {
nextToken.visit(null, str.substring(index));
}
}
enum Result {
CONTINUE,
STOP,
}
public interface NextToken {
Result visit(Matcher matcher, String str);
}
/***********************************/
/***** test cases FOR IT ***********/
/***********************************/
public void testCase() {
// as a test, it tries access tokenizer result for each part,
// then replace variable parts by given values.
// And finally, we collect the result target string as output.
String strSource = "My name is {{NAME}}, nice to meet you.";
String strTarget = "My name is TokenTst, nice to meet you.";
// separator pattern for: variable names in two curly brackets
String variableRegex = "\{\{([A-Za-z]+)\}\}";
// variable values
org.json.JSONObject data = new org.json.JSONObject(
java.util.Collections.singletonMap("NAME", "TokenTst")
);
StringBuilder sb = new StringBuilder();
new StringPatternTokenizer(variableRegex)
.getTokens(strSource, (matcher, str) -> {
sb.append(matcher == null ? str
: data.optString(matcher.group(1), ""));
return StringPatternTokenizer.Result.CONTINUE;
});
// check the result as expected
org.junit.Assert.assertEquals(strTarget, sb.toString());
}
}