Java 正则表达式将句子中的单词拆分为单个单词的值及其度量
Java regex to split words in a sentences with value and its metric as single word
我正在尝试将一个句子拆分成一组单词。我正在寻找的是在分块数字时还要考虑指标。
E.g (Made-up).
document= The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs.
需要的是单词集
the
root
cause
problem
...
40 degrees
30 percent
1.67 gpm
1-19666 tablet
3 hrs
我试过的是
List<String> bagOfWords = new ArrayList<>();
String [] words = StringUtils.normalizeSpace(document.replaceAll("[^0-9a-zA-Z_.-]", " ")).split(" ");
for(String word :words){
bagOfWords.add(StringUtils.normalizeSpace(word.replaceAll("\.(?!\d)", " ")));
}
System.out.println("NEW 2 :: " + bagOfWords.toString());
让我们假设一个包含数字的单词后跟另一个不包含数字的单词。然后是代码:
private static final String DOC = "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs";
// ...
Pattern pattern = Pattern.compile("(\b\S*\d\S*\b\s+)?\b\S+\b");
Matcher matcher = pattern.matcher(DOC);
List<String> words = new ArrayList<>();
while (matcher.find()) {
words.add(matcher.group());
}
for (String word : words) {
System.out.println(word);
}
解释:
\b
找到单词边界
\S
是非 space 字符。因此,您可以在一个词中包含所有内容,例如点或逗号。
(...)?
是第一个可选部分。如果有的话,它会用一个数字来捕捉这个词。 IE。它有一些字符 (\S*
),然后是一个数字 (\d
),然后又是一些字符 (\S*
)
- 第二个字很简单:至少一个非白色space字符。因此它在
S
. 之后有一个 +
,而不是 *
你的问题范围有点大,但这里有一个 hack 可以适用于这种格式的大多数句子。
首先,您可以创建一个前缀列表,其中包含您单位的关键字,例如 hrs, tablet, gpm ...
一旦有了它,您就可以很容易地挑选出您需要的东西。
String document= "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs.";
if(document.endsWith(".")){
document = document.substring(0, document.length() -1 );
}
System.out.println(document);
String[] splitted = document.split(" ");
List<String> keywords = new ArrayList();
keywords.add("degrees");
keywords.add("percent");
keywords.add("gpm");
keywords.add("tablet");
keywords.add("hrs");
List<String> words = new ArrayList();
for(String s : splitted){
if(!s.equals(",")){
//if s is not a comma;
if(keywords.contains(s) && words.size()!=0){
//if s is a keyword append to last item in list
int lastIndex = words.size()-1;
words.set(lastIndex, words.get(lastIndex)+" "+s);
}
else{
words.add(s);
}
}
}
for(String s : words){
System.out.println(s);
}
我正在尝试将一个句子拆分成一组单词。我正在寻找的是在分块数字时还要考虑指标。
E.g (Made-up).
document= The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs.
需要的是单词集
the
root
cause
problem
...
40 degrees
30 percent
1.67 gpm
1-19666 tablet
3 hrs
我试过的是
List<String> bagOfWords = new ArrayList<>();
String [] words = StringUtils.normalizeSpace(document.replaceAll("[^0-9a-zA-Z_.-]", " ")).split(" ");
for(String word :words){
bagOfWords.add(StringUtils.normalizeSpace(word.replaceAll("\.(?!\d)", " ")));
}
System.out.println("NEW 2 :: " + bagOfWords.toString());
让我们假设一个包含数字的单词后跟另一个不包含数字的单词。然后是代码:
private static final String DOC = "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs";
// ...
Pattern pattern = Pattern.compile("(\b\S*\d\S*\b\s+)?\b\S+\b");
Matcher matcher = pattern.matcher(DOC);
List<String> words = new ArrayList<>();
while (matcher.find()) {
words.add(matcher.group());
}
for (String word : words) {
System.out.println(word);
}
解释:
\b
找到单词边界\S
是非 space 字符。因此,您可以在一个词中包含所有内容,例如点或逗号。(...)?
是第一个可选部分。如果有的话,它会用一个数字来捕捉这个词。 IE。它有一些字符 (\S*
),然后是一个数字 (\d
),然后又是一些字符 (\S*
)- 第二个字很简单:至少一个非白色space字符。因此它在
S
. 之后有一个
+
,而不是 *
你的问题范围有点大,但这里有一个 hack 可以适用于这种格式的大多数句子。
首先,您可以创建一个前缀列表,其中包含您单位的关键字,例如 hrs, tablet, gpm ...
一旦有了它,您就可以很容易地挑选出您需要的东西。
String document= "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs.";
if(document.endsWith(".")){
document = document.substring(0, document.length() -1 );
}
System.out.println(document);
String[] splitted = document.split(" ");
List<String> keywords = new ArrayList();
keywords.add("degrees");
keywords.add("percent");
keywords.add("gpm");
keywords.add("tablet");
keywords.add("hrs");
List<String> words = new ArrayList();
for(String s : splitted){
if(!s.equals(",")){
//if s is not a comma;
if(keywords.contains(s) && words.size()!=0){
//if s is a keyword append to last item in list
int lastIndex = words.size()-1;
words.set(lastIndex, words.get(lastIndex)+" "+s);
}
else{
words.add(s);
}
}
}
for(String s : words){
System.out.println(s);
}