Java 正则表达式将句子中的单词拆分为单个单词的值及其度量

Java regex to split words in a sentences with value and its metric as single word

我正在尝试将一个句子拆分成一组单词。我正在寻找的是在分块数字时还要考虑指标。

E.g (Made-up).
 document= The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs.

需要的是单词集

the
root
cause
problem
...
40 degrees
30 percent
1.67 gpm
1-19666 tablet
3 hrs

我试过的是

List<String> bagOfWords = new ArrayList<>();    
String [] words = StringUtils.normalizeSpace(document.replaceAll("[^0-9a-zA-Z_.-]", " ")).split(" ");
for(String word :words){
    bagOfWords.add(StringUtils.normalizeSpace(word.replaceAll("\.(?!\d)", " ")));         
    }                
System.out.println("NEW 2 :: " + bagOfWords.toString());

让我们假设一个包含数字的单词后跟另一个不包含数字的单词。然后是代码:

    private static final String DOC = "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs";

   // ...

    Pattern pattern = Pattern.compile("(\b\S*\d\S*\b\s+)?\b\S+\b");
    Matcher matcher = pattern.matcher(DOC);
    List<String> words = new ArrayList<>();
    while (matcher.find()) {
        words.add(matcher.group());
    }
    for (String word : words) {
        System.out.println(word);
    }

解释:

  • \b 找到单词边界
  • \S 是非 space 字符。因此,您可以在一个词中包含所有内容,例如点或逗号。
  • (...)? 是第一个可选部分。如果有的话,它会用一个数字来捕捉这个词。 IE。它有一些字符 (\S*),然后是一个数字 (\d),然后又是一些字符 (\S*)
  • 第二个字很简单:至少一个非白色space字符。因此它在 S.
  • 之后有一个 +,而不是 *

你的问题范围有点大,但这里有一个 hack 可以适用于这种格式的大多数句子。

首先,您可以创建一个前缀列表,其中包含您单位的关键字,例如 hrs, tablet, gpm ... 一旦有了它,您就可以很容易地挑选出您需要的东西。

    String document= "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs.";
    if(document.endsWith(".")){
        document = document.substring(0, document.length() -1 );
    }
    System.out.println(document);
    String[] splitted = document.split(" ");
    List<String> keywords = new ArrayList();
    keywords.add("degrees");
    keywords.add("percent");
    keywords.add("gpm");
    keywords.add("tablet");
    keywords.add("hrs");

    List<String> words = new ArrayList();

    for(String s : splitted){
        if(!s.equals(",")){
            //if s is not a comma;
            if(keywords.contains(s) && words.size()!=0){
                //if s is a keyword append to last item in list
                int lastIndex = words.size()-1;
                words.set(lastIndex, words.get(lastIndex)+" "+s);
            }
            else{
                words.add(s);
            }
        }
    }
    for(String s : words){
        System.out.println(s);
    }