如何在Java中使用正则表达式查找文本第一句中的特定单词？

Question

我需要你的帮助来解决我的问题。我是 Java 中正则表达式的新手，我不知道如何正确使用它。

例如我有简单的文本：

In downtown Las Vegas, John spent a lot of time on The Strip, which is a 2.5 mile stretch of shopping, entertainment venues, luxury hotels, and fine dining experiences. This is probably the most commonly visited tourist area in the city. The Strip at night time looks especially beautiful. All of the buildings light up with bright, neon, eye-catching signs to attract visitors attention.

任务是找到并 return 第一个句子中的所有此类单词，而其他任何句子中都没有。这应该使用正则表达式来完成。

单词in不合适，因为它出现在第二句中：

"This is probably the most commonly visited tourist area ---in--- the city."

单词downtown是合适的，因为下面的句子none中包含它。 Las、Vegas、John、spent、a、lot、of也很合适。

单词 time 不是，因为它出现在第三个句子中：

"The Strip at night ---time--- looks especially beautiful."

第 1 句中的所有单词依此类推。

一些规则

句子之间用点隔开.
文章也是文字
搜索不区分大小写：John和john是同一个词

Answer 1

您可以通过以下方式实现您的目标：

使用正则表达式拆分文本，该正则表达式匹配仅用作句点而不用作小数点分隔符的点。
用第一个句子中的所有单词创建一个 Set，用第二个正则表达式提取。
迭代每个剩余的句子并使用第二个正则表达式解析它们以验证您的 Set 是否包含也出现在以下句子中的单词。在这种情况下，您只需删除“重新出现”这个词。

下面是上述逻辑的简单实现：

public class Main {

    public static void main(String[] args) throws IOException {
        String text = "In downtown Las Vegas, John spent a lot of time on The Strip, \n" +
                "which is a 2.5 mile stretch of shopping, entertainment venues, \n" +
                "luxury hotels, and fine dining experiences. This is probably the most \n" +
                "commonly visited tourist area in the city. The Strip at night time looks \n" +
                "especially beautiful. All of the buildings light up with bright, neon, \n" +
                "eye-catching signs to attract visitors attention.";

        //Splitting by the dot character except when it represents a decimal separator
        String[] sentences = text.split("(?!\d+\.)\.(?!\d+)");

        //Set containing all the words encountered within the first sentence
        Set<String> setWordsFirstSentence = new HashSet<>();

        //The current regex identifies words and decimal numbers (like 2.5)
        Pattern patternWords = Pattern.compile("(?m)\b(\d+\.\d+)|\w+\b");
        Matcher matchSentence = patternWords.matcher(sentences[0]);

        //Adding each word of the first sentence within the set
        while (matchSentence.find()) {
            setWordsFirstSentence.add(matchSentence.group());
        }

        //Skipping the first sentence and removing from the set every word which appears in the following sentences
        for (int i = 1; i < sentences.length; i++) {
            matchSentence = patternWords.matcher(sentences[i]);
            while (matchSentence.find()){
                //You could either remove the word with a lambda passed to the removeIf method, although declaring variables in a loop is a bad practice...
//                String tempWord = matchSentence.group();
//                setWordsFirstSentence.removeIf(word -> word.equalsIgnoreCase(tempWord));

                //... Or use an Iterator
                for (Iterator<String> it = setWordsFirstSentence.iterator(); it.hasNext();){
                    if (it.next().equalsIgnoreCase(matchSentence.group())){
                        it.remove();
                    }
                }
            }
        }

        //Printing the set's words
        setWordsFirstSentence.stream().forEach(System.out::println);
    }
}

如代码中所述，可以通过将 lambda 表达式传递给 Set class 的 removeIf 方法来实现删除，只有当谓词表达式是真的。但是，这意味着在每次迭代时都声明一个 String 变量，以使其有效地成为最终变量，这实际上不是最干净的方法，也不是很有效。

否则，您可以使用 Iterator（更好的方法）来迭代您的 Set 并删除等于以下匹配词之一的词。

如何在Java中使用正则表达式查找文本第一句中的特定单词？

How to find specific words in the first sentence of the text using regular expressions in Java?

java

regex