如何通过 Stream 统计 Map 中的字数

How to count words in Map via Stream

我正在使用 List<String> -- 它包含一个大文本。文本看起来像:

List<String> lines = Arrays.asList("The first line", "The second line", "Some words can repeat", "The first the second"); //etc

我需要计算其中的单词输出:

first - 2
line - 2
second - 2
can - 1
repeat - 1
some - 1
words - 1

应跳过短于 4 个符号的单词,这就是输出中不包含“the”和“can”的原因。这里我写了这个例子,但本来如果这个词很稀有并且条目 < 20,我应该跳过这个词。然后按字母顺序按Key对地图进行排序。 仅使用流,不使用“if”、“while”和“for”结构。

我实现的:

Map<String, Integer> wordCount = Stream.of(list)
                .flatMap(Collection::stream)
                .flatMap(str -> Arrays.stream(str.split("\p{Punct}| |[0-9]|…|«|»|“|„")))
                .filter(str -> (str.length() >= 4))
                .collect(Collectors.toMap(
                        i -> i.toLowerCase(),
                        i -> 1,
                        (a, b) -> java.lang.Integer.sum(a, b))
                );

wordCount 包含带有单词及其条目的地图。但是我怎样才能跳过生僻字呢?我应该创建新流吗?如果是,我怎样才能得到地图的价值?我试过了,但不正确:

 String result = Stream.of(wordCount)
         .filter(i -> (Map.Entry::getValue > 10));

我的计算应该return一个字符串:

"word" - number of entries

谢谢!

在计算频率计数之前,您不能排除任何小于 rare 的值。

以下是我的处理方法。

  • 计算频率(我选择的做法与您略有不同)。
  • 然后流map的entrySet,过滤掉小于一定频率的值。
  • 然后使用 TreeMap 重建地图以按词汇顺序对单词进行排序
List<String> list = Arrays.asList(....);

int wordRarity = 10; // minimum frequency to accept
int wordLength = 4; // minimum word length to accept
        
Map<String, Long> map = list.stream()
        .flatMap(str -> Arrays.stream(
                str.split("\p{Punct}|\s+|[0-9]|…|«|»|“|„")))
        .filter(str -> str.length() >= wordLength)
        .collect(Collectors.groupingBy(String::toLowerCase, 
                Collectors.counting()))
        // here is where the rare words are filtered out.
        .entrySet().stream().filter(e->e.getValue() > wordRarity)
        .collect(Collectors.toMap(Entry::getKey, Entry::getValue,
                (a,b)->a,TreeMap::new));
    }

请注意,(a,b)->a lambda 是处理重复项的合并函数,未被使用。不幸的是,如果不指定合并功能就无法指定供应商。

最简单的打印方式如下:

map.entrySet().forEach(e -> System.out.printf("%s - %s%n",
                e.getKey(), e.getValue()));

给定流已经完成:

List<String> lines = Arrays.asList(
        "For the rabbit, it was a bad day.",
        "An Antillean rabbit is very abundant.",
        "She put the rabbit back in the cage and closed the door securely, then ran away.",
        "The rabbit tired of her inquisition and hopped away a few steps.",
        "The Dean took the rabbit and went out of the house and away."
);

Map<String, Integer> wordCounts = Stream.of(lines)
        .flatMap(Collection::stream)
        .flatMap(str -> Arrays.stream(str.split("\p{Punct}| |[0-9]|…|«|»|“|„")))
        .filter(str -> (str.length() >= 4))
        .collect(Collectors.toMap(
                String::toLowerCase,
                i -> 1,
                Integer::sum)
        );

System.out.println("Original:" + wordCounts);

原始输出:

Original:{dean=1, took=1, door=1, very=1, went=1, away=3, antillean=1, abundant=1, tired=1, back=1, then=1, house=1, steps=1, hopped=1, inquisition=1, cage=1, securely=1, rabbit=5, closed=1}

你可以这样做:

String results = wordCounts.entrySet()
        .stream()
        .filter(wordToCount -> wordToCount.getValue() > 2) // 2 is rare
        .sorted(Map.Entry.comparingByKey()).map(wordCount -> wordCount.getKey() + " - " + wordCount.getValue())
            .collect(Collectors.joining(", "));

System.out.println(results);

过滤后的输出:

away - 3, rabbit - 5