如何计算 csv 文件中的单词出现次数？

Question

我有一个 CSV 文件，我需要读取它并显示每个单词的出现次数，应用程序应该只计算包含一个以上字母的单词，而不是字母数字也变成小写的单词。

这就是我现在所拥有的，我被困在这上面，不知道从哪里开始。

public static void countWordNumber() throws IOException, CsvException

String pathFile1 = "src/main/resources/Documents/Example.csv"

 {

        CSVReader reader = new CSVReaderBuilder(new FileReader(pathFile1)).withSkipLines(1).build();

        Map<String, Integer> frequency = new HashMap<>();
        String[] line;


        while ((line = reader.readNext()) != null) {
            String words = line[1];

            words = words.replaceAll("\p{Punct}", " ").trim();
            words = words.replaceAll("\s{2}", " ");
            words = words.toLowerCase();

            if (frequency.containsKey(words)) {
                frequency.put(words, frequency.get(words) + 1);
            } else {
                frequency.put(words, 0);
            }


        }


    }

我正在尝试读取 csv 数组列表中的第二个索引，即 line[1] ，这是文档文本所在的位置。

我已经用空格替换了所有标点符号并对其进行了修剪，如果有超过 2 个空格，我将它们替换为 1 并使其小写。

我想要实现的输出是：

Title of document: XXXX

Word: is, Value: 3

编辑：这是我的输入文件的示例。

title,text,date
exampleTitle,This is is is an example example, April 2022

Answer 1

您可以使用正则表达式匹配来验证 words 是否符合您的条件，然后再将其添加到您的 HashMap，如下所示：

if (words.matches("[a-z]{2,}"))

[a-z] 仅指定小写字母字符
{2,} 指定“最少出现 2 次，最多出现 ”

不过，考虑到您要将标点符号转换为 space，这听起来您可能在第 [1] 行中有多个单词。如果您想跨多行收集多个单词的计数，那么您可能想在 space 字符上拆分 words，如下所示：

for (String word : words.split(" ")) {
  if (word.matches("[a-z]{2,}")) {
    // Then use your code for checking if frequency contains the term,
    //   but use `word` instead of `words`
  }
}

Answer 2

你的解决方案看起来还不错。但是对于初始化我会替换

frequency.put(words, 0);

和

frequency.put(words, 1);

因为我丢失了你的输入文件，所以我创建了一个可以正常工作的虚拟文件。

    Map<String, Integer> frequency = new HashMap<>();
    List<String> csvSimulation = new ArrayList<String>();
    csvSimulation.add("test");
    csvSimulation.add( "marvin");
    csvSimulation.add("aaaaa");
    csvSimulation.add("nothing");
    csvSimulation.add("test");
    csvSimulation.add("test");
    csvSimulation.add("aaaaa");
    csvSimulation.add("Whosebug");
    csvSimulation.add("test");
    csvSimulation.add("bread");

    Iterator<String> iterator = csvSimulation.iterator();


    while(iterator.hasNext()){
        String words = iterator.next();
        words = words.toLowerCase();
        if (frequency.containsKey(words)) {
            frequency.put(words, frequency.get(words) + 1);
        } else {
            frequency.put(words, 1);
        }

    }

    System.out.println(frequency);

您确定在迭代时循环访问行[1] 是正确的吗？正确阅读输入似乎是我的问题。没有看到你的 CSV 文件，我无法进一步帮助你。

编辑：

使用提供的 csv 数据，像这样调整您的代码将解决您的问题

.....
.....
    while ((line = reader.readNext()) != null) {
        String words = line[1];

        words = words.replaceAll("\p{Punct}", " ").trim();
        words = words.replaceAll("\s{2}", " ");
        words = words.toLowerCase();
        String[] singleWords = words.split(" ");
        
        for(int i = 0 ; i < singleWords.length; i++) {
            String currentWord = singleWords[i];
            if (frequency.containsKey(currentWord)) {
                frequency.put(currentWord, frequency.get(currentWord) + 1);
            } else {
                frequency.put(currentWord, 1);
            }   
        }


    }
    
    System.out.println("Word: is, Value: " + frequency.get("is"));

Answer 3

只是事情的另一个转折：

由于（由 OP）确定 CSV 文件由 Title, Text, Date 数据组成，因此可以假定该文件的每个数据行都用典型的逗号 (,) 分隔，并且该 CSV 数据文件的每一行（Header 行除外）可能包含不同的标题。

然而，建立的期望输出（通过 OP）是：

Title of document: exampleTitle

Word: is, Value: 3 
Word: example, Value: 2

让我们改变这个输出，让事情看起来更赏心悦目：

-------------------------------
Title of document: exampleTitle
-------------------------------
Word           Value
===================== 
an             1
is             3 
example        2
this           1 
=====================

根据这些信息，因为每个文件数据行都包含一个 Title，所以我们只需要处理和存储第 2 列 中出现的单词，这似乎是合乎逻辑的对于该数据线。在处理每个单词时，我们需要维护该单词的来源，以便我们知道它来自什么标题，而不是仅仅对所有行（文件行）中的所有第 2 列单词进行简单的出现计数。这 if course then 意味着 Map 的 KEY 将是一个词，它也必须包含该词的来源。这没什么大不了的，但是当需要从地图中提取相关数据以便在控制台 Window 中正确显示它，或者在应用程序中用于其他目的时，需要多加考虑。我们可以做的是利用地图的列表接口，例如：

List<Map<String, Integer>> mapsList = new ArrayList<>();

通过这样做，我们可以将处理过的每个文件行放入一个映射中，然后将该映射添加到一个名为 mapsList.

的列表中

所提供的文本文件内容示例几乎没有什么不足之处，尽管如此，它确实在某种程度上有所帮助，让我欣慰的是，是的，有一条 Header 行在 CSV 数据文件中以及文件中使用典型逗号作为分隔符的事实......仅此而已。所以想到了更多问题：

多个文件数据行是否可以包含相同的标题（来源）？
- 如果“是”，那么你想用这个词做什么其他行的出现次数？
  - 是否要将它们添加到第一个建立的标题？
  - 或者你想开始一个新的修改标题？（在这种情况下必须是不同的标题名称）
第二列（文本列）是否可能包含逗号分隔符？逗号在合理的文本中很常见长度。
- 如果“是”，第 2 栏中的文字是否用引号引起来？
第2栏的文字大概能到多长时间？（只是好奇 - 它实际上是无关紧要的）。
是否会有不同的 CSV 文件从中获取单词出现次数其中包含超过三列？
是否会有需要出现单词的时间源自任何 CSV 文件数据行上的 多于一列 ？

下面（在可运行的应用程序中）提供的名为 getWordOccurrencesFromCSV() 的方法足够灵活，基本上涵盖了上述所有问题。此方法还使用另外两个 helper 方法，名为 getWordOccurrences() 和 combineSameOrigins() 来获取完成的工作。尽管 combineSameOrigins() 方法是专门为 getWordOccurrencesFromCSV() 方法设计的，但如果需要，这些辅助方法可以单独用于其他情况. startApp() 方法启动并将生成的地图列表显示到控制台 Window。

这是可运行的代码（一定要阅读代码中的所有注释）：

package so_demo_hybridize;

public class SO_Demo_hybridize {

    private final java.util.Scanner userInput = new java.util.Scanner(System.in);
    
    
    public static void main(String[] args) {
        // Started this way to avoid the need for statics.
        new SO_Demo_hybridize().startApp(args);
    }

    private void startApp(String[] args) {
        String ls = System.lineSeparator();
        String filePath = "DemoCSV.csv"; //"DemoCSV.csv";
        
        /* Retrieve the neccessary data from the supplied CSV 
           file and place it into a List of Maps:
        
           getWordOccurrencesFromCSV() Parameters Info:
           --------------------------------------------
           filePath:  Path and file name of the CSV file.
           ","     :  The delimiter used in the CSV file.
           1       :  The literal data column which will hold the Origin String.
           2       :  The literal data column which will hold text of words to get occurrences from.
           1       :  The minimum number of occurrences needed to save.   */
        java.util.List<java.util.Map<String, Integer>> mapsList
                = getWordOccurrencesFromCSV(filePath, ",", 1, new int[]{2}, 1);

        /* Display what is desired from the gathered file data now 
           contained within the Maps held within the 'mapsList' List.
           Now that you have this List of Maps, you can do whatever 
           and display whatever you like with the data.           */
        System.out.println("Word Occurrences In CSV File" + ls 
                         + "============================" + ls);
        for (java.util.Map<String, Integer> maps : mapsList) {
            String mapTitle = "";
            int cnt = 0;
            for (java.util.Map.Entry<String, Integer> entry : maps.entrySet()) {
                /* Because the Origin is attached to the Map Key (a word) 
                   we need to split it off. Note the special delimiter. */
                String[] keyParts = entry.getKey().split(":\|:");
                String wordOrigin = keyParts[0];
                String word = keyParts[1];
                if (mapTitle.isEmpty()) {
                    mapTitle = "Title of document: " + wordOrigin;
                    // The Title underline...
                    String underLine = String.join("", java.util.Collections.nCopies(mapTitle.length(), "-"));
                    System.out.println(underLine + ls + mapTitle + ls + underLine);
                    // Disaplay a Header and underline
                    String mapHeader = "Words          Values" + ls
                            + "=====================";
                    System.out.println(mapHeader);
                    cnt++;
                }
                System.out.println(String.format("%-15s%-6s", word, entry.getValue()));
            }
            if (cnt > 0) {
                // The underline for the Word Occurences table displayed.
                System.out.println("=====================");
                System.out.println();
            }
        }
    }

    /**
     * Retrieves and returns a List of Word Occurrences Maps ({@code Map<String,
     * Integer>}) from each CSV data line from the specified column. The each
     * word is considered the KEY and the number of Occurrences of that word
     * would be VALUE. Each KEY in each Map is also prefixed with the Origin 
     * String of that word delimited with ":|:". READ THIS DOCUMENT IN FULL!
     *
     * @param csvFilePath           (String) The full path and file name of the
     *                              CSV file to process.<br>
     *
     * @param csvDelimiter          (String) The delimiter used within the CSV
     *                              File. This can be and single character
     *                              string including the whitespace. Although
     *                              not mandatory, adding whitespaces to your
     *                              CSV Delimiter argument should be
     *                              discouraged.<br>
     *
     * @param originFromColumn      (Integer) The The CSV File line data column
     *                              literal number where the Origin for the
     *                              evaluated occurrences will be related to. By
     *                              <b><i>literal</i></b> we mean the actual 
     *                              column number, <u>not</u> the column Index 
     *                              Value. Whatever literal column number is 
     *                              supplied, the data within that column should 
     *                              be Unique to all other lines within the CSV 
     *                              data file. In most CSV files, records in that 
     *                              file (each line) usually contains one column 
     *                              that contains a Unique ID of some sort which 
     *                              identifies that line as a separate record. It 
     *                              would be this column which would make a good 
     *                              candidate to use as the <b><i>origin</i></b> 
     *                              for the <i>Word Occurrences</i> about to be 
     *                              generated from the line (record) column 
     *                              specified from the argument supplied to the 
     *                              'occurrencesFromColumn' parameter.<br><br>
     *
     *                              If null is supplied to this parameter then 
     *                              Column one (1) (index 0) of every data line 
     *                              will be assumed to contain the Origin data 
     *                              string.<br>
     *
     * @param occurrencesFromColumn (int[] Array) Because this method can gather
     *                              Word Occurrences from 1 <b><i>or
     *                              more</i></b> columns within any given CSV
     *                              file data line, the literal column number(s)
     *                              must be supplied within an <b>int[]</b> array. 
     *                              The objective of this method is to obviously 
     *                              collect Word Occurrences contained within 
     *                              text that is mounted within at least one 
     *                              column of any CSV file data line, therefore, 
     *                              the literal column number(s) from where the 
     *                              Words to process are located need to be suppled. 
     *                              By <b><i>literal</i></b> we mean the actual 
     *                              column number, <u>not</u> the column Index 
     *                              Value. The desired column number (or column 
     *                              numbers) can be supplied to this parameter 
     *                              in this fashion: <b><i>new int[]{3}</i></b> 
     *                              OR <b><i>new int[]{3,5,6}</i></b>.<br><br> 
     *                              All words processed, regardless of what 
     *                              columns they come from, will all fall under 
     *                              the same Origin String.<br><br>
     *
     *                              Null <b><u>can not</u></b> be supplied as an 
     *                              argument to this parameter.<br>
     *
     * @param minWordCountRequired  (Integer) If any integer value less than 2
     *                              is supplied then all words within the
     *                              supplied Input String will be placed into
     *                              the map regardless of how many there are. If
     *                              however you only want words where there are
     *                              two (2) or more within the Input String then
     *                              supply 2 as an argument to this parameter.
     *                              If you only want words where there are three
     *                              (3) or more within the Input String then
     *                              supply 3 as an argument to this parameter...
     *                              and so on.<br><br>
     *
     *                              If null is supplied to this parameter then a 
     *                              default of one (1) will be assumed.<br>
     *
     * @param options               (Optional - Two parameters both boolean):<pre>
     *
     *      noDuplicateOrigins - Optional - Default is true. By default, duplicate
     *                           Origins are not permitted. This means that no two
     *                           Maps within the List can contain the same Origin
     *                           for word occurrences. Obviously, it is possible
     *                           for a CSV file to contain data lines which holds
     *                           duplicate Origins but in this case the words in
     *                           the duplicate Origin are added to the Map within
     *                           the List which already contains that Origin and
     *                           if any word in the new duplicate Origin Map is
     *                           found to be in the already stored Map with the
     *                           original Origin then the occurrences count for
     *                           the word in the new Map is added to the same word
     *                           occurrences count of the already stored Map.
     *
     *                           If boolean <b>false</b> is optionally supplied to this
     *                           parameter then a duplicate Map with the duplicate
     *                           Origin is added to the List of Maps.
     *
     *                           Null can be supplied to this optional parameter.
     *                           You can not just supply a blank comma.
     *
     *      noNumerics         - Optional - Default is false. By default, this
     *                           method considers numbers (either integer or
     *                           floating point) as words therefore this
     *                           parameter would be considered as always
     *                           false. If however you don't want then
     *                           occurrences of numbers to be placed into the
     *                           returned Map then you can optionally supply
     *                           an argument of boolean true here.
     *
     *                           If null or a boolean value is supplied to this
     *                           optional parameter then null or a boolean value
     *                           <u>must</u> be supplied to the noDuplicateOrigins
     *                           parameter.</pre>
     *
     * @return ({@code List<Map<String, Integer>>})
     */
    @SuppressWarnings("CallToPrintStackTrace")
    public java.util.List<java.util.Map<String, Integer>> getWordOccurrencesFromCSV(
            String csvFilePath, String csvDelimiter, Integer originFromColumn,
            int[] occurrencesFromColumn, Integer minWordCountRequired, Boolean... options) {
        String ls = System.lineSeparator();

        // Handle invalid arguments to this method...
        if (!new java.io.File(csvFilePath).exists()) {
            throw new IllegalArgumentException(ls + "getWordOccurrencesFromCSV() "
                    + "Method Error! The file indicated below can not be found!" + ls
                    + csvFilePath + ls);
        }
        else if (csvFilePath == null || csvFilePath.isEmpty()) {
            throw new IllegalArgumentException(ls + "getWordOccurrencesFromCSV() "
                    + "Method Error! The csvFilePath parameter can not be supplied "
                    + "null or a null string!" + ls);
        }
        else if (csvDelimiter == null || csvDelimiter.isEmpty()) {
            throw new IllegalArgumentException(ls + "getWordOccurrencesFromCSV() "
                    + "Method Error! The csvDelimiter parameter can not be supplied "
                    + "null or a null string!" + ls);
        }
        if (originFromColumn == null || originFromColumn < 1) {
            originFromColumn = 1;
        }
        for (int i = 0; i < occurrencesFromColumn.length; i++) {
            if (occurrencesFromColumn[i] == originFromColumn) {
                throw new IllegalArgumentException(ls + "getWordOccurrencesFromCSV() "
                        + "Method Error! The 'occurrencesFromColumn' argument ("
                        + occurrencesFromColumn[i] + ")" + ls + "can not be the same column "
                        + "as the 'originFromColumn' argument (" + originFromColumn
                        + ")!" + ls);
            }
            else if (occurrencesFromColumn[i] < 1) {
                throw new IllegalArgumentException(ls + "getWordOccurrencesFromCSV() "
                        + "Method Error! The argument for the occurrencesFromColumn "
                        + "parameter can not be less than 1!" + ls);
            }
        }
        if (minWordCountRequired == null || minWordCountRequired < 2) {
            minWordCountRequired = 1;
        }
        final int minWrdCnt = minWordCountRequired;

        // Take care of the Optional Parameters
        boolean noDuplicateOrigins = true;
        boolean noNumerics = false;

        if (options != null && options.length > 0) {
            if (options[0] != null && options.length >= 1) {
                noDuplicateOrigins = options[0];
            }
            if (options[1] != null && options.length >= 2) {
                noNumerics = options[1];
            }
        }

        java.util.List<java.util.Map<String, Integer>> mapsList = new java.util.ArrayList<>();

        // 'Try With Resources' is used here to auto-close file and free resources.
        try (java.util.Scanner reader = new java.util.Scanner(new java.io.FileReader(csvFilePath))) {
            String line = reader.nextLine(); // Skip the Header line (first line in file).
            String origin = "", date = "";

            java.util.Map<String, Integer> map;

            while (reader.hasNextLine()) {
                line = reader.nextLine().trim();
                // Skip blank lines (if any)
                if (line.isEmpty()) {
                    continue;
                }

                // Get columnar data from data line
                // If there are no quotation marks in data line.
                String regex = "\s*\" + csvDelimiter + "\s*";
                /* If there are quotation marks in data line and they are 
                   actually balanced. If they're not balanced the we obviously
                   use the regular expression above. The regex below ignores
                   the supplied delimiter contained between quotation marks. */
                if (line.contains("\"") && line.replaceAll("[^\"]", "").length() % 2 == 0) {
                     regex = "\s*\" + csvDelimiter + "\s*(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)";
                }   
                String[] csvColumnStrings = line.split(regex);
                // Acquire the Origin String
                origin = csvColumnStrings[originFromColumn - 1];

                // Get the Word Occurrences from the provided column number(s)...
                for (int i = 0; i < occurrencesFromColumn.length; i++) {
                    /* Acquire the String to get Word Occurrences from 
                   and remove any punctuation characters from it. */
                    line = csvColumnStrings[occurrencesFromColumn[i] - 1];
                    line = line.replaceAll("\p{Punct}", "").replaceAll("\s+", " ").toLowerCase();

                    // Get Word Occurrences...
                    map = getWordOccurrences(origin, line, noNumerics);

                    /* Has same Origin been processed before? 
                   If so, do we add this one to the original?   */
                    if (noDuplicateOrigins || i > 0) {
                        if (combineSameOrigins(mapsList, map) > 0) {
                            continue;
                        }
                    }
                    mapsList.add(map);
                }
            }
        }
        catch (java.io.FileNotFoundException ex) {
            ex.printStackTrace();
        }

        /* Remove any words from all the Maps within the List that 
           does not meet our occurrences minimum count argument 
           supplied to the 'minWordCountRequired' parameter. 
           (Java8+ needed)  */
        for (java.util.Map<String, Integer> mapInList : mapsList) {
            mapInList.entrySet().removeIf(e -> e.getValue() < minWrdCnt);
        }

        return mapsList;  // Return the generated List of Maps
    }

    /**
     * This method will go through all Maps contained within the supplied List of 
     * Maps and see if  the Origin String within the supplied Map already exists 
     * within a Listed Map. If it does then those words within the Supplied Map 
     * are added to the Map within the List is they don't already exist there.
     * If any words from the Supplied Map does exist within the Listed Map then 
     * only the count values from those words are summed to the words within the 
     * Listed Map.<br>
     * 
     * @param list ({@code List of Map<String, Integer>}) The List Interface which 
     * contains all the Maps of Word Occurrences or different Origins.<br>
     * 
     * @param suppliedMap ({@code Map<String, Integer> Map}) The Map to check 
     * against all Maps contained within the List of Maps for duplicate Origins.
     * 
     * @return (int) The number of words added to any Map contained within the 
     * List which contains the same Origin.
     */
    public int combineSameOrigins(java.util.List<java.util.Map<String, Integer>> list,
            java.util.Map<String, Integer> suppliedMap) {
        int wrdCnt = 0;
        String newOrigin = suppliedMap.keySet().stream().findFirst().get().split(":\|:")[0].trim();
        String originInListedMap;
        for (java.util.Map<String, Integer> mapInList : list) {
            originInListedMap = mapInList.keySet().stream().findFirst().get().split(":\|:")[0].trim();
            if (originInListedMap.equals(newOrigin)) {
                wrdCnt++;
                for (java.util.Map.Entry<String, Integer> suppliedMapEntry : suppliedMap.entrySet()) {
                    String key = suppliedMapEntry.getKey();
                    int value = suppliedMapEntry.getValue();
                    boolean haveIt = false;
                    for (java.util.Map.Entry<String, Integer> mapInListEntry : mapInList.entrySet()) {
                        if (mapInListEntry.getKey().equals(key)) {
                            haveIt = true;
                            mapInListEntry.setValue(mapInListEntry.getValue() + value);
                            break;
                        }
                    }
                    if (!haveIt) {
                        mapInList.put(key, value);
                    }
                }
            }
        }
        return wrdCnt;
    }

    /**
     * Find the Duplicate Words In a String And Count the Number Of Occurrences
     * for each of those words. This method will fill and return a Map of all
     * the words within the supplied string (as Key) and the number of
     * occurrences for each word (as Value).<br><br>
     * <p>
     * <b>Example to read the returned Map and display in console:</b><pre>
     * {@code for (java.util.Map.Entry<String, Integer> entry : map.entrySet()) {
     *      System.out.println(String.format("%-12s%-4s", entry.getKey(), entry.getValue()));
     *  } }</pre>
     *
     * @param origin      (String) The UNIQUE origin String of what the word is
     *                    related to. This can be anything as long as this same
     *                    origin string is applied to all the related words of
     *                    the same Input String. A unique ID of some sort or a
     *                    title string of some kind would work fine for
     *                    this.<br>
     *
     * @param inputString (String) The string to process for word
     *                    occurrences.<br>
     *
     * @param noNumerics  (Optional - Default - false) By default, this method
     *                    considers numbers (either integer or floating point)
     *                    as words therefore this parameter would be considered
     *                    as always false. If however you don't want the
     *                    occurrences of numbers to be placed into the returned
     *                    Map then you can optionally supply an argument of
     *                    boolean true here.<br>
     *
     * @return ({@code java.util.Map<String, Integer>}) Consisting of individual
     *         words found within the supplied String as KEY and the number of
     *         occurrences as VALUE.
     */
    public static java.util.Map<String, Integer> getWordOccurrences(String origin,
            String inputString, Boolean... noNumerics) {
        boolean allowNumbersAsWords = true;
        if (noNumerics.length > 0) {
            if (noNumerics[0] != null) {
                allowNumbersAsWords = !noNumerics[0];
            }
        }

        // Use this to have Words in Ascending order.
        java.util.TreeMap<String, Integer> map = new java.util.TreeMap<>();

        // Use this to have Words in the order of Insertion (when they're added).
        //java.util.LinkedHashMap<String, Integer> map = new java.util.LinkedHashMap<>();
        // Use this to have Words in a 'who cares' order.
        //java.util.Map<String, Integer> map = new java.util.HashMap<>();
        String[] words = inputString.replaceAll("[^A-Za-z0-9' ]", "")
                .replaceAll("\s+", " ").trim().split("\s+");
        for (int i = 0; i < words.length; i++) {
            String w = words[i];
            if (!allowNumbersAsWords && w.matches("-?\d+(\.\d+)?")) {
                continue;
            }
            if (map.containsKey(origin + ":|:" + w)) {
                int cnt = map.get(origin + ":|:" + w);
                map.put(origin + ":|:" + w, ++cnt);
            }
            else {
                map.put(origin + ":|:" + w, 1);
            }
        }
        return map;
    }
}

如何计算 csv 文件中的单词出现次数？

How do I count word occurrences in a csv file?

java

csv