从文件中读取并计算文件中的单词数

Question

我正在运行浏览项目想法列表，我的目标是完成所有这些想法，希望到那时我会在 c# 上表现得相当不错。我写了一个程序，可以计算给定文件中的字数，它可以工作，但是其中有一个错误。

工作原理：

文件名作为提示给出，用户输入文件路径或名称。
文件然后运行通过正则表达式："[a-zA-Z]+" 将单词拆分成一个数组。
然后统计数组的长度

我遇到的唯一麻烦是，如果您使用 '（撇号），它会将单词分成两个单词，例如，如果我从一个文件中读取：this is a test of my program and now I'm going to test it again, to see what happens...当它应该输出 19 时它会输出 20 因为它将 I'm 分成两个词。有没有办法让正则表达式补偿正确的语法使用，或者有没有办法不用 regex?

来源：

using System;
using System.IO;
using Reg = System.Text.RegularExpressions.Regex;

namespace count
{
    class CountWordsInString
    {
        static string Count(string list)
        {
            string[] arrStr = Reg.Split(list, "[a-zA-Z]+");
            int length = arrStr.Length - 1;

            return length.ToString();
        }

        static void Main(string[] args)
        {
            Console.Write("Enter file path: ");
            var file = Console.ReadLine();

            var info = File.ReadAllText(file);

            Console.WriteLine(Count(info));
        }
    }
}

Answer 1

您可以这样做的一种方法是匹配任何非空白的内容（空格制表符等）。这可以用否定字符 class 来完成，如下所示：

[^\s]+

^ 表示字符 class 将匹配除其中字符以外的任何内容。当然，这假定您对 "word" 的定义是按空格拆分的字符串。

试试看 here。

Answer 2

在我看来，如果你想计算字数，你不需要 RegEx。 RegEx 是一个很大的库，如果不注意如何使用它会消耗大量资源。

split 函数是更好的选择，将文本加载到一个变量上并按此方式应用 split 方法：

string[] separators = {" ","\r\n", "\n"}; string value = "the string that will be word counted"; string[] words = value.Split(separators, StringSplitOptions.RemoveEmptyEntries); Console.WriteLine(words.Count);

Answer 3

如果您希望 "words" 包含可选的撇号，您可以使用正则表达式

[A-Za-z]+('[A-Za-z]+)*

这将匹配包含撇号的单词，只要撇号被字母包围。所以它将匹配 fo'c's'le（一个词，根据 Ubuntu 字典），但不匹配 a''b 或 'Twas。对于字数统计，首撇号和末尾撇号没有任何区别——'Twas 无论如何都算作一个词——但如果你想对这个词做些什么，比如拼写检查，那么你'将需要一种更复杂的方法来正确处理 'Twas，同时仍然从

中提取单词 Go

"Start running when I say 'Go!'," he said.

Answer 4

using System.Text.RegularExpressions; //regex
using System.IO; //File reading

#region //Return the count of words in a file
public int wordamount(string filename) 
{
     return Regex.Matches(File.ReadAllText(filename), @"\w+|\w+\'\w+").Count; //Match all the alphanumeric characters, and or with commas
}
#endregion

从文件中读取并计算文件中的单词数

Reading from a file and counting the words in the file

c#

regex

grammar