单词快速计数

Word Fast Counting

我正在为我的 ASP.NET 服务器实现字数统计功能,我想知道这样做最快的方法是什么,因为我不确定使用简单的

text.AsParallel().Count(Char.IsWhiteSpace);

是最快的方法。由于此功能可能会在相对较长的文本墙上使用很多,所以我希望它尽可能快,即使这意味着使用不安全的方法。

编辑:使用 Rufus L 的代码以及我自己的不安全方法进行一些基准测试:

public static unsafe int CountWords(string s)
    {
        int count = 0;
        fixed (char* ps = s)
        {
            int len = s.Length;
            char* pc = ps;
            while (len-- > 0)
            {
                if (char.IsWhiteSpace(*pc++))
                {
                    count++;
                }
            }
        }
        return count;
    }

Split(null): 681979 words in 415867 ticks.

Count(WhiteSpace): 681978 words in 147860 ticks.

AsParallel: 681978 words in 401077 ticks.

Unsafe: 681978 words in 98139 ticks.

我仍然愿意接受任何更好的想法:)

编辑 2:

重写了函数,也处理了多个空格:

public static unsafe int CountWords(string s)
    {
        int count = 0;
        fixed (char* ps = s)
        {
            int len = s.Length;
            bool inWord = false;
            char* pc = ps;
            while (len-- > 0)
            {
                if (char.IsWhiteSpace(*pc++))
                {
                    if (!inWord)
                    {
                        inWord = true;
                    }
                }
                else
                {
                    if (inWord)
                    {
                        inWord = false;
                        count++;
                    }
                }
                if (len == 0)
                {
                    if (inWord)
                    {
                        count++;
                    }
                }
            }
        }
        return count;
    }

Split(null): 681979 words in 517055 ticks.

Count(WhiteSpace): 681978 words in 148952 ticks.

AsParallel: 681978 words in 410289 ticks.

Unsafe: 660000 words in 114833 ticks.

根据我的测试,这要快得多,快了 4 倍(但请参阅下面的更新以获得不同的结果):

wordCount = text.Split(null).Length;

这是测试,如果您想尝试一下。请注意,由于任务切换的成本,添加 AsParallel() 会减慢我机器上的进程:

public static void Main()
{
    var text = File.ReadAllText("d:\public\temp\temp.txt");
    int wordCount;
    var sw = new Stopwatch();

    sw.Start();
    wordCount = text.Split(null).Length;
    sw.Stop();

    Console.WriteLine("Split(null): {0} words in {1} ticks.", wordCount, 
        sw.ElapsedTicks);

    sw.Restart();
    wordCount = text.Count(Char.IsWhiteSpace);
    sw.Stop();

    Console.WriteLine("Count(WhiteSpace): {0} words in {1} ticks.", wordCount, 
        sw.ElapsedTicks);

    sw.Restart();
    wordCount = text.AsParallel().Count(Char.IsWhiteSpace);
    sw.Stop();

    Console.WriteLine("AsParallel: {0} words in {1} ticks.", wordCount, 
        sw.ElapsedTicks);
}

输出:

Split(null): 964 words in 629 ticks.

Count(WhiteSpace): 963 words in 2377 ticks.

AsParallel: 963 words in 208983 ticks.

更新

在使字符串更长(OP 提到了 1000 个单词中的 100 个)之后,结果变得更加相似,并且 Count(WhiteSpace) 方法变得比 Split(null) 方法更快:

代码更改:

var text = File.ReadAllText("d:\public\temp\temp.txt");
var textToSearch = new StringBuilder();
for (int i = 0; i < 500; i++) textToSearch.Append(text);
text = textToSearch.ToString();

输出:

Split(null): 481501 words in 185135 ticks.

Count(WhiteSpace): 481500 words in 101373 ticks.

AsParallel: 481500 words in 336117 ticks.

回答您的问题,方法 AsParallel() 非常快。但存在更多选择,例如:

使用正则表达式:

string input = "text text text text";
string pattern = "(-)";

string[] substrings = Regex.Split(input, pattern);    // Split on hyphens
Console.WriteLine("Words: {0}", substrings.count());

但我重申,AsParallel() 方法非常快。您可以进行概念验证,以找出哪个更好。在代码的开头和末尾放置一个秒表 (),并将 AsParallel 运行时 () 与正则表达式进行比较,因此会有更多 'exact' 的答案。

更新

使用Parallel.For:

static void Main(string[] args)
        {
            string text = @"text text text text text text text text text text ";
            int count = 0;

            Console.WriteLine("Generating words, wait...");
            Parallel.For(0, 100000, i =>
            {
                text += @"text text text text text text text text text text ";
            });


            var sw = new Stopwatch();
            sw.Start();
            Parallel.For(0, text.Length, i =>
            {
                if (char.IsWhiteSpace(text[i]))
                    count++;
            });
            sw.Stop();

            Console.WriteLine("Words: {0} in {1} ticks", count, sw.ElapsedTicks);
            Console.ReadLine();
        }

结果:

PS: Note that the Parallel.For used is not managed

经过一些基准测试,以下 unsage 代码在 500 多个单词的情况下产生最快的结果:

public static unsafe int CountWords(string s)
{
    int count = 0;
    fixed (char* ps = s)
    {
        int len = s.Length;
        bool inWord = false;
        char* pc = ps;
        while (len-- > 0)
        {
            if (char.IsWhiteSpace(*pc++))
            {
                if (!inWord)
                {
                    inWord = true;
                }
            }
            else
            {
                if (inWord)
                {
                    inWord = false;
                    count++;
                }
            }
            if (len == 0)
            {
                if (inWord)
                {
                    count++;
                }
            }
        }
    }
    return count;
}