单词快速计数
Word Fast Counting
我正在为我的 ASP.NET 服务器实现字数统计功能,我想知道这样做最快的方法是什么,因为我不确定使用简单的
text.AsParallel().Count(Char.IsWhiteSpace);
是最快的方法。由于此功能可能会在相对较长的文本墙上使用很多,所以我希望它尽可能快,即使这意味着使用不安全的方法。
编辑:使用 Rufus L 的代码以及我自己的不安全方法进行一些基准测试:
public static unsafe int CountWords(string s)
{
int count = 0;
fixed (char* ps = s)
{
int len = s.Length;
char* pc = ps;
while (len-- > 0)
{
if (char.IsWhiteSpace(*pc++))
{
count++;
}
}
}
return count;
}
Split(null): 681979 words in 415867 ticks.
Count(WhiteSpace): 681978 words in 147860 ticks.
AsParallel: 681978 words in 401077 ticks.
Unsafe: 681978 words in 98139 ticks.
我仍然愿意接受任何更好的想法:)
编辑 2:
重写了函数,也处理了多个空格:
public static unsafe int CountWords(string s)
{
int count = 0;
fixed (char* ps = s)
{
int len = s.Length;
bool inWord = false;
char* pc = ps;
while (len-- > 0)
{
if (char.IsWhiteSpace(*pc++))
{
if (!inWord)
{
inWord = true;
}
}
else
{
if (inWord)
{
inWord = false;
count++;
}
}
if (len == 0)
{
if (inWord)
{
count++;
}
}
}
}
return count;
}
Split(null): 681979 words in 517055 ticks.
Count(WhiteSpace): 681978 words in 148952 ticks.
AsParallel: 681978 words in 410289 ticks.
Unsafe: 660000 words in 114833 ticks.
根据我的测试,这要快得多,快了 4 倍(但请参阅下面的更新以获得不同的结果):
wordCount = text.Split(null).Length;
这是测试,如果您想尝试一下。请注意,由于任务切换的成本,添加 AsParallel()
会减慢我机器上的进程:
public static void Main()
{
var text = File.ReadAllText("d:\public\temp\temp.txt");
int wordCount;
var sw = new Stopwatch();
sw.Start();
wordCount = text.Split(null).Length;
sw.Stop();
Console.WriteLine("Split(null): {0} words in {1} ticks.", wordCount,
sw.ElapsedTicks);
sw.Restart();
wordCount = text.Count(Char.IsWhiteSpace);
sw.Stop();
Console.WriteLine("Count(WhiteSpace): {0} words in {1} ticks.", wordCount,
sw.ElapsedTicks);
sw.Restart();
wordCount = text.AsParallel().Count(Char.IsWhiteSpace);
sw.Stop();
Console.WriteLine("AsParallel: {0} words in {1} ticks.", wordCount,
sw.ElapsedTicks);
}
输出:
Split(null): 964 words in 629 ticks.
Count(WhiteSpace): 963 words in 2377 ticks.
AsParallel: 963 words in 208983 ticks.
更新
在使字符串更长(OP 提到了 1000 个单词中的 100 个)之后,结果变得更加相似,并且 Count(WhiteSpace)
方法变得比 Split(null)
方法更快:
代码更改:
var text = File.ReadAllText("d:\public\temp\temp.txt");
var textToSearch = new StringBuilder();
for (int i = 0; i < 500; i++) textToSearch.Append(text);
text = textToSearch.ToString();
输出:
Split(null): 481501 words in 185135 ticks.
Count(WhiteSpace): 481500 words in 101373 ticks.
AsParallel: 481500 words in 336117 ticks.
回答您的问题,方法 AsParallel() 非常快。但存在更多选择,例如:
使用正则表达式:
string input = "text text text text";
string pattern = "(-)";
string[] substrings = Regex.Split(input, pattern); // Split on hyphens
Console.WriteLine("Words: {0}", substrings.count());
但我重申,AsParallel() 方法非常快。您可以进行概念验证,以找出哪个更好。在代码的开头和末尾放置一个秒表 (),并将 AsParallel 运行时 () 与正则表达式进行比较,因此会有更多 'exact' 的答案。
更新
使用Parallel.For:
static void Main(string[] args)
{
string text = @"text text text text text text text text text text ";
int count = 0;
Console.WriteLine("Generating words, wait...");
Parallel.For(0, 100000, i =>
{
text += @"text text text text text text text text text text ";
});
var sw = new Stopwatch();
sw.Start();
Parallel.For(0, text.Length, i =>
{
if (char.IsWhiteSpace(text[i]))
count++;
});
sw.Stop();
Console.WriteLine("Words: {0} in {1} ticks", count, sw.ElapsedTicks);
Console.ReadLine();
}
结果:
PS: Note that the Parallel.For used is not managed
经过一些基准测试,以下 unsage 代码在 500 多个单词的情况下产生最快的结果:
public static unsafe int CountWords(string s)
{
int count = 0;
fixed (char* ps = s)
{
int len = s.Length;
bool inWord = false;
char* pc = ps;
while (len-- > 0)
{
if (char.IsWhiteSpace(*pc++))
{
if (!inWord)
{
inWord = true;
}
}
else
{
if (inWord)
{
inWord = false;
count++;
}
}
if (len == 0)
{
if (inWord)
{
count++;
}
}
}
}
return count;
}
我正在为我的 ASP.NET 服务器实现字数统计功能,我想知道这样做最快的方法是什么,因为我不确定使用简单的
text.AsParallel().Count(Char.IsWhiteSpace);
是最快的方法。由于此功能可能会在相对较长的文本墙上使用很多,所以我希望它尽可能快,即使这意味着使用不安全的方法。
编辑:使用 Rufus L 的代码以及我自己的不安全方法进行一些基准测试:
public static unsafe int CountWords(string s)
{
int count = 0;
fixed (char* ps = s)
{
int len = s.Length;
char* pc = ps;
while (len-- > 0)
{
if (char.IsWhiteSpace(*pc++))
{
count++;
}
}
}
return count;
}
Split(null): 681979 words in 415867 ticks.
Count(WhiteSpace): 681978 words in 147860 ticks.
AsParallel: 681978 words in 401077 ticks.
Unsafe: 681978 words in 98139 ticks.
我仍然愿意接受任何更好的想法:)
编辑 2:
重写了函数,也处理了多个空格:
public static unsafe int CountWords(string s)
{
int count = 0;
fixed (char* ps = s)
{
int len = s.Length;
bool inWord = false;
char* pc = ps;
while (len-- > 0)
{
if (char.IsWhiteSpace(*pc++))
{
if (!inWord)
{
inWord = true;
}
}
else
{
if (inWord)
{
inWord = false;
count++;
}
}
if (len == 0)
{
if (inWord)
{
count++;
}
}
}
}
return count;
}
Split(null): 681979 words in 517055 ticks.
Count(WhiteSpace): 681978 words in 148952 ticks.
AsParallel: 681978 words in 410289 ticks.
Unsafe: 660000 words in 114833 ticks.
根据我的测试,这要快得多,快了 4 倍(但请参阅下面的更新以获得不同的结果):
wordCount = text.Split(null).Length;
这是测试,如果您想尝试一下。请注意,由于任务切换的成本,添加 AsParallel()
会减慢我机器上的进程:
public static void Main()
{
var text = File.ReadAllText("d:\public\temp\temp.txt");
int wordCount;
var sw = new Stopwatch();
sw.Start();
wordCount = text.Split(null).Length;
sw.Stop();
Console.WriteLine("Split(null): {0} words in {1} ticks.", wordCount,
sw.ElapsedTicks);
sw.Restart();
wordCount = text.Count(Char.IsWhiteSpace);
sw.Stop();
Console.WriteLine("Count(WhiteSpace): {0} words in {1} ticks.", wordCount,
sw.ElapsedTicks);
sw.Restart();
wordCount = text.AsParallel().Count(Char.IsWhiteSpace);
sw.Stop();
Console.WriteLine("AsParallel: {0} words in {1} ticks.", wordCount,
sw.ElapsedTicks);
}
输出:
Split(null): 964 words in 629 ticks.
Count(WhiteSpace): 963 words in 2377 ticks.
AsParallel: 963 words in 208983 ticks.
更新
在使字符串更长(OP 提到了 1000 个单词中的 100 个)之后,结果变得更加相似,并且 Count(WhiteSpace)
方法变得比 Split(null)
方法更快:
代码更改:
var text = File.ReadAllText("d:\public\temp\temp.txt");
var textToSearch = new StringBuilder();
for (int i = 0; i < 500; i++) textToSearch.Append(text);
text = textToSearch.ToString();
输出:
Split(null): 481501 words in 185135 ticks.
Count(WhiteSpace): 481500 words in 101373 ticks.
AsParallel: 481500 words in 336117 ticks.
回答您的问题,方法 AsParallel() 非常快。但存在更多选择,例如:
使用正则表达式:
string input = "text text text text";
string pattern = "(-)";
string[] substrings = Regex.Split(input, pattern); // Split on hyphens
Console.WriteLine("Words: {0}", substrings.count());
但我重申,AsParallel() 方法非常快。您可以进行概念验证,以找出哪个更好。在代码的开头和末尾放置一个秒表 (),并将 AsParallel 运行时 () 与正则表达式进行比较,因此会有更多 'exact' 的答案。
更新
使用Parallel.For:
static void Main(string[] args)
{
string text = @"text text text text text text text text text text ";
int count = 0;
Console.WriteLine("Generating words, wait...");
Parallel.For(0, 100000, i =>
{
text += @"text text text text text text text text text text ";
});
var sw = new Stopwatch();
sw.Start();
Parallel.For(0, text.Length, i =>
{
if (char.IsWhiteSpace(text[i]))
count++;
});
sw.Stop();
Console.WriteLine("Words: {0} in {1} ticks", count, sw.ElapsedTicks);
Console.ReadLine();
}
结果:
PS: Note that the Parallel.For used is not managed
经过一些基准测试,以下 unsage 代码在 500 多个单词的情况下产生最快的结果:
public static unsafe int CountWords(string s)
{
int count = 0;
fixed (char* ps = s)
{
int len = s.Length;
bool inWord = false;
char* pc = ps;
while (len-- > 0)
{
if (char.IsWhiteSpace(*pc++))
{
if (!inWord)
{
inWord = true;
}
}
else
{
if (inWord)
{
inWord = false;
count++;
}
}
if (len == 0)
{
if (inWord)
{
count++;
}
}
}
}
return count;
}