TParallel.For 默认线程池的奇怪行为
Strange behaviour of TParallel.For default ThreadPool
我正在试用 Delphi XE7 Update 1 的并行编程功能。
我创建了一个简单的 TParallel.For
循环,基本上是通过一些伪造的操作来打发时间。
我在 AWS 实例 (c4.8xlarge) 的 36 vCPU 上启动了该程序,以尝试查看并行编程的收益。
当我第一次启动程序并执行 TParallel.For
循环时,我看到了显着的收益(虽然承认比我预期的 36 个 vCPU 少很多):
Parallel matches: 23077072 in 242ms
Single Threaded matches: 23077072 in 2314ms
如果我不关闭程序并且 运行 不久之后(例如,立即或大约 10-20 秒后)在 36 vCPU 机器上再次通过,并行通过会恶化很多:
Parallel matches: 23077169 in 2322ms
Single Threaded matches: 23077169 in 2316ms
如果我不关闭程序并等待几分钟(不是几秒钟,而是几分钟)然后再次 运行通过,我将再次获得第一次获得的结果启动程序(响应时间提高 10 倍)。
启动程序后的第一遍在 36 个 vCPU 的机器上总是更快,所以这种效果似乎只在程序中第二次调用 TParallel.For
时才会发生。
这是我运行ning:
的示例代码
unit ParallelTests;
interface
uses
Winapi.Windows, Winapi.Messages, System.SysUtils, System.Variants, System.Classes, Vcl.Graphics,
System.Threading, System.SyncObjs, System.Diagnostics,
Vcl.Controls, Vcl.Forms, Vcl.Dialogs, Vcl.StdCtrls;
type
TForm1 = class(TForm)
Button1: TButton;
Memo1: TMemo;
SingleThreadCheckBox: TCheckBox;
ParallelCheckBox: TCheckBox;
UnitsEdit: TEdit;
Label1: TLabel;
procedure Button1Click(Sender: TObject);
private
{ Private declarations }
public
{ Public declarations }
end;
var
Form1: TForm1;
implementation
{$R *.dfm}
procedure TForm1.Button1Click(Sender: TObject);
var
matches: integer;
i,j: integer;
sw: TStopWatch;
maxItems: integer;
referenceStr: string;
begin
sw := TStopWatch.Create;
maxItems := 5000;
Randomize;
SetLength(referenceStr,120000); for i := 1 to 120000 do referenceStr[i] := Chr(Ord('a') + Random(26));
if ParallelCheckBox.Checked then begin
matches := 0;
sw.Reset;
sw.Start;
TParallel.For(1, MaxItems,
procedure (Value: Integer)
var
index: integer;
found: integer;
begin
found := 0;
for index := 1 to length(referenceStr) do begin
if (((Value mod 26) + ord('a')) = ord(referenceStr[index])) then begin
inc(found);
end;
end;
TInterlocked.Add(matches, found);
end);
sw.Stop;
Memo1.Lines.Add('Parallel matches: ' + IntToStr(matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
end;
if SingleThreadCheckBox.Checked then begin
matches := 0;
sw.Reset;
sw.Start;
for i := 1 to MaxItems do begin
for j := 1 to length(referenceStr) do begin
if (((i mod 26) + ord('a')) = ord(referenceStr[j])) then begin
inc(matches);
end;
end;
end;
sw.Stop;
Memo1.Lines.Add('Single Threaded matches: ' + IntToStr(Matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
end;
end;
end.
这是按设计工作的吗?我发现这篇文章 (http://delphiaball.co.uk/tag/parallel-programming/) 建议我让库决定线程池,但如果我必须在请求之间等待几分钟以便更快地处理请求,我不明白使用并行编程的意义.
我是否遗漏了有关如何使用 TParallel.For
循环的任何信息?
请注意,我无法在 AWS m3.large 实例(根据 AWS 的 2 个 vCPU)上重现此内容。在那种情况下,我总是得到轻微的改善,并且在随后不久的 TParallel.For
的后续调用中我没有得到更差的结果。
Parallel matches: 23077054 in 2057ms
Single Threaded matches: 23077054 in 2900ms
因此,当有许多可用内核 (36) 时,似乎会出现这种效果,这很遗憾,因为并行编程的全部意义在于从许多内核中受益。我想知道这是否是一个库错误,因为内核数太多,或者在这种情况下内核数不是 2 的幂。
UPDATE: After testing it with various instances of different vCPU
counts in AWS, this seems to be the behaviour:
- 36 vCPUs (c4.8xlarge). You have to wait minutes between subsequent calls to a vanilla TParallel call (it makes it unusable for
production)
- 32 vCPUs (c3.8xlarge). You have to wait minutes between subsequent calls to a vanilla TParallel call (it makes it unusable for
production)
- 16 vCPUs (c3.4xlarge). You have to wait sub second times. It could be usable if load is low but response time still important
- 8 vCPUs (c3.2xlarge). It seems to work normally
- 4 vCPUs (c3.xlarge). It seems to work normally
- 2 vCPUs (m3.large). It seems to work normally
我根据你的创建了两个测试程序来比较 System.Threading
和 OTL
。我使用 XE7 update 1 和 OTL r1397 构建。我使用的 OTL 源对应于 3.04 版。我使用 32 位 Windows 编译器构建,使用发布构建选项。
我的测试机器是双 Intel Xeon E5530 运行 Windows 7 x64。该系统有两个四核处理器。总共有 8 个处理器,但由于超线程,系统说有 16 个。经验告诉我,超线程只是营销噱头,我从未见过在这台机器上扩展超过 8 倍。
现在是两个几乎相同的程序。
System.Threading
program SystemThreadingTest;
{$APPTYPE CONSOLE}
uses
System.Diagnostics,
System.Threading;
const
maxItems = 5000;
DataSize = 100000;
procedure DoTest;
var
matches: integer;
i, j: integer;
sw: TStopWatch;
referenceStr: string;
begin
Randomize;
SetLength(referenceStr, DataSize);
for i := low(referenceStr) to high(referenceStr) do
referenceStr[i] := Chr(Ord('a') + Random(26));
// parallel
matches := 0;
sw := TStopWatch.StartNew;
TParallel.For(1, maxItems,
procedure(Value: integer)
var
index: integer;
found: integer;
begin
found := 0;
for index := low(referenceStr) to high(referenceStr) do
if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
inc(found);
AtomicIncrement(matches, found);
end);
Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
// serial
matches := 0;
sw := TStopWatch.StartNew;
for i := 1 to maxItems do
for j := low(referenceStr) to high(referenceStr) do
if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
inc(matches);
Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;
begin
while True do
DoTest;
end.
OTL
program OTLTest;
{$APPTYPE CONSOLE}
uses
Winapi.Windows,
Winapi.Messages,
System.Diagnostics,
OtlParallel;
const
maxItems = 5000;
DataSize = 100000;
procedure ProcessThreadMessages;
var
msg: TMsg;
begin
while PeekMessage(Msg, 0, 0, 0, PM_REMOVE) and (Msg.Message <> WM_QUIT) do begin
TranslateMessage(Msg);
DispatchMessage(Msg);
end;
end;
procedure DoTest;
var
matches: integer;
i, j: integer;
sw: TStopWatch;
referenceStr: string;
begin
Randomize;
SetLength(referenceStr, DataSize);
for i := low(referenceStr) to high(referenceStr) do
referenceStr[i] := Chr(Ord('a') + Random(26));
// parallel
matches := 0;
sw := TStopWatch.StartNew;
Parallel.For(1, maxItems).Execute(
procedure(Value: integer)
var
index: integer;
found: integer;
begin
found := 0;
for index := low(referenceStr) to high(referenceStr) do
if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
inc(found);
AtomicIncrement(matches, found);
end);
Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
ProcessThreadMessages;
// serial
matches := 0;
sw := TStopWatch.StartNew;
for i := 1 to maxItems do
for j := low(referenceStr) to high(referenceStr) do
if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
inc(matches);
Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;
begin
while True do
DoTest;
end.
现在是输出。
System.Threading输出
Parallel matches: 19230817 in 374ms
Serial matches: 19230817 in 2423ms
Parallel matches: 19230698 in 374ms
Serial matches: 19230698 in 2409ms
Parallel matches: 19230556 in 368ms
Serial matches: 19230556 in 2433ms
Parallel matches: 19230635 in 2412ms
Serial matches: 19230635 in 2430ms
Parallel matches: 19230843 in 2441ms
Serial matches: 19230843 in 2413ms
Parallel matches: 19230905 in 2493ms
Serial matches: 19230905 in 2423ms
Parallel matches: 19231032 in 2430ms
Serial matches: 19231032 in 2443ms
Parallel matches: 19230669 in 2440ms
Serial matches: 19230669 in 2473ms
Parallel matches: 19230811 in 2404ms
Serial matches: 19230811 in 2432ms
....
OTL输出
Parallel matches: 19230667 in 422ms
Serial matches: 19230667 in 2475ms
Parallel matches: 19230663 in 335ms
Serial matches: 19230663 in 2438ms
Parallel matches: 19230889 in 395ms
Serial matches: 19230889 in 2461ms
Parallel matches: 19230874 in 391ms
Serial matches: 19230874 in 2441ms
Parallel matches: 19230617 in 385ms
Serial matches: 19230617 in 2524ms
Parallel matches: 19231021 in 368ms
Serial matches: 19231021 in 2455ms
Parallel matches: 19230904 in 357ms
Serial matches: 19230904 in 2537ms
Parallel matches: 19230568 in 373ms
Serial matches: 19230568 in 2456ms
Parallel matches: 19230758 in 333ms
Serial matches: 19230758 in 2710ms
Parallel matches: 19230580 in 371ms
Serial matches: 19230580 in 2532ms
Parallel matches: 19230534 in 336ms
Serial matches: 19230534 in 2436ms
Parallel matches: 19230879 in 368ms
Serial matches: 19230879 in 2419ms
Parallel matches: 19230651 in 409ms
Serial matches: 19230651 in 2598ms
Parallel matches: 19230461 in 357ms
....
我离开OTL版本运行很久了,模式一直没变。并行版本总是比串行版本快 7 倍左右。
结论
代码非常简单。唯一可以得出的合理结论是 System.Threading
的实现有缺陷。
有许多与新 System.Threading
库相关的错误报告。所有的迹象都表明它的质量很差。 Embarcadero 在发布不合标准的库代码方面有着长期的记录。我在想 TMonitor
,XE3 字符串助手,早期版本的 System.IOUtils
,FireMonkey。清单还在继续。
很明显质量是 Embarcadero 的一个大问题。发布的代码很明显没有经过充分测试,如果有的话。这对于线程库来说尤其麻烦,因为在该库中,错误可能处于休眠状态并且仅在特定 hardware/software 配置中暴露。 TMonitor
的经验让我相信 Embarcadero 没有足够的专业知识来生成高质量、正确的线程代码。
我的建议是您不应该使用当前形式的 System.Threading
。在可以看到它具有足够的质量和正确性之前,应该避免使用它。我建议你使用OTL。
编辑:该程序的原始 OTL 版本有一个实时内存泄漏,这是由于一个丑陋的实现细节而发生的。 Parallel.For 使用 .Unobserved 修饰符创建任务。这导致所述任务仅在某些内部消息 window 收到 'task has terminated' 消息时才被销毁。这个 window 是在与 Parallel.For 调用者相同的线程中创建的 - 即在本例中是在主线程中。由于主线程没有处理消息,任务永远不会被破坏,内存消耗(加上其他资源)只会堆积起来。有可能是因为那个程序在一段时间后挂了。
我正在试用 Delphi XE7 Update 1 的并行编程功能。
我创建了一个简单的 TParallel.For
循环,基本上是通过一些伪造的操作来打发时间。
我在 AWS 实例 (c4.8xlarge) 的 36 vCPU 上启动了该程序,以尝试查看并行编程的收益。
当我第一次启动程序并执行 TParallel.For
循环时,我看到了显着的收益(虽然承认比我预期的 36 个 vCPU 少很多):
Parallel matches: 23077072 in 242ms
Single Threaded matches: 23077072 in 2314ms
如果我不关闭程序并且 运行 不久之后(例如,立即或大约 10-20 秒后)在 36 vCPU 机器上再次通过,并行通过会恶化很多:
Parallel matches: 23077169 in 2322ms
Single Threaded matches: 23077169 in 2316ms
如果我不关闭程序并等待几分钟(不是几秒钟,而是几分钟)然后再次 运行通过,我将再次获得第一次获得的结果启动程序(响应时间提高 10 倍)。
启动程序后的第一遍在 36 个 vCPU 的机器上总是更快,所以这种效果似乎只在程序中第二次调用 TParallel.For
时才会发生。
这是我运行ning:
的示例代码unit ParallelTests;
interface
uses
Winapi.Windows, Winapi.Messages, System.SysUtils, System.Variants, System.Classes, Vcl.Graphics,
System.Threading, System.SyncObjs, System.Diagnostics,
Vcl.Controls, Vcl.Forms, Vcl.Dialogs, Vcl.StdCtrls;
type
TForm1 = class(TForm)
Button1: TButton;
Memo1: TMemo;
SingleThreadCheckBox: TCheckBox;
ParallelCheckBox: TCheckBox;
UnitsEdit: TEdit;
Label1: TLabel;
procedure Button1Click(Sender: TObject);
private
{ Private declarations }
public
{ Public declarations }
end;
var
Form1: TForm1;
implementation
{$R *.dfm}
procedure TForm1.Button1Click(Sender: TObject);
var
matches: integer;
i,j: integer;
sw: TStopWatch;
maxItems: integer;
referenceStr: string;
begin
sw := TStopWatch.Create;
maxItems := 5000;
Randomize;
SetLength(referenceStr,120000); for i := 1 to 120000 do referenceStr[i] := Chr(Ord('a') + Random(26));
if ParallelCheckBox.Checked then begin
matches := 0;
sw.Reset;
sw.Start;
TParallel.For(1, MaxItems,
procedure (Value: Integer)
var
index: integer;
found: integer;
begin
found := 0;
for index := 1 to length(referenceStr) do begin
if (((Value mod 26) + ord('a')) = ord(referenceStr[index])) then begin
inc(found);
end;
end;
TInterlocked.Add(matches, found);
end);
sw.Stop;
Memo1.Lines.Add('Parallel matches: ' + IntToStr(matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
end;
if SingleThreadCheckBox.Checked then begin
matches := 0;
sw.Reset;
sw.Start;
for i := 1 to MaxItems do begin
for j := 1 to length(referenceStr) do begin
if (((i mod 26) + ord('a')) = ord(referenceStr[j])) then begin
inc(matches);
end;
end;
end;
sw.Stop;
Memo1.Lines.Add('Single Threaded matches: ' + IntToStr(Matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
end;
end;
end.
这是按设计工作的吗?我发现这篇文章 (http://delphiaball.co.uk/tag/parallel-programming/) 建议我让库决定线程池,但如果我必须在请求之间等待几分钟以便更快地处理请求,我不明白使用并行编程的意义.
我是否遗漏了有关如何使用 TParallel.For
循环的任何信息?
请注意,我无法在 AWS m3.large 实例(根据 AWS 的 2 个 vCPU)上重现此内容。在那种情况下,我总是得到轻微的改善,并且在随后不久的 TParallel.For
的后续调用中我没有得到更差的结果。
Parallel matches: 23077054 in 2057ms
Single Threaded matches: 23077054 in 2900ms
因此,当有许多可用内核 (36) 时,似乎会出现这种效果,这很遗憾,因为并行编程的全部意义在于从许多内核中受益。我想知道这是否是一个库错误,因为内核数太多,或者在这种情况下内核数不是 2 的幂。
UPDATE: After testing it with various instances of different vCPU counts in AWS, this seems to be the behaviour:
- 36 vCPUs (c4.8xlarge). You have to wait minutes between subsequent calls to a vanilla TParallel call (it makes it unusable for production)
- 32 vCPUs (c3.8xlarge). You have to wait minutes between subsequent calls to a vanilla TParallel call (it makes it unusable for production)
- 16 vCPUs (c3.4xlarge). You have to wait sub second times. It could be usable if load is low but response time still important
- 8 vCPUs (c3.2xlarge). It seems to work normally
- 4 vCPUs (c3.xlarge). It seems to work normally
- 2 vCPUs (m3.large). It seems to work normally
我根据你的创建了两个测试程序来比较 System.Threading
和 OTL
。我使用 XE7 update 1 和 OTL r1397 构建。我使用的 OTL 源对应于 3.04 版。我使用 32 位 Windows 编译器构建,使用发布构建选项。
我的测试机器是双 Intel Xeon E5530 运行 Windows 7 x64。该系统有两个四核处理器。总共有 8 个处理器,但由于超线程,系统说有 16 个。经验告诉我,超线程只是营销噱头,我从未见过在这台机器上扩展超过 8 倍。
现在是两个几乎相同的程序。
System.Threading
program SystemThreadingTest;
{$APPTYPE CONSOLE}
uses
System.Diagnostics,
System.Threading;
const
maxItems = 5000;
DataSize = 100000;
procedure DoTest;
var
matches: integer;
i, j: integer;
sw: TStopWatch;
referenceStr: string;
begin
Randomize;
SetLength(referenceStr, DataSize);
for i := low(referenceStr) to high(referenceStr) do
referenceStr[i] := Chr(Ord('a') + Random(26));
// parallel
matches := 0;
sw := TStopWatch.StartNew;
TParallel.For(1, maxItems,
procedure(Value: integer)
var
index: integer;
found: integer;
begin
found := 0;
for index := low(referenceStr) to high(referenceStr) do
if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
inc(found);
AtomicIncrement(matches, found);
end);
Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
// serial
matches := 0;
sw := TStopWatch.StartNew;
for i := 1 to maxItems do
for j := low(referenceStr) to high(referenceStr) do
if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
inc(matches);
Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;
begin
while True do
DoTest;
end.
OTL
program OTLTest;
{$APPTYPE CONSOLE}
uses
Winapi.Windows,
Winapi.Messages,
System.Diagnostics,
OtlParallel;
const
maxItems = 5000;
DataSize = 100000;
procedure ProcessThreadMessages;
var
msg: TMsg;
begin
while PeekMessage(Msg, 0, 0, 0, PM_REMOVE) and (Msg.Message <> WM_QUIT) do begin
TranslateMessage(Msg);
DispatchMessage(Msg);
end;
end;
procedure DoTest;
var
matches: integer;
i, j: integer;
sw: TStopWatch;
referenceStr: string;
begin
Randomize;
SetLength(referenceStr, DataSize);
for i := low(referenceStr) to high(referenceStr) do
referenceStr[i] := Chr(Ord('a') + Random(26));
// parallel
matches := 0;
sw := TStopWatch.StartNew;
Parallel.For(1, maxItems).Execute(
procedure(Value: integer)
var
index: integer;
found: integer;
begin
found := 0;
for index := low(referenceStr) to high(referenceStr) do
if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
inc(found);
AtomicIncrement(matches, found);
end);
Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
ProcessThreadMessages;
// serial
matches := 0;
sw := TStopWatch.StartNew;
for i := 1 to maxItems do
for j := low(referenceStr) to high(referenceStr) do
if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
inc(matches);
Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;
begin
while True do
DoTest;
end.
现在是输出。
System.Threading输出
Parallel matches: 19230817 in 374ms Serial matches: 19230817 in 2423ms Parallel matches: 19230698 in 374ms Serial matches: 19230698 in 2409ms Parallel matches: 19230556 in 368ms Serial matches: 19230556 in 2433ms Parallel matches: 19230635 in 2412ms Serial matches: 19230635 in 2430ms Parallel matches: 19230843 in 2441ms Serial matches: 19230843 in 2413ms Parallel matches: 19230905 in 2493ms Serial matches: 19230905 in 2423ms Parallel matches: 19231032 in 2430ms Serial matches: 19231032 in 2443ms Parallel matches: 19230669 in 2440ms Serial matches: 19230669 in 2473ms Parallel matches: 19230811 in 2404ms Serial matches: 19230811 in 2432ms ....
OTL输出
Parallel matches: 19230667 in 422ms Serial matches: 19230667 in 2475ms Parallel matches: 19230663 in 335ms Serial matches: 19230663 in 2438ms Parallel matches: 19230889 in 395ms Serial matches: 19230889 in 2461ms Parallel matches: 19230874 in 391ms Serial matches: 19230874 in 2441ms Parallel matches: 19230617 in 385ms Serial matches: 19230617 in 2524ms Parallel matches: 19231021 in 368ms Serial matches: 19231021 in 2455ms Parallel matches: 19230904 in 357ms Serial matches: 19230904 in 2537ms Parallel matches: 19230568 in 373ms Serial matches: 19230568 in 2456ms Parallel matches: 19230758 in 333ms Serial matches: 19230758 in 2710ms Parallel matches: 19230580 in 371ms Serial matches: 19230580 in 2532ms Parallel matches: 19230534 in 336ms Serial matches: 19230534 in 2436ms Parallel matches: 19230879 in 368ms Serial matches: 19230879 in 2419ms Parallel matches: 19230651 in 409ms Serial matches: 19230651 in 2598ms Parallel matches: 19230461 in 357ms ....
我离开OTL版本运行很久了,模式一直没变。并行版本总是比串行版本快 7 倍左右。
结论
代码非常简单。唯一可以得出的合理结论是 System.Threading
的实现有缺陷。
有许多与新 System.Threading
库相关的错误报告。所有的迹象都表明它的质量很差。 Embarcadero 在发布不合标准的库代码方面有着长期的记录。我在想 TMonitor
,XE3 字符串助手,早期版本的 System.IOUtils
,FireMonkey。清单还在继续。
很明显质量是 Embarcadero 的一个大问题。发布的代码很明显没有经过充分测试,如果有的话。这对于线程库来说尤其麻烦,因为在该库中,错误可能处于休眠状态并且仅在特定 hardware/software 配置中暴露。 TMonitor
的经验让我相信 Embarcadero 没有足够的专业知识来生成高质量、正确的线程代码。
我的建议是您不应该使用当前形式的 System.Threading
。在可以看到它具有足够的质量和正确性之前,应该避免使用它。我建议你使用OTL。
编辑:该程序的原始 OTL 版本有一个实时内存泄漏,这是由于一个丑陋的实现细节而发生的。 Parallel.For 使用 .Unobserved 修饰符创建任务。这导致所述任务仅在某些内部消息 window 收到 'task has terminated' 消息时才被销毁。这个 window 是在与 Parallel.For 调用者相同的线程中创建的 - 即在本例中是在主线程中。由于主线程没有处理消息,任务永远不会被破坏,内存消耗(加上其他资源)只会堆积起来。有可能是因为那个程序在一段时间后挂了。