TParallel.For 默认线程池的奇怪行为

Question

我正在试用 Delphi XE7 Update 1 的并行编程功能。

我创建了一个简单的 TParallel.For 循环，基本上是通过一些伪造的操作来打发时间。

我在 AWS 实例 (c4.8xlarge) 的 36 vCPU 上启动了该程序，以尝试查看并行编程的收益。

当我第一次启动程序并执行 TParallel.For 循环时，我看到了显着的收益（虽然承认比我预期的 36 个 vCPU 少很多）：

Parallel matches: 23077072 in 242ms
Single Threaded matches: 23077072 in 2314ms

如果我不关闭程序并且运行不久之后（例如，立即或大约 10-20 秒后）在 36 vCPU 机器上再次通过，并行通过会恶化很多：

Parallel matches: 23077169 in 2322ms
Single Threaded matches: 23077169 in 2316ms

如果我不关闭程序并等待几分钟（不是几秒钟，而是几分钟）然后再次运行通过，我将再次获得第一次获得的结果启动程序（响应时间提高 10 倍）。

启动程序后的第一遍在 36 个 vCPU 的机器上总是更快，所以这种效果似乎只在程序中第二次调用 TParallel.For 时才会发生。

这是我运行ning:

的示例代码

unit ParallelTests;

interface

uses
  Winapi.Windows, Winapi.Messages, System.SysUtils, System.Variants, System.Classes, Vcl.Graphics,
  System.Threading, System.SyncObjs, System.Diagnostics,
  Vcl.Controls, Vcl.Forms, Vcl.Dialogs, Vcl.StdCtrls;

type
  TForm1 = class(TForm)
    Button1: TButton;
    Memo1: TMemo;
    SingleThreadCheckBox: TCheckBox;
    ParallelCheckBox: TCheckBox;
    UnitsEdit: TEdit;
    Label1: TLabel;
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.dfm}

procedure TForm1.Button1Click(Sender: TObject);
var
  matches: integer;
  i,j: integer;
  sw: TStopWatch;
  maxItems: integer;
  referenceStr: string;

 begin
  sw := TStopWatch.Create;

  maxItems := 5000;

  Randomize;
  SetLength(referenceStr,120000); for i := 1 to 120000 do referenceStr[i] := Chr(Ord('a') + Random(26)); 

  if ParallelCheckBox.Checked then begin
    matches := 0;
    sw.Reset;
    sw.Start;
    TParallel.For(1, MaxItems,
      procedure (Value: Integer)
        var
          index: integer;
          found: integer;
        begin
          found := 0;
          for index := 1 to length(referenceStr) do begin
            if (((Value mod 26) + ord('a')) = ord(referenceStr[index])) then begin
              inc(found);
            end;
          end;
          TInterlocked.Add(matches, found);
        end);
    sw.Stop;
    Memo1.Lines.Add('Parallel matches: ' + IntToStr(matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
  end;

  if SingleThreadCheckBox.Checked then begin
    matches := 0;
    sw.Reset;
    sw.Start;
    for i := 1 to MaxItems do begin
      for j := 1 to length(referenceStr) do begin
        if (((i mod 26) + ord('a')) = ord(referenceStr[j])) then begin
          inc(matches);
        end;
      end;
    end;
    sw.Stop;
    Memo1.Lines.Add('Single Threaded matches: ' + IntToStr(Matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
  end;
end;

end.

这是按设计工作的吗？我发现这篇文章 (http://delphiaball.co.uk/tag/parallel-programming/) 建议我让库决定线程池，但如果我必须在请求之间等待几分钟以便更快地处理请求，我不明白使用并行编程的意义.

我是否遗漏了有关如何使用 TParallel.For 循环的任何信息？

请注意，我无法在 AWS m3.large 实例（根据 AWS 的 2 个 vCPU）上重现此内容。在那种情况下，我总是得到轻微的改善，并且在随后不久的 TParallel.For 的后续调用中我没有得到更差的结果。

Parallel matches: 23077054 in 2057ms
Single Threaded matches: 23077054 in 2900ms

因此，当有许多可用内核 (36) 时，似乎会出现这种效果，这很遗憾，因为并行编程的全部意义在于从许多内核中受益。我想知道这是否是一个库错误，因为内核数太多，或者在这种情况下内核数不是 2 的幂。

UPDATE: After testing it with various instances of different vCPU counts in AWS, this seems to be the behaviour:

36 vCPUs (c4.8xlarge). You have to wait minutes between subsequent calls to a vanilla TParallel call (it makes it unusable for production)

32 vCPUs (c3.8xlarge). You have to wait minutes between subsequent calls to a vanilla TParallel call (it makes it unusable for production)

16 vCPUs (c3.4xlarge). You have to wait sub second times. It could be usable if load is low but response time still important

8 vCPUs (c3.2xlarge). It seems to work normally

4 vCPUs (c3.xlarge). It seems to work normally

2 vCPUs (m3.large). It seems to work normally

Answer 1

我根据你的创建了两个测试程序来比较 System.Threading 和 OTL。我使用 XE7 update 1 和 OTL r1397 构建。我使用的 OTL 源对应于 3.04 版。我使用 32 位 Windows 编译器构建，使用发布构建选项。

我的测试机器是双 Intel Xeon E5530 运行 Windows 7 x64。该系统有两个四核处理器。总共有 8 个处理器，但由于超线程，系统说有 16 个。经验告诉我，超线程只是营销噱头，我从未见过在这台机器上扩展超过 8 倍。

现在是两个几乎相同的程序。

System.Threading

program SystemThreadingTest;

{$APPTYPE CONSOLE}

uses
  System.Diagnostics,
  System.Threading;

const
  maxItems = 5000;
  DataSize = 100000;

procedure DoTest;
var
  matches: integer;
  i, j: integer;
  sw: TStopWatch;
  referenceStr: string;
begin
  Randomize;
  SetLength(referenceStr, DataSize);
  for i := low(referenceStr) to high(referenceStr) do
    referenceStr[i] := Chr(Ord('a') + Random(26));

  // parallel
  matches := 0;
  sw := TStopWatch.StartNew;
  TParallel.For(1, maxItems,
    procedure(Value: integer)
    var
      index: integer;
      found: integer;
    begin
      found := 0;
      for index := low(referenceStr) to high(referenceStr) do
        if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
          inc(found);
      AtomicIncrement(matches, found);
    end);
  Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');

  // serial
  matches := 0;
  sw := TStopWatch.StartNew;
  for i := 1 to maxItems do
    for j := low(referenceStr) to high(referenceStr) do
      if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
        inc(matches);
  Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;

begin
  while True do
    DoTest;
end.

OTL

program OTLTest;

{$APPTYPE CONSOLE}

uses
  Winapi.Windows,
  Winapi.Messages,
  System.Diagnostics,
  OtlParallel;

const
  maxItems = 5000;
  DataSize = 100000;

procedure ProcessThreadMessages;
var
  msg: TMsg;
begin
  while PeekMessage(Msg, 0, 0, 0, PM_REMOVE) and (Msg.Message <> WM_QUIT) do begin
    TranslateMessage(Msg);
    DispatchMessage(Msg);
  end;
end;

procedure DoTest;
var
  matches: integer;
  i, j: integer;
  sw: TStopWatch;
  referenceStr: string;
begin
  Randomize;
  SetLength(referenceStr, DataSize);
  for i := low(referenceStr) to high(referenceStr) do
    referenceStr[i] := Chr(Ord('a') + Random(26));

  // parallel
  matches := 0;
  sw := TStopWatch.StartNew;
  Parallel.For(1, maxItems).Execute(
    procedure(Value: integer)
    var
      index: integer;
      found: integer;
    begin
      found := 0;
      for index := low(referenceStr) to high(referenceStr) do
        if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
          inc(found);
      AtomicIncrement(matches, found);
    end);
  Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');

  ProcessThreadMessages;

  // serial
  matches := 0;
  sw := TStopWatch.StartNew;
  for i := 1 to maxItems do
    for j := low(referenceStr) to high(referenceStr) do
      if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
        inc(matches);
  Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;

begin
  while True do
    DoTest;
end.

现在是输出。

System.Threading输出

Parallel matches: 19230817 in 374ms
Serial matches: 19230817 in 2423ms
Parallel matches: 19230698 in 374ms
Serial matches: 19230698 in 2409ms
Parallel matches: 19230556 in 368ms
Serial matches: 19230556 in 2433ms
Parallel matches: 19230635 in 2412ms
Serial matches: 19230635 in 2430ms
Parallel matches: 19230843 in 2441ms
Serial matches: 19230843 in 2413ms
Parallel matches: 19230905 in 2493ms
Serial matches: 19230905 in 2423ms
Parallel matches: 19231032 in 2430ms
Serial matches: 19231032 in 2443ms
Parallel matches: 19230669 in 2440ms
Serial matches: 19230669 in 2473ms
Parallel matches: 19230811 in 2404ms
Serial matches: 19230811 in 2432ms
....

OTL输出

Parallel matches: 19230667 in 422ms
Serial matches: 19230667 in 2475ms
Parallel matches: 19230663 in 335ms
Serial matches: 19230663 in 2438ms
Parallel matches: 19230889 in 395ms
Serial matches: 19230889 in 2461ms
Parallel matches: 19230874 in 391ms
Serial matches: 19230874 in 2441ms
Parallel matches: 19230617 in 385ms
Serial matches: 19230617 in 2524ms
Parallel matches: 19231021 in 368ms
Serial matches: 19231021 in 2455ms
Parallel matches: 19230904 in 357ms
Serial matches: 19230904 in 2537ms
Parallel matches: 19230568 in 373ms
Serial matches: 19230568 in 2456ms
Parallel matches: 19230758 in 333ms
Serial matches: 19230758 in 2710ms
Parallel matches: 19230580 in 371ms
Serial matches: 19230580 in 2532ms
Parallel matches: 19230534 in 336ms
Serial matches: 19230534 in 2436ms
Parallel matches: 19230879 in 368ms
Serial matches: 19230879 in 2419ms
Parallel matches: 19230651 in 409ms
Serial matches: 19230651 in 2598ms
Parallel matches: 19230461 in 357ms
....

我离开OTL版本运行很久了，模式一直没变。并行版本总是比串行版本快 7 倍左右。

结论

代码非常简单。唯一可以得出的合理结论是 System.Threading 的实现有缺陷。

有许多与新 System.Threading 库相关的错误报告。所有的迹象都表明它的质量很差。 Embarcadero 在发布不合标准的库代码方面有着长期的记录。我在想 TMonitor，XE3 字符串助手，早期版本的 System.IOUtils，FireMonkey。清单还在继续。

很明显质量是 Embarcadero 的一个大问题。发布的代码很明显没有经过充分测试，如果有的话。这对于线程库来说尤其麻烦，因为在该库中，错误可能处于休眠状态并且仅在特定 hardware/software 配置中暴露。 TMonitor 的经验让我相信 Embarcadero 没有足够的专业知识来生成高质量、正确的线程代码。

我的建议是您不应该使用当前形式的 System.Threading。在可以看到它具有足够的质量和正确性之前，应该避免使用它。我建议你使用OTL。

编辑：该程序的原始 OTL 版本有一个实时内存泄漏，这是由于一个丑陋的实现细节而发生的。 Parallel.For 使用 .Unobserved 修饰符创建任务。这导致所述任务仅在某些内部消息 window 收到 'task has terminated' 消息时才被销毁。这个 window 是在与 Parallel.For 调用者相同的线程中创建的 - 即在本例中是在主线程中。由于主线程没有处理消息，任务永远不会被破坏，内存消耗（加上其他资源）只会堆积起来。有可能是因为那个程序在一段时间后挂了。

TParallel.For 默认线程池的奇怪行为

Strange behaviour of TParallel.For default ThreadPool

delphi

parallel-processing

rtl-ppl