x86 和 x64 中的非托管到托管互操作性能

Question

在我的测试中，我发现在为 x64 而不是 x86 编译时，非托管到托管互操作的性能成本翻了一番。是什么导致了这种放缓？

我正在调试器下测试发布版本而不是运行。循环是 100,000,000 次迭代。

在 x86 中，我测得每个互操作调用的平均时间为 8ns，这似乎与我在其他地方看到的相符。 Unity 的 x86 互操作是 8.2ns。 Microsoft 的一篇文章和 Hans Passant 都提到了 7ns。 8ns 在我的机器上是 28 个时钟周期，这看起来至少是合理的，但我确实想知道是否有可能走得更快。

在 x64 中，我测得每个互操作调用的平均时间为 17ns。我找不到任何人提到 x86 和 x64 之间的区别，甚至找不到他们在给出时间时指的是什么。 Unity 的 x64 互操作时钟约为 5.9ns。

常规函数调用（包括对非托管 C++ DLL 的调用）平均花费 1.3ns。这在 x86 和 x64 之间没有显着变化。

下面是我用于测量这一点的最小 C++/CLI 代码，尽管我在我的实际项目中看到了相同的数字，该项目由调用 C++/CLI DLL 的托管端的本机 C++ 项目组成。

#pragma managed
void
ManagedUpdate()
{
}


#pragma unmanaged
#include <wtypes.h>
#include <cstdint>
#include <cwchar>

struct ProfileSample
{
    static uint64_t frequency;
    uint64_t startTick;
    wchar_t* name;
    int count;

    ProfileSample(wchar_t* name_, int count_)
    {
        name = name_;
        count = count_;

        LARGE_INTEGER win32_startTick;
        QueryPerformanceCounter(&win32_startTick);
        startTick = win32_startTick.QuadPart;
    }

    ~ProfileSample()
    {
        LARGE_INTEGER win32_endTick;
        QueryPerformanceCounter(&win32_endTick);
        uint64_t endTick = win32_endTick.QuadPart;

        uint64_t deltaTicks = endTick - startTick;
        double nanoseconds = (double) deltaTicks / (double) frequency * 1000000000.0 / count;

        wchar_t buffer[128];
        swprintf(buffer, _countof(buffer), L"%s - %.4f ns\n", name, nanoseconds);
        OutputDebugStringW(buffer);

        if (!IsDebuggerPresent())
            MessageBoxW(nullptr, buffer, nullptr, 0);
    }
};

uint64_t ProfileSample::frequency = 0;

int CALLBACK
WinMain(HINSTANCE, HINSTANCE, PSTR, INT)
{
    LARGE_INTEGER frequency;
    QueryPerformanceFrequency(&frequency);
    ProfileSample::frequency = frequency.QuadPart;

    //Warm stuff up
    for ( size_t i = 0; i < 100; i++ )
        ManagedUpdate();

    const int num = 100000000;
    {
        ProfileSample p(L"ManagedUpdate", num);

        for ( size_t i = 0; i < num; i++ )
            ManagedUpdate();
    }

    return 0;
}

1) 为什么 x64 互操作成本为 17ns 而 x86 互操作成本为 8ns

2) 8ns 是我可以合理预期的最快速度吗？

编辑 1

附加信息 CPU i7-4770k @ 3.5 GHz
测试用例是 VS2017 中的单个 C++/CLI 项目。
默认发布配置
全面优化/O2
我随机使用了一些设置，例如 Favor Size 或 Speed、Omit Frame Pointers、Enable C++ Exceptions 和 Security Check，none 似乎改变了 x86/x64 差异。

编辑 2

我已经完成了反汇编（此时我还不是很熟悉）。

在 x86 中似乎类似于

call    ManagedUpdate
jmp     ptr [__mep@?ManagedUpdate@@$$FYAXXZ]
jmp     _IJWNOADThunkJumpTarget@0

在 x64 中我看到了

call    ManagedUpdate
jmp     ptr [__mep@?ManagedUpdate@@$$FYAXXZ]
        //Some jumping around that quickly leads to IJWNOADThunk::MakeCall:
call    IJWNOADThunk::FindThunkTarget
        //MakeCall uses the result from FindThunkTarget to jump into UMThunkStub:

FindThunkTarget 相当繁重，看起来大部分时间都花在了那里。所以我的工作理论是，在 x86 中，thunk 目标是已知的，执行或多或少可以直接跳到它。但是在 x64 中，thunk 目标是未知的，并且在能够跳转到它之前会发生一个搜索过程来找到它。我想知道这是为什么？

Answer 1

我不记得曾经对这样的代码提供性能保证。 7 纳秒是您在 C++ Interop 代码（调用本机代码的托管代码）上可以预期的性能。这是反过来的，本机代码调用托管代码，又名 "reverse pinvoke".

您肯定会体会到这种互操作的缓慢味道。据我所知，IJWNOADThunk 中的 "No AD" 是令人讨厌的小细节。这段代码没有得到互操作存根中常见的微优化。它还高度特定于 C++/CLI 代码。讨厌，因为它不能假设托管代码需要在其中运行的 AppDomain 的任何内容。事实上，它甚至不能假定 CLR 已加载和初始化。

Is 8ns the fastest I can reasonably expect to go?

是的。实际上，您的测量值处于非常低的水平。你的硬件比我的强多了，我正在移动 Haswell 上测试这个。我看到 x86 大约需要 26 到 43 纳秒，x64 大约需要 40 到 46 纳秒。所以你得到了 x3 更好的时间，非常令人印象深刻。坦率地说，有点太令人印象深刻了，但您看到的代码与我看到的相同，所以我们必须测量相同的场景。

Why does x64 interop cost 17ns when x86 interop costs 8ns?

这不是最佳代码，Microsoft 程序员对他可以削减哪些角落非常悲观。我不知道这是否合理，UMThunkStub.asm 中的评论没有解释任何关于选择的内容。

reverse pinvoke 并没有什么特别之处。一直发生在处理 Windows 消息的 GUI 程序中。但这是非常不同的，这样的代码使用委托。这是取得成功并使其更快的方法。使用 Marshal::GetFunctionPointerForDelegate() 是关键。我试过这种方法：

using namespace System;
using namespace System::Runtime::InteropServices;


void* GetManagedUpdateFunctionPointer() {
    auto dlg = gcnew Action(&ManagedUpdate);
    auto tobereleased = GCHandle::Alloc(dlg);
    return Marshal::GetFunctionPointerForDelegate(dlg).ToPointer();
}

并在 WinMain() 函数中这样使用：

typedef void(__stdcall * testfuncPtr)();
testfuncPtr fptr = (testfuncPtr)GetManagedUpdateFunctionPointer();
//Warm stuff up
for (size_t i = 0; i < 100; i++) fptr();

    //...
    for ( size_t i = 0; i < num; i++ ) fptr();

这使得 x86 版本更快了一点。和 x64 版本一样快。

如果您打算使用这种方法，请记住，作为委托目标的实例方法比 x64 代码中的静态方法更快，调用存根需要做的重新排列函数参数的工作更少。请注意，我在 tobereleased 变量上使用了快捷方式，此处可能有内存管理细节，并且在插件场景中可能首选或需要 GCHandle::Free() 调用。

x86 和 x64 中的非托管到托管互操作性能

Unmanaged to managed interop performance in x86 and x64

performance

64-bit

interop

c++-cli