负载测试期间出现 FabricTransientException "Could not ping any of the provided Service Fabric gateway endpoints."

Question

我们有一个 class 可以将消息广播到 Service Fabric 无状态服务。这个无状态服务只有一个分区，但有很多副本。该消息应该发送到系统中的所有副本。因此，我们向 FabricClient 查询单个分区，以及该分区的所有副本。我们使用标准 HTTP 通信（无状态服务具有通信侦听器和自托管 OWIN 侦听器，使用 WebListener/HttpSys）和共享 HttpClient 实例。在负载测试期间，我们在发送消息时遇到很多错误。请注意，我们在同一个应用程序中还有其他服务，也在通信（WebListener/HttpSys、ServiceProxy 和 ActorProxy）。

我们看到异常的代码是（堆栈跟踪在代码示例下方）：

private async Task SendMessageToReplicas(string actionName, string message)
{
  var fabricClient = new FabricClient();
  var eventNotificationHandlerServiceUri = new Uri(ServiceFabricSettings.EventNotificationHandlerServiceName);

  var promises = new List<Task>();
  // There is only one partition of this service, but there are many replica's
  Partition partition = (await fabricClient.QueryManager.GetPartitionListAsync(eventNotificationHandlerServiceUri).ConfigureAwait(false)).First();

  string continuationToken = null;
  do
  {
    var replicas = await fabricClient.QueryManager.GetReplicaListAsync(partition.PartitionInformation.Id, continuationToken).ConfigureAwait(false);
    foreach(Replica replica in replicas)
    {
      promises.Add(SendMessageToReplica(replica, actionName, message));
    }

    continuationToken = replicas.ContinuationToken;
  } while(continuationToken != null);

  await Task.WhenAll(promises).ConfigureAwait(false);
}


private async Task SendMessageToReplica(Replica replica, string actionName, string message)
{
  if(replica.TryGetEndpoint(out Uri replicaUrl))
  {
    Uri requestUri = UriUtility.Combine(replicaUrl, actionName);
    using(var response = await _httpClient.PostAsync(requestUri, message == null ? null : new JsonContent(message)).ConfigureAwait(false))
    {
      string responseContent = await response.Content.ReadAsStringAsync().ConfigureAwait(false);
      if(!response.IsSuccessStatusCode)
      {
        throw new Exception();
      }
    }
  }
  else
  {
    throw new Exception();
  }
}

抛出以下异常：

System.Fabric.FabricTransientException: Could not ping any of the provided Service Fabric gateway endpoints. ---> System.Runtime.InteropServices.COMException: Exception from HRESULT: 0x80071C49
at System.Fabric.Interop.NativeClient.IFabricQueryClient9.EndGetPartitionList2(IFabricAsyncOperationContext context)
at System.Fabric.FabricClient.QueryClient.GetPartitionListAsyncEndWrapper(IFabricAsyncOperationContext context)
at System.Fabric.Interop.AsyncCallOutAdapter2`1.Finish(IFabricAsyncOperationContext context, Boolean expectedCompletedSynchronously)
--- End of inner exception stack trace ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Company.ServiceFabric.ServiceFabricEventNotifier.<SendMessageToReplicas>d__7.MoveNext() in c:\work\ServiceFabricEventNotifier.cs:line 138

在同一时期，我们还看到抛出此异常：

System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.) ---> System.ComponentModel.Win32Exception (0x80004005): An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full
at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)
at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry)
at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
at System.Data.SqlClient.SqlConnection.OpenAsync(CancellationToken cancellationToken)

集群中机器上的事件日志显示这些警告：

Event ID: 4231
Source: Tcpip
Level: Warning
A request to allocate an ephemeral port number from the global TCP port space has failed due to all such ports being in use.

Event ID: 4227
Source: Tcpip
Level: Warning
TCP/IP failed to establish an outgoing connection because the selected local endpoint was recently used to connect to the same remote endpoint. This error typically occurs when outgoing connections are opened and closed at a high rate, causing all available local ports to be used and forcing TCP/IP to reuse a local port for an outgoing connection. To minimize the risk of data corruption, the TCP/IP standard requires a minimum time period to elapse between successive connections from a given local endpoint to a given remote endpoint.

最后，Microsoft-Service Fabric 管理日志显示了数百条类似于

的警告

Event 4121
Source Microsoft-Service-Fabric
Level: Warning
client-02VM4.company.nl:19000/192.168.10.36:19000: error = 2147942452, failureCount=160522. Filter by (type~Transport.St && ~"(?i)02VM4.company.nl:19000") to get listener lifecycle. Connect failure is expected if listener was never started, or listener/its process was stopped before/during connecting.

Event 4097
Source Microsoft-Service-Fabric
Level: Warning
client-02VM4.company.nl:19000 : connect failed, having tried all addresses

过了一会儿，警告变成了错误：

Event 4096
Source Microsoft-Service-Fabric
Level: Error
client-02VM4.company.nl:19000 failed to bind to local port for connecting: 0x80072747

谁能告诉我们为什么会这样，我们可以做些什么来解决这个问题？我们是不是做错了什么？

Answer 1

您似乎遇到了端口耗尽问题。如果是这样的话您要么必须弄清楚如何重用您的连接，要么必须实施某种节流机制，以免用完所有可用端口。

不确定 fabric 客户端的行为方式，可能是它导致耗尽，或者可能是我们看不到代码的 SQL 服务器部分（但由于您将其发布在我假设它可能与您的 ping 测试无关。

查看 httpwebresponse (https://github.com/Microsoft/referencesource/blob/master/System/net/System/Net/HttpWebResponse.cs) 的参考源，也可能是处理响应（即您的 postasync using 语句）正在关闭 HttpClients 连接。这意味着您不是在重复使用连接，而是一直打开新连接。

我想测试一个不处理你的 httpwebresponse 的变体是一件相当容易的事情。

Answer 2

调用每个现有服务实例的原因是什么？

通常情况下，您应该只调用 SF 运行时提供的一个服务实例（它会尝试从同一个 node/process 或另一个节点中选择一个，如果该节点负载过大）。

如果您需要在所有服务实例中发出某些状态 change/event 的信号，也许这应该在服务实现内部完成，以便它检查此状态更改（可能来自有状态服务）或来自每次需要此信息时的发布-订阅事件队列（参见示例 https://github.com/loekd/ServiceFabric.PubSubActors）。

另一个想法是在支持批量数据的另一个操作中一次向服务实例发送许多消息。

如果您必须以高频率从单个源发送单独的消息，那么像上一个答案一样保持连接打开是一个很好的解决方案。

另外，调用者应该做连接弹性，参见例子https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-communication#communicating-with-a-service

Answer 3

我们（我与 OP 合作）一直在对此进行测试，结果证明是 Esben Bach 建议的 FabricClient。

FabricClient 上的文档还指出：

It is highly recommended that you share FabricClients as much as possible. This is because the FabricClient has multiple optimizations such as caching and batching that you would not be able to fully utilize otherwise.

似乎 FabricClient 的行为类似于 HttpClient class，您也应该在其中共享实例，否则您会遇到同样的问题，即端口耗尽。

使用 FabricClient 的常见异常 documentation 但是也提到当出现 FabricObjectClosedException 时您应该：

Dispose of the FabricClient object you are using and instantiate a new FabricClient object.

共享 FabricClient 修复了端口耗尽问题。

负载测试期间出现 FabricTransientException "Could not ping any of the provided Service Fabric gateway endpoints."

FabricTransientException during load test "Could not ping any of the provided Service Fabric gateway endpoints."

c#

azure-service-fabric