通过阻塞套接字发送（）超时，但此后消息到达目的地

Question

我正在处理对等通信系统中的分布式“死锁”情况（在 Python 3.5 中编写和运行）。在这个系统中，每个节点与其每个对等节点保持 2 个所谓的 inconn 和 outconn 连接。我使用 select.poll() 来执行多路复用。所以有时会发生以下死锁：如果两个连接的对等点都试图通过 outconn 发送给对方，则每个对等点的 select.poll() 循环都阻塞在发送（ ) 因此另一方无法在 inconn 连接上接收 ()。

我处理这种死锁的方法是在 outconnn 的套接字上设置 settimeout()，这似乎有效。不过有意思的是，socket超时后，消息似乎还能到达目的地。以下是两个节点的示例日志：

节点 A(192.168.56.109)

INFO: [2016-11-02 11:08:05,172] [COOP] Sending ASK_COOP [2016-11-02 11:08:05.172643] to 192.168.56.110 for segment 2.

WARNING: [2016-11-02 11:08:06,173] [COOP] Cannot send to 192.168.56.110. Error: timed out

INFO: [2016-11-02 11:08:06,174] [COOP] Message from 192.168.56.110 is available on 10.

INFO: [2016-11-02 11:08:06,174] [COOP] Get HEARTBEAT [2016-11-02 11:08:04.503723] from 192.168.56.110 for segment 2.

节点 B(192.168.56.110)

INFO: [2016-11-02 11:08:04,503] [COOP] Sending HEARTBEAT [2016-11-02 11:08:04.503723] to 192.168.56.109 for segment 2.

WARNING: [2016-11-02 11:08:05,505] [COOP] Cannot send to 192.168.56.109. Error: timed out

INFO: [2016-11-02 11:08:05,505] [COOP] Message from 192.168.56.109 is available on 11.

INFO: [2016-11-02 11:08:05,505] [COOP] Get ASK_COOP [2016-11-02 11:08:05.172643] from 192.168.56.109 for segment 2.

请问这是为什么？顺便说一句，我处理这种僵局的方式是一种好习惯吗？如果不是，避免这种分布式死锁的最佳做法是什么？

Answer 1

根据我的经验，避免此问题的最佳做法是始终使用非阻塞 I/O。如果您的应用程序从不在 send() 或 recv() 中阻塞，那么就不会出现死锁（至少不会是您描述的那种死锁）。

当然，非阻塞 I/O 带来了其自身的复杂性——特别是，您的代码需要能够正确处理部分发送和部分接收。实际上，这意味着您的应用程序的事件循环可能看起来像这样（伪代码）：

while true:
   block in select() until at least one socket is ready-for-read (or ready-for write, if you have data you want to send on that socket)

   for each ready-for-read socket:      
      read as many bytes as you can (without blocking) into a FIFO receive buffer that you have associated with that socket
      parse as many complete messages as you can out of the beginning of the FIFO buffer 
      (pop the parsed bytes out of the FIFO when you're done with them)

   for each ready-for-write socket:
      send as many bytes as you can (without blocking) from a FIFO send buffer that you have associated with that socket
      (pop the sent bytes out of the FIFO when you're done with them)

在这种设计中，每当您的应用程序生成要在套接字上发送的新数据时，它不应该直接调用 send()；相反，它应该将该数据附加到与该套接字关联的 FIFO 发送缓冲区的末尾，并且上述事件循环将允许尽快发送数据（在发送 FIFO 中已经存在的任何数据之后），当然），不会阻止事件循环执行它可能具有的任何其他职责。

在最坏的情况下（一个非常慢的 TCP 连接，您想通过它发送大量数据）FIFO 可能会变大（使用额外的内存），但它永远不会 "deadlock".

通过阻塞套接字发送（）超时，但此后消息到达目的地

send() via blocking socket timed out but the message arrived at the destination thereafter

python

sockets

p2p

deadlock

network-programming