Tesauro 的 TD-Gammon 中的棋盘编码

Question

目前我正在尝试让 Tesauro 的 TD gammon 正常工作。然而，我对电路板如何编码以输入神经网络感到有点困惑。

据我所知，他为每个玩家在棋盘上的每个点使用了 4 个单位（2 * 96 个单位），每两个额外的单位用于横杆上的跳棋和下跳棋（2 * 2 个单位）以及两个单位表示轮到谁了。总共有 198 个输入。我也完全理解如何对每个点的不同数量的跳棋进行编码。

然而，我不太确定的是输入的顺序。是不是 96 个第一个输入编码了棋盘上的白色棋子，然后是两个输入用于白色条形和对角棋子？剩下的输入是专用于黑色棋子、黑色条、黑色关闭和指示当前玩家的两个单位吗？

或者说 4 个连续的输入单元将棋盘上的一个点编码为一种颜色，接下来的 4 个输入单元编码相同的点但现在用于另一个玩家？

如果有人能分享一些知识，我将非常高兴，因为我在网上找到的所有内容在 Tesauro 用于编码特定西洋双陆棋情况的输入序列方面都非常模棱两可。

干杯，斯蒂芬

Answer 1

如 this article about the details of TD Gammon 中所述：

In preliminary experiments, the input representation only encoded the raw board information (the number of White or Black checkers at each location), and did not utilize any additional pre-computed features relevant to good play, such as, e.g., the strength of a blockade or probability of being hit. These experiments were completely knowledge-free in that there was no initial knowledge built in about how to play good backgammon. In subsequent experiments, a set of handcrafted features (the same set used by Neurogammon) was added to the representation, resulting in higher overall performance.

很明显，TD Gammon 有很多不同的版本，none 确实定义了一个具体的 TD Gammon 版本。看起来它在原始板编码上表现不错，如果添加了一些手工制作的功能，效果会更好。

因此，确定确切的功能可能有点像徒劳，而且它可能不是您想要的。对于固定的特征编码，您可以将您提出的强化学习方法与时间差分学习（也如该文章中所述）进行比较。这将是您提出的方法与时间差分学习的公平比较（而不是将其称为与实际的 TD Gammon 进行比较，因为无论如何你几乎没有希望完全复制 TD Gammon。）

随着您添加更多功能，您可能会发现这两种方法都在改进，希望您的方法在这些比较中名列前茅。

Answer 2

我知道，这是一个很老的问题，但我仍然想分享一下

编码顺序根本不重要，因为您总是将所有输入乘以权重相加。我正在研究这个主题并从某人的博士论文中收集了几个编码。

萨顿

case 0 => if (men >= 1) 1.0f else 0.0f
case 1 => if (men >= 2) 1.0f else 0.0f
case 2 => if (men >= 3) 1.0f else 0.0f
case 3 => if (men >= 4) (men - 3) / 2.0f else 0.0f

Tesauro89

case 0 => if (men == 1) 1.0f else 0.0f
case 1 => if (men == 2) 1.0f else 0.0f
case 2 => if (men == 3) 1.0f else 0.0f
case 3 => if (men >= 4) (men - 3) / 2.0f else 0.0f

Tesauro92

case 0 => if (men == 1) 1.0f else 0.0f
case 1 => if (men >= 2) 1.0f else 0.0f
case 2 => if (men == 3) 1.0f else 0.0f
case 3 => if (men >= 4) (men - 3) / 2.0f else 0.0f

GNU 双陆棋

case 0 => if (men == 1) 1.0f else 0.0f
case 1 => if (men == 2) 1.0f else 0.0f
case 2 => if (men >= 3) 1.0f else 0.0f
case 3 => if (men >= 4) (men - 3) / 2.0f else 0.0f

我得到了位置数据库，其中包含用于接触、碰撞和比赛阶段的预计算权益，并且运行针对它们进行批量学习。

Sutton 的编码需要更多的 epoch，其他的则收敛得更快。 Tesauro89 在接触赛阶段最快，Tesauro92 在比赛阶段最快。 GnuBG 编码介于 Tesauro 的编码之间。

我还为隐藏层（40 个隐藏单元）尝试了不同的激活函数：sigmoid、tanh、relu、leaking relu、extended elu、softrelu/softplus、symmetric elliott 和 log。像 tanh 和对称艾略特这样的零中心函数比其他函数收敛得更快。同样，更快意味着纪元数，我没有测量时间。非常便宜的 relu 可能是一个不错的选择。当网络无法再改进时，训练就停止了。所有网络最终都停在了几乎相同的 MSE。所以我的结论是：实现细节并不那么重要。

几年前（也许 10 年）Tesauro 重复了 TD-gammon 训练，没有使用以前使用的所有手工制作的功能。训练速度很慢，但随着计算机变得越来越快，使用简单编码的玩家在合理的时间后匹配更复杂的玩家。

我还尝试了几个更奇特的函数，但它们根本不起作用。

Tesauro 的 TD-Gammon 中的棋盘编码

Board encoding in Tesauro's TD-Gammon

artificial-intelligence

machine-learning

reinforcement-learning