了解 monte carlo 树搜索
Understanding monte carlo tree search
所以,我正在尝试使用 keras 创建 AlphaZero 的实现。但是,我对 MCTS 不太确定。我对Monte Carlo Tree Search的理解和编码如下:
class MCTS(object):
def __init__(self, action_size, movesets, nsims, ndepth):
self.nsims = nsims
self.ndepth = ndepth
self.movesets = movesets
self.action_size = action_size
def evaluate_and_act(self, agent, stateizer, critic, state):
sims = []
print("Beginning monte carlo tree search")
true_state = state
for i in range(self.nsims):
random_walk = []
for j in range(self.ndepth):
random_actions = []
print("Searching depth", j, "of simulation", i)
for k in range(self.movesets):
rand_move = np.random.choice(self.action_size)
rand_move_matrix = cp.add(cp.zeros((1, self.action_size)), .0001)
rand_move_matrix[0][rand_move] = critic.predict(state, batch_size=64)[0][0]
random_actions.append(cp.asnumpy(rand_move_matrix))
random_action_concat = np.concatenate(random_actions, -1)
state = stateizer.predict(cp.asnumpy(random_action_concat), batch_size=64)
random_walk.append(random_actions)
sims.append(random_walk)
state = true_state
best_reward = -1000000.0
for walk in sims:
sum_reward = np.sum(walk)
if sum_reward >= best_reward:
best_walk = walk
best_reward = sum_reward
return best_walk[0]
看来我在这个实现中根本不需要策略网络,只需要评论家。有人可以帮助我了解我的实施是否正确,以及为什么它在 AlphaZero 方面不正确?谢谢。
它似乎缺少 AlphaGo 论文中的 U(s, a)
术语。
U(s, a) ∝ P(s, a) / (1 + N(s, a))
其中 P(s, a)
是策略移动的概率。
所以,我正在尝试使用 keras 创建 AlphaZero 的实现。但是,我对 MCTS 不太确定。我对Monte Carlo Tree Search的理解和编码如下:
class MCTS(object):
def __init__(self, action_size, movesets, nsims, ndepth):
self.nsims = nsims
self.ndepth = ndepth
self.movesets = movesets
self.action_size = action_size
def evaluate_and_act(self, agent, stateizer, critic, state):
sims = []
print("Beginning monte carlo tree search")
true_state = state
for i in range(self.nsims):
random_walk = []
for j in range(self.ndepth):
random_actions = []
print("Searching depth", j, "of simulation", i)
for k in range(self.movesets):
rand_move = np.random.choice(self.action_size)
rand_move_matrix = cp.add(cp.zeros((1, self.action_size)), .0001)
rand_move_matrix[0][rand_move] = critic.predict(state, batch_size=64)[0][0]
random_actions.append(cp.asnumpy(rand_move_matrix))
random_action_concat = np.concatenate(random_actions, -1)
state = stateizer.predict(cp.asnumpy(random_action_concat), batch_size=64)
random_walk.append(random_actions)
sims.append(random_walk)
state = true_state
best_reward = -1000000.0
for walk in sims:
sum_reward = np.sum(walk)
if sum_reward >= best_reward:
best_walk = walk
best_reward = sum_reward
return best_walk[0]
看来我在这个实现中根本不需要策略网络,只需要评论家。有人可以帮助我了解我的实施是否正确,以及为什么它在 AlphaZero 方面不正确?谢谢。
它似乎缺少 AlphaGo 论文中的 U(s, a)
术语。
U(s, a) ∝ P(s, a) / (1 + N(s, a))
其中 P(s, a)
是策略移动的概率。