-
Bias : The environment version sudoku-v0 on which all these policies were trained and tested does not let them corrupt the board, meaning a policy cannot modify a correct cell it has previously filled. This is one of, if not the only reason these trained policies are able complete all sudoku board. A less biased and closer-to-the-truth version of the environment would let the policies corrupt the board if they try to modify a previously filled correct cell with a new bad value. This should make learning much harder, and generalization in this version of the environment should be very impressive since error recovery can be a sign of some sort of learning.
-
KL Divergence Explosion during the Auxiliary Phase : I initially computed KL using the probabilities drawn from the post-masked logits and since invalid actions are masked by setting their logits to
-infor-1e9(for the position mask) before Softmax, their resulting probabilities becomes zero and this cause KL to occasionally beinf... KL contains the terms log(p) and log(q) and when using the masked logits to generate a distribution,log(p)orlog(q)or perhaps both can~ log(0) and log(0) -→ -∞.An easy fix is to compute KL between the old and new policy using the raw pre-masked logits instead. -
Experiments : The model
sumo_v0_4kwas trained on a single board and on only ~4 million transitions but is remarkably, and for reasons I can't yet explain, able to generalize on boards it has never seen during training while always beating the baseline on the current version of the environment which is a random policy's total steps for n episodes. This suggests that resetting the environment with different boards might not be necessary (at least for the biased version of the environment) and that a policy is able to learn a general strategy to solve a variety of boards even when all the training data are coming from a single source. The modelsumo_v0_10k, on the other hand, was trained on 10 million transitions also all coming from a single board and displayed the same capabilities as the first model that was trained on 4 million transitions. This might imply that more training time is perhaps not necessary and that 4 million transitions or even less can be enough for a policy to display some signs of learning when trained on the current biased version of the environment.
-
Notifications
You must be signed in to change notification settings - Fork 0
Phasic Policy Gradient (PPG) implementation on a biased Gymnasium-based Sudoku environment
License
adeotti/sumo
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
About
Phasic Policy Gradient (PPG) implementation on a biased Gymnasium-based Sudoku environment
Topics
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published