Skip to content

adeotti/sumo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

  • Bias : The environment version sudoku-v0 on which all these policies were trained and tested does not let them corrupt the board, meaning a policy cannot modify a correct cell it has previously filled. This is one of, if not the only reason these trained policies are able complete all sudoku board. A less biased and closer-to-the-truth version of the environment would let the policies corrupt the board if they try to modify a previously filled correct cell with a new bad value. This should make learning much harder, and generalization in this version of the environment should be very impressive since error recovery can be a sign of some sort of learning.

  • KL Divergence Explosion during the Auxiliary Phase : I initially computed KL using the probabilities drawn from the post-masked logits and since invalid actions are masked by setting their logits to -inf or -1e9(for the position mask) before Softmax, their resulting probabilities becomes zero and this cause KL to occasionally be inf... KL contains the terms log(p) and log(q) and when using the masked logits to generate a distribution,log(p) or log(q) or perhaps both can ~ log(0) and log(0) -→ -∞.An easy fix is to compute KL between the old and new policy using the raw pre-masked logits instead.

  • Experiments : The modelsumo_v0_4k was trained on a single board and on only ~4 million transitions but is remarkably, and for reasons I can't yet explain, able to generalize on boards it has never seen during training while always beating the baseline on the current version of the environment which is a random policy's total steps for n episodes. This suggests that resetting the environment with different boards might not be necessary (at least for the biased version of the environment) and that a policy is able to learn a general strategy to solve a variety of boards even when all the training data are coming from a single source. The model sumo_v0_10k, on the other hand, was trained on 10 million transitions also all coming from a single board and displayed the same capabilities as the first model that was trained on 4 million transitions. This might imply that more training time is perhaps not necessary and that 4 million transitions or even less can be enough for a policy to display some signs of learning when trained on the current biased version of the environment.

About

Phasic Policy Gradient (PPG) implementation on a biased Gymnasium-based Sudoku environment

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages