Policy Gradient : Illegal Actions

Post Reply
madesjardins
Posts: 2
Joined: Thu Jan 10, 2019 4:09 pm

Policy Gradient : Illegal Actions

Post by madesjardins » Thu Jan 10, 2019 5:12 pm

Just to make sure I get this right, I'm implementing policy gradient on a fairly complex game where there are a lot of possible discrete actions. Some actions are not available at certain states for various reasons. If I simply give a big negative reward when an 'illegal' move is made during training (and potentially end the game like in Gridworld), with sufficient training, will this be enough to prevent my PolicyModel from predicting such illegal actions when playing an episode ? I guess "not 100%" and I'll need to make sure the predicted action is legal anyway and if not, get the best legal action from comparing V(s') where s' is the state I'll end up in if I play legal action a ? Is there a better way to do this or is this ok ? Cheers

lazyprogrammer
Site Admin
Posts: 9
Joined: Sat Jul 28, 2018 3:46 am

Re: Policy Gradient : Illegal Actions

Post by lazyprogrammer » Thu Jan 10, 2019 9:18 pm

Thanks for your question!

It depends on what you mean by illegal action, i.e. what is the result of that action.

In gridworld, the better solution would be to use TD learning and a reward of -1.

In that way, the actual value of that state + bumping into the wall should be -infinity by definition.

The problem with MC learning is that the values are not updated until the episode is over.

madesjardins
Posts: 2
Joined: Thu Jan 10, 2019 4:09 pm

Re: Policy Gradient : Illegal Actions

Post by madesjardins » Fri Jan 11, 2019 5:17 pm

Thank you very much for the reply.

By illegal, I meant you don't have the right to play that action and it will result in forfeiting it, which is most of the time a really bad thing (+ in real life it will result in people insulting you and discussing how to resolve this mistake for hours).

To give a bit more details, it's for an agent for Game Of Thrones the board game.
In the "planning" phase, you put an "order" token on every "area" you control that have at least a unit on them. There is a gigantic amount of states, but if you decompose the planning phase as 1 step per played order, there are about 600 different actions total (11 different orders x 58 areas) and PolicyModel would predict one of those actions. Based on your position on an "influence track", you have the right to put up to X "special orders", no more. These special orders have certain advantages that makes you want to always play the maximum allowed. The orders are revealed only once all players have placed them (face down). Planning an action does not necessarily mean you will actually do it, it might be use as a threat or your strategy might change based on what others have played. Giving a huge negative reward for ending up in a state with too much special orders made sense (in my head) or playing more "support" orders than you actually have or playing on an area you don't control, etc.

I thought of using MC because the game is episodic and only last for 10 turns (which can take hours in real life, with 90% head scratching and 10% actually do something). I'm collecting data from played games and wanted to train the agent with it, so no need to update while playing, it can wait after the game, even when I'll do agent vs agent games for nights and nights in a row. I'll just save the NNs and train again later.

Is there something more appropriate I should use instead ? Cheers

Post Reply

Return to “Advanced AI: Deep Reinforcement Learning in Python”