Hi,
I'm a little bit confused why you just take the q value of the best action and set this as state value function. According to the relationships between v and q the averaged q values over the actions according to the policy should be the value of the state value function.
Best regards