-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Description
Hi, thanks for Tianshou framework first!
I'm using Tianshou's DDPGPolicy for a custom continuous-control environment, and I’ve noticed that the actor network's gradients become completely zero after the first update step. Meanwhile, the critic network continues to receive non-zero gradients and seems to train normally.
I’ve logged the actor and critic gradients before and after .backward() during learn(). Initially, actor gradients are non-zero, but in all subsequent updates, they become exactly zero across all layers. This causes the actor to stop learning entirely after just one gradient update.
I have used my custom DDPG policy with log output:
class DDPGPolicy(BasePolicy[TDDPGTrainingStats], Generic[TDDPGTrainingStats]):
"""Implementation of Deep Deterministic Policy Gradient. arXiv:1509.02971.
:param actor: The actor network following the rules (s -> actions)
:param actor_optim: The optimizer for actor network.
:param critic: The critic network. (s, a -> Q(s, a))
:param critic_optim: The optimizer for critic network.
:param action_space: Env's action space.
:param tau: Param for soft update of the target network.
:param gamma: Discount factor, in [0, 1].
:param exploration_noise: The exploration noise, added to the action. Defaults
to ``GaussianNoise(sigma=0.1)``.
:param estimation_step: The number of steps to look ahead.
:param observation_space: Env's observation space.
:param action_scaling: if True, scale the action from [-1, 1] to the range
of action_space. Only used if the action_space is continuous.
:param action_bound_method: method to bound action to range [-1, 1].
Only used if the action_space is continuous.
:param lr_scheduler: if not None, will be called in `policy.update()`.
"""
def __init__(
self,
*,
actor: EVCActor,
actor_optim: torch.optim.Optimizer,
critic: EVCCritic,
critic_optim: torch.optim.Optimizer,
action_space: gym.Space,
tau: float = 0.005,
gamma: float = 0.99,
exploration_noise: BaseNoise | Literal["default"] | None = "default",
estimation_step: int = 1,
observation_space: gym.Space | None = None,
action_scaling: bool = True,
# tanh not supported, see assert below
action_bound_method: Literal["clip"] | None = "clip",
lr_scheduler: TLearningRateScheduler | None = None,
) -> None:
...
@staticmethod
def _mse_optimizer(
batch: RolloutBatchProtocol,
critic: torch.nn.Module,
optimizer: torch.optim.Optimizer,
) -> tuple[torch.Tensor, torch.Tensor]:
"""A simple wrapper script for updating critic network."""
weight = getattr(batch, "weight", 1.0)
current_q = critic(batch.obs, batch.act).flatten()
target_q = batch.returns.flatten()
td = current_q - target_q
# critic_loss = F.mse_loss(current_q1, target_q)
critic_loss = (td.pow(2) * weight).mean()
optimizer.zero_grad()
critic_loss.backward()
optimizer.step()
return td, critic_loss
def learn(self, batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) -> TDDPGTrainingStats: # type: ignore
# critic
td, critic_loss = self._mse_optimizer(batch, self.critic, self.critic_optim)
batch.weight = td # prio-buffer
# Print critic gradients after update
with open("critic_gradients.log", "a") as f:
f.write("Critic Network Gradients:\n")
for name, param in self.critic.named_parameters():
if param.requires_grad and param.grad is not None:
# Check if all elements in param.grad are zero
all_zero = torch.all(param.grad == 0).item()
if all_zero:
f.write(f"{name}: Gradients: {param.grad.cpu().numpy()}\n")
else:
f.write(f"{name}: Gradients: {param.grad.norm().item()}\n")
else:
f.write(f"{name}: No gradient available\n")
# actor
actor_loss = -self.critic(batch.obs, self(batch).act).mean()
self.actor_optim.zero_grad()
actor_loss.backward()
# Print actor gradients before optimizer step
with open("actor_gradients.log", "a") as f:
f.write("Actor Network Gradients:\n")
for name, param in self.actor.named_parameters():
if param.requires_grad and param.grad is not None:
# Check if all elements in param.grad are zero
all_zero = torch.all(param.grad == 0).item()
if all_zero:
f.write(f"{name}: Gradients: {param.grad.cpu().numpy()}\n")
else:
f.write(f"{name}: Gradients: {param.grad.norm().item()}\n")
else:
f.write(f"{name}: No gradient available\n")
self.actor_optim.step()
self.sync_weight()
return DDPGTrainingStats(actor_loss=actor_loss.item(),
critic_loss=critic_loss.item()) # type: ignore[return-value]
_TArrOrActBatch = TypeVar("_TArrOrActBatch", bound="np.ndarray | ActBatchProtocol")
And the log file is like this:
Actor gradient:
Actor Network Gradients:
preprocess.prices_model.0.weight: Gradients: 0.0010648445459082723
preprocess.prices_model.0.bias: Gradients: 0.0016365565825253725
preprocess.prices_model.2.weight: Gradients: 0.007918216288089752
preprocess.prices_model.2.bias: Gradients: 0.0036073809023946524
preprocess.battery_left_model.0.weight: Gradients: 0.005425306037068367
preprocess.battery_left_model.0.bias: Gradients: 0.0010850612306967378
preprocess.battery_left_model.2.weight: Gradients: 0.04320951923727989
preprocess.battery_left_model.2.bias: Gradients: 0.002716963179409504
preprocess.time_left_model.0.weight: Gradients: 0.01680506393313408
preprocess.time_left_model.0.bias: Gradients: 0.0012926972704008222
preprocess.time_left_model.2.weight: Gradients: 0.1108398586511612
preprocess.time_left_model.2.bias: Gradients: 0.0030905732419341803
last.model.0.weight: Gradients: 0.236045241355896
last.model.0.bias: Gradients: 0.012233617715537548
Actor Network Gradients:
preprocess.prices_model.0.weight: Gradients: [[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
preprocess.prices_model.0.bias: Gradients: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
preprocess.prices_model.2.weight: Gradients: [[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
preprocess.prices_model.2.bias: Gradients: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
preprocess.battery_left_model.0.weight: Gradients: [[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]]
preprocess.battery_left_model.0.bias: Gradients: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
preprocess.battery_left_model.2.weight: Gradients: [[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
preprocess.battery_left_model.2.bias: Gradients: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
preprocess.time_left_model.0.weight: Gradients: [[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]]
preprocess.time_left_model.0.bias: Gradients: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
preprocess.time_left_model.2.weight: Gradients: [[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
preprocess.time_left_model.2.bias: Gradients: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
last.model.0.weight: Gradients: [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
last.model.0.bias: Gradients: [0.]
Critic gradient:
ritic Network Gradients:
preprocess.prices_model.0.weight: Gradients: 0.019699564203619957
preprocess.prices_model.0.bias: Gradients: 0.030276207253336906
preprocess.prices_model.2.weight: Gradients: 0.2047939896583557
preprocess.prices_model.2.bias: Gradients: 0.10075069218873978
preprocess.battery_left_model.0.weight: Gradients: 0.2434540092945099
preprocess.battery_left_model.0.bias: Gradients: 0.0486908033490181
preprocess.battery_left_model.2.weight: Gradients: 2.296942949295044
preprocess.battery_left_model.2.bias: Gradients: 0.12656116485595703
preprocess.time_left_model.0.weight: Gradients: 0.6150978803634644
preprocess.time_left_model.0.bias: Gradients: 0.047315217554569244
preprocess.time_left_model.2.weight: Gradients: 4.802682399749756
preprocess.time_left_model.2.bias: Gradients: 0.11132071912288666
preprocess.action_model.0.weight: Gradients: 0.038945045322179794
preprocess.action_model.0.bias: Gradients: 0.05049269273877144
preprocess.action_model.2.weight: Gradients: 0.5154708623886108
preprocess.action_model.2.bias: Gradients: 0.11851561814546585
last.model.0.weight: Gradients: 10.407426834106445
last.model.0.bias: Gradients: 0.5554802417755127
Critic Network Gradients:
preprocess.prices_model.0.weight: Gradients: 0.02118404023349285
preprocess.prices_model.0.bias: Gradients: 0.032557692378759384
preprocess.prices_model.2.weight: Gradients: 0.2504027485847473
preprocess.prices_model.2.bias: Gradients: 0.12331806123256683
preprocess.battery_left_model.0.weight: Gradients: 0.2629085183143616
preprocess.battery_left_model.0.bias: Gradients: 0.05947006493806839
preprocess.battery_left_model.2.weight: Gradients: 2.6021101474761963
preprocess.battery_left_model.2.bias: Gradients: 0.16108590364456177
preprocess.time_left_model.0.weight: Gradients: 0.7877254486083984
preprocess.time_left_model.0.bias: Gradients: 0.060594264417886734
preprocess.time_left_model.2.weight: Gradients: 5.933137893676758
preprocess.time_left_model.2.bias: Gradients: 0.1374756246805191
preprocess.action_model.0.weight: Gradients: 0.028953827917575836
preprocess.action_model.0.bias: Gradients: 0.055419400334358215
preprocess.action_model.2.weight: Gradients: 0.5418355464935303
preprocess.action_model.2.bias: Gradients: 0.13757666945457458
last.model.0.weight: Gradients: 13.180129051208496
last.model.0.bias: Gradients: 0.7118054628372192
Used tianshou 1.2.0
Metadata
Metadata
Assignees
Labels
No labels