DDPG Policy Actor Gradient Question

Hi, thanks for Tianshou framework first!

I'm using Tianshou's DDPGPolicy for a custom continuous-control environment, and I’ve noticed that the actor network's gradients become completely zero after the first update step. Meanwhile, the critic network continues to receive non-zero gradients and seems to train normally.

I’ve logged the actor and critic gradients before and after .backward() during learn(). Initially, actor gradients are non-zero, but in all subsequent updates, they become exactly zero across all layers. This causes the actor to stop learning entirely after just one gradient update.

I have used my custom DDPG policy with log output:
```
class DDPGPolicy(BasePolicy[TDDPGTrainingStats], Generic[TDDPGTrainingStats]):
    """Implementation of Deep Deterministic Policy Gradient. arXiv:1509.02971.

    :param actor: The actor network following the rules (s -> actions)
    :param actor_optim: The optimizer for actor network.
    :param critic: The critic network. (s, a -> Q(s, a))
    :param critic_optim: The optimizer for critic network.
    :param action_space: Env's action space.
    :param tau: Param for soft update of the target network.
    :param gamma: Discount factor, in [0, 1].
    :param exploration_noise: The exploration noise, added to the action. Defaults
        to ``GaussianNoise(sigma=0.1)``.
    :param estimation_step: The number of steps to look ahead.
    :param observation_space: Env's observation space.
    :param action_scaling: if True, scale the action from [-1, 1] to the range
        of action_space. Only used if the action_space is continuous.
    :param action_bound_method: method to bound action to range [-1, 1].
        Only used if the action_space is continuous.
    :param lr_scheduler: if not None, will be called in `policy.update()`.
    """

    def __init__(
            self,
            *,
            actor: EVCActor,
            actor_optim: torch.optim.Optimizer,
            critic: EVCCritic,
            critic_optim: torch.optim.Optimizer,
            action_space: gym.Space,
            tau: float = 0.005,
            gamma: float = 0.99,
            exploration_noise: BaseNoise | Literal["default"] | None = "default",
            estimation_step: int = 1,
            observation_space: gym.Space | None = None,
            action_scaling: bool = True,
            # tanh not supported, see assert below
            action_bound_method: Literal["clip"] | None = "clip",
            lr_scheduler: TLearningRateScheduler | None = None,
    ) -> None:
       ...

    @staticmethod
    def _mse_optimizer(
            batch: RolloutBatchProtocol,
            critic: torch.nn.Module,
            optimizer: torch.optim.Optimizer,
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """A simple wrapper script for updating critic network."""
        weight = getattr(batch, "weight", 1.0)
        current_q = critic(batch.obs, batch.act).flatten()
        target_q = batch.returns.flatten()
        td = current_q - target_q
        # critic_loss = F.mse_loss(current_q1, target_q)
        critic_loss = (td.pow(2) * weight).mean()
        optimizer.zero_grad()
        critic_loss.backward()
        optimizer.step()
        return td, critic_loss

    def learn(self, batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) -> TDDPGTrainingStats:  # type: ignore
        # critic
        td, critic_loss = self._mse_optimizer(batch, self.critic, self.critic_optim)
        batch.weight = td  # prio-buffer

        # Print critic gradients after update
        with open("critic_gradients.log", "a") as f:
            f.write("Critic Network Gradients:\n")
            for name, param in self.critic.named_parameters():
                if param.requires_grad and param.grad is not None:
                    # Check if all elements in param.grad are zero
                    all_zero = torch.all(param.grad == 0).item()
                    if all_zero:
                        f.write(f"{name}: Gradients: {param.grad.cpu().numpy()}\n")
                    else:
                        f.write(f"{name}: Gradients: {param.grad.norm().item()}\n")
                else:
                    f.write(f"{name}: No gradient available\n")

        # actor
        actor_loss = -self.critic(batch.obs, self(batch).act).mean()
        self.actor_optim.zero_grad()
        actor_loss.backward()

        # Print actor gradients before optimizer step
        with open("actor_gradients.log", "a") as f:
            f.write("Actor Network Gradients:\n")
            for name, param in self.actor.named_parameters():
                if param.requires_grad and param.grad is not None:
                    # Check if all elements in param.grad are zero
                    all_zero = torch.all(param.grad == 0).item()
                    if all_zero:
                        f.write(f"{name}: Gradients: {param.grad.cpu().numpy()}\n")
                    else:
                        f.write(f"{name}: Gradients: {param.grad.norm().item()}\n")
                else:
                    f.write(f"{name}: No gradient available\n")

        self.actor_optim.step()
        self.sync_weight()

        return DDPGTrainingStats(actor_loss=actor_loss.item(),
                                 critic_loss=critic_loss.item())  # type: ignore[return-value]

    _TArrOrActBatch = TypeVar("_TArrOrActBatch", bound="np.ndarray | ActBatchProtocol")
```

And the log file is like this:

Actor gradient:
```
Actor Network Gradients:
preprocess.prices_model.0.weight: Gradients: 0.0010648445459082723
preprocess.prices_model.0.bias: Gradients: 0.0016365565825253725
preprocess.prices_model.2.weight: Gradients: 0.007918216288089752
preprocess.prices_model.2.bias: Gradients: 0.0036073809023946524
preprocess.battery_left_model.0.weight: Gradients: 0.005425306037068367
preprocess.battery_left_model.0.bias: Gradients: 0.0010850612306967378
preprocess.battery_left_model.2.weight: Gradients: 0.04320951923727989
preprocess.battery_left_model.2.bias: Gradients: 0.002716963179409504
preprocess.time_left_model.0.weight: Gradients: 0.01680506393313408
preprocess.time_left_model.0.bias: Gradients: 0.0012926972704008222
preprocess.time_left_model.2.weight: Gradients: 0.1108398586511612
preprocess.time_left_model.2.bias: Gradients: 0.0030905732419341803
last.model.0.weight: Gradients: 0.236045241355896
last.model.0.bias: Gradients: 0.012233617715537548
Actor Network Gradients:
preprocess.prices_model.0.weight: Gradients: [[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
preprocess.prices_model.0.bias: Gradients: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
preprocess.prices_model.2.weight: Gradients: [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
preprocess.prices_model.2.bias: Gradients: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
preprocess.battery_left_model.0.weight: Gradients: [[0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]]
preprocess.battery_left_model.0.bias: Gradients: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
preprocess.battery_left_model.2.weight: Gradients: [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
preprocess.battery_left_model.2.bias: Gradients: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
preprocess.time_left_model.0.weight: Gradients: [[0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]]
preprocess.time_left_model.0.bias: Gradients: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
preprocess.time_left_model.2.weight: Gradients: [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
preprocess.time_left_model.2.bias: Gradients: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
last.model.0.weight: Gradients: [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
last.model.0.bias: Gradients: [0.]
```

Critic gradient:
```
ritic Network Gradients:
preprocess.prices_model.0.weight: Gradients: 0.019699564203619957
preprocess.prices_model.0.bias: Gradients: 0.030276207253336906
preprocess.prices_model.2.weight: Gradients: 0.2047939896583557
preprocess.prices_model.2.bias: Gradients: 0.10075069218873978
preprocess.battery_left_model.0.weight: Gradients: 0.2434540092945099
preprocess.battery_left_model.0.bias: Gradients: 0.0486908033490181
preprocess.battery_left_model.2.weight: Gradients: 2.296942949295044
preprocess.battery_left_model.2.bias: Gradients: 0.12656116485595703
preprocess.time_left_model.0.weight: Gradients: 0.6150978803634644
preprocess.time_left_model.0.bias: Gradients: 0.047315217554569244
preprocess.time_left_model.2.weight: Gradients: 4.802682399749756
preprocess.time_left_model.2.bias: Gradients: 0.11132071912288666
preprocess.action_model.0.weight: Gradients: 0.038945045322179794
preprocess.action_model.0.bias: Gradients: 0.05049269273877144
preprocess.action_model.2.weight: Gradients: 0.5154708623886108
preprocess.action_model.2.bias: Gradients: 0.11851561814546585
last.model.0.weight: Gradients: 10.407426834106445
last.model.0.bias: Gradients: 0.5554802417755127
Critic Network Gradients:
preprocess.prices_model.0.weight: Gradients: 0.02118404023349285
preprocess.prices_model.0.bias: Gradients: 0.032557692378759384
preprocess.prices_model.2.weight: Gradients: 0.2504027485847473
preprocess.prices_model.2.bias: Gradients: 0.12331806123256683
preprocess.battery_left_model.0.weight: Gradients: 0.2629085183143616
preprocess.battery_left_model.0.bias: Gradients: 0.05947006493806839
preprocess.battery_left_model.2.weight: Gradients: 2.6021101474761963
preprocess.battery_left_model.2.bias: Gradients: 0.16108590364456177
preprocess.time_left_model.0.weight: Gradients: 0.7877254486083984
preprocess.time_left_model.0.bias: Gradients: 0.060594264417886734
preprocess.time_left_model.2.weight: Gradients: 5.933137893676758
preprocess.time_left_model.2.bias: Gradients: 0.1374756246805191
preprocess.action_model.0.weight: Gradients: 0.028953827917575836
preprocess.action_model.0.bias: Gradients: 0.055419400334358215
preprocess.action_model.2.weight: Gradients: 0.5418355464935303
preprocess.action_model.2.bias: Gradients: 0.13757666945457458
last.model.0.weight: Gradients: 13.180129051208496
last.model.0.bias: Gradients: 0.7118054628372192
```

Used tianshou 1.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DDPG Policy Actor Gradient Question #1268

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DDPG Policy Actor Gradient Question #1268

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions