这是indexloc提供的服务,不要输入任何密码
Skip to content

DDPG Policy Actor Gradient Question #1268

@Jack251970

Description

@Jack251970

Hi, thanks for Tianshou framework first!

I'm using Tianshou's DDPGPolicy for a custom continuous-control environment, and I’ve noticed that the actor network's gradients become completely zero after the first update step. Meanwhile, the critic network continues to receive non-zero gradients and seems to train normally.

I’ve logged the actor and critic gradients before and after .backward() during learn(). Initially, actor gradients are non-zero, but in all subsequent updates, they become exactly zero across all layers. This causes the actor to stop learning entirely after just one gradient update.

I have used my custom DDPG policy with log output:

class DDPGPolicy(BasePolicy[TDDPGTrainingStats], Generic[TDDPGTrainingStats]):
    """Implementation of Deep Deterministic Policy Gradient. arXiv:1509.02971.

    :param actor: The actor network following the rules (s -> actions)
    :param actor_optim: The optimizer for actor network.
    :param critic: The critic network. (s, a -> Q(s, a))
    :param critic_optim: The optimizer for critic network.
    :param action_space: Env's action space.
    :param tau: Param for soft update of the target network.
    :param gamma: Discount factor, in [0, 1].
    :param exploration_noise: The exploration noise, added to the action. Defaults
        to ``GaussianNoise(sigma=0.1)``.
    :param estimation_step: The number of steps to look ahead.
    :param observation_space: Env's observation space.
    :param action_scaling: if True, scale the action from [-1, 1] to the range
        of action_space. Only used if the action_space is continuous.
    :param action_bound_method: method to bound action to range [-1, 1].
        Only used if the action_space is continuous.
    :param lr_scheduler: if not None, will be called in `policy.update()`.
    """

    def __init__(
            self,
            *,
            actor: EVCActor,
            actor_optim: torch.optim.Optimizer,
            critic: EVCCritic,
            critic_optim: torch.optim.Optimizer,
            action_space: gym.Space,
            tau: float = 0.005,
            gamma: float = 0.99,
            exploration_noise: BaseNoise | Literal["default"] | None = "default",
            estimation_step: int = 1,
            observation_space: gym.Space | None = None,
            action_scaling: bool = True,
            # tanh not supported, see assert below
            action_bound_method: Literal["clip"] | None = "clip",
            lr_scheduler: TLearningRateScheduler | None = None,
    ) -> None:
       ...

    @staticmethod
    def _mse_optimizer(
            batch: RolloutBatchProtocol,
            critic: torch.nn.Module,
            optimizer: torch.optim.Optimizer,
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """A simple wrapper script for updating critic network."""
        weight = getattr(batch, "weight", 1.0)
        current_q = critic(batch.obs, batch.act).flatten()
        target_q = batch.returns.flatten()
        td = current_q - target_q
        # critic_loss = F.mse_loss(current_q1, target_q)
        critic_loss = (td.pow(2) * weight).mean()
        optimizer.zero_grad()
        critic_loss.backward()
        optimizer.step()
        return td, critic_loss

    def learn(self, batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) -> TDDPGTrainingStats:  # type: ignore
        # critic
        td, critic_loss = self._mse_optimizer(batch, self.critic, self.critic_optim)
        batch.weight = td  # prio-buffer

        # Print critic gradients after update
        with open("critic_gradients.log", "a") as f:
            f.write("Critic Network Gradients:\n")
            for name, param in self.critic.named_parameters():
                if param.requires_grad and param.grad is not None:
                    # Check if all elements in param.grad are zero
                    all_zero = torch.all(param.grad == 0).item()
                    if all_zero:
                        f.write(f"{name}: Gradients: {param.grad.cpu().numpy()}\n")
                    else:
                        f.write(f"{name}: Gradients: {param.grad.norm().item()}\n")
                else:
                    f.write(f"{name}: No gradient available\n")

        # actor
        actor_loss = -self.critic(batch.obs, self(batch).act).mean()
        self.actor_optim.zero_grad()
        actor_loss.backward()

        # Print actor gradients before optimizer step
        with open("actor_gradients.log", "a") as f:
            f.write("Actor Network Gradients:\n")
            for name, param in self.actor.named_parameters():
                if param.requires_grad and param.grad is not None:
                    # Check if all elements in param.grad are zero
                    all_zero = torch.all(param.grad == 0).item()
                    if all_zero:
                        f.write(f"{name}: Gradients: {param.grad.cpu().numpy()}\n")
                    else:
                        f.write(f"{name}: Gradients: {param.grad.norm().item()}\n")
                else:
                    f.write(f"{name}: No gradient available\n")

        self.actor_optim.step()
        self.sync_weight()

        return DDPGTrainingStats(actor_loss=actor_loss.item(),
                                 critic_loss=critic_loss.item())  # type: ignore[return-value]

    _TArrOrActBatch = TypeVar("_TArrOrActBatch", bound="np.ndarray | ActBatchProtocol")

And the log file is like this:

Actor gradient:

Actor Network Gradients:
preprocess.prices_model.0.weight: Gradients: 0.0010648445459082723
preprocess.prices_model.0.bias: Gradients: 0.0016365565825253725
preprocess.prices_model.2.weight: Gradients: 0.007918216288089752
preprocess.prices_model.2.bias: Gradients: 0.0036073809023946524
preprocess.battery_left_model.0.weight: Gradients: 0.005425306037068367
preprocess.battery_left_model.0.bias: Gradients: 0.0010850612306967378
preprocess.battery_left_model.2.weight: Gradients: 0.04320951923727989
preprocess.battery_left_model.2.bias: Gradients: 0.002716963179409504
preprocess.time_left_model.0.weight: Gradients: 0.01680506393313408
preprocess.time_left_model.0.bias: Gradients: 0.0012926972704008222
preprocess.time_left_model.2.weight: Gradients: 0.1108398586511612
preprocess.time_left_model.2.bias: Gradients: 0.0030905732419341803
last.model.0.weight: Gradients: 0.236045241355896
last.model.0.bias: Gradients: 0.012233617715537548
Actor Network Gradients:
preprocess.prices_model.0.weight: Gradients: [[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
preprocess.prices_model.0.bias: Gradients: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
preprocess.prices_model.2.weight: Gradients: [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
preprocess.prices_model.2.bias: Gradients: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
preprocess.battery_left_model.0.weight: Gradients: [[0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]]
preprocess.battery_left_model.0.bias: Gradients: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
preprocess.battery_left_model.2.weight: Gradients: [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
preprocess.battery_left_model.2.bias: Gradients: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
preprocess.time_left_model.0.weight: Gradients: [[0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]]
preprocess.time_left_model.0.bias: Gradients: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
preprocess.time_left_model.2.weight: Gradients: [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
preprocess.time_left_model.2.bias: Gradients: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
last.model.0.weight: Gradients: [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
last.model.0.bias: Gradients: [0.]

Critic gradient:

ritic Network Gradients:
preprocess.prices_model.0.weight: Gradients: 0.019699564203619957
preprocess.prices_model.0.bias: Gradients: 0.030276207253336906
preprocess.prices_model.2.weight: Gradients: 0.2047939896583557
preprocess.prices_model.2.bias: Gradients: 0.10075069218873978
preprocess.battery_left_model.0.weight: Gradients: 0.2434540092945099
preprocess.battery_left_model.0.bias: Gradients: 0.0486908033490181
preprocess.battery_left_model.2.weight: Gradients: 2.296942949295044
preprocess.battery_left_model.2.bias: Gradients: 0.12656116485595703
preprocess.time_left_model.0.weight: Gradients: 0.6150978803634644
preprocess.time_left_model.0.bias: Gradients: 0.047315217554569244
preprocess.time_left_model.2.weight: Gradients: 4.802682399749756
preprocess.time_left_model.2.bias: Gradients: 0.11132071912288666
preprocess.action_model.0.weight: Gradients: 0.038945045322179794
preprocess.action_model.0.bias: Gradients: 0.05049269273877144
preprocess.action_model.2.weight: Gradients: 0.5154708623886108
preprocess.action_model.2.bias: Gradients: 0.11851561814546585
last.model.0.weight: Gradients: 10.407426834106445
last.model.0.bias: Gradients: 0.5554802417755127
Critic Network Gradients:
preprocess.prices_model.0.weight: Gradients: 0.02118404023349285
preprocess.prices_model.0.bias: Gradients: 0.032557692378759384
preprocess.prices_model.2.weight: Gradients: 0.2504027485847473
preprocess.prices_model.2.bias: Gradients: 0.12331806123256683
preprocess.battery_left_model.0.weight: Gradients: 0.2629085183143616
preprocess.battery_left_model.0.bias: Gradients: 0.05947006493806839
preprocess.battery_left_model.2.weight: Gradients: 2.6021101474761963
preprocess.battery_left_model.2.bias: Gradients: 0.16108590364456177
preprocess.time_left_model.0.weight: Gradients: 0.7877254486083984
preprocess.time_left_model.0.bias: Gradients: 0.060594264417886734
preprocess.time_left_model.2.weight: Gradients: 5.933137893676758
preprocess.time_left_model.2.bias: Gradients: 0.1374756246805191
preprocess.action_model.0.weight: Gradients: 0.028953827917575836
preprocess.action_model.0.bias: Gradients: 0.055419400334358215
preprocess.action_model.2.weight: Gradients: 0.5418355464935303
preprocess.action_model.2.bias: Gradients: 0.13757666945457458
last.model.0.weight: Gradients: 13.180129051208496
last.model.0.bias: Gradients: 0.7118054628372192

Used tianshou 1.2.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions