Working with agent dimension in multi-agent workflows based on single policy (parameter sharing)

I am looking for a simple library to implement parameter sharing in multi-agent RL using single-agent RL algorithms. I have just discovered Tianshou and it looks awesome, but I have a problem with the dimension of the data that represents the n of agents.

My project uses a custom grid-based environment where: 

- the observation has shape (n_agents, n_channels, h, w) 
- the state-action vector has shape (n_agents, n_actions)
- the action vector is discrete and has shape (n_agents)
- the reward signal has shape (n_agents)
 
As far as I understand, Tianshou uses the collector both for getting the batches for the simulation (in one or multiple environments) and for retrieving batches for the training. Therefore, the 

- the batch observation has shape (batch_size, n_agents, n_channels, h, w) 
- the batch state-action vector has shape (batch_size, n_agents, n_actions)
- the batch action vector is discrete and has shape (batch_size, n_agents)
- the batch reward signal has shape (batch_size, n_agents)

Notice that the number of samples in a batch from the perspective of the neural network (batch_size * n_agents) is different from the number of samples from the perspective of the environment (batch_size), which can be problematic. In the simulation, the agents should generate a coherent trajectory, so the n_agents dimension is important to indicate which action vectors should be passed to which environment. I can use the forward method of the neural network model to check if the n_agents dimension exists. In this case, I concatenate batch_size and n_agents to train the neural networks and then reshape the resulting q-values to extract the action vector for each environment. 
```
    def forward(self, s, state=None, info={}):
        shape = s.shape
        s = to_torch(s, device=self.device, dtype=torch.float).cuda() 
        if len(s.shape) == 5: #(batch_size, n_agents, n_channels, h, w)
            s = s.view(shape[0] * shape[1], shape[2], shape[3], shape[4]) #(batch_size * n_agents, n_channels, h, w)
            q_values = self.model(s) #(batch_size * n_agents, n_actions)
            q_values = q_values.view(shape[0], shape[1], -1) #(batch_size, n_agents, n_actions)
        return q_values, state
```
However, this creates a problem on the training side, because the observations are stored with the n_agents dimension in the buffer. For the the training algorithm, that dimension does not exist. For example, in the line 91 of dqn.py (tianshou 0.2.3)

`returns = buffer.rew[now] + self._gamma * returns  
`
buffer.rew[now] has shape (batch_size, n_agents) and self._gamma (batch_size), so this would break.

What is the best way of addressing this? I foresee two possible strategies
1) Change collector to merge n_batch and n_agents in the Buffer. Only retrieve the n_agents dimension to pass the action vector to the environment in the simulation. The problem is that this would mess with the sequential nature of the samples in the buffer, preventing use of n-step algorithms and invalidating the terminal information (although my environment does not have an end, so this might not be critical). 
2) Enable the policy to accept the additional dimension. This would preserve the sequential nature of the buffer, but it would require re-writing some methods.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Working with agent dimension in multi-agent workflows based on single policy (parameter sharing) #136

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Working with agent dimension in multi-agent workflows based on single policy (parameter sharing) #136

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions