[Question] How to zeroing out the reward sum ret_ = rewards[r] + gamma * ret_ #2089

gekator · 2025-02-22T16:17:22Z

❓ How to zeroing out the reward sum ret_ = rewards[r] + gamma * ret_

Hello, i made custom environment like Atari 2600 Pong and my custom A2C has this code

def update_params(worker_opt,values,logprobs,rewards,clc=0.1,gamma=0.99):
        rewards = torch.Tensor(rewards).flip(dims=(0,)).view(-1) #A
        logprobs = torch.stack(logprobs).flip(dims=(0,)).view(-1)
        values = torch.stack(values).flip(dims=(0,)).view(-1)
        Returns = []
        ret_ = torch.Tensor([0])
        for r in range(rewards.shape[0]): #B
            if rewards[r] != 1: 
                ret_ = torch.Tensor([0]) # reset the sum, since this was a game boundary (pong specific!)
            ret_ = rewards[r] + gamma * ret_
            Returns.append(ret_)
        Returns = torch.stack(Returns).view(-1)
        Returns = F.normalize(Returns,dim=0)
        actor_loss = -1*logprobs * (Returns - values.detach()) #C
        critic_loss = torch.pow(values - Returns,2) #D
        loss = actor_loss.sum() + clc*critic_loss.sum() #E
        loss.backward()
        worker_opt.step()
        return actor_loss, critic_loss, len(rewards)

The lines below are responsible for zeroing out the reward sum ret_ = rewards[r] + gamma * ret_, like here line 44, when it reaches the boundary between game episodes, for example, the first player wins the episode, and there can be several episodes until someone scores 21 points.

if rewards[r] != 1: 
    ret_ = torch.Tensor([0]) # reset the sum, since this was a game boundary (pong specific!)

i found this code here

rollout_buffer.compute_returns_and_advantage(last_values=values, dones=dones)

and here i think it's last_gae_lam
last_gae_lam = delta + self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam

def compute_returns_and_advantage(self, last_values: th.Tensor, dones: np.ndarray) -> None:
        """
        Post-processing step: compute the lambda-return (TD(lambda) estimate)
        and GAE(lambda) advantage.

        Uses Generalized Advantage Estimation (https://arxiv.org/abs/1506.02438)
        to compute the advantage. To obtain Monte-Carlo advantage estimate (A(s) = R - V(S))
        where R is the sum of discounted reward with value bootstrap
        (because we don't always have full episode), set ``gae_lambda=1.0`` during initialization.

        The TD(lambda) estimator has also two special cases:
        - TD(1) is Monte-Carlo estimate (sum of discounted rewards)
        - TD(0) is one-step estimate with bootstrapping (r_t + gamma * v(s_{t+1}))

        For more information, see discussion in https://github.com/DLR-RM/stable-baselines3/pull/375.

        :param last_values: state value estimation for the last step (one for each env)
        :param dones: if the last step was a terminal step (one bool for each env).
        """
        # Convert to numpy
        last_values = last_values.clone().cpu().numpy().flatten()  # type: ignore[assignment]

        last_gae_lam = 0
        for step in reversed(range(self.buffer_size)):
            if step == self.buffer_size - 1:
                next_non_terminal = 1.0 - dones.astype(np.float32)
                next_values = last_values
            else:
                next_non_terminal = 1.0 - self.episode_starts[step + 1]
                next_values = self.values[step + 1]
            delta = self.rewards[step] + self.gamma * next_values * next_non_terminal - self.values[step]
            last_gae_lam = delta + self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam
            self.advantages[step] = last_gae_lam
        # TD(lambda) estimator, see Github PR #375 or "Telescoping in TD(lambda)"
        # in David Silver Lecture 4: https://www.youtube.com/watch?v=PnHCvfgC_ZA
        self.returns = self.advantages + self.values

And how to do zeroing out the reward sum in stable baselines 3 in A2C?

Checklist

I have checked that there is no similar issue in the repo
I have read the documentation
If code there is, it is minimal and working
If code there is, it is formatted using the markdown code blocks for both code and stack traces.

The text was updated successfully, but these errors were encountered:

araffin · 2025-03-01T14:03:36Z

And how to do zeroing out the reward sum in stable baselines 3 in A2C?

https://stable-baselines3.readthedocs.io/en/master/common/atari_wrappers.html#module-stable_baselines3.common.atari_wrappers

gekator added the question Further information is requested label Feb 22, 2025

gekator changed the title ~~[Question] question title~~ [Question] How to zeroing out the reward sum ret_ = rewards[r] + gamma * ret_ Feb 23, 2025

araffin added No tech support We do not do tech support and removed No tech support We do not do tech support labels Mar 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] How to zeroing out the reward sum ret_ = rewards[r] + gamma * ret_ #2089

[Question] How to zeroing out the reward sum ret_ = rewards[r] + gamma * ret_ #2089

gekator commented Feb 22, 2025 •

edited

Loading

araffin commented Mar 1, 2025

[Question] How to zeroing out the reward sum ret_ = rewards[r] + gamma * ret_ #2089

[Question] How to zeroing out the reward sum ret_ = rewards[r] + gamma * ret_ #2089

Comments

gekator commented Feb 22, 2025 • edited Loading

❓ How to zeroing out the reward sum ret_ = rewards[r] + gamma * ret_

Checklist

araffin commented Mar 1, 2025

gekator commented Feb 22, 2025 •

edited

Loading