Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How to zeroing out the reward sum ret_ = rewards[r] + gamma * ret_ #2089

Open
4 tasks done
gekator opened this issue Feb 22, 2025 · 1 comment
Open
4 tasks done
Labels
question Further information is requested

Comments

@gekator
Copy link

gekator commented Feb 22, 2025

❓ How to zeroing out the reward sum ret_ = rewards[r] + gamma * ret_

Hello, i made custom environment like Atari 2600 Pong and my custom A2C has this code

def update_params(worker_opt,values,logprobs,rewards,clc=0.1,gamma=0.99):
        rewards = torch.Tensor(rewards).flip(dims=(0,)).view(-1) #A
        logprobs = torch.stack(logprobs).flip(dims=(0,)).view(-1)
        values = torch.stack(values).flip(dims=(0,)).view(-1)
        Returns = []
        ret_ = torch.Tensor([0])
        for r in range(rewards.shape[0]): #B
            if rewards[r] != 1: 
                ret_ = torch.Tensor([0]) # reset the sum, since this was a game boundary (pong specific!)
            ret_ = rewards[r] + gamma * ret_
            Returns.append(ret_)
        Returns = torch.stack(Returns).view(-1)
        Returns = F.normalize(Returns,dim=0)
        actor_loss = -1*logprobs * (Returns - values.detach()) #C
        critic_loss = torch.pow(values - Returns,2) #D
        loss = actor_loss.sum() + clc*critic_loss.sum() #E
        loss.backward()
        worker_opt.step()
        return actor_loss, critic_loss, len(rewards)

The lines below are responsible for zeroing out the reward sum ret_ = rewards[r] + gamma * ret_, like here line 44, when it reaches the boundary between game episodes, for example, the first player wins the episode, and there can be several episodes until someone scores 21 points.

if rewards[r] != 1: 
    ret_ = torch.Tensor([0]) # reset the sum, since this was a game boundary (pong specific!)

i found this code here

rollout_buffer.compute_returns_and_advantage(last_values=values, dones=dones)

and here i think it's last_gae_lam
last_gae_lam = delta + self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam

def compute_returns_and_advantage(self, last_values: th.Tensor, dones: np.ndarray) -> None:
        """
        Post-processing step: compute the lambda-return (TD(lambda) estimate)
        and GAE(lambda) advantage.

        Uses Generalized Advantage Estimation (https://arxiv.org/abs/1506.02438)
        to compute the advantage. To obtain Monte-Carlo advantage estimate (A(s) = R - V(S))
        where R is the sum of discounted reward with value bootstrap
        (because we don't always have full episode), set ``gae_lambda=1.0`` during initialization.

        The TD(lambda) estimator has also two special cases:
        - TD(1) is Monte-Carlo estimate (sum of discounted rewards)
        - TD(0) is one-step estimate with bootstrapping (r_t + gamma * v(s_{t+1}))

        For more information, see discussion in https://github.com/DLR-RM/stable-baselines3/pull/375.

        :param last_values: state value estimation for the last step (one for each env)
        :param dones: if the last step was a terminal step (one bool for each env).
        """
        # Convert to numpy
        last_values = last_values.clone().cpu().numpy().flatten()  # type: ignore[assignment]

        last_gae_lam = 0
        for step in reversed(range(self.buffer_size)):
            if step == self.buffer_size - 1:
                next_non_terminal = 1.0 - dones.astype(np.float32)
                next_values = last_values
            else:
                next_non_terminal = 1.0 - self.episode_starts[step + 1]
                next_values = self.values[step + 1]
            delta = self.rewards[step] + self.gamma * next_values * next_non_terminal - self.values[step]
            last_gae_lam = delta + self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam
            self.advantages[step] = last_gae_lam
        # TD(lambda) estimator, see Github PR #375 or "Telescoping in TD(lambda)"
        # in David Silver Lecture 4: https://www.youtube.com/watch?v=PnHCvfgC_ZA
        self.returns = self.advantages + self.values

And how to do zeroing out the reward sum in stable baselines 3 in A2C?

Checklist

@gekator gekator added the question Further information is requested label Feb 22, 2025
@gekator gekator changed the title [Question] question title [Question] How to zeroing out the reward sum ret_ = rewards[r] + gamma * ret_ Feb 23, 2025
@araffin araffin added No tech support We do not do tech support and removed No tech support We do not do tech support labels Mar 1, 2025
@araffin
Copy link
Member

araffin commented Mar 1, 2025

And how to do zeroing out the reward sum in stable baselines 3 in A2C?

https://stable-baselines3.readthedocs.io/en/master/common/atari_wrappers.html#module-stable_baselines3.common.atari_wrappers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants