Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imporve the way of handling BUFFER_PG during PFC storm #7

Closed
wants to merge 3 commits into from

Conversation

stephenxs
Copy link
Owner

What I did
One action for the PFC storm is to set a zero buffer profile on the PG under PFC storm.
The zero buffer won't be removed until the PFC storm has gone.
If the user wants to modify the buffer profile for the PG, the bufferorch will return "task_need_retry".
General speaking it doesn't matter unless that:

  • the system can't be warm rebooted until the PFC storm has gone.
  • the "task_need_retry" will block the update coming from any entry of the BUFFER_PG table from being programmed to ASIC.
    In this sense, we need a better solution.

In the new solution, it will:

  • record the new profile user wants to apply during PFC storm as the "pending profile" for that PG and return "task_success" if the PG is under PFC storm.
  • apply the pending profile once the PG is unlocked.
  • the latest pending profile will take effect in case user tries updating the profile for more than 1 times.

Signed-off-by: Stephen Sun [email protected]

Why I did it

How I verified it
It can be verified in the following steps:

  1. configure PFC watchdog. the buffer profile should be the original value
  2. start PFC storm on PG 4.
NOTICE swss#orchagent: :- startWdActionOnQueue: Receive notification, storm
NOTICE swss#orchagent: :- startWdActionOnQueue: PFC Watchdog detected PFC storm on port Ethernet64, queue index 4, queue id 0x150000000006e5 and port id 0x10000000006de.
  1. modify PG 3-4
Oct 16 10:31:50.869634 mtbc-sonic-01-2410 NOTICE swss#orchagent: :- processPriorityGroup: Set buffer PG Ethernet64:3-4 to profile
Oct 16 10:31:50.869893 mtbc-sonic-01-2410 WARNING swss#orchagent: :- processPriorityGroup: Priority group 4 on port Ethernet64 is locked, pending until unlocked
  1. stop PFC storm
Oct 16 10:33:29.099110 mtbc-sonic-01-2410 NOTICE swss#orchagent: :- startWdActionOnQueue: Receive notification, restore
Oct 16 10:33:29.099110 mtbc-sonic-01-2410 NOTICE swss#orchagent: :- startWdActionOnQueue: PFC Watchdog storm restored on port Ethernet64, queue index 4, queue id 0x150000000006e5 and port id 0x10000000006de.
Oct 16 10:33:29.117272 mtbc-sonic-01-2410 NOTICE swss#orchagent: :- ~PfcWdZeroBufferHandler: Priority group 4 on port Ethernet64 has been resotred to pending profile 0x19000000000ab1

Details if related

One action for the PFC storm is to set a zero buffer profile on the PG under PFC storm.
The zero buffer won't be removed until the PFC storm has gone.
If the user wants to modify the buffer profile for the PG, the bufferorch will return "task_need_retry".
General speaking it doesn't matter unless that:
- the system can't be warm rebooted until the PFC storm has gone.
- the "task_need_retry" will block the update of the entire BUFFER_PG table from being programmed to ASIC.
In this sense, we need a better solution.

The new solution will:
- record the new profile user wants to apply during PFC storm as the "pending profile" for that PG and return "task_success" if the PG is under PFC storm.
- apply the pending profile once the PG is unlocked.
- the latest pending profile will take effect in case user tries updating the profile for more than 1 times.

Signed-off-by: Stephen Sun <[email protected]>
@stephenxs
Copy link
Owner Author

Community PR 1480 opened.

@stephenxs stephenxs closed this Oct 24, 2020
stephenxs pushed a commit that referenced this pull request Aug 14, 2023
**What I did**

Fix the Mem Leak by moving the raw pointers in type_maps to use smart pointers

**Why I did it**

```
Indirect leak of 83776 byte(s) in 476 object(s) allocated from:
    #0 0x7f0a2a414647 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x5555590cc923 in __gnu_cxx::new_allocator, std::allocator > const, referenced_object> > >::allocate(unsigned long, void const*) /usr/include/c++/10/ext/new_allocator.h:115
    #2 0x5555590cc923 in std::allocator_traits, std::allocator > const, referenced_object> > > >::allocate(std::allocator, std::allocator > const, referenced_object> > >&, unsigned long) /usr/include/c++/10/bits/alloc_traits.h:460
    #3 0x5555590cc923 in std::_Rb_tree, std::allocator >, std::pair, std::allocator > const, referenced_object>, std::_Select1st, std::allocator > const, referenced_object> >, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::_M_get_node() /usr/include/c++/10/bits/stl_tree.h:584
    #4 0x5555590cc923 in std::_Rb_tree_node, std::allocator > const, referenced_object> >* std::_Rb_tree, std::allocator >, std::pair, std::allocator > const, referenced_object>, std::_Select1st, std::allocator > const, referenced_object> >, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::_M_create_node, std::allocator > const&>, std::tuple<> >(std::piecewise_construct_t const&, std::tuple, std::allocator > const&>&&, std::tuple<>&&) /usr/include/c++/10/bits/stl_tree.h:634
    #5 0x5555590cc923 in std::_Rb_tree_iterator, std::allocator > const, referenced_object> > std::_Rb_tree, std::allocator >, std::pair, std::allocator > const, referenced_object>, std::_Select1st, std::allocator > const, referenced_object> >, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::_M_emplace_hint_unique, std::allocator > const&>, std::tuple<> >(std::_Rb_tree_const_iterator, std::allocator > const, referenced_object> >, std::piecewise_construct_t const&, std::tuple, std::allocator > const&>&&, std::tuple<>&&) /usr/include/c++/10/bits/stl_tree.h:2461
    #6 0x5555590e8757 in std::map, std::allocator >, referenced_object, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::operator[](std::__cxx11::basic_string, std::allocator > const&) /usr/include/c++/10/bits/stl_map.h:501
    #7 0x5555590d48b0 in Orch::setObjectReference(std::map, std::allocator >, std::map, std::allocator >, referenced_object, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >*, std::less, std::allocator > >, std::allocator, std::allocator > const, std::map, std::allocator >, referenced_object, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >*> > >&, std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&) orchagent/orch.cpp:450
    #8 0x5555594ff66b in QosOrch::handleQueueTable(Consumer&, std::tuple, std::allocator >, std::__cxx11::basic_string, std::allocator >, std::vector, std::allocator >, std::__cxx11::basic_string, std::allocator > >, std::allocator, std::allocator >, std::__cxx11::basic_string, std::allocator > > > > >&) orchagent/qosorch.cpp:1763
    #9 0x5555594edbd6 in QosOrch::doTask(Consumer&) orchagent/qosorch.cpp:2179
    #10 0x5555590c8743 in Consumer::drain() orchagent/orch.cpp:241
    #11 0x5555590c8743 in Consumer::drain() orchagent/orch.cpp:238
    #12 0x5555590c8743 in Consumer::execute() orchagent/orch.cpp:235
    #13 0x555559090dad in OrchDaemon::start() orchagent/orchdaemon.cpp:755
    #14 0x555558e9be25 in main orchagent/main.cpp:766
    #15 0x7f0a299b6d09 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x23d09)
```
stephenxs pushed a commit that referenced this pull request Feb 27, 2025
…tries in gRouteBulker (sonic-net#3526)

<!--
Please make sure you have read and understood the contribution guildlines:
https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

1. Make sure your commit includes a signature generted with `git commit -s`
2. Make sure your commit title follows the correct format: [component]: description
3. Make sure your commit message contains enough details about the change and related tests
4. Make sure your pull request adds related reviewers, asignees, labels

Please also provide the following information in this pull request:
-->

**What I did**

Avoid removing a VRF routing table when there are pending creation entries in gRouteBulker

1. Remove a VRF routing table when a routing entry is removed only if there is no pending creation entry in gRouteBulker
2. Avoid uninitialized value SAI IP address/prefix structure

**Why I did it**

Fix issue: out of range exception can be thrown in `addRoutePost` due to non exist VRF

```
(gdb) bt
#0 0x00007f5791aedebc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f5791a9efb2 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f5791a89472 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f5791de0919 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007f5791debe1a in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007f5791debe85 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007f5791dec0d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007f5791de3240 in std::__throw_out_of_range(char const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8 0x00005594e856d956 in std::map<unsigned long, std::map<swss::IpPrefix, RouteNhg, std::less<swss::IpPrefix>, std::allocator<std::pair<swss::IpPrefix const, RouteNhg> > >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::map<swss::IpPrefix, RouteNhg, std::less<swss::IpPrefix>, std::allocator<std::pair<swss::IpPrefix const, RouteNhg> > > > > >::at (this=<optimized out>, __k=<optimized out>) at /usr/include/c++/12/bits/stl_map.h:551
#9 0x00005594e8564beb in RouteOrch::addRoutePost (this=this@entry=0x5594ea13e080, ctx=..., nextHops=...) at ./orchagent/routeorch.cpp:2145
#10 0x00005594e856b0b2 in RouteOrch::doTask (this=0x5594ea13e080, consumer=...) at ./orchagent/routeorch.cpp:1021
#11 0x00005594e85282d2 in Orch::doTask (this=0x5594ea13e080) at ./orchagent/orch.cpp:553
#12 0x00005594e851909a in OrchDaemon::start (this=this@entry=0x5594ea0a0950) at ./orchagent/orchdaemon.cpp:895
#13 0x00005594e8485632 in main (argc=<optimized out>, argv=<optimized out>) at ./orchagent/main.cpp:818
```

**How I verified it**

Unit (mock) test

**Details if related**

Originally, it cleaned up a VRF routing table whenever a prefix of the VRF was removed if
1. there was no routing entry in the VRF routing table and
2. the prefix was not pending creation in gRouteBulker

The motivation is to remove a VRF routing table if there is no routing entry in the VRF and no routing entry pending creation for that VRF. However, condition 2 does not guarantee that.

The ideal way of the 2nd condition is to check pending creation entries of a certain VRF, which we can not do.
So, we are using strict conditions here as the following:
1. there is no routing entry in the VRF routing table and
2. there is no pending creating routing entry in gRouteBulker regardless of which VRF it belongs to
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants