-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PMON container crashes in latest SONiC images #5986
Comments
@vdahiya12 The last image I tried was build# 493. Pmon was crashing on this. |
can u please try 496 as its also built |
@vdahiya12 Downloading build# 496. Once it's complete, I will load it and share the results. |
@vdahiya12 PMON crash is seen in build# 496 also. Please let me know which image will have your fix. |
While this issue might have been fixed. the latest master image has a different issue with xcvrd as in issue #5994 |
@vdahiya12 @lguohan @jleveque @yxieca I tried build# 500. Most of the services are down. Still seeing the following error
Also seeing an orchagent crash:
|
Here is the dump from build# 500: |
@ciju-juniper: It appears as though your sonic_platform wheel package was not found when the PMon container started up, so it was not installed in the container. Can you please investigate why? |
@jleveque, which package are you referring to. I do not see they build their own wheel package. https://sonic-jenkins.westus2.cloudapp.azure.com/job/broadcom/job/buildimage-brcm-all/lastSuccessfulBuild/artifact/target/debs/buster/sonic-platform-juniper-qfx5210_1.1_amd64.deb.log |
I see references to files in the sonic_platform package in the log, but I don't see a sonic_platform-1.0-py2-none-any.whl package being built. |
@jleveque I started seeing that error from build# 482. No platform changes are involved. |
@jleveque Moreover, the images are built by jenkins jobs. Were there any changes in the package selection / build rules from build# 482 onwards? |
@ciju-juniper: You can see the commits which went into build # 482 here: https://sonic-jenkins.westus2.cloudapp.azure.com/job/broadcom/job/buildimage-brcm-all/482/ There are no changes to package selection/build rules that I can see. |
@ciju-juniper are you still encountering this issue? |
@daall Let me start downloading the latest master image. I will update this issue shortly. |
@daall I tried with build# 522. Problem is still seen. PMON is exited after several startup attempts.
Please let me know, if you have seen similar issues and any suggestions to rectify it. |
@daall I did some debugging by selectively enabling syseepromd, & psud. Without enabling, xcvrd, pmon is starting up. The 'No module named sonic_platform' error is still there. It becomes problematic when the xcvrd is started. This is what I get when I manually start xcvrd from pmon:
This sonic_platform is imported in the xcvrd daemon init:
It's clear that sonic_platform library is not available for xcvrd to start. In the last good build# 481 (in master), I see that xcvrd is available at /usr/bin/xcvrd. Look like something is broken in the library packaging / initialization. |
I found out what's happening. After the xcvrd code moved to the python library, it started creating conflict with 'sonic_platform' platform library implementation (that contains chassis.py & platform.py). As an experiment, I removed the 'sonic_platform' package from platform directory and built an image. There were no crashes of xcvrd and PMON docker is fine. I do see these messages:
Out of these errors, 'reboot_cause' error is expected as the platform implementation was in chassis.py. Despite of having all these errors, psud, syseepromd are running fine. I'm not sure about the ImportError('No module named sonic_platform',). @daall @jleveque @vdahiya12 What would be your suggestion to get rid of these errors? |
@lguohan @jleveque @vdahiya12 @daall I took a deeper look at the error 6 listed in the above comment. This is due to the changes introduced for supporting chassis based systems. Similar errors are there in the syseepromd, watchdog, xcvrd, etc.
This is mandating the implementation of 'sonic_platform.platform.Platform().get_chassis()' . Why is the new implementation is done without having backward compatibility? And the subsequent code block ensures that psud will be functional for the pizza box types. This explains, how the PMON daemons are running even after getting a 'No module' error.
I'm OK to make any changes in the platform scripts to get rid of this error, but IMHO, the platform common implementation could have been better. Thoughts? |
@ciju-juniper: The old platform plugins are deprecated, and all vendors should move to the new Please see https://github.com/Azure/SONiC/wiki/Porting-Guide |
Hi, I am using an Edgecore switch with SONiC on it. But it seems that the same problem is repeating here too : PMON crashes. I built this image from the master branch. Is there any branch which has a fix for it ?
|
Description
Seeing issues with 'pmon' container startup and the following error in the syslog. All the platform monitoring daemons are killed and pmon also stopped after a few trials
Initial Triage
The last good build on the master branch was build# 481 and pmon crashes are seen from build# 482 onwards. In between, there are a few commits in which I suspect the following commit introduced the breakage:
[submodule]: update sonic-platform-daemons (#5868) (detail / githubweb)
Platform details
This problem should be there in most of the platforms. I had tested it on Juniper QFX5210 & QFX5200 platforms.
show techsupport
Here is the techsupport archive:
sonic_dump_sonic_20201120_175417.tar.gz
@vdahiya12 Could you please take a look? Please let me know if you need any further details. Also, please suggest if there are any platform side changes required after this PR #5868
The text was updated successfully, but these errors were encountered: