Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reboot pattern "...Rebooting with /sbin/kexec -e..." is missing from syslog after fast-reboot #11993

Closed
arfeigin opened this issue Sep 7, 2022 · 7 comments
Assignees
Labels
MSFT Triaged this issue has been triaged

Comments

@arfeigin
Copy link
Contributor

arfeigin commented Sep 7, 2022

Description

Reboot pattern message is not printed to syslog in fast-reboot - this is not consistent. Running multiple times and checking reboot cause history will show few fast-reboot were done but in the syslog(s) it is not printed each time.

Steps to reproduce the issue:

  1. Run: sudo fast-reboot
  2. Check if reboot pattern is printed to syslog file, check reboot-cause history and compare.

Describe the results you received:

As mentioned it is missing from syslog in some cases.

Describe the results you expected:

This message should be printed anytime fast-reboot is performed.

Output of show version:

SONiC.202205_rc.5-e9edd6649_Internal

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@zhangyanzhao zhangyanzhao added Triaged this issue has been triaged MSFT labels Sep 14, 2022
@dprital
Copy link
Collaborator

dprital commented Oct 6, 2022

@vaibhavhd , what is your plan to fix it ?

@vaibhavhd
Copy link
Contributor

I missed this issue somehow. Do you have any failure logs that you can share?
I am trying to repro this too.

@dprital
Copy link
Collaborator

dprital commented Oct 7, 2022

I missed this issue somehow. Do you have any failure logs that you can share? I am trying to repro this too.

It happens occasionally, we saw that the reset happens before the kexec LOG message is written to LOG. If we put sleep of 2 seconds between the LOG message write and the reset, it always get into the LOG but it will impact the scheduling so we will not do it. (it was done only for testing)

@dprital
Copy link
Collaborator

dprital commented Oct 18, 2022

@vaibhavhd - Any update here ?

@dprital
Copy link
Collaborator

dprital commented Nov 24, 2022

@vaibhavhd - Can you please update regarding this issue ?

@vaibhavhd
Copy link
Contributor

Based on my analysis:
This issue is seen only on 202205 (not seen on 202012, master)

The issue is due to some services not being killed gracefully, and gets to a hung state when database container go down.
This causes these services to produce too many error logs Unable to read redis reply.

Identified services are hostcfgd.service, system-health.service, caclmgrd.service
There are several thousands of below logs printed after all containers go down, and before device boots into a new image:

Dec  1 20:11:42.100048 str-msn2700-04 ERR hostcfgd: :- poll_descriptors: readData error: Unable to read redis reply
Dec  1 20:11:42.249122 str-msn2700-04 ERR caclmgrd[4430]: :- poll_descriptors: readData error: Unable to read redis reply
Dec  1 20:11:42.249452 str-msn2700-04 ERR healthd: :- poll_descriptors: readData error: Unable to read redis reply
Dec  1 20:11:42.249824 str-msn2700-04 WARNING healthd: sel.select() did not return swsscommon.Select.OBJECT

When I add explicit service stop for these services, no more error logs appear, and KEXEC string appears in the logs.

@vaibhavhd
Copy link
Contributor

We don't need a new fix in 202205 for this issue. The real reason is that we missed to cherrypick a change to 202205:

sonic-net/sonic-utilities#2133 and
#10510

I have added labels for cherrypicking the changes to 202205. That should fix this issue.

With these changes the shutdown part is handled by systemd and happens in terms of systemd service dependencies, which takes care of unexpected errors or hung services in shutdown path.
For eg., in master image, these services get killed by systemd as below:

Dec  1 19:02:12.314404 str-msn2700-04 INFO hostcfgd: HostCfgd: signal 'SIGTERM' is caught and exiting...
Dec  1 19:02:12.366216 str-msn2700-04 NOTICE healthd: Caught SIGTERM - exiting...
Dec  1 19:02:13.311258 str-msn2700-04 INFO caclmgrd[908772]: DaemonBase: Caught SIGTERM - exiting...

@dprital dprital closed this as completed Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MSFT Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

4 participants