Reboot pattern "...Rebooting with /sbin/kexec -e..." is missing from syslog after fast-reboot #11993

arfeigin · 2022-09-07T13:37:04Z

Description

Reboot pattern message is not printed to syslog in fast-reboot - this is not consistent. Running multiple times and checking reboot cause history will show few fast-reboot were done but in the syslog(s) it is not printed each time.

Steps to reproduce the issue:

Run: sudo fast-reboot
Check if reboot pattern is printed to syslog file, check reboot-cause history and compare.

Describe the results you received:

As mentioned it is missing from syslog in some cases.

Describe the results you expected:

This message should be printed anytime fast-reboot is performed.

Output of `show version`:

SONiC.202205_rc.5-e9edd6649_Internal

Output of `show techsupport`:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

The text was updated successfully, but these errors were encountered:

dprital · 2022-10-06T07:17:43Z

@vaibhavhd , what is your plan to fix it ?

vaibhavhd · 2022-10-07T00:27:32Z

I missed this issue somehow. Do you have any failure logs that you can share?
I am trying to repro this too.

dprital · 2022-10-07T11:29:20Z

I missed this issue somehow. Do you have any failure logs that you can share? I am trying to repro this too.

It happens occasionally, we saw that the reset happens before the kexec LOG message is written to LOG. If we put sleep of 2 seconds between the LOG message write and the reset, it always get into the LOG but it will impact the scheduling so we will not do it. (it was done only for testing)

dprital · 2022-10-18T07:33:56Z

@vaibhavhd - Any update here ?

dprital · 2022-11-24T08:16:52Z

@vaibhavhd - Can you please update regarding this issue ?

vaibhavhd · 2022-12-02T19:13:30Z

Based on my analysis:
This issue is seen only on 202205 (not seen on 202012, master)

The issue is due to some services not being killed gracefully, and gets to a hung state when database container go down.
This causes these services to produce too many error logs Unable to read redis reply.

Identified services are hostcfgd.service, system-health.service, caclmgrd.service
There are several thousands of below logs printed after all containers go down, and before device boots into a new image:

Dec  1 20:11:42.100048 str-msn2700-04 ERR hostcfgd: :- poll_descriptors: readData error: Unable to read redis reply
Dec  1 20:11:42.249122 str-msn2700-04 ERR caclmgrd[4430]: :- poll_descriptors: readData error: Unable to read redis reply
Dec  1 20:11:42.249452 str-msn2700-04 ERR healthd: :- poll_descriptors: readData error: Unable to read redis reply
Dec  1 20:11:42.249824 str-msn2700-04 WARNING healthd: sel.select() did not return swsscommon.Select.OBJECT

When I add explicit service stop for these services, no more error logs appear, and KEXEC string appears in the logs.

vaibhavhd · 2022-12-02T19:26:41Z

We don't need a new fix in 202205 for this issue. The real reason is that we missed to cherrypick a change to 202205:

sonic-net/sonic-utilities#2133 and
#10510

I have added labels for cherrypicking the changes to 202205. That should fix this issue.

With these changes the shutdown part is handled by systemd and happens in terms of systemd service dependencies, which takes care of unexpected errors or hung services in shutdown path.
For eg., in master image, these services get killed by systemd as below:

Dec  1 19:02:12.314404 str-msn2700-04 INFO hostcfgd: HostCfgd: signal 'SIGTERM' is caught and exiting...
Dec  1 19:02:12.366216 str-msn2700-04 NOTICE healthd: Caught SIGTERM - exiting...
Dec  1 19:02:13.311258 str-msn2700-04 INFO caclmgrd[908772]: DaemonBase: Caught SIGTERM - exiting...

zhangyanzhao assigned vaibhavhd Sep 14, 2022

zhangyanzhao added Triaged this issue has been triaged MSFT labels Sep 14, 2022

dprital closed this as completed Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reboot pattern "...Rebooting with /sbin/kexec -e..." is missing from syslog after fast-reboot #11993

Reboot pattern "...Rebooting with /sbin/kexec -e..." is missing from syslog after fast-reboot #11993

arfeigin commented Sep 7, 2022

dprital commented Oct 6, 2022

vaibhavhd commented Oct 7, 2022

dprital commented Oct 7, 2022 •

edited

Loading

dprital commented Oct 18, 2022

dprital commented Nov 24, 2022

vaibhavhd commented Dec 2, 2022

vaibhavhd commented Dec 2, 2022

Reboot pattern "...Rebooting with /sbin/kexec -e..." is missing from syslog after fast-reboot #11993

Reboot pattern "...Rebooting with /sbin/kexec -e..." is missing from syslog after fast-reboot #11993

Comments

arfeigin commented Sep 7, 2022

Description

Steps to reproduce the issue:

Describe the results you received:

Describe the results you expected:

Output of show version:

Output of show techsupport:

Additional information you deem important (e.g. issue happens only occasionally):

dprital commented Oct 6, 2022

vaibhavhd commented Oct 7, 2022

dprital commented Oct 7, 2022 • edited Loading

dprital commented Oct 18, 2022

dprital commented Nov 24, 2022

vaibhavhd commented Dec 2, 2022

vaibhavhd commented Dec 2, 2022

Output of `show version`:

Output of `show techsupport`:

dprital commented Oct 7, 2022 •

edited

Loading