Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restart every devices after restart HA #528

Open
williB78 opened this issue Feb 12, 2025 · 24 comments
Open

restart every devices after restart HA #528

williB78 opened this issue Feb 12, 2025 · 24 comments
Labels
todo This item needs processing

Comments

@williB78
Copy link

if i restart HA all the Meross Devices, in my case mss305 8.0.0, turn off and on. so the device thats pluged in gets off and on.

@Zet-an
Copy link

Zet-an commented Feb 12, 2025

I've noticed this also and I have my pc plugged into a meross plug so that turns off without warning. One time I thought my gpu broke as it was making a terrible noise afterwards but somehow as cable got sucked into one of the fans during the process. :s

I hope a fix can be found as I have to remember to turn off my pc before updating or restarting HA.

Using mss305 hardware 8.0.0 firmware 8.3.15

@mikozaman
Copy link

mikozaman commented Feb 12, 2025

Same issue
And as i have my home assistant host connected to a smart plug meross it go offline.
Using mss210 mss305.
I removed this integration and change to Meross integration and no more restart of meross plug when i restart HA.

@krahabb krahabb added the todo This item needs processing label Feb 13, 2025
@krahabb
Copy link
Owner

krahabb commented Feb 13, 2025

I'm sorry for this behavior but, at the moment, I have no idea of what's going on especially since the component is not being updated since quite a long time so the issue might lie in the way the latest HA release initializes comnponents.
I'll try testing the reboot process...
In the meantime, do you have any warning or significative log related to meross_lan in HA log ?

@williB78
Copy link
Author

williB78 commented Feb 13, 2025

Image

in the logfile i had restart at 10:27.

and yes, it happens since HA update to 25.1.4

@jonlicence
Copy link

The same is happening with me, not sure if it is coincidence but I updated the Firmware on the Meross mss305 to version 8.3.15 this morning.
My HA core version is still on 2024.12.0 so don't think its the HA update as this didn't happen before the mss305 firmware update.

@Archaiel
Copy link

Archaiel commented Feb 15, 2025

I'm using Docker on HA, AU plugs - mss210 (hw 7.0.0, fw 7.3.9) and mss310 (hw 8.0.0.0 and fw 6.3.23) and its working fine - HA 25.2.4.

Seems restricted to mss305 devices?

@Zet-an
Copy link

Zet-an commented Feb 15, 2025

I didn't expierence this before I updated my firmware on my mss305's but I haven't been using this setup for long. At the time I applied the firmware update and then did a HA update so can't link it to one or the other but it appears to be the firmware thats the issue.

I don't know if re-linking the plugs would be of any use or if HA will have a workaround.

@krahabb
Copy link
Owner

krahabb commented Feb 16, 2025

I'll throw my point of view into the issue:
When HA/meross_lan starts it does nothing 'special' except querying the device for its general status before starting the 'regular' query poll.
So the issue might be induced by the 'general status query' done at at startup.

I think the switch toggling off/on might be due the device rebooting because of improper querying during HA startup. In general, the devices are 'resilient' to malformed queries but this is not always true. Also, the aforementioned 'general status query' is a really basic one universally accepted by all device types (in fact the issue only arises on this special device/fw).

I'd ask if you can collect a 'diagnostic' from any of the misbehaving devices (the start of the diagnostic could trigger the toggling off/on though)

Also, after the 'toggling glitch' does the device work correctly in HA/meross_lan (i.e. state update, toggling and so on..)?

@jonlicence
Copy link

jonlicence commented Feb 17, 2025

Hopefully here is the diagnostic file from one of my devices, and yes it did toggle the switch off/on.

meross_lan_diagnostic.json

Let me know if you need anything else, and I'll try to get it for you.

Once the initial poll happens and the device powers off/on everything works fine with them in HA. The power toggles work and states update correctly.

@Xaymar
Copy link

Xaymar commented Feb 20, 2025

Sadly also affected by this with 31 MSS305 power outlets. All of mine already had the latest firmware installed prior to this issue occurring. The issue was not present before the 2025 series of Home Assistant updates, as I had an automation restarting my router, managed LAN switches, and RPi (Home Assistant) at 4:00. Now I need to exclude the RPi and hope the Voice Assistant stuff doesn't get stuck.

Similar to jonlicence, everything works as intended after the initial On/Off/OriginalState toggle. Since my Home Assistant is also behind one of these, it sometimes results in an endless loop of On/Off/OriginalState until I get home. As a temporary solution, I put all the important things behind battery power backups now - not efficient (+30W base load) but it offsets the sudden power loss.

Considering all of my devices are misbehaving on reboot of Home Assistant, should I grab it from all of them?

@krahabb
Copy link
Owner

krahabb commented Feb 21, 2025

I don't have many tools to inspect this behavior right now but I might supsect (just speculation) that the devices don't like a special query used when coming online (already guessed about this before..)

You could help me by trying manually querying any of the 'offended' devices by using the meross_lan.request action in the HA - Developer tools->ACTIONS UI.
Please fill in the required parameters as:

action: meross_lan.request
data:
  protocol: auto
  method: GET
  namespace: Appliance.System.Debug
  payload: "{}"
  device_id: PUT_THE_HEX_FORMATTED_DEVICE_ID_HERE

Where you can recover the aforementioned device_id from the device configuration UI (or use instead the host address field if you better know that)

That message (for namespace Appliance.System.Debug) is one of the 'single shot' messages used while onlining the device and it is also sent when using device diagnostics (so that the same toggling behavior might be triggered then)

If that message doesn't trigger the toggling you could try querying for another namespace : Appliance.System.All (this too is used when onlining but I'd really hope this doesn't trigger the toggling because it is a fundamental query to inspect device state...)

@jonlicence
Copy link

Hi krahabb, I'm away from home this weekend but will definitely try this on Monday and let you know the results.

@Xaymar
Copy link

Xaymar commented Feb 22, 2025

Tried both, neither of them caused the issue.

Edit: Enabling debug logs has my log file spammed with this:

Traceback (most recent call last):
  File "/config/custom_components/meross_lan/meross_device.py", line 1590, in _async_polling_callback
    await self._async_request_updates(epoch, ns_all_handler.ns.name)
  File "/config/custom_components/meross_lan/meross_device.py", line 1459, in _async_request_updates
    await handler.polling_strategy(handler, self)  # type: ignore
          ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not callable

Edit: config_entry-meross_lan-01J4Z3FFBK2WTHV76RTGK9T1TK(1).json

Trace

  • GET Appliance.System.All
  • GET Appliance.System.Ability
  • PUSH Appliance.Config.Info with {}
  • PUSH Appliance.Config.Info with {"info":{"homekit":{}}}
  • GET Appliance.Config.Info
  • GET Appliance.Config.OverTemp
  • PUSH Appliance.Config.StandbyKiller with {}
  • 3x GET Appliance.Config.StandbyKiller
  • PUSH Appliance.Control.AlertReport
  • GET Appliance.Control.ConsumptionConfig
  • GET Appliance.Control.ConsumptionH
  • GET Appliance.Control.ConsumptionX
  • GET Appliance.Control.Electricity
  • GET Appliance.Control.OverTemp
  • GET Appliance.Control.ToggleX
  • GET Appliance.System.Debug
  • GET Appliance.System.Runtime

Edit: None of the commands above appear to trigger it.

@krahabb
Copy link
Owner

krahabb commented Feb 22, 2025

@Xaymar,
Thank you for checking this.
The logged error should be related to a transient condition due to usage of the 'diagnostic entities' feature so it should no harm in standard usage, nevertheless it's nice you spotted that so that I'm able to fix it.

As for the toggling issue it leaves us with no apparent solutions at the moment...

@Xaymar
Copy link

Xaymar commented Feb 22, 2025

Scanning through the diagnostic data, I found exactly one that had a different trace compared to the rest: config_entry-meross_lan-01J59XDGT8K29MBVMW2K4AKZAN.json. This one has a file 7KB bigger than the average.

Trace
  1. GETACK Appliance.System.All (no matching GET request)
  2. GETACK Appliance.System.Ability (no matching GET request)
  3. PUSH Appliance.Config.Info with {}
  4. SETACK Appliance.Control.Multiple (no matching SET/PUT/PUSH), containing responses for:
    • GET Appliance.Control.Electricity
    • GET Appliance.Control.ConsumptionH
  5. GET Application.Control.ConsumptionX
    • GETACK Application.Control.ConsumptionX
  6. PUSH Appliance.Config.Info with {}
  7. GET Application.Config.Info with {"info":[]}
    • GETACK Application.Config.Info
  8. GET Application.Config.OverTemp with {"overTemp":{}}
    • GETACK Application.Config.OverTemp
  9. PUSH Appliance.Config.StandbyKiller with {}
  10. PUSH Application.Config.StandbyKiller with {"config":[{"channel":0,"power":0,"time":300,"enable":2,"alert":2}]}
  11. GET Appliance.Config.StandbyKiller with {"standbyKiller": []}
    • HTTP ERROR GET Appliance.Config.StandbyKiller (messageId:9d971769b6814f4881f35e339736bcb8 ServerDisconnectedError:Server disconnected)
    • GETACK Appliance.Config.StandbyKiller
    • Handler undefined for method:GETACK namespace:Appliance.Config.StandbyKiller payload:{'config': []}?
  12. GET Appliance.Config.StandbyKiller with {"config":[{"channel":0}]}
    • GETACK Appliance.Config.StandbyKiller
    • Handler undefined for method:GETACK namespace:Appliance.Config.StandbyKiller payload:{'config': [{'channel': 0, 'power': 0, 'time': 300, 'enable': 2, 'alert': 2}]}
  13. PUSH Appliance.Control.AlertReport with {}
    • HTTP ERROR PUSH Appliance.Control.AlertReport (messageId:82505164b7fe44f8866fdb4695ff9694 ServerDisconnectedError:Server disconnected)
  14. GET Appliance.Control.AlertReport
    • HTTP ERROR GET Appliance.Control.AlertReport (messageId:b38b015ea23d46178a2dd503fe048807 ServerDisconnectedError:Server disconnected)
  15. GET Appliance.Control.ConsumptionConfig with {"config":{}}
    • GETACK Appliacne.Control.ConsumptionConfig
  16. GET Appliance.Control.ConsumptionH
    • GETACK Appliance.Control.ConsumptionH
  17. GET Appliance.Control.ConsumptionX
    • GETACK Appliance.Control.ConsumptionX
  18. GET Appliance.Control.Electricity
    • GETACK Appliance.Control.Electricity
  19. GET Appliance.Control.OverTemp
    • HTTP ERROR GET Appliance.Control.OverTemp (messageId:658f2e51b0b74a40a9ca74a69c527152 ServerDisconnectedError:Server disconnected)
    • No reply to this at all.
  20. GET Appliance.Control.ToggleX with {"togglex":[]}
    • GETACK Appliance.Control.ToggleX
  21. GET Appliance.System.Debug with {"debug":{}}
    • GETACK Appliance.System.DEbug
  22. GET Appliance.System.Runtime with {"runtime":{}}
    • GETACK Appliance.System.Runtime

I do not see anything immediately obvious here. It seems that the diagnostics start in the middle of work instead of at the beginning, so we might be missing the real issue entirely.

@Xaymar
Copy link

Xaymar commented Feb 22, 2025

I think I found it! These chains of commands triggers the reset:

Chain 1:

  1. GET Appliance.Control.Electricity
  2. GET Appliance.Control.OverTemp

Chain 2:

  1. PUSH Appliance.Config.StandbyKiller
  2. One of:
    • GET Appliance.Config.StandbyKiller
    • PUSH Appliance.Control.AlertReport

Chain 3:

  1. Any other command that isn't Appliance.System.Debug
  2. GET Appliance.Control.OverTemp

Edit: From what it looks like based on Wi-Fi sniffing, the entire device crashes and hard resets if it encounters a malformed request.
Edit 2: And now I can't get it to happen again. Is it a buffer overflow crash of some kind?
Edit 3: Curious, if i spam the same device with commands, other MSS305 devices restart.
Edit 4: Considering that it stopped happening entirely now, I think I'm being rate limited by my own devices.

@krahabb
Copy link
Owner

krahabb commented Feb 23, 2025

The internal buffer overflow is an high candidate for the issue,to overcome it, we'd need to understand what's causing it.

The sequnces of commands triggering the reset don't really make sense since they look pretty uncorrelated though.

I might suspect instead that the issue arises when commands are sent back-to-back. This could explain why devices typically reset when HA/meross_lan initializes them since, at that time, there's a a bit of message spamming since the initialization queues out at least 3 queries (likely 4) 'almost' back to back.
Now, this queries should not be overlapping 'HTTP-wise' since the underlying library code (aiohttp connector) is instructued to only keep 1 single connetion per device at a time, moreover, meross_lan in general, after sending a query, waits for the reply before proceeding. Nevertheless they might really be forwarded to the device with almost no delay between the device reply and the start of the next query.
Also, there's an exception to the general meross_lan behavior of send query -> wait reply and this is exactly for the Appliance.System.Debug query, which is sent once at the start. This in turn, doesn't break the underlying connector behavior of serializing HTTP connections but in the end it might throw out 2 queries 'back-to-back' with almost no delay between each other and this is where (I could guess) the device fw fails.

Now, except patching the software to avoid this edge case, the query for the Appliance.System.Debug namespace could be disabled as a 'side effect' of setting the protocol to HTTP in meross_lan device configuration. If this trick works then we've found how to overcome the issue and I'll proceed to patch the code to avoid this subtlety.

@Xaymar
Copy link

Xaymar commented Feb 24, 2025

I found an easier way to repro the problem: Disable the "Outlet" Entity, then enable it again. After exactly 30 seconds it will toggle Off and On, just like it would after restarting Home Assistant. That should make debugging the problem much easier, since we no longer have to restart everything. Below are two json diagnostic traces:

Edit: The order of files above is a bit weird. The one without a number should be the one downloaded prior to the entity disable, and the one with (1) and (2) should be after.
Edit: Another set from the same device:
  1. Prior to disabling the Entity: config_entry-meross_lan-01J59XE2DCF9MZP0D77J6F3BBH-before-disable.json
  2. After disabling the Entity: config_entry-meross_lan-01J59XE2DCF9MZP0D77J6F3BBH-before-enable.json
  3. After enabling the Entity and waiting precisely 31s: config_entry-meross_lan-01J59XE2DCF9MZP0D77J6F3BBH-after-30s.json
Edit: No reset on:
  • GET Appliance.System.All
  • GET Appliance.System.Ability
  • PUSH Appliance.System.Online
  • GET Appliance.Control.ConsumptionH
    • Had to guess at how this is called since it is missing from the trace. Couldn't get it completely right.
  • GET Appliance.Control.ConsumptionX
  • GET Appliance.System.Runtime
    • Original request uses a payload, but it is entirely optional.
  • PUSH Appliance.Config.Info
  • GET Appliance.Config.Info
  • GET Appliance.Config.OverTemp
  • PUSH Appliance.Config.NtpSite
  • SET Appliance.Config.NtpSite
    • Only causes an error to be logged as there's no handler for SET Appliance.Config.NtpSite
  • GET Appliance.Config.NtpSite
    • This request takes a while to complete and times out sometimes. On average it takes 23s to get a response.
  • GET Appliance.System.Debug
  • GET Appliance.System.Runtime
  • GET Appliance.Control.OverTemp
    • Takes on average 5s to get a response.
  • PUSH Appliance.Control.ChangeWiFi ()
    • Doesn't actually do anything? No change in WiFi traffic.
  • PUSH Appliance.Control.ChangeWiFi (changeWiFi)
    • Doesn't actually do anything? No change in WiFi traffic.
  • GET Appliance.Control.ConsumptionX
  • GET Appliance.Control.ConsumptionH
  • GET Appliance.Control.Electricity
  • GET Appliance.Control.ConsumptionConfig
  • GET Appliance.Control.ToggleX
  • PUSH Appliance.Control.StandbyKiller
  • GET Appliance.Control.StandbyKiller (standbyKiller)
    • Response time averages 10s, potential invalid request for MSS305.
  • GET Appliance.Control.StandbyKiller (config)
    • Instant response with data.
  • GET Appliance.Control.StandbyKiller (config, channel 0)
    • Instant response with data.
  • PUSH Appliance.Control.AlertReport
  • GET Appliance.Control.AlertReport (alertReport)
    • Response time averages 10s, potential invalid request for MSS305.

Trace ends here. I followed the chain of requests as best as I could, but as I don't have the same speed as a machine, I don't seem to be triggering the buffer overflow. Either that or the trace is missing requests to the entity itself, which would be pretty bad for us trying to find out what's going on.

@Xaymar
Copy link

Xaymar commented Feb 26, 2025

I'm pretty much stuck. I tried running a script that emits these events in their original order as fast as possible, but it did not trigger the issue. There is definitely data missing from the trace, but unfortunately I do not have the capacity to intercept encrypted Wi-Fi traffic fully to figure out what is missing.

@krahabb
Copy link
Owner

krahabb commented Mar 1, 2025

I've just published a 'fresh' release (v5.4.2) fixing some compatibility issues against incoming HA core 2025.3 but I've also added a specific feature in order to try 'experiment' a bit with this issue.

In the device configuration there's now an option to 'disable multiple requests'. Leaving the option disabled the software will work as in previous versions but you can try activate it and see if this mitigates the issue.

@Xaymar
Copy link

Xaymar commented Mar 3, 2025

Turning on the "Disable multiple request packing" seems to address the issue. I was able to restart Home Assistant w/o entering an infinite loop of Home Assistant restarting.

@krahabb
Copy link
Owner

krahabb commented Mar 3, 2025

Ok, got it.
In my opinion, the device supports the 'multiple packing' but when the software starts there's no knowledge on the maximum buffer length so meross_lan starts with a reasonable limit and reduces (or increases) the estimated maximum buffer size according to succesful or failing replies.
Usually a request for an excessive buffer size doesn't hurt (just fails) and this is leveraged to reduce the next request a bit.
That's why, after the initial 'crash' (thus meross_lan reducing the next packing limits), the software kept working even with multple requests..

I'll see if it's possible to better set the initial estimated buffer size by reducing it for these devices so that the multiple requests never hits the maximum limit thus avoiding the device reboot at start.

@jvanderweken
Copy link

Incorrect timestamp: 1741548706 seconds behind HA (174154870 on average)

That's the error in the protocoll, and all the ms305 restart after a reboot of HA

@krahabb
Copy link
Owner

krahabb commented Mar 9, 2025

@jvanderweken,
With the new v5.4.2 you can try this and see if the restarting behavior gets mitigated.
As for the incorrect timestamp, that's it, the device doesn't have proper internal time and this is just a 'warning'. It usually depends on the device being unable to query NTP servers to get current time synchronization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
todo This item needs processing
Projects
None yet
Development

No branches or pull requests

8 participants