restart every devices after restart HA #528

williB78 · 2025-02-12T12:09:43Z

if i restart HA all the Meross Devices, in my case mss305 8.0.0, turn off and on. so the device thats pluged in gets off and on.

Zet-an · 2025-02-12T15:50:39Z

I've noticed this also and I have my pc plugged into a meross plug so that turns off without warning. One time I thought my gpu broke as it was making a terrible noise afterwards but somehow as cable got sucked into one of the fans during the process. :s

I hope a fix can be found as I have to remember to turn off my pc before updating or restarting HA.

Using mss305 hardware 8.0.0 firmware 8.3.15

mikozaman · 2025-02-12T20:59:06Z

Same issue
And as i have my home assistant host connected to a smart plug meross it go offline.
Using mss210 mss305.
I removed this integration and change to Meross integration and no more restart of meross plug when i restart HA.

krahabb · 2025-02-13T08:53:04Z

I'm sorry for this behavior but, at the moment, I have no idea of what's going on especially since the component is not being updated since quite a long time so the issue might lie in the way the latest HA release initializes comnponents.
I'll try testing the reboot process...
In the meantime, do you have any warning or significative log related to meross_lan in HA log ?

williB78 · 2025-02-13T09:07:45Z

in the logfile i had restart at 10:27.

and yes, it happens since HA update to 25.1.4

jonlicence · 2025-02-14T12:35:06Z

The same is happening with me, not sure if it is coincidence but I updated the Firmware on the Meross mss305 to version 8.3.15 this morning.
My HA core version is still on 2024.12.0 so don't think its the HA update as this didn't happen before the mss305 firmware update.

Archaiel · 2025-02-15T05:53:18Z

I'm using Docker on HA, AU plugs - mss210 (hw 7.0.0, fw 7.3.9) and mss310 (hw 8.0.0.0 and fw 6.3.23) and its working fine - HA 25.2.4.

Seems restricted to mss305 devices?

Zet-an · 2025-02-15T12:35:28Z

I didn't expierence this before I updated my firmware on my mss305's but I haven't been using this setup for long. At the time I applied the firmware update and then did a HA update so can't link it to one or the other but it appears to be the firmware thats the issue.

I don't know if re-linking the plugs would be of any use or if HA will have a workaround.

krahabb · 2025-02-16T14:22:07Z

I'll throw my point of view into the issue:
When HA/meross_lan starts it does nothing 'special' except querying the device for its general status before starting the 'regular' query poll.
So the issue might be induced by the 'general status query' done at at startup.

I think the switch toggling off/on might be due the device rebooting because of improper querying during HA startup. In general, the devices are 'resilient' to malformed queries but this is not always true. Also, the aforementioned 'general status query' is a really basic one universally accepted by all device types (in fact the issue only arises on this special device/fw).

I'd ask if you can collect a 'diagnostic' from any of the misbehaving devices (the start of the diagnostic could trigger the toggling off/on though)

Also, after the 'toggling glitch' does the device work correctly in HA/meross_lan (i.e. state update, toggling and so on..)?

jonlicence · 2025-02-17T08:21:30Z

Hopefully here is the diagnostic file from one of my devices, and yes it did toggle the switch off/on.

meross_lan_diagnostic.json

Let me know if you need anything else, and I'll try to get it for you.

Once the initial poll happens and the device powers off/on everything works fine with them in HA. The power toggles work and states update correctly.

Xaymar · 2025-02-20T19:26:25Z

Sadly also affected by this with 31 MSS305 power outlets. All of mine already had the latest firmware installed prior to this issue occurring. The issue was not present before the 2025 series of Home Assistant updates, as I had an automation restarting my router, managed LAN switches, and RPi (Home Assistant) at 4:00. Now I need to exclude the RPi and hope the Voice Assistant stuff doesn't get stuck.

Similar to jonlicence, everything works as intended after the initial On/Off/OriginalState toggle. Since my Home Assistant is also behind one of these, it sometimes results in an endless loop of On/Off/OriginalState until I get home. As a temporary solution, I put all the important things behind battery power backups now - not efficient (+30W base load) but it offsets the sudden power loss.

Considering all of my devices are misbehaving on reboot of Home Assistant, should I grab it from all of them?

krahabb · 2025-02-21T20:30:29Z

I don't have many tools to inspect this behavior right now but I might supsect (just speculation) that the devices don't like a special query used when coming online (already guessed about this before..)

You could help me by trying manually querying any of the 'offended' devices by using the meross_lan.request action in the HA - Developer tools->ACTIONS UI.
Please fill in the required parameters as:

action: meross_lan.request
data:
  protocol: auto
  method: GET
  namespace: Appliance.System.Debug
  payload: "{}"
  device_id: PUT_THE_HEX_FORMATTED_DEVICE_ID_HERE

Where you can recover the aforementioned device_id from the device configuration UI (or use instead the host address field if you better know that)

That message (for namespace Appliance.System.Debug) is one of the 'single shot' messages used while onlining the device and it is also sent when using device diagnostics (so that the same toggling behavior might be triggered then)

If that message doesn't trigger the toggling you could try querying for another namespace : Appliance.System.All (this too is used when onlining but I'd really hope this doesn't trigger the toggling because it is a fundamental query to inspect device state...)

jonlicence · 2025-02-21T20:43:17Z

Hi krahabb, I'm away from home this weekend but will definitely try this on Monday and let you know the results.

Xaymar · 2025-02-22T07:19:43Z

Tried both, neither of them caused the issue.

Edit: Enabling debug logs has my log file spammed with this:

Traceback (most recent call last):
  File "/config/custom_components/meross_lan/meross_device.py", line 1590, in _async_polling_callback
    await self._async_request_updates(epoch, ns_all_handler.ns.name)
  File "/config/custom_components/meross_lan/meross_device.py", line 1459, in _async_request_updates
    await handler.polling_strategy(handler, self)  # type: ignore
          ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not callable

Edit: config_entry-meross_lan-01J4Z3FFBK2WTHV76RTGK9T1TK(1).json

Trace

GET Appliance.System.All
GET Appliance.System.Ability
PUSH Appliance.Config.Info with {}
PUSH Appliance.Config.Info with {"info":{"homekit":{}}}
GET Appliance.Config.Info
GET Appliance.Config.OverTemp
PUSH Appliance.Config.StandbyKiller with {}
3x GET Appliance.Config.StandbyKiller
PUSH Appliance.Control.AlertReport
GET Appliance.Control.ConsumptionConfig
GET Appliance.Control.ConsumptionH
GET Appliance.Control.ConsumptionX
GET Appliance.Control.Electricity
GET Appliance.Control.OverTemp
GET Appliance.Control.ToggleX
GET Appliance.System.Debug
GET Appliance.System.Runtime

Edit: None of the commands above appear to trigger it.

krahabb · 2025-02-22T10:28:28Z

@Xaymar,
Thank you for checking this.
The logged error should be related to a transient condition due to usage of the 'diagnostic entities' feature so it should no harm in standard usage, nevertheless it's nice you spotted that so that I'm able to fix it.

As for the toggling issue it leaves us with no apparent solutions at the moment...

Xaymar · 2025-02-22T19:28:38Z

Scanning through the diagnostic data, I found exactly one that had a different trace compared to the rest: config_entry-meross_lan-01J59XDGT8K29MBVMW2K4AKZAN.json. This one has a file 7KB bigger than the average.

Trace

GETACK Appliance.System.All (no matching GET request)
GETACK Appliance.System.Ability (no matching GET request)
PUSH Appliance.Config.Info with {}
SETACK Appliance.Control.Multiple (no matching SET/PUT/PUSH), containing responses for:
- GET Appliance.Control.Electricity
- GET Appliance.Control.ConsumptionH
GET Application.Control.ConsumptionX
- GETACK Application.Control.ConsumptionX
PUSH Appliance.Config.Info with {}
GET Application.Config.Info with {"info":[]}
- GETACK Application.Config.Info
GET Application.Config.OverTemp with {"overTemp":{}}
- GETACK Application.Config.OverTemp
PUSH Appliance.Config.StandbyKiller with {}
PUSH Application.Config.StandbyKiller with {"config":[{"channel":0,"power":0,"time":300,"enable":2,"alert":2}]}
GET Appliance.Config.StandbyKiller with {"standbyKiller": []}
- HTTP ERROR GET Appliance.Config.StandbyKiller (messageId:9d971769b6814f4881f35e339736bcb8 ServerDisconnectedError:Server disconnected)
- GETACK Appliance.Config.StandbyKiller
- Handler undefined for method:GETACK namespace:Appliance.Config.StandbyKiller payload:{'config': []}?
GET Appliance.Config.StandbyKiller with {"config":[{"channel":0}]}
- GETACK Appliance.Config.StandbyKiller
- Handler undefined for method:GETACK namespace:Appliance.Config.StandbyKiller payload:{'config': [{'channel': 0, 'power': 0, 'time': 300, 'enable': 2, 'alert': 2}]}
PUSH Appliance.Control.AlertReport with {}
- HTTP ERROR PUSH Appliance.Control.AlertReport (messageId:82505164b7fe44f8866fdb4695ff9694 ServerDisconnectedError:Server disconnected)
GET Appliance.Control.AlertReport
- HTTP ERROR GET Appliance.Control.AlertReport (messageId:b38b015ea23d46178a2dd503fe048807 ServerDisconnectedError:Server disconnected)
GET Appliance.Control.ConsumptionConfig with {"config":{}}
- GETACK Appliacne.Control.ConsumptionConfig
GET Appliance.Control.ConsumptionH
- GETACK Appliance.Control.ConsumptionH
GET Appliance.Control.ConsumptionX
- GETACK Appliance.Control.ConsumptionX
GET Appliance.Control.Electricity
- GETACK Appliance.Control.Electricity
GET Appliance.Control.OverTemp
- HTTP ERROR GET Appliance.Control.OverTemp (messageId:658f2e51b0b74a40a9ca74a69c527152 ServerDisconnectedError:Server disconnected)
- No reply to this at all.
GET Appliance.Control.ToggleX with {"togglex":[]}
- GETACK Appliance.Control.ToggleX
GET Appliance.System.Debug with {"debug":{}}
- GETACK Appliance.System.DEbug
GET Appliance.System.Runtime with {"runtime":{}}
- GETACK Appliance.System.Runtime

I do not see anything immediately obvious here. It seems that the diagnostics start in the middle of work instead of at the beginning, so we might be missing the real issue entirely.

Xaymar · 2025-02-22T19:34:31Z

I think I found it! These chains of commands triggers the reset:

Chain 1:

GET Appliance.Control.Electricity
GET Appliance.Control.OverTemp

Chain 2:

PUSH Appliance.Config.StandbyKiller
One of:
- GET Appliance.Config.StandbyKiller
- PUSH Appliance.Control.AlertReport

Chain 3:

Any other command that isn't Appliance.System.Debug
GET Appliance.Control.OverTemp

Edit: From what it looks like based on Wi-Fi sniffing, the entire device crashes and hard resets if it encounters a malformed request.
Edit 2: And now I can't get it to happen again. Is it a buffer overflow crash of some kind?
Edit 3: Curious, if i spam the same device with commands, other MSS305 devices restart.
Edit 4: Considering that it stopped happening entirely now, I think I'm being rate limited by my own devices.

krahabb · 2025-02-23T18:51:16Z

The internal buffer overflow is an high candidate for the issue,to overcome it, we'd need to understand what's causing it.

The sequnces of commands triggering the reset don't really make sense since they look pretty uncorrelated though.

I might suspect instead that the issue arises when commands are sent back-to-back. This could explain why devices typically reset when HA/meross_lan initializes them since, at that time, there's a a bit of message spamming since the initialization queues out at least 3 queries (likely 4) 'almost' back to back.
Now, this queries should not be overlapping 'HTTP-wise' since the underlying library code (aiohttp connector) is instructued to only keep 1 single connetion per device at a time, moreover, meross_lan in general, after sending a query, waits for the reply before proceeding. Nevertheless they might really be forwarded to the device with almost no delay between the device reply and the start of the next query.
Also, there's an exception to the general meross_lan behavior of send query -> wait reply and this is exactly for the Appliance.System.Debug query, which is sent once at the start. This in turn, doesn't break the underlying connector behavior of serializing HTTP connections but in the end it might throw out 2 queries 'back-to-back' with almost no delay between each other and this is where (I could guess) the device fw fails.

Now, except patching the software to avoid this edge case, the query for the Appliance.System.Debug namespace could be disabled as a 'side effect' of setting the protocol to HTTP in meross_lan device configuration. If this trick works then we've found how to overcome the issue and I'll proceed to patch the code to avoid this subtlety.

Xaymar · 2025-02-24T09:21:53Z

I found an easier way to repro the problem: Disable the "Outlet" Entity, then enable it again. After exactly 30 seconds it will toggle Off and On, just like it would after restarting Home Assistant. That should make debugging the problem much easier, since we no longer have to restart everything. Below are two json diagnostic traces:

Edit: The order of files above is a bit weird. The one without a number should be the one downloaded prior to the entity disable, and the one with (1) and (2) should be after.

Edit: Another set from the same device:

Prior to disabling the Entity: config_entry-meross_lan-01J59XE2DCF9MZP0D77J6F3BBH-before-disable.json
After disabling the Entity: config_entry-meross_lan-01J59XE2DCF9MZP0D77J6F3BBH-before-enable.json
After enabling the Entity and waiting precisely 31s: config_entry-meross_lan-01J59XE2DCF9MZP0D77J6F3BBH-after-30s.json

Edit: No reset on:

GET Appliance.System.All
GET Appliance.System.Ability
PUSH Appliance.System.Online
GET Appliance.Control.ConsumptionH
- Had to guess at how this is called since it is missing from the trace. Couldn't get it completely right.
GET Appliance.Control.ConsumptionX
GET Appliance.System.Runtime
- Original request uses a payload, but it is entirely optional.
PUSH Appliance.Config.Info
GET Appliance.Config.Info
GET Appliance.Config.OverTemp
PUSH Appliance.Config.NtpSite
SET Appliance.Config.NtpSite
- Only causes an error to be logged as there's no handler for SET Appliance.Config.NtpSite
GET Appliance.Config.NtpSite
- This request takes a while to complete and times out sometimes. On average it takes 23s to get a response.
GET Appliance.System.Debug
GET Appliance.System.Runtime
GET Appliance.Control.OverTemp
- Takes on average 5s to get a response.
PUSH Appliance.Control.ChangeWiFi ()
- Doesn't actually do anything? No change in WiFi traffic.
PUSH Appliance.Control.ChangeWiFi (changeWiFi)
- Doesn't actually do anything? No change in WiFi traffic.
GET Appliance.Control.ConsumptionX
GET Appliance.Control.ConsumptionH
GET Appliance.Control.Electricity
GET Appliance.Control.ConsumptionConfig
GET Appliance.Control.ToggleX
PUSH Appliance.Control.StandbyKiller
GET Appliance.Control.StandbyKiller (standbyKiller)
- Response time averages 10s, potential invalid request for MSS305.
GET Appliance.Control.StandbyKiller (config)
- Instant response with data.
GET Appliance.Control.StandbyKiller (config, channel 0)
- Instant response with data.
PUSH Appliance.Control.AlertReport
GET Appliance.Control.AlertReport (alertReport)
- Response time averages 10s, potential invalid request for MSS305.

Trace ends here. I followed the chain of requests as best as I could, but as I don't have the same speed as a machine, I don't seem to be triggering the buffer overflow. Either that or the trace is missing requests to the entity itself, which would be pretty bad for us trying to find out what's going on.

Xaymar · 2025-02-26T13:28:16Z

I'm pretty much stuck. I tried running a script that emits these events in their original order as fast as possible, but it did not trigger the issue. There is definitely data missing from the trace, but unfortunately I do not have the capacity to intercept encrypted Wi-Fi traffic fully to figure out what is missing.

…itigate #528)

krahabb · 2025-03-01T21:56:55Z

I've just published a 'fresh' release (v5.4.2) fixing some compatibility issues against incoming HA core 2025.3 but I've also added a specific feature in order to try 'experiment' a bit with this issue.

In the device configuration there's now an option to 'disable multiple requests'. Leaving the option disabled the software will work as in previous versions but you can try activate it and see if this mitigates the issue.

Xaymar · 2025-03-03T10:52:23Z

Turning on the "Disable multiple request packing" seems to address the issue. I was able to restart Home Assistant w/o entering an infinite loop of Home Assistant restarting.

krahabb · 2025-03-03T11:26:20Z

Ok, got it.
In my opinion, the device supports the 'multiple packing' but when the software starts there's no knowledge on the maximum buffer length so meross_lan starts with a reasonable limit and reduces (or increases) the estimated maximum buffer size according to succesful or failing replies.
Usually a request for an excessive buffer size doesn't hurt (just fails) and this is leveraged to reduce the next request a bit.
That's why, after the initial 'crash' (thus meross_lan reducing the next packing limits), the software kept working even with multple requests..

I'll see if it's possible to better set the initial estimated buffer size by reducing it for these devices so that the multiple requests never hits the maximum limit thus avoiding the device reboot at start.

…(trying fix #528)

jvanderweken · 2025-03-09T19:38:01Z

Incorrect timestamp: 1741548706 seconds behind HA (174154870 on average)

That's the error in the protocoll, and all the ms305 restart after a reboot of HA

krahabb · 2025-03-09T20:15:49Z

@jvanderweken,
With the new v5.4.2 you can try this and see if the restarting behavior gets mitigated.
As for the incorrect timestamp, that's it, the device doesn't have proper internal time and this is just a 'warning'. It usually depends on the device being unable to query NTP servers to get current time synchronization

krahabb added the todo This item needs processing label Feb 13, 2025

krahabb added a commit that referenced this issue Mar 1, 2025

fix async behavior while polling namespaces (#528)

ed5d4c8

krahabb added a commit that referenced this issue Mar 1, 2025

add configurable option to enable/disable multiple requests (trying m…

7a965ec

…itigate #528)

krahabb mentioned this issue Mar 1, 2025

Moonlight.4.2 #532

Merged

krahabb added a commit that referenced this issue Mar 4, 2025

set a more meaningful device_response_size_max for multiple requests …

2aa19dd

…(trying fix #528)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restart every devices after restart HA #528

restart every devices after restart HA #528

williB78 commented Feb 12, 2025

Zet-an commented Feb 12, 2025

mikozaman commented Feb 12, 2025 •

edited

Loading

krahabb commented Feb 13, 2025

williB78 commented Feb 13, 2025 •

edited

Loading

jonlicence commented Feb 14, 2025

Archaiel commented Feb 15, 2025 •

edited

Loading

Zet-an commented Feb 15, 2025

krahabb commented Feb 16, 2025

jonlicence commented Feb 17, 2025 •

edited

Loading

Xaymar commented Feb 20, 2025

krahabb commented Feb 21, 2025

jonlicence commented Feb 21, 2025

Xaymar commented Feb 22, 2025 •

edited

Loading

krahabb commented Feb 22, 2025

Xaymar commented Feb 22, 2025 •

edited

Loading

Xaymar commented Feb 22, 2025 •

edited

Loading

krahabb commented Feb 23, 2025 •

edited

Loading

Xaymar commented Feb 24, 2025 •

edited

Loading

Xaymar commented Feb 26, 2025

krahabb commented Mar 1, 2025

Xaymar commented Mar 3, 2025

krahabb commented Mar 3, 2025

jvanderweken commented Mar 9, 2025

krahabb commented Mar 9, 2025

restart every devices after restart HA #528

restart every devices after restart HA #528

Comments

williB78 commented Feb 12, 2025

Zet-an commented Feb 12, 2025

mikozaman commented Feb 12, 2025 • edited Loading

krahabb commented Feb 13, 2025

williB78 commented Feb 13, 2025 • edited Loading

jonlicence commented Feb 14, 2025

Archaiel commented Feb 15, 2025 • edited Loading

Zet-an commented Feb 15, 2025

krahabb commented Feb 16, 2025

jonlicence commented Feb 17, 2025 • edited Loading

Xaymar commented Feb 20, 2025

krahabb commented Feb 21, 2025

jonlicence commented Feb 21, 2025

Xaymar commented Feb 22, 2025 • edited Loading

Trace

krahabb commented Feb 22, 2025

Xaymar commented Feb 22, 2025 • edited Loading

Xaymar commented Feb 22, 2025 • edited Loading

Chain 1:

Chain 2:

Chain 3:

krahabb commented Feb 23, 2025 • edited Loading

Xaymar commented Feb 24, 2025 • edited Loading

Edit: The order of files above is a bit weird. The one without a number should be the one downloaded prior to the entity disable, and the one with (1) and (2) should be after.

Edit: Another set from the same device:

Edit: No reset on:

Xaymar commented Feb 26, 2025

krahabb commented Mar 1, 2025

Xaymar commented Mar 3, 2025

krahabb commented Mar 3, 2025

jvanderweken commented Mar 9, 2025

krahabb commented Mar 9, 2025

mikozaman commented Feb 12, 2025 •

edited

Loading

williB78 commented Feb 13, 2025 •

edited

Loading

Archaiel commented Feb 15, 2025 •

edited

Loading

jonlicence commented Feb 17, 2025 •

edited

Loading

Xaymar commented Feb 22, 2025 •

edited

Loading

Xaymar commented Feb 22, 2025 •

edited

Loading

Xaymar commented Feb 22, 2025 •

edited

Loading

krahabb commented Feb 23, 2025 •

edited

Loading

Xaymar commented Feb 24, 2025 •

edited

Loading