Observation reset by alex3305 · Pull Request #214 · home-assistant-libs/pytradfri

alex3305 · 2018-12-14T22:02:07Z

This commit resets the observations when they hit a timeout, and also prevents duplicate state changes. Also observations are cancelled whenever possible instead of erroring out.

Whenever the light or socket already contains a specific state (ie. the light is on and is dimmed at 20%), and a change is requested, a filter will be applied and only the changed states will be send to the device.
Observations are reset whenever no state change was received for a limited amount of time. This is dependent on the first change, because Tradfri does not send a successful callback back to us, when no change was applied.
Cancelling observations instead of erroring out will create less error logging. There is still some error logging left, but this is because of the aiocoap library.

This was tested for more than a week on my own production installation and the first fix, makes the pytradfri lib about 4 times more stable for me. The second change resets the observations when something bad happens and works wonderfully for me.

There is still something odd happening with observation cancellation which must be errored out instead of cancelled, because otherwise no new observation will be launched by Home Assistant. This is because the aiocoap library does not report when a cancellation happened, but it does on errors.

I and some others have tested these changes from: https://github.com/alex3305/hass-components debug and reception has generally been positive.

References:

This commit resets the observations when they hit a timeout, and also prevents duplicate state changes. Also observations are cancelled whenever possible instead of erroring out. 1. Whenever the light or socket already contains a specific state (ie. the light is on and is dimmed at 20%), and a change is requested, a filter will be applied and only the changed states will be send to the device. 2. Observations are reset whenever no state change was received for a limited amount of time. This is dependent on the first change, because Tradfri does not send a successful callback back to us, when no change was applied. 3. Cancelling observations instead of erroring out will create less error logging. There is still some error logging left, but this is because of the `aiocoap` library. This was tested for more than a week on my own production installation and the first fix, makes the pytradfri lib about 4 times more stable for me. The second change resets the observations when something bad happens and works wonderfully for me. There is still something odd happening with observation cancellation which must be errored out instead of cancelled, because otherwise no new observation will be launched by Home Assistant. This is because the `aiocoap` library does not report when a cancellation happened, but it does on errors. I and some others have tested these changes from: https://github.com/alex3305/hass-components-debug and reception has generally been positive.

ggravlingen · 2018-12-15T06:58:26Z

Thank you for your continued effort on this issue!

lwis · 2018-12-15T10:35:48Z

Thanks for the PR, and for continuing to look at stability.

I'm concerned about the second point, and the time you've set - 15 seconds? Do all observations receive a payload within that window? Looking at this PR I fear it may be more appropriate to switch problematic installations to polling rather than aggressively closing sockets.

alex3305 · 2018-12-15T11:25:15Z

@lwis Let me try and ease a couple of your concerns.

I'm concerned about the second point, and the time you've set - 15 seconds? Do all observations receive a payload within that window?

I've spent almost two weeks fine tuning this specific timeout and found out that most (90%+) of the requests receive an observation within a second after the request. The others receive a callback between 1 and 5 seconds after firing the request. So for me 15 seconds ware more than enough.

I've also tested it with 35 seconds (5 seconds sleep, 30 seconds timeout) and that works too. But when the Tradfri observations fail, the time it takes to reach the user is quite long IMHO. However this is still something we could tune for sure.

Looking at this PR I fear it may be more appropriate to switch problematic installations to polling rather than aggressively closing sockets.

Observations work great though, so that should be a shame. The connection is now lost for me once every 4 - 5 days. That's much better than before and gives me hope that this could be a solution.

Furthermore I suspect that the IKEA Tradfri app works the same way. I found out that the app activity on Android times out itself, forcing to re-open observations. It seems they are working in a similar manner. That's one of the reasons I've chosen this approach.

Also the aiocoap library eems to be closing connections properly whenever possible. I've had zero issues so far with my Tradfri hub.

Do you have any (additional) concerns or changes that I could make to get this PR merged?

alex3305 · 2018-12-17T19:25:45Z

@lwis did you have any time to review this?

@ggravlingen can you give your opinion on this PR?

ggravlingen · 2018-12-17T19:41:17Z

Thank you @alex3305 for addressing the concerns of @lwis. Given the previously closed PRs and your further analysis of the questions that have arisen, it seems there has been comprehensive testing. Albeit for one user only, but that is not anything unique for this PR. Pushing this to Home Assistant's beta branch will quite quickly give us more test user data, while also leaving room to pull the update back.

That said, it's @lwis who's done most of the work on the API and knows what's going on under the hood, so let's hear his views.

Given his approval, please also make sure the tests pass.

This slightly modifies the way a command is send to the backend API. I think this is a more concise way, because now the `set_values` function in `command.py` always returns a value. This works the same as the other solution. This also fixes the unit tests.

alex3305 · 2018-12-17T21:11:47Z

@ggravlingen I didn't notice the tests failing. I've fixed that.

I know my previous PR's weren't the best regarding this solution, initially I was looking for a more pragmatic approach. Since than I've moved on and searched for a good and stable solution. Coming up with this one.

I would love to have this integrated into Home Assistant. I had multiple users test this within my own repo and the issue on Home Assistant. See also:

coveralls · 2018-12-17T21:12:20Z

Coverage decreased (-1.6%) to 77.384% when pulling b099b7a on alex3305:observation-reset into 685027b on ggravlingen:master.

coveralls · 2018-12-17T21:12:20Z

Coverage decreased (-1.8%) to 77.233% when pulling 17638af on alex3305:observation-reset into 685027b on ggravlingen:master.

ggravlingen · 2018-12-21T19:53:05Z

Thank you! When I talked to Lwis the other day he indicated he's quite busy at the moment but will revert asap. As I mentioned above, for the sake of stability, no changes to the api's core will be merged without his approval.

alex3305 · 2018-12-22T15:19:20Z

Thanks for the feedback. It seems that home-assistant/core#19222 also had a positive impact on Home Assistant and Tradfri stability. I am currently quite busy myself regarding the holidays. So I understand 😄.

Happy holidays!

Appearently light transitions are always preceded with an on or off command. This change removes the filter for the state attribute. I don't know if this should be an issue in Home Assistant...

lwis

Sorry it's taken a while to review this. Some of the changes aren't clear, while others are good stability improvements.

lwis · 2018-12-24T11:38:56Z

pytradfri/api/aiocoap_api.py


        return self._psk

+    @classmethod


Given that an APIFactory will be created for each gateway, rather than each observation, should this be instanced instead?

Since this function is called from the callback defined in _observe(), when trying to use an instance method, the callback doesn't have any relations to self or an instance of APIFactory anymore. That's why I put this on a 'global' timer.

If you can point me out on how to use an instance method in the aforementioned callback, I would gladly implement that change.

success_callback is created in scope of self?

success_callback only has it's own scope within the method.

lwis · 2018-12-24T11:46:15Z

pytradfri/api/aiocoap_api.py

-        self._observations_err_callbacks.append(ob.error)
+        self._observations.append(ob)
+
+    async def _check_observations(self):


Method name needs improving as this isn't just checking for dead observations, please also add a docstring.

Ammended in 3fb5c87

lwis · 2018-12-24T11:47:50Z

pytradfri/api/aiocoap_api.py

        self._psk_id = psk_id
        self._loop = loop
-        self._observations_err_callbacks = []
+        self._is_checking = False


Are you expecting a race of some sort? Maybe we should consider using asyncio.Lock instead?

Not specifically a race condition. But because the asyncio.sleep function is called within _remove_timedout_observations, it can happen that a observation is checked multiple times. This shouldn't cause any problems, but is just unnecessary. Using an asyncio.Lock will have the same behaviour, but just seqential instead of parallel.

This check is only done to make sure we check and reset the observations only once at a time, even when controlling multiple lights at once.

lwis · 2018-12-24T11:49:49Z

pytradfri/const.py

 SUPPORT_XY_COLOR = 16
+
+OBSERVATION_SLEEP_TIME = 5
+OBSERVATION_TIMEOUT = 10


I'm still not sold on these values, if I don't touch any of my lights for a few hours they won't send any events IIRC. Meaning that the _check_observations function taking a guess at best, and it'll consider all my observations to be 'dead' when I make a request.

I'm cautious as many users (such as myself) have good stability on observations, I can have connections open for weeks without issue. This change potentially closes observations unnecessarily.

That behaviour should never happen though. In my testing it still hasn't. Even after a few hours or close to a day, the observations don't just reset. They just work. What happens is the following flow:

You send a command to the Tradfri gateway

The check method for dead observations is triggered

This method sets checking = True

This method waits for 5 seconds initially

go to 3 below

The last updated global timer is checked against the values defined in const

If the time is above the defined times, the observations will be reset

If the time is below the defined times, nothing happens

Set checking = False

The Tradfri hub sends a response in the observation callback

The state of the object is changed from the callback

The last updated global time is updated

go to 2.4 above

In my experience most of this happens in a fraction of a second. Almost all of my requests were sub second response times. In some edge cases it took a bit longer, but not that long. We can always up this value though. But I encourage you to test it out for yourself, perhaps with debug logging like:

_LOGGER.debug('Tradfri response_time %f', time.time() - APIFactory.get_last_changed())

3.1, what's causing the observation callback to be triggered? If there's no activity on any of the devices, what's causing the gateway to send a message? I think this is my largest point of confusion in this change.

If the device responds, the observation callback will be triggered. It's the same as before.

But if no response from the Tradfri gateway will be received within the timeout window, the observations will be reset. That's the only functional change.

lwis · 2018-12-24T11:57:38Z

pytradfri/api/aiocoap_api.py

+            self._is_resetting = True
+
+            while self._observations:
+                ob = self._observations.pop()


Do we need to let aiocoap know that we're ending the observation? If we discard an active observation accidentally does that mean we'll potentially leak a connection?

The connections should end as defined in aiocoap.protocol._BaseClientObservation. I've reworked this a bit in 17638af so it should also cancel out.

Some initial, small changes that came up during the review of home-assistant-libs#214. This changes the following: * Better naming of checking observations * Cancelling observations instead of erroring them out during cancellation

alex3305 · 2018-12-24T13:00:11Z

@lwis Thanks for the review. Some great points you've got there. I'm hoping you can test these changes out for yourself, even if your set up is rock solid.

alex3305 · 2019-01-10T14:34:50Z

@lwis Any updates?

alex3305 · 2019-01-17T18:54:35Z

@lwis Any updates?

Still working wonderfully for me luckily. Very stable, even with longer transitions and such. I'm hoping that you or @ggravlingen can test this branch for yourself too.

lwis

I'm still somewhat confused by the mysterious message the gateway is sending every ~5s?

Moreover, I'd rather this health check is entirely instance based. Lets keep the Command callback and our gateway health callback seperate, where register the callback for the command (ob.register_callback(...). aiocoap accepts multiple callbacks on it's register methods, we should register a method on the gateway as a callback for each observation to tidy this code.

lwis · 2019-01-21T09:23:56Z

pytradfri/api/aiocoap_api.py


        return self._psk

+    @classmethod


success_callback is created in scope of self?

lwis · 2019-01-21T09:28:02Z

pytradfri/const.py

 SUPPORT_XY_COLOR = 16
+
+OBSERVATION_SLEEP_TIME = 5
+OBSERVATION_TIMEOUT = 10


3.1, what's causing the observation callback to be triggered? If there's no activity on any of the devices, what's causing the gateway to send a message? I think this is my largest point of confusion in this change.

alex3305 · 2019-02-02T22:24:47Z

The gateway is not sending out a message every 5 seconds. Since the request to the gateway and the response from the gateway is async, my code will wait 5 seconds to assure that the gatweay has indeed responded. When the gateway does not respond within 5 seconds after the request to the gateway, the observations will be reset. We don't want to make that a blocking request...

Registering a health check on every observation wouldn't be very efficient. When the observations fail, all the observations fail, not only one, since the connection with the gateway seems to fail. Since the connection to the gateway fails and not the device itself, it would be very inefficient to register a callback for each observation.

alex3305 mentioned this pull request Dec 14, 2018

A bit more stable observations #208

Closed

ggravlingen requested a review from lwis December 15, 2018 06:55

Fixed light transitions

2e9f4b3

Appearently light transitions are always preceded with an on or off command. This change removes the filter for the state attribute. I don't know if this should be an issue in Home Assistant...

lwis requested changes Dec 24, 2018

View reviewed changes

Review fixes for home-assistant-libs#214

17638af

Some initial, small changes that came up during the review of home-assistant-libs#214. This changes the following: * Better naming of checking observations * Cancelling observations instead of erroring them out during cancellation

ggravlingen mentioned this pull request Dec 30, 2018

Error handling when gateway can't be reached #195

Closed

jhagberg mentioned this pull request Jan 21, 2019

Testing? pytradfri/pull/214 alex3305/hass-components-debug#8

Open

lwis requested changes Jan 21, 2019

View reviewed changes

lwis changed the base branch from master to aio-observation-stability February 15, 2019 21:13

lwis merged commit 2c32894 into home-assistant-libs:aio-observation-stability Feb 15, 2019

Conversation

alex3305 commented Dec 14, 2018

Uh oh!

ggravlingen commented Dec 15, 2018

Uh oh!

lwis commented Dec 15, 2018

Uh oh!

alex3305 commented Dec 15, 2018

Uh oh!

alex3305 commented Dec 17, 2018

Uh oh!

ggravlingen commented Dec 17, 2018

Uh oh!

alex3305 commented Dec 17, 2018

Uh oh!

coveralls commented Dec 17, 2018

Uh oh!

coveralls commented Dec 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggravlingen commented Dec 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alex3305 commented Dec 22, 2018

Uh oh!

lwis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alex3305 commented Dec 24, 2018

Uh oh!

alex3305 commented Jan 10, 2019

Uh oh!

alex3305 commented Jan 17, 2019

Uh oh!

lwis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alex3305 commented Feb 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coveralls commented Dec 17, 2018 •

edited

Loading

ggravlingen commented Dec 21, 2018 •

edited

Loading