Observation reset#214
Conversation
This commit resets the observations when they hit a timeout, and also prevents
duplicate state changes. Also observations are cancelled whenever possible
instead of erroring out.
1. Whenever the light or socket already contains a specific state (ie. the
light is on and is dimmed at 20%), and a change is requested, a filter will
be applied and only the changed states will be send to the device.
2. Observations are reset whenever no state change was received for a limited
amount of time. This is dependent on the first change, because Tradfri does
not send a successful callback back to us, when no change was applied.
3. Cancelling observations instead of erroring out will create less error
logging. There is still some error logging left, but this is because of the
`aiocoap` library.
This was tested for more than a week on my own production installation and the
first fix, makes the pytradfri lib about 4 times more stable for me. The second
change resets the observations when something bad happens and works wonderfully
for me.
There is still something odd happening with observation cancellation which must
be errored out instead of cancelled, because otherwise no new observation will
be launched by Home Assistant. This is because the `aiocoap` library does not
report when a cancellation happened, but it does on errors.
I and some others have tested these changes from: https://github.com/alex3305/hass-components-debug
and reception has generally been positive.
|
Thank you for your continued effort on this issue! |
|
Thanks for the PR, and for continuing to look at stability. I'm concerned about the second point, and the time you've set - 15 seconds? Do all observations receive a payload within that window? Looking at this PR I fear it may be more appropriate to switch problematic installations to polling rather than aggressively closing sockets. |
|
@lwis Let me try and ease a couple of your concerns.
I've spent almost two weeks fine tuning this specific timeout and found out that most (90%+) of the requests receive an observation within a second after the request. The others receive a callback between 1 and 5 seconds after firing the request. So for me 15 seconds ware more than enough. I've also tested it with 35 seconds (5 seconds sleep, 30 seconds timeout) and that works too. But when the Tradfri observations fail, the time it takes to reach the user is quite long IMHO. However this is still something we could tune for sure.
Observations work great though, so that should be a shame. The connection is now lost for me once every 4 - 5 days. That's much better than before and gives me hope that this could be a solution. Furthermore I suspect that the IKEA Tradfri app works the same way. I found out that the app activity on Android times out itself, forcing to re-open observations. It seems they are working in a similar manner. That's one of the reasons I've chosen this approach. Also the Do you have any (additional) concerns or changes that I could make to get this PR merged? |
|
@lwis did you have any time to review this? @ggravlingen can you give your opinion on this PR? |
|
Thank you @alex3305 for addressing the concerns of @lwis. Given the previously closed PRs and your further analysis of the questions that have arisen, it seems there has been comprehensive testing. Albeit for one user only, but that is not anything unique for this PR. Pushing this to Home Assistant's beta branch will quite quickly give us more test user data, while also leaving room to pull the update back. That said, it's @lwis who's done most of the work on the API and knows what's going on under the hood, so let's hear his views. Given his approval, please also make sure the tests pass. |
This slightly modifies the way a command is send to the backend API. I think this is a more concise way, because now the `set_values` function in `command.py` always returns a value. This works the same as the other solution. This also fixes the unit tests.
|
@ggravlingen I didn't notice the tests failing. I've fixed that. I know my previous PR's weren't the best regarding this solution, initially I was looking for a more pragmatic approach. Since than I've moved on and searched for a good and stable solution. Coming up with this one. I would love to have this integrated into Home Assistant. I had multiple users test this within my own repo and the issue on Home Assistant. See also: |
|
Thank you! When I talked to Lwis the other day he indicated he's quite busy at the moment but will revert asap. As I mentioned above, for the sake of stability, no changes to the api's core will be merged without his approval. |
|
Thanks for the feedback. It seems that home-assistant/core#19222 also had a positive impact on Home Assistant and Tradfri stability. I am currently quite busy myself regarding the holidays. So I understand 😄. Happy holidays! |
Appearently light transitions are always preceded with an on or off command. This change removes the filter for the state attribute. I don't know if this should be an issue in Home Assistant...
lwis
left a comment
There was a problem hiding this comment.
Sorry it's taken a while to review this. Some of the changes aren't clear, while others are good stability improvements.
|
|
||
| return self._psk | ||
|
|
||
| @classmethod |
There was a problem hiding this comment.
Given that an APIFactory will be created for each gateway, rather than each observation, should this be instanced instead?
There was a problem hiding this comment.
Since this function is called from the callback defined in _observe(), when trying to use an instance method, the callback doesn't have any relations to self or an instance of APIFactory anymore. That's why I put this on a 'global' timer.
If you can point me out on how to use an instance method in the aforementioned callback, I would gladly implement that change.
There was a problem hiding this comment.
success_callback is created in scope of self?
There was a problem hiding this comment.
success_callback only has it's own scope within the method.
pytradfri/api/aiocoap_api.py
Outdated
| self._observations_err_callbacks.append(ob.error) | ||
| self._observations.append(ob) | ||
|
|
||
| async def _check_observations(self): |
There was a problem hiding this comment.
Method name needs improving as this isn't just checking for dead observations, please also add a docstring.
| self._psk_id = psk_id | ||
| self._loop = loop | ||
| self._observations_err_callbacks = [] | ||
| self._is_checking = False |
There was a problem hiding this comment.
Are you expecting a race of some sort? Maybe we should consider using asyncio.Lock instead?
There was a problem hiding this comment.
Not specifically a race condition. But because the asyncio.sleep function is called within _remove_timedout_observations, it can happen that a observation is checked multiple times. This shouldn't cause any problems, but is just unnecessary. Using an asyncio.Lock will have the same behaviour, but just seqential instead of parallel.
This check is only done to make sure we check and reset the observations only once at a time, even when controlling multiple lights at once.
| SUPPORT_XY_COLOR = 16 | ||
|
|
||
| OBSERVATION_SLEEP_TIME = 5 | ||
| OBSERVATION_TIMEOUT = 10 |
There was a problem hiding this comment.
I'm still not sold on these values, if I don't touch any of my lights for a few hours they won't send any events IIRC. Meaning that the _check_observations function taking a guess at best, and it'll consider all my observations to be 'dead' when I make a request.
I'm cautious as many users (such as myself) have good stability on observations, I can have connections open for weeks without issue. This change potentially closes observations unnecessarily.
There was a problem hiding this comment.
That behaviour should never happen though. In my testing it still hasn't. Even after a few hours or close to a day, the observations don't just reset. They just work. What happens is the following flow:
- You send a command to the Tradfri gateway
- The check method for dead observations is triggered
- This method sets checking = True
- This method waits for 5 seconds initially
- go to 3 below
- The last updated global timer is checked against the values defined in const
- If the time is above the defined times, the observations will be reset
- If the time is below the defined times, nothing happens
- Set checking = False
- The Tradfri hub sends a response in the observation callback
- The state of the object is changed from the callback
- The last updated global time is updated
- go to 2.4 above
In my experience most of this happens in a fraction of a second. Almost all of my requests were sub second response times. In some edge cases it took a bit longer, but not that long. We can always up this value though. But I encourage you to test it out for yourself, perhaps with debug logging like:
_LOGGER.debug('Tradfri response_time %f', time.time() - APIFactory.get_last_changed())There was a problem hiding this comment.
3.1, what's causing the observation callback to be triggered? If there's no activity on any of the devices, what's causing the gateway to send a message? I think this is my largest point of confusion in this change.
There was a problem hiding this comment.
If the device responds, the observation callback will be triggered. It's the same as before.
But if no response from the Tradfri gateway will be received within the timeout window, the observations will be reset. That's the only functional change.
| self._is_resetting = True | ||
|
|
||
| while self._observations: | ||
| ob = self._observations.pop() |
There was a problem hiding this comment.
Do we need to let aiocoap know that we're ending the observation? If we discard an active observation accidentally does that mean we'll potentially leak a connection?
There was a problem hiding this comment.
The connections should end as defined in aiocoap.protocol._BaseClientObservation. I've reworked this a bit in 17638af so it should also cancel out.
Some initial, small changes that came up during the review of home-assistant-libs#214. This changes the following: * Better naming of checking observations * Cancelling observations instead of erroring them out during cancellation
|
@lwis Thanks for the review. Some great points you've got there. I'm hoping you can test these changes out for yourself, even if your set up is rock solid. |
|
@lwis Any updates? |
Still working wonderfully for me luckily. Very stable, even with longer transitions and such. I'm hoping that you or @ggravlingen can test this branch for yourself too. |
lwis
left a comment
There was a problem hiding this comment.
I'm still somewhat confused by the mysterious message the gateway is sending every ~5s?
Moreover, I'd rather this health check is entirely instance based. Lets keep the Command callback and our gateway health callback seperate, where register the callback for the command (ob.register_callback(...). aiocoap accepts multiple callbacks on it's register methods, we should register a method on the gateway as a callback for each observation to tidy this code.
|
|
||
| return self._psk | ||
|
|
||
| @classmethod |
There was a problem hiding this comment.
success_callback is created in scope of self?
| SUPPORT_XY_COLOR = 16 | ||
|
|
||
| OBSERVATION_SLEEP_TIME = 5 | ||
| OBSERVATION_TIMEOUT = 10 |
There was a problem hiding this comment.
3.1, what's causing the observation callback to be triggered? If there's no activity on any of the devices, what's causing the gateway to send a message? I think this is my largest point of confusion in this change.
|
The gateway is not sending out a message every 5 seconds. Since the request to the gateway and the response from the gateway is async, my code will wait 5 seconds to assure that the gatweay has indeed responded. When the gateway does not respond within 5 seconds after the request to the gateway, the observations will be reset. We don't want to make that a blocking request... Registering a health check on every observation wouldn't be very efficient. When the observations fail, all the observations fail, not only one, since the connection with the gateway seems to fail. Since the connection to the gateway fails and not the device itself, it would be very inefficient to register a callback for each observation. |
This commit resets the observations when they hit a timeout, and also prevents duplicate state changes. Also observations are cancelled whenever possible instead of erroring out.
aiocoaplibrary.This was tested for more than a week on my own production installation and the first fix, makes the pytradfri lib about 4 times more stable for me. The second change resets the observations when something bad happens and works wonderfully for me.
There is still something odd happening with observation cancellation which must be errored out instead of cancelled, because otherwise no new observation will be launched by Home Assistant. This is because the
aiocoaplibrary does not report when a cancellation happened, but it does on errors.I and some others have tested these changes from: https://github.com/alex3305/hass-components debug and reception has generally been positive.
References: