Skip to content

Observation reset#214

Merged
lwis merged 4 commits intohome-assistant-libs:aio-observation-stabilityfrom
alex3305:observation-reset
Feb 15, 2019
Merged

Observation reset#214
lwis merged 4 commits intohome-assistant-libs:aio-observation-stabilityfrom
alex3305:observation-reset

Conversation

@alex3305
Copy link
Copy Markdown
Contributor

This commit resets the observations when they hit a timeout, and also prevents duplicate state changes. Also observations are cancelled whenever possible instead of erroring out.

  1. Whenever the light or socket already contains a specific state (ie. the light is on and is dimmed at 20%), and a change is requested, a filter will be applied and only the changed states will be send to the device.
  2. Observations are reset whenever no state change was received for a limited amount of time. This is dependent on the first change, because Tradfri does not send a successful callback back to us, when no change was applied.
  3. Cancelling observations instead of erroring out will create less error logging. There is still some error logging left, but this is because of the aiocoap library.

This was tested for more than a week on my own production installation and the first fix, makes the pytradfri lib about 4 times more stable for me. The second change resets the observations when something bad happens and works wonderfully for me.

There is still something odd happening with observation cancellation which must be errored out instead of cancelled, because otherwise no new observation will be launched by Home Assistant. This is because the aiocoap library does not report when a cancellation happened, but it does on errors.

I and some others have tested these changes from: https://github.com/alex3305/hass-components debug and reception has generally been positive.

References:

This commit resets the observations when they hit a timeout, and also prevents
duplicate state changes. Also observations are cancelled whenever possible
instead of erroring out.

 1. Whenever the light or socket already contains a specific state (ie. the
    light is on and is dimmed at 20%), and a change is requested, a filter will
    be applied and only the changed states will be send to the device.
 2. Observations are reset whenever no state change was received for a limited
    amount of time. This is dependent on the first change, because Tradfri does
    not send a successful callback back to us, when no change was applied.
 3. Cancelling observations instead of erroring out will create less error
    logging. There is still some error logging left, but this is because of the
    `aiocoap` library.

This was tested for more than a week on my own production installation and the
first fix, makes the pytradfri lib about 4 times more stable for me. The second
change resets the observations when something bad happens and works wonderfully
for me.

There is still something odd happening with observation cancellation which must
be errored out instead of cancelled, because otherwise no new observation will
be launched by Home Assistant. This is because the `aiocoap` library does not
report when a cancellation happened, but it does on errors.

I and some others have tested these changes from: https://github.com/alex3305/hass-components-debug
and reception has generally been positive.
@ggravlingen
Copy link
Copy Markdown
Member

Thank you for your continued effort on this issue!

@lwis
Copy link
Copy Markdown
Collaborator

lwis commented Dec 15, 2018

Thanks for the PR, and for continuing to look at stability.

I'm concerned about the second point, and the time you've set - 15 seconds? Do all observations receive a payload within that window? Looking at this PR I fear it may be more appropriate to switch problematic installations to polling rather than aggressively closing sockets.

@alex3305
Copy link
Copy Markdown
Contributor Author

@lwis Let me try and ease a couple of your concerns.

I'm concerned about the second point, and the time you've set - 15 seconds? Do all observations receive a payload within that window?

I've spent almost two weeks fine tuning this specific timeout and found out that most (90%+) of the requests receive an observation within a second after the request. The others receive a callback between 1 and 5 seconds after firing the request. So for me 15 seconds ware more than enough.

I've also tested it with 35 seconds (5 seconds sleep, 30 seconds timeout) and that works too. But when the Tradfri observations fail, the time it takes to reach the user is quite long IMHO. However this is still something we could tune for sure.

Looking at this PR I fear it may be more appropriate to switch problematic installations to polling rather than aggressively closing sockets.

Observations work great though, so that should be a shame. The connection is now lost for me once every 4 - 5 days. That's much better than before and gives me hope that this could be a solution.

Furthermore I suspect that the IKEA Tradfri app works the same way. I found out that the app activity on Android times out itself, forcing to re-open observations. It seems they are working in a similar manner. That's one of the reasons I've chosen this approach.

Also the aiocoap library eems to be closing connections properly whenever possible. I've had zero issues so far with my Tradfri hub.

Do you have any (additional) concerns or changes that I could make to get this PR merged?

@alex3305
Copy link
Copy Markdown
Contributor Author

@lwis did you have any time to review this?

@ggravlingen can you give your opinion on this PR?

@ggravlingen
Copy link
Copy Markdown
Member

Thank you @alex3305 for addressing the concerns of @lwis. Given the previously closed PRs and your further analysis of the questions that have arisen, it seems there has been comprehensive testing. Albeit for one user only, but that is not anything unique for this PR. Pushing this to Home Assistant's beta branch will quite quickly give us more test user data, while also leaving room to pull the update back.

That said, it's @lwis who's done most of the work on the API and knows what's going on under the hood, so let's hear his views.

Given his approval, please also make sure the tests pass.

This slightly modifies the way a command is send to the backend API. I think
this is a more concise way, because now the `set_values` function in
`command.py` always returns a value. This works the same as the other
solution.

This also fixes the unit tests.
@alex3305
Copy link
Copy Markdown
Contributor Author

@ggravlingen I didn't notice the tests failing. I've fixed that.

I know my previous PR's weren't the best regarding this solution, initially I was looking for a more pragmatic approach. Since than I've moved on and searched for a good and stable solution. Coming up with this one.

I would love to have this integrated into Home Assistant. I had multiple users test this within my own repo and the issue on Home Assistant. See also:

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-1.6%) to 77.384% when pulling b099b7a on alex3305:observation-reset into 685027b on ggravlingen:master.

@coveralls
Copy link
Copy Markdown

coveralls commented Dec 17, 2018

Coverage Status

Coverage decreased (-1.8%) to 77.233% when pulling 17638af on alex3305:observation-reset into 685027b on ggravlingen:master.

@ggravlingen
Copy link
Copy Markdown
Member

ggravlingen commented Dec 21, 2018

Thank you! When I talked to Lwis the other day he indicated he's quite busy at the moment but will revert asap. As I mentioned above, for the sake of stability, no changes to the api's core will be merged without his approval.

@alex3305
Copy link
Copy Markdown
Contributor Author

Thanks for the feedback. It seems that home-assistant/core#19222 also had a positive impact on Home Assistant and Tradfri stability. I am currently quite busy myself regarding the holidays. So I understand 😄.

Happy holidays!

Appearently light transitions are always preceded with an on or off command.
This change removes the filter for the state attribute.

I don't know if this should be an issue in Home Assistant...
Copy link
Copy Markdown
Collaborator

@lwis lwis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it's taken a while to review this. Some of the changes aren't clear, while others are good stability improvements.


return self._psk

@classmethod
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that an APIFactory will be created for each gateway, rather than each observation, should this be instanced instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this function is called from the callback defined in _observe(), when trying to use an instance method, the callback doesn't have any relations to self or an instance of APIFactory anymore. That's why I put this on a 'global' timer.

If you can point me out on how to use an instance method in the aforementioned callback, I would gladly implement that change.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

success_callback is created in scope of self?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

success_callback only has it's own scope within the method.

self._observations_err_callbacks.append(ob.error)
self._observations.append(ob)

async def _check_observations(self):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method name needs improving as this isn't just checking for dead observations, please also add a docstring.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ammended in 3fb5c87

self._psk_id = psk_id
self._loop = loop
self._observations_err_callbacks = []
self._is_checking = False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you expecting a race of some sort? Maybe we should consider using asyncio.Lock instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specifically a race condition. But because the asyncio.sleep function is called within _remove_timedout_observations, it can happen that a observation is checked multiple times. This shouldn't cause any problems, but is just unnecessary. Using an asyncio.Lock will have the same behaviour, but just seqential instead of parallel.

This check is only done to make sure we check and reset the observations only once at a time, even when controlling multiple lights at once.

SUPPORT_XY_COLOR = 16

OBSERVATION_SLEEP_TIME = 5
OBSERVATION_TIMEOUT = 10
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sold on these values, if I don't touch any of my lights for a few hours they won't send any events IIRC. Meaning that the _check_observations function taking a guess at best, and it'll consider all my observations to be 'dead' when I make a request.

I'm cautious as many users (such as myself) have good stability on observations, I can have connections open for weeks without issue. This change potentially closes observations unnecessarily.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That behaviour should never happen though. In my testing it still hasn't. Even after a few hours or close to a day, the observations don't just reset. They just work. What happens is the following flow:

  1. You send a command to the Tradfri gateway
  2. The check method for dead observations is triggered
    1. This method sets checking = True
    2. This method waits for 5 seconds initially
    3. go to 3 below
    4. The last updated global timer is checked against the values defined in const
      1. If the time is above the defined times, the observations will be reset
      2. If the time is below the defined times, nothing happens
    5. Set checking = False
  3. The Tradfri hub sends a response in the observation callback
    1. The state of the object is changed from the callback
    2. The last updated global time is updated
    3. go to 2.4 above

In my experience most of this happens in a fraction of a second. Almost all of my requests were sub second response times. In some edge cases it took a bit longer, but not that long. We can always up this value though. But I encourage you to test it out for yourself, perhaps with debug logging like:

_LOGGER.debug('Tradfri response_time %f', time.time() - APIFactory.get_last_changed())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.1, what's causing the observation callback to be triggered? If there's no activity on any of the devices, what's causing the gateway to send a message? I think this is my largest point of confusion in this change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the device responds, the observation callback will be triggered. It's the same as before.

But if no response from the Tradfri gateway will be received within the timeout window, the observations will be reset. That's the only functional change.

self._is_resetting = True

while self._observations:
ob = self._observations.pop()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to let aiocoap know that we're ending the observation? If we discard an active observation accidentally does that mean we'll potentially leak a connection?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connections should end as defined in aiocoap.protocol._BaseClientObservation. I've reworked this a bit in 17638af so it should also cancel out.

Some initial, small changes that came up during the review of home-assistant-libs#214. This
changes the following:

 * Better naming of checking observations
 * Cancelling observations instead of erroring them out during cancellation
@alex3305
Copy link
Copy Markdown
Contributor Author

@lwis Thanks for the review. Some great points you've got there. I'm hoping you can test these changes out for yourself, even if your set up is rock solid.

@alex3305
Copy link
Copy Markdown
Contributor Author

@lwis Any updates?

@alex3305
Copy link
Copy Markdown
Contributor Author

@lwis Any updates?

Still working wonderfully for me luckily. Very stable, even with longer transitions and such. I'm hoping that you or @ggravlingen can test this branch for yourself too.

Copy link
Copy Markdown
Collaborator

@lwis lwis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still somewhat confused by the mysterious message the gateway is sending every ~5s?

Moreover, I'd rather this health check is entirely instance based. Lets keep the Command callback and our gateway health callback seperate, where register the callback for the command (ob.register_callback(...). aiocoap accepts multiple callbacks on it's register methods, we should register a method on the gateway as a callback for each observation to tidy this code.


return self._psk

@classmethod
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

success_callback is created in scope of self?

SUPPORT_XY_COLOR = 16

OBSERVATION_SLEEP_TIME = 5
OBSERVATION_TIMEOUT = 10
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.1, what's causing the observation callback to be triggered? If there's no activity on any of the devices, what's causing the gateway to send a message? I think this is my largest point of confusion in this change.

@alex3305
Copy link
Copy Markdown
Contributor Author

alex3305 commented Feb 2, 2019

The gateway is not sending out a message every 5 seconds. Since the request to the gateway and the response from the gateway is async, my code will wait 5 seconds to assure that the gatweay has indeed responded. When the gateway does not respond within 5 seconds after the request to the gateway, the observations will be reset. We don't want to make that a blocking request...

Registering a health check on every observation wouldn't be very efficient. When the observations fail, all the observations fail, not only one, since the connection with the gateway seems to fail. Since the connection to the gateway fails and not the device itself, it would be very inefficient to register a callback for each observation.

@lwis lwis changed the base branch from master to aio-observation-stability February 15, 2019 21:13
@lwis lwis merged commit 2c32894 into home-assistant-libs:aio-observation-stability Feb 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants