Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost samples in reliable QoS #313

Closed
drsk0 opened this issue Jul 19, 2024 · 9 comments · Fixed by #319
Closed

Lost samples in reliable QoS #313

drsk0 opened this issue Jul 19, 2024 · 9 comments · Fixed by #319
Assignees

Comments

@drsk0
Copy link

drsk0 commented Jul 19, 2024

I have a basic reader/writer setup analogous to the (async) examples with a ReliabilityQosPolicy set to Reliable for both reader and writer and a very high maximal blocking time. The writer loop is a lot faster than the reader loop. I see that a lot of samples are simply lost and the reader never receives them. I would have expected that the writer starts to block until the reader takes the next sample and thus applies back pressure. Is this expected or an issue of the implementation?

@jrebelo
Copy link
Contributor

jrebelo commented Jul 22, 2024

Hi @drsk0,

Thank you for submitting this issue. The first thing is that, indeed, the blocking feature for Reliable senders is not yet implemented so there will be no blocking on send. If you are on a fast network it shouldn't cause samples to be lost though.

However, even when the functionality is in place it will not work as you describe. The writer only has information if the samples have arrived to every discovered reader (i.e. an acknowledgement was sent from the reader to the writer) but it has no information whatsoever on whether the user has read/taken samples on the reader side. That means you can't create backpressure in this way. In principle DDS is made such that the publishing and subscribing side should be independent from each other and the state of a single reader shouldn't affect the publishing which would in turn be possibly affecting sample delivery on the other reader.

The better DDS way to handle this is to let the writer publish the samples at its rhythm and set a correct size for its history QoS. On the reader side you could either change the history QoS to KEEP_ALL such that you have a buffer to account for the slower reading loop or consider using a TimeBasedFilterQosPolicy if the reader is interested only in samples at a given interval. This allows you to think about the two sides of your system independently which makes it easier to implement and generally more future-proof.

@drsk0
Copy link
Author

drsk0 commented Jul 22, 2024

Hi @jrebelo . Thanks for the clarification! I tried to set the history QoS to KEEP_ALL on the reader side, and set it to KEEP_LAST(1) on the writer side, but this makes the reader grind into a complete halt after around 5 received samples. The writer on the other side works fine.

@jrebelo
Copy link
Contributor

jrebelo commented Jul 22, 2024

When you say that it grinds to a halt does it mean your program crashes or that you don't get any more samples?

@drsk0
Copy link
Author

drsk0 commented Jul 22, 2024

The program doesn't crash, but the reader stops receiving samples (via take_next_sample). The writer keeps sending at the normal rate.

@drsk0
Copy link
Author

drsk0 commented Jul 31, 2024

@jrebelo Can you confirm that this is a bug? If yes, I would start looking into it.

@jrebelo
Copy link
Contributor

jrebelo commented Aug 1, 2024

Hi @drsk0, I don't manage to reproduce the behavior you describe.

I have create a Reliable Writer with a KeepLast(1) history kind which is publishing in a fast loop (every 100ms). I have a Reliable Reader with a KeepAll history where I call reader.take(1, ANY_SAMPLE_STATE, ANY_VIEW_STATE, ANY_INSTANCE_STATE) every 2 seconds. The call to take returns the oldest sample one by one as I expected. I do this in two separate executables.

Do you maybe have a minimal code example with the problem that you could share? Or do you see something different between what I am describing and what you are doing?

@drsk0
Copy link
Author

drsk0 commented Aug 1, 2024

Hi @jrebelo , thanks for trying to re-produce! The only difference in your setup to mine seems to be that I'm using the async runner. My fast loop is also still faster than the 100 cadence. I'll try to reproduce in a minimal example the coming week and add it here.

@jrebelo
Copy link
Contributor

jrebelo commented Aug 1, 2024

Yes, I see the issue now when publishing without any sleep. I will work on a solution the coming days.

@jrebelo jrebelo linked a pull request Aug 4, 2024 that will close this issue
@jrebelo jrebelo self-assigned this Aug 4, 2024
@jrebelo
Copy link
Contributor

jrebelo commented Aug 6, 2024

The issue was happening because the sample were taken from the Writer History Cache due to the HistoryQoS without having a chance of being sent or acknowledged first. Adding the functionality for blocking the writer until the sample is acknowledged by all the matched reliable readers before removing it from the history solves the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants