Lost samples in reliable QoS #313

drsk0 · 2024-07-19T14:43:41Z

I have a basic reader/writer setup analogous to the (async) examples with a ReliabilityQosPolicy set to Reliable for both reader and writer and a very high maximal blocking time. The writer loop is a lot faster than the reader loop. I see that a lot of samples are simply lost and the reader never receives them. I would have expected that the writer starts to block until the reader takes the next sample and thus applies back pressure. Is this expected or an issue of the implementation?

The text was updated successfully, but these errors were encountered:

jrebelo · 2024-07-22T08:32:50Z

Hi @drsk0,

Thank you for submitting this issue. The first thing is that, indeed, the blocking feature for Reliable senders is not yet implemented so there will be no blocking on send. If you are on a fast network it shouldn't cause samples to be lost though.

However, even when the functionality is in place it will not work as you describe. The writer only has information if the samples have arrived to every discovered reader (i.e. an acknowledgement was sent from the reader to the writer) but it has no information whatsoever on whether the user has read/taken samples on the reader side. That means you can't create backpressure in this way. In principle DDS is made such that the publishing and subscribing side should be independent from each other and the state of a single reader shouldn't affect the publishing which would in turn be possibly affecting sample delivery on the other reader.

The better DDS way to handle this is to let the writer publish the samples at its rhythm and set a correct size for its history QoS. On the reader side you could either change the history QoS to KEEP_ALL such that you have a buffer to account for the slower reading loop or consider using a TimeBasedFilterQosPolicy if the reader is interested only in samples at a given interval. This allows you to think about the two sides of your system independently which makes it easier to implement and generally more future-proof.

drsk0 · 2024-07-22T09:10:09Z

Hi @jrebelo . Thanks for the clarification! I tried to set the history QoS to KEEP_ALL on the reader side, and set it to KEEP_LAST(1) on the writer side, but this makes the reader grind into a complete halt after around 5 received samples. The writer on the other side works fine.

jrebelo · 2024-07-22T09:15:27Z

When you say that it grinds to a halt does it mean your program crashes or that you don't get any more samples?

drsk0 · 2024-07-22T09:43:09Z

The program doesn't crash, but the reader stops receiving samples (via take_next_sample). The writer keeps sending at the normal rate.

drsk0 · 2024-07-31T09:36:02Z

@jrebelo Can you confirm that this is a bug? If yes, I would start looking into it.

jrebelo · 2024-08-01T11:08:10Z

Hi @drsk0, I don't manage to reproduce the behavior you describe.

I have create a Reliable Writer with a KeepLast(1) history kind which is publishing in a fast loop (every 100ms). I have a Reliable Reader with a KeepAll history where I call reader.take(1, ANY_SAMPLE_STATE, ANY_VIEW_STATE, ANY_INSTANCE_STATE) every 2 seconds. The call to take returns the oldest sample one by one as I expected. I do this in two separate executables.

Do you maybe have a minimal code example with the problem that you could share? Or do you see something different between what I am describing and what you are doing?

drsk0 · 2024-08-01T17:22:34Z

Hi @jrebelo , thanks for trying to re-produce! The only difference in your setup to mine seems to be that I'm using the async runner. My fast loop is also still faster than the 100 cadence. I'll try to reproduce in a minimal example the coming week and add it here.

jrebelo · 2024-08-01T20:30:46Z

Yes, I see the issue now when publishing without any sleep. I will work on a solution the coming days.

jrebelo · 2024-08-06T19:53:51Z

The issue was happening because the sample were taken from the Writer History Cache due to the HistoryQoS without having a chance of being sent or acknowledged first. Adding the functionality for blocking the writer until the sample is acknowledged by all the matched reliable readers before removing it from the history solves the problem.

jrebelo linked a pull request Aug 4, 2024 that will close this issue

313 lost samples in reliable qos #319

Merged

jrebelo self-assigned this Aug 4, 2024

jrebelo closed this as completed in #319 Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lost samples in reliable QoS #313

Lost samples in reliable QoS #313

drsk0 commented Jul 19, 2024

jrebelo commented Jul 22, 2024

drsk0 commented Jul 22, 2024

jrebelo commented Jul 22, 2024

drsk0 commented Jul 22, 2024

drsk0 commented Jul 31, 2024

jrebelo commented Aug 1, 2024

drsk0 commented Aug 1, 2024

jrebelo commented Aug 1, 2024

jrebelo commented Aug 6, 2024

Lost samples in reliable QoS #313

Lost samples in reliable QoS #313

Comments

drsk0 commented Jul 19, 2024

jrebelo commented Jul 22, 2024

drsk0 commented Jul 22, 2024

jrebelo commented Jul 22, 2024

drsk0 commented Jul 22, 2024

drsk0 commented Jul 31, 2024

jrebelo commented Aug 1, 2024

drsk0 commented Aug 1, 2024

jrebelo commented Aug 1, 2024

jrebelo commented Aug 6, 2024