Dropped samples after system clock adjustments #5019

ma30002000 · 2024-07-02T17:12:44Z

Is there an already existing issue for this?

I have searched the existing issues

Expected behavior

When dealing with system clock adjustments (manually or due to clock server synchronization), my subscriber drops all samples once the system has been set back into the past.
This seems to be due to DataReaderHistory::received_change_keep_last / DataReaderHistory::completed_change_keep_last comparing all the incoming samples' sourceTimestamp (which will be a timestamp in the past once the clock has been changed) to the first change's, leading to all samples to be dropped:

        CacheChange_t* first_change = instance_changes.at(0);
        if (change->sourceTimestamp >= first_change->sourceTimestamp)
        {
            // As the instance is ordered by source timestamp, we can always remove the first one.
            ret_value = remove_change_sub(first_change);
        }
        else
        {
            // Received change is older than oldest, and should be discarded
            return true;
        }

If I remove the if and simply drop the first sample everything seems to work unaffected from the system clock adjustments.

Note that I would expect the current fast dds behaviour if DestinationOrderQosPolicy were implemented (which it is not) and set to BY_SOURCE_TIMESTAMP_DESTINATIONORDER_QOS. However, according to the manual should be BY_RECEPTION_TIMESTAMP_DESTINATIONORDER_QOS .

Note that I have created a pull request for additional observations when the system clock is adjusted (PR #5018).

This might be related to #4850.

Current behavior

Samples get dropped when publisher's system clock is set into the past.

Steps to reproduce

Set back system clock after disabling automatic synchronization via NTP.

Fast DDS version/commit

2.14.2

Platform/Architecture

Ubuntu Focal 20.04 amd64

Transport layer

Shared Memory Transport (SHM)

Additional context

No response

XML configuration file

No response

Relevant log output

No response

Network traffic capture

No response

The text was updated successfully, but these errors were encountered:

i-and · 2024-07-07T21:25:28Z

In my opinion, the stack functionality should be as independent of the system clock as possible. Otherwise, hard-to-diagnose errors will occur at random points in time (when the system clock jumps forward or backward), such as @ma30002000 pointed out above. From this point of view, the use of sourceTimestamp (in the basis of the system clock) should be minimized. At the same time, sourceTimestamp is used in the stack in the following five places:

Where @ma30002000 indicated, namely in the DataReaderHistory::received_change_keep_last() and DataReaderHistory::completed_change_keep_last() methods. This condition for discarding accepted samples contradicts the using QoS BY_RECEPTION_TIMESTAMP_DESTINATIONORDER_QOS - i.e. the reader history should not be sorted by source time. Also these samples with large sequence numbers are silently discarded without notifying the user (REJECTED_BY_... callback is not called). I suggest that these sample drop conditions be excluded from the code for QoS BY_RECEPTION_TIMESTAMP_DESTINATION_ORDER_QOS.

The reader's history is sorted by the timestamp of the source using the following comparison function:

Fast-DDS/src/cpp/rtps/common/ChangeComparison.hpp

Lines 28 to 35 in f973dc5

    
           inline bool history_order_cmp( 
        
                   const CacheChange_t* lhs, 
        
                   const CacheChange_t* rhs) 
        
           { 
        
               return lhs->writerGUID == rhs->writerGUID ? 
        
                      lhs->sequenceNumber < rhs->sequenceNumber : 
        
                      lhs->sourceTimestamp < rhs->sourceTimestamp; 
        
           }

This does not correspond to the set QoS parameter BY_RECEPTION_TIMESTAMP_DESTINATION_ORDER_QOS. I propose to implement sorting in accordance with QoS BY_RECEPTION_TIMESTAMP_DESTINATION_ORDER_QOS and without using sourceTimestamp. In this case, it will probably be enough to use the sample sequence numbers from the corresponding DataWriter.

In the implementation of the QoS Lifespan for DataWriter. From the point of view of practical use, this parameter should work in terms of local time and not depend on system clock jumps (in the case of a backward jump, the sample will mistakenly be in the history for more than a given Lifespan duration, and in the case of a forward jump, it will be removed from the history ahead of time). Technically, this is implemented by adding a steady clock timestamp to the CacheChange_t with its analysis in the method DataWriterImpl::lifespan_expired().
In the implementation of the QoS Lifespan for DataReader. The reader's implementation is based on the assumption that the Writer's and Reader's system clocks are synchronized. If this is not the case, the received samples will be discarded in the DataReaderImpl::on_new_cache_change_added() method. Incorrect operation will also be observed here when time jumps forward or backward (by analogy with paragraph 3 above). It would be more interesting for the user to get a mechanism to control the obsolescence of samples in the Reader's history, which works based on the local(steady) clock and does not depend on the quality of synchronization of the system clock. I suggest considering such an implementation (perhaps with its activation from the new QoS parameter for DataReader for preserve the old default behavior of Lifespan QoS).
In the implementation of disable_positive_ack in the StatefulWriter side. I suggest considering the possibility of using a steady clock with its stamping at the CacheChange_t (see paragraph 3 above).

Please comment on the five points presented above.

elianalf · 2024-07-09T06:34:05Z

Hi @ma30002000,
Thanks for using Fast DDS.
We are trying to reproduce the issue and investigate. We will come back with a feedback.

ma30002000 · 2024-07-09T14:43:33Z

Any indications and hints concerning the root cause (and a possible fix) would be highly appreciated..

ma30002000 added the triage Issue pending classification label Jul 2, 2024

elianalf added in progress Issue or PR which is being reviewed and removed triage Issue pending classification labels Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dropped samples after system clock adjustments #5019

Dropped samples after system clock adjustments #5019

ma30002000 commented Jul 2, 2024 •

edited

Loading

i-and commented Jul 7, 2024

elianalf commented Jul 9, 2024

ma30002000 commented Jul 9, 2024

Dropped samples after system clock adjustments #5019

Dropped samples after system clock adjustments #5019

Comments

ma30002000 commented Jul 2, 2024 • edited Loading

Is there an already existing issue for this?

Expected behavior

Current behavior

Steps to reproduce

Fast DDS version/commit

Platform/Architecture

Transport layer

Additional context

XML configuration file

Relevant log output

Network traffic capture

i-and commented Jul 7, 2024

elianalf commented Jul 9, 2024

ma30002000 commented Jul 9, 2024

ma30002000 commented Jul 2, 2024 •

edited

Loading