Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropped samples after system clock adjustments #5019

Open
1 task done
ma30002000 opened this issue Jul 2, 2024 · 3 comments
Open
1 task done

Dropped samples after system clock adjustments #5019

ma30002000 opened this issue Jul 2, 2024 · 3 comments
Labels
in progress Issue or PR which is being reviewed

Comments

@ma30002000
Copy link
Contributor

ma30002000 commented Jul 2, 2024

Is there an already existing issue for this?

  • I have searched the existing issues

Expected behavior

When dealing with system clock adjustments (manually or due to clock server synchronization), my subscriber drops all samples once the system has been set back into the past.
This seems to be due to DataReaderHistory::received_change_keep_last / DataReaderHistory::completed_change_keep_last comparing all the incoming samples' sourceTimestamp (which will be a timestamp in the past once the clock has been changed) to the first change's, leading to all samples to be dropped:

        CacheChange_t* first_change = instance_changes.at(0);
        if (change->sourceTimestamp >= first_change->sourceTimestamp)
        {
            // As the instance is ordered by source timestamp, we can always remove the first one.
            ret_value = remove_change_sub(first_change);
        }
        else
        {
            // Received change is older than oldest, and should be discarded
            return true;
        }

If I remove the if and simply drop the first sample everything seems to work unaffected from the system clock adjustments.

Note that I would expect the current fast dds behaviour if DestinationOrderQosPolicy were implemented (which it is not) and set to BY_SOURCE_TIMESTAMP_DESTINATIONORDER_QOS. However, according to the manual should be BY_RECEPTION_TIMESTAMP_DESTINATIONORDER_QOS .

Note that I have created a pull request for additional observations when the system clock is adjusted (PR #5018).

This might be related to #4850.

Current behavior

Samples get dropped when publisher's system clock is set into the past.

Steps to reproduce

Set back system clock after disabling automatic synchronization via NTP.

Fast DDS version/commit

2.14.2

Platform/Architecture

Ubuntu Focal 20.04 amd64

Transport layer

Shared Memory Transport (SHM)

Additional context

No response

XML configuration file

No response

Relevant log output

No response

Network traffic capture

No response

@ma30002000 ma30002000 added the triage Issue pending classification label Jul 2, 2024
@i-and
Copy link

i-and commented Jul 7, 2024

In my opinion, the stack functionality should be as independent of the system clock as possible. Otherwise, hard-to-diagnose errors will occur at random points in time (when the system clock jumps forward or backward), such as @ma30002000 pointed out above. From this point of view, the use of sourceTimestamp (in the basis of the system clock) should be minimized. At the same time, sourceTimestamp is used in the stack in the following five places:

  1. Where @ma30002000 indicated, namely in the DataReaderHistory::received_change_keep_last() and DataReaderHistory::completed_change_keep_last() methods. This condition for discarding accepted samples contradicts the using QoS BY_RECEPTION_TIMESTAMP_DESTINATIONORDER_QOS - i.e. the reader history should not be sorted by source time. Also these samples with large sequence numbers are silently discarded without notifying the user (REJECTED_BY_... callback is not called). I suggest that these sample drop conditions be excluded from the code for QoS BY_RECEPTION_TIMESTAMP_DESTINATION_ORDER_QOS.
  2. The reader's history is sorted by the timestamp of the source using the following comparison function:
    inline bool history_order_cmp(
    const CacheChange_t* lhs,
    const CacheChange_t* rhs)
    {
    return lhs->writerGUID == rhs->writerGUID ?
    lhs->sequenceNumber < rhs->sequenceNumber :
    lhs->sourceTimestamp < rhs->sourceTimestamp;
    }

    This does not correspond to the set QoS parameter BY_RECEPTION_TIMESTAMP_DESTINATION_ORDER_QOS. I propose to implement sorting in accordance with QoS BY_RECEPTION_TIMESTAMP_DESTINATION_ORDER_QOS and without using sourceTimestamp. In this case, it will probably be enough to use the sample sequence numbers from the corresponding DataWriter.
  3. In the implementation of the QoS Lifespan for DataWriter. From the point of view of practical use, this parameter should work in terms of local time and not depend on system clock jumps (in the case of a backward jump, the sample will mistakenly be in the history for more than a given Lifespan duration, and in the case of a forward jump, it will be removed from the history ahead of time). Technically, this is implemented by adding a steady clock timestamp to the CacheChange_t with its analysis in the method DataWriterImpl::lifespan_expired().
  4. In the implementation of the QoS Lifespan for DataReader. The reader's implementation is based on the assumption that the Writer's and Reader's system clocks are synchronized. If this is not the case, the received samples will be discarded in the DataReaderImpl::on_new_cache_change_added() method. Incorrect operation will also be observed here when time jumps forward or backward (by analogy with paragraph 3 above). It would be more interesting for the user to get a mechanism to control the obsolescence of samples in the Reader's history, which works based on the local(steady) clock and does not depend on the quality of synchronization of the system clock. I suggest considering such an implementation (perhaps with its activation from the new QoS parameter for DataReader for preserve the old default behavior of Lifespan QoS).
  5. In the implementation of disable_positive_ack in the StatefulWriter side. I suggest considering the possibility of using a steady clock with its stamping at the CacheChange_t (see paragraph 3 above).

Please comment on the five points presented above.

@elianalf
Copy link
Contributor

elianalf commented Jul 9, 2024

Hi @ma30002000,
Thanks for using Fast DDS.
We are trying to reproduce the issue and investigate. We will come back with a feedback.

@elianalf elianalf added in progress Issue or PR which is being reviewed and removed triage Issue pending classification labels Jul 9, 2024
@ma30002000
Copy link
Contributor Author

Any indications and hints concerning the root cause (and a possible fix) would be highly appreciated..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in progress Issue or PR which is being reviewed
Projects
None yet
Development

No branches or pull requests

3 participants