Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronize puts to KafkaConsumer protocol buffer during async sends #1733

Merged
merged 1 commit into from
Mar 12, 2019

Conversation

dpkp
Copy link
Owner

@dpkp dpkp commented Mar 12, 2019

If we attempt to maintain a FIFO queue of in-flight requests in the BrokerConnection, we will have to be careful that once a request is added to the _protocol buffer that we also make sure to add the response future to the in-flight-requests queue. Furthermore, we need to make sure that the order of requests added to the _protocol buffer is the same order as the order of the IFR queue. Otherwise, if we do not, then we end up with a mismatch between actual response and the expected response as determined by the IFR queue. Currently we do not synchronize the code that does this, so two requests could come in at the same time and get placed (1) R1 -> _protocol buffer, (2) R2 -> _procotol buffer, (3) R2 -> IFR list, (4) R1 -> IFR list.

To address this issue I've modified BrokerConnection to use the correlation_id that is already tracked by the _protocol buffer, and switched the in-flight request tracking from a FIFO queue to a simple dict of correlation id => tracking data. We've already pushed the low-level request/response correlation to the _protocol buffer, so I don't think there is any need to maintain a second FIFO queue at the BrokerConnection level. In addition, I added a small lock to BrokerConnection to prevent more than one thread from adding a new request to the _protocol buffer concurrently. This should address the async send synchronization issue and allow the KafkaClient to use BrokerConnection.send() without acquiring an external lock (for all other network access we are still relying on the external KafkaClient lock to synchronize access).

I believe this should fix the out-of-sync errors reported in #1728


This change is Reviewable

Copy link
Collaborator

@jeffwidman jeffwidman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good for tackling the issues mentioned.

Tested on production, and so far this has fixed the exceptions being thrown in #1728. I'll double-check tomorrow when the load is heavier as that's when we tend to get more issues.

However, if I understand correctly, the broader response processing design assumes that responses are returned in the same order that the requests were sent, and I'm not sure that's always a valid assumption. Let's continue that discussion over in the issue ticket: #1728 (comment).

@tvoinarovskyi
Copy link
Collaborator

Lgtm


def requests_timed_out(self):
if self.in_flight_requests:
(_, _, oldest_at) = self.in_flight_requests[0]
get_timestamp = lambda v: v[1]
oldest_at = min(map(get_timestamp,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope this will not introduce any speed problems.

@jeffwidman
Copy link
Collaborator

I've seen no further errors after letting this run overnight/this morning on a fairly busy production host where I was previously consistently observing the #1728 problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants