Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

@HeartSaVioR HeartSaVioR commented Jan 9, 2020

What changes were proposed in this pull request?

This patch documents the configuration for the Kafka producer pool, newly revised via SPARK-21869 (#26845)

Why are the changes needed?

The explanation of new Kafka producer pool configuration is missing, whereas the doc has Kafka
consumer pool configuration.

Does this PR introduce any user-facing change?

Yes. This is a documentation change.

Screen Shot 2020-01-12 at 11 16 19 PM

How was this patch tested?

N/A

@HeartSaVioR
Copy link
Contributor Author

cc. @vanzin @gaborgsomogyi

@SparkQA
Copy link

SparkQA commented Jan 9, 2020

Test build #116365 has finished for PR 27146 at commit 31e9c64.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR HeartSaVioR changed the title [SPARK-21869][SS][DOCS][FOLLOWUP] Document producer pool configuration [SPARK-21869][SS][DOCS][FOLLOWUP] Document Kafka producer pool configuration Jan 10, 2020

Given Kafka producer instance is designed to be thread-safe, Spark initializes a Kafka producer instance and co-use across tasks for same caching key.

The caching key is built up from the following information:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question. Does The following information cover all configurations used in paramsToSeq? It seems that setAuthenticationConfigIfNeeded injects more config in addition to the following twos.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically you're right, the cache key will contain the configuration including auth config Spark will inject. I've not mentioned it as we start to explain the internal which end users may be OK with only knowing abstracted info.

And same configuration goes to same addition, except a case where delegation token is renewed. Not 100% sure about details, @gaborgsomogyi could you help me confirming this?

Copy link
Contributor

@gaborgsomogyi gaborgsomogyi Jan 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun really good point! If delegation token used then time to time new producer must be created and the old must be evicted otherwise the query will fail. There are multiple ways to reach that (not yet analyzed how it's done in the latest change made by @HeartSaVioR but I'm on it):

  • Either the cache key contains authentication information (dynamic jaas config). This way the new producer creation and old eviction would be automatic. Without super deep consideration that's my suggested way.
  • Or the cache key NOT contains authentication information (dynamic jaas config). This way additional logic must be added to handle this scenario. At the first place I have the feeling it would just add complexity increase and would make this part of code brittle.

As I understand from @HeartSaVioR comment the first approach is implemented at the moment. If that so then I'm fine with that but I would mention 2 things here:

  • The key may contain authentication information
  • There could be situations where more than one producer is instantiated. This is important because producers are consuming significant amount of memory as @zsxwing pointed out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we take the first approach - I just followed the way we did it before.

Back to the topic, would we like to add the details on the guide doc? I'll address it if we would like to let end users know about. Otherwise we could leave it as it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the first bullet point is questionable since it's a deep detail but the second is an important memory sizing information. Let's hear what @dongjoon-hyun thinks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be more useful if you want to explain how a user can force separate producers to be used. Otherwise it's an internal detail that doesn't really affect users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @vanzin 's advice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... would we like to guide how to tweak it, or would like to just hide it? The reason I explain this is that end users may encounter an issue on producer pool and would like to debug a bit, but if it feels us to be too internal, sounds OK to hide it as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added some explanation since it makes sense to let end users know about it to debug.

Copy link
Contributor

@gaborgsomogyi gaborgsomogyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking tare @HeartSaVioR

<tr>
<td>spark.kafka.producer.cache.evictorThreadRunInterval</td>
<td>The interval of time between runs of the idle evictor thread for producer pool. When non-positive, no idle evictor thread will be run.</td>
<td>1m (1 minutes)</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/1 minutes/1 minute


Given Kafka producer instance is designed to be thread-safe, Spark initializes a Kafka producer instance and co-use across tasks for same caching key.

The caching key is built up from the following information:
Copy link
Contributor

@gaborgsomogyi gaborgsomogyi Jan 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun really good point! If delegation token used then time to time new producer must be created and the old must be evicted otherwise the query will fail. There are multiple ways to reach that (not yet analyzed how it's done in the latest change made by @HeartSaVioR but I'm on it):

  • Either the cache key contains authentication information (dynamic jaas config). This way the new producer creation and old eviction would be automatic. Without super deep consideration that's my suggested way.
  • Or the cache key NOT contains authentication information (dynamic jaas config). This way additional logic must be added to handle this scenario. At the first place I have the feeling it would just add complexity increase and would make this part of code brittle.

As I understand from @HeartSaVioR comment the first approach is implemented at the moment. If that so then I'm fine with that but I would mention 2 things here:

  • The key may contain authentication information
  • There could be situations where more than one producer is instantiated. This is important because producers are consuming significant amount of memory as @zsxwing pointed out.

@gaborgsomogyi
Copy link
Contributor

@HeartSaVioR @dongjoon-hyun I've just answered a question in the user channel and realized that the following doc page is collapsed somehow: https://spark.apache.org/docs/3.0.0-preview/structured-streaming-kafka-integration.html

The following configurations are optional:

</tr> </table> ### Consumer Caching It's time-consuming to initialize Kafka consumers, especially in streaming scenarios where processing time 

Since my environment to generate docs collapsed as it is could you double check that the mentioned page work without this change? I don't see any obvious issue which should cause such collapse.

@HeartSaVioR
Copy link
Contributor Author

The doc collapsed issue was fixed via #27098

@SparkQA
Copy link

SparkQA commented Jan 10, 2020

Test build #116486 has finished for PR 27146 at commit 6e29d01.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor

The doc collapsed issue was fixed via #27098

Then only a rebuild needed.

@HeartSaVioR
Copy link
Contributor Author

Yeah I sought a bit, and realized the published doc seems to be only broken for preview/preview2. If then may not be a big deal.

@SparkQA
Copy link

SparkQA commented Jan 13, 2020

Test build #116584 has finished for PR 27146 at commit d3c3094.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @HeartSaVioR , @gaborgsomogyi , and @vanzin .
Merged to master.

BTW, @HeartSaVioR , for a documentation PR, we need a screenshot. I attached one for you.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Jan 13, 2020

Thanks all for reviewing and merging!

@dongjoon-hyun What about enhancing the steps on contribution for documentation? (Either adding this to 'contributing' page or even adding this to PR template.) I'm not sure we explicitly have some requirements, and it would be helpful if we standardize this.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Jan 13, 2020

Btw I'm also seeing different understanding of the section "Does this PR introduce any user-facing change?" around many open PRs.

My understanding of intention for the section is emphasizing the fact and enumerating if there's any behavioral changes / API side changes so that end users are likely to change their query/code. (So if the answer of section is yes then the patch should have to be reviewed carefully.) Expanding this to anything end users are facing would lead the answer of section to be most likely "yes", lighten the meaning of the section. I might be missing anything, welcome discussion around this.

Copy link
Contributor

@gaborgsomogyi gaborgsomogyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants