[SPARK-21869][SS][DOCS][FOLLOWUP] Document Kafka producer pool configuration #27146

HeartSaVioR · 2020-01-09T08:05:25Z

What changes were proposed in this pull request?

This patch documents the configuration for the Kafka producer pool, newly revised via SPARK-21869 (#26845)

Why are the changes needed?

The explanation of new Kafka producer pool configuration is missing, whereas the doc has Kafka
consumer pool configuration.

Does this PR introduce any user-facing change?

Yes. This is a documentation change.

How was this patch tested?

N/A

HeartSaVioR · 2020-01-09T08:07:56Z

cc. @vanzin @gaborgsomogyi

SparkQA · 2020-01-09T08:21:57Z

Test build #116365 has finished for PR 27146 at commit 31e9c64.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-01-10T08:29:16Z

docs/structured-streaming-kafka-integration.md

+
+Given Kafka producer instance is designed to be thread-safe, Spark initializes a Kafka producer instance and co-use across tasks for same caching key.
+
+The caching key is built up from the following information:


Just a question. Does The following information cover all configurations used in paramsToSeq? It seems that setAuthenticationConfigIfNeeded injects more config in addition to the following twos.

Technically you're right, the cache key will contain the configuration including auth config Spark will inject. I've not mentioned it as we start to explain the internal which end users may be OK with only knowing abstracted info.

And same configuration goes to same addition, except a case where delegation token is renewed. Not 100% sure about details, @gaborgsomogyi could you help me confirming this?

@dongjoon-hyun really good point! If delegation token used then time to time new producer must be created and the old must be evicted otherwise the query will fail. There are multiple ways to reach that (not yet analyzed how it's done in the latest change made by @HeartSaVioR but I'm on it):

Either the cache key contains authentication information (dynamic jaas config). This way the new producer creation and old eviction would be automatic. Without super deep consideration that's my suggested way.

Or the cache key NOT contains authentication information (dynamic jaas config). This way additional logic must be added to handle this scenario. At the first place I have the feeling it would just add complexity increase and would make this part of code brittle.

As I understand from @HeartSaVioR comment the first approach is implemented at the moment. If that so then I'm fine with that but I would mention 2 things here:

The key may contain authentication information

There could be situations where more than one producer is instantiated. This is important because producers are consuming significant amount of memory as @zsxwing pointed out.

I think we take the first approach - I just followed the way we did it before.

Back to the topic, would we like to add the details on the guide doc? I'll address it if we would like to let end users know about. Otherwise we could leave it as it is.

I agree the first bullet point is questionable since it's a deep detail but the second is an important memory sizing information. Let's hear what @dongjoon-hyun thinks.

I think this would be more useful if you want to explain how a user can force separate producers to be used. Otherwise it's an internal detail that doesn't really affect users.

+1 for @vanzin 's advice.

Hmm... would we like to guide how to tweak it, or would like to just hide it? The reason I explain this is that end users may encounter an issue on producer pool and would like to debug a bit, but if it feels us to be too internal, sounds OK to hide it as well.

Just added some explanation since it makes sense to let end users know about it to debug.

gaborgsomogyi

Thanks for taking tare @HeartSaVioR

gaborgsomogyi · 2020-01-10T09:51:04Z

docs/structured-streaming-kafka-integration.md

+<tr>
+  <td>spark.kafka.producer.cache.evictorThreadRunInterval</td>
+  <td>The interval of time between runs of the idle evictor thread for producer pool. When non-positive, no idle evictor thread will be run.</td>
+  <td>1m (1 minutes)</td>


s/1 minutes/1 minute

gaborgsomogyi · 2020-01-10T10:11:25Z

docs/structured-streaming-kafka-integration.md

+
+Given Kafka producer instance is designed to be thread-safe, Spark initializes a Kafka producer instance and co-use across tasks for same caching key.
+
+The caching key is built up from the following information:


@dongjoon-hyun really good point! If delegation token used then time to time new producer must be created and the old must be evicted otherwise the query will fail. There are multiple ways to reach that (not yet analyzed how it's done in the latest change made by @HeartSaVioR but I'm on it):

Either the cache key contains authentication information (dynamic jaas config). This way the new producer creation and old eviction would be automatic. Without super deep consideration that's my suggested way.

Or the cache key NOT contains authentication information (dynamic jaas config). This way additional logic must be added to handle this scenario. At the first place I have the feeling it would just add complexity increase and would make this part of code brittle.

As I understand from @HeartSaVioR comment the first approach is implemented at the moment. If that so then I'm fine with that but I would mention 2 things here:

The key may contain authentication information

There could be situations where more than one producer is instantiated. This is important because producers are consuming significant amount of memory as @zsxwing pointed out.

gaborgsomogyi · 2020-01-10T11:57:35Z

@HeartSaVioR @dongjoon-hyun I've just answered a question in the user channel and realized that the following doc page is collapsed somehow: https://spark.apache.org/docs/3.0.0-preview/structured-streaming-kafka-integration.html

The following configurations are optional:

</tr> </table> ### Consumer Caching It's time-consuming to initialize Kafka consumers, especially in streaming scenarios where processing time

Since my environment to generate docs collapsed as it is could you double check that the mentioned page work without this change? I don't see any obvious issue which should cause such collapse.

HeartSaVioR · 2020-01-10T12:06:23Z

The doc collapsed issue was fixed via #27098

SparkQA · 2020-01-10T12:30:46Z

Test build #116486 has finished for PR 27146 at commit 6e29d01.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-01-10T13:46:42Z

The doc collapsed issue was fixed via #27098

Then only a rebuild needed.

HeartSaVioR · 2020-01-11T05:14:33Z

Yeah I sought a bit, and realized the published doc seems to be only broken for preview/preview2. If then may not be a big deal.

SparkQA · 2020-01-13T02:34:52Z

Test build #116584 has finished for PR 27146 at commit d3c3094.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @HeartSaVioR , @gaborgsomogyi , and @vanzin .
Merged to master.

BTW, @HeartSaVioR , for a documentation PR, we need a screenshot. I attached one for you.

HeartSaVioR · 2020-01-13T07:35:38Z

Thanks all for reviewing and merging!

@dongjoon-hyun What about enhancing the steps on contribution for documentation? (Either adding this to 'contributing' page or even adding this to PR template.) I'm not sure we explicitly have some requirements, and it would be helpful if we standardize this.

HeartSaVioR · 2020-01-13T07:45:57Z

Btw I'm also seeing different understanding of the section "Does this PR introduce any user-facing change?" around many open PRs.

My understanding of intention for the section is emphasizing the fact and enumerating if there's any behavioral changes / API side changes so that end users are likely to change their query/code. (So if the answer of section is yes then the patch should have to be reviewed carefully.) Expanding this to anything end users are facing would lead the answer of section to be most likely "yes", lighten the meaning of the section. I might be missing anything, welcome discussion around this.

gaborgsomogyi

Late LGTM.

[SPARK-21869][SS][DOCS][FOLLOWUP] Document producer pool configuration

31e9c64

HeartSaVioR mentioned this pull request Jan 9, 2020

[SPARK-21869][SS] Revise Kafka producer pool to implement 'expire' correctly #26845

Closed

HeartSaVioR changed the title ~~[SPARK-21869][SS][DOCS][FOLLOWUP] Document producer pool configuration~~ [SPARK-21869][SS][DOCS][FOLLOWUP] Document Kafka producer pool configuration Jan 10, 2020

dongjoon-hyun added STRUCTURED STREAMING DOCUMENTATION labels Jan 10, 2020

dongjoon-hyun reviewed Jan 10, 2020

View reviewed changes

gaborgsomogyi reviewed Jan 10, 2020

View reviewed changes

Fix nits

6e29d01

More explanation around cache key for producer instance

d3c3094

dongjoon-hyun approved these changes Jan 13, 2020

View reviewed changes

dongjoon-hyun closed this in eefcc7d Jan 13, 2020

gaborgsomogyi reviewed Jan 13, 2020

View reviewed changes


		Given Kafka producer instance is designed to be thread-safe, Spark initializes a Kafka producer instance and co-use across tasks for same caching key.

		The caching key is built up from the following information:

[SPARK-21869][SS][DOCS][FOLLOWUP] Document Kafka producer pool configuration #27146

[SPARK-21869][SS][DOCS][FOLLOWUP] Document Kafka producer pool configuration #27146

Uh oh!

Conversation

HeartSaVioR commented Jan 9, 2020 • edited by dongjoon-hyun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HeartSaVioR commented Jan 9, 2020

Uh oh!

SparkQA commented Jan 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi Jan 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi Jan 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi commented Jan 10, 2020

Uh oh!

HeartSaVioR commented Jan 10, 2020

Uh oh!

SparkQA commented Jan 10, 2020

Uh oh!

gaborgsomogyi commented Jan 10, 2020

Uh oh!

HeartSaVioR commented Jan 11, 2020

Uh oh!

SparkQA commented Jan 13, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Jan 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR commented Jan 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gaborgsomogyi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HeartSaVioR commented Jan 9, 2020 •

edited by dongjoon-hyun

Loading

gaborgsomogyi Jan 10, 2020 •

edited

Loading

gaborgsomogyi Jan 10, 2020 •

edited

Loading

HeartSaVioR commented Jan 13, 2020 •

edited

Loading

HeartSaVioR commented Jan 13, 2020 •

edited

Loading