[Pull-based Ingestion][WIP] Introduce the new pull-based ingestion engine, APIs, and Kafka plugin #16958

yupeng9 · 2025-01-06T19:02:51Z

Description

This PR implements the basics of the pull-based ingestion described in this RFC, including:

The APIs for the pull-based ingestion source
A Kafka plugin that implements the ingestion source API
A new IngestionEngine that pulls data from the ingestion sources

Currently WIP, and there are a few improvements to make and test coverage to increase

Related Issues

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…cessing

github-actions · 2025-01-06T19:44:38Z

❌ Gradle check result for 16dd9d0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Bukhtawar · 2025-01-08T08:21:18Z

server/src/main/java/org/opensearch/index/engine/IngestionEngine.java

+                new Translog.Snapshot() {
+                    @Override
+                    public void close() {}
+
+                    @Override
+                    public int totalOperations() {
+                        return 0;
+                    }
+
+                    @Override
+                    public Translog.Operation next() {
+                        return null;
+                    }
+                }
+            );


Maybe create a static EMPTY_TRANSLOG_SNAPSHOT and reuse across this and NoOpEngine

Good suggestion. Let me explore that

Bukhtawar · 2025-01-08T08:22:46Z

server/src/main/java/org/opensearch/index/engine/IngestionEngine.java

+            String clientId = engineConfig.getIndexSettings().getNodeName()
+                + "-"
+                + engineConfig.getIndexSettings().getIndex().getName()
+                + "-"
+                + engineConfig.getShardId().getId();


Should we use ids instead of names like index uuid, node id etc

this is mainly for monitoring and operation, for example, kafka supports quota set by client-id. as long as we can uniquely identify a streaming consumer, it's sufficient. any suggestion?

Bukhtawar

Curious how would the FGAC security model work, espl with security plugin which intercepts transport actions to validate if authorised users can perform bulk actions on certain indices. Is the intent to handle permissions at a Kafka "partition level"
Another aspect is maintaining Kafka checkpoints durably, I'm yet to read that part but would be good to understand how are we handling fail overs and recoveries

andrross · 2025-01-08T17:34:38Z

server/src/main/java/org/opensearch/plugins/IngestionConsumerPlugin.java

+ *
+ * @opensearch.api
+ */
+public interface IngestionConsumerPlugin {


Let's put the @ExperimentalApi annotation on this as well

andrross · 2025-01-08T17:39:19Z

server/src/main/java/org/opensearch/indices/ingest/package-info.java

+ */
+
+/** Indices ingestion module package. */
+package org.opensearch.indices.ingest;


The term "ingest" is definitely overloaded. _bulk is a type of ingestion, there are ingest pipelines, etc. I'd suggest using polling.ingest or pollingingest or anything else that helps disambiguate this area of the code from the other ingest related pieces.

Agreed. pollingingest sounds good to me.

server/src/main/java/org/opensearch/indices/ingest/StreamPoller.java

andrross · 2025-01-08T17:45:15Z

server/src/main/java/org/opensearch/index/engine/IngestionEngine.java

+    private final TranslogManager translogManager;
+    private final DocumentMapperForType documentMapperForType;
+    private final IngestionConsumerFactory ingestionConsumerFactory;
+    protected StreamPoller streamPoller;


It looks like streamPoller is assigned in the constructor and never accessed outside this class. Why is it not private final?

i had some tests that used it but I changed to a different way of testing. let me keep it private

andrross · 2025-01-08T17:51:34Z

plugins/ingestion-kafka/build.gradle

+}
+
+versions << [
+  'kafka': '2.8.2',


This looks quite old (September 2022 according to https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients). Why not use the newest available?

Good catch. I set it to test with Uber internal kafka, which we are using 2.8.2. Let me upgrade it

andrross · 2025-01-09T00:14:18Z

server/src/main/java/org/opensearch/indices/ingest/MessageProcessorRunnable.java

+                result = blockingQueue.poll(1000, TimeUnit.MILLISECONDS);
+            } catch (InterruptedException e) {
+                // TODO: add metric
+                logger.debug("ConcurrentSiaStreamsPoller poll interruptedException", e);


ConcurrentSiaStreamsPoller?

andrross · 2025-01-09T00:21:14Z

server/src/main/java/org/opensearch/index/engine/IngestionEngine.java

+            streamPoller = new DefaultStreamPoller(startPointer, persistedPointers, ingestionShardConsumer, this, resetState);
+            streamPoller.start();


The pattern here of sending the this pointer out of the constructor is a bit concerning to me. This allows DefaultStreamPoller to observe a partially-constructed instance of IngestionEngine. If, for example, a future refactoring added a new final field to this class and it were initialized after this line, then DefaultStreamPoller would be able to observe an uninitialized final field. Alternatively, if streamPoller.start() throws an exception then that streamPoller instance will still have a reference to this instance, which never successfully completed its constructor. Is there a way to structure these classes to avoid these problems?

I mainly need it to pass the engine to MessageProcessor, because MessageProcessor can invoke the index/delete operations on the engine.

you have a valid point, and ideally we can call engine.start, which can pass the pointer to the poller to start. do you feel it's viable with the existing Engine interface design?

yupeng9 · 2025-01-09T15:30:21Z

Curious how would the FGAC security model work, espl with security plugin which intercepts transport actions to validate if authorised users can perform bulk actions on certain indices. Is the intent to handle permissions at a Kafka "partition level" Another aspect is maintaining Kafka checkpoints durably, I'm yet to read that part but would be good to understand how are we handling fail overs and recoveries

that's a good question. For kafka-related access control, the policies can be passed as ingestion config params, so the created consumer can handle the kafka access, which is separated from OpenSearch permission management. However, the access control in your example if authorised users can perform bulk actions on certain indices that we can carry some existing permission handling policy and implement the control in the MessageHandler. This requires some thinking and modeling, and I can create an issue to track.

mch2 · 2025-01-09T17:49:16Z

plugins/ingestion-kafka/build.gradle

+  api "org.apache.kafka:kafka-clients:${versions.kafka}"
+
+  // test
+  api "com.github.docker-java:docker-java-api:${versions.docker}"


I think this docker dependency should also be testImplementation scope?

mch2 · 2025-01-09T18:12:16Z

server/src/main/resources/org/opensearch/bootstrap/security.policy

@@ -194,4 +194,15 @@ grant {
  permission java.io.FilePermission "/sys/fs/cgroup/cpuacct/-", "read";
  permission java.io.FilePermission "/sys/fs/cgroup/memory", "read";
  permission java.io.FilePermission "/sys/fs/cgroup/memory/-", "read";
+
+  //TODO: enable these policies to plugin-security.policy in kafka policy


You can safely remove these added permissions and rely on the plugin policy. Though you'll need to wrap the privileged call in AccessController.doPrivileged during consumer creation -

return AccessController.doPrivileged((PrivilegedAction<Consumer<byte[], byte[]>>) () -> { return new KafkaConsumer<>(consumerProp, new ByteArrayDeserializer(), new ByteArrayDeserializer()); });

awesome. that works. thanks!

mch2 · 2025-01-09T18:42:03Z

.../ingestion-kafka/src/internalClusterTest/java/org/opensearch/plugin/kafka/KafkaPluginIT.java

+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+public class KafkaPluginIT extends OpenSearchIntegTestCase {


a nitpick, you can combine this with the other IT so the tests run against the same cluster and save a bit of time. The default test scope is SUITE which will spin up one test cluster and run all cases.

mch2 · 2025-01-09T19:21:34Z

server/src/main/java/org/opensearch/indices/IndicesService.java

    private EngineFactory getEngineFactory(final IndexSettings idxSettings) {
        final IndexMetadata indexMetadata = idxSettings.getIndexMetadata();
        if (indexMetadata != null && indexMetadata.getState() == IndexMetadata.State.CLOSE) {
            // NoOpEngine takes precedence as long as the index is closed
            return NoOpEngine::new;
        }

+        // streaming ingestion
+        if (indexMetadata != null && indexMetadata.useIngestionSource()) {
+            return new IngestionEngineFactory();


I think we can add IngestionConsumerFactory as a 2nd ctor param to IngestionEngine and pass through here, wdyt? Then we don't need to plumb it through IndexService/Shard.

You could also build the IngestionSource config object here and avoid having to send it over the wire & bake it into IndexMetadata.

github-actions · 2025-01-10T00:41:24Z

❌ Gradle check result for b8923ca: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

yupeng9 added 24 commits December 27, 2024 17:07

add ingestion modules

244d2d9

stream poller wip

a221824

update ingestion engine

d0ff1cf

kafka container

2795d5e

more updates

5cecc54

local update

e269aa2

add batch_start/end to stream poller

e728863

add index settings

1d7491e

local change

8eab373

pass docmapper

094a9aa

basic recovery

dc4ae2b

add kafka ingestion as plugin

e8e4c72

add integration test for kafka plugin

74d539e

cleanup

f2e9b08

use byte[] for message payload type

a9167f2

javadocs

1752268

add ingestionEngineTest

211859a

test recovery test in ingestionEngineTest

a4dfd36

unit tests for kafka plugin

9e3202c

style fix

0ef937e

add license

353e4b8

more unit tests

204c6ba

cleanup

08f0712

use a blocking queue to pass polled messages to the processor for pro…

16dd9d0

…cessing

yupeng9 requested review from anasalkouz, andrross, ashking94, Bukhtawar, CEHENKLE and dblock as code owners January 6, 2025 19:02

yupeng9 requested review from dbwiddis, gbbafna, jed326, kotwanikunal, mch2, msfroh, nknize, owaiskazi19, reta, Rishikesh1159, sachinpkale, saratvemulapalli, shwetathareja, sohami, VachaShah, jainankitk and linuxpi as code owners January 6, 2025 19:02

github-actions bot added enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing labels Jan 6, 2025

Bukhtawar reviewed Jan 8, 2025

View reviewed changes

andrross reviewed Jan 8, 2025

View reviewed changes

andrross reviewed Jan 9, 2025

View reviewed changes

mch2 reviewed Jan 9, 2025

View reviewed changes

address comments also remove security policy from bootstrap files

b8923ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pull-based Ingestion][WIP] Introduce the new pull-based ingestion engine, APIs, and Kafka plugin #16958

[Pull-based Ingestion][WIP] Introduce the new pull-based ingestion engine, APIs, and Kafka plugin #16958

yupeng9 commented Jan 6, 2025

github-actions bot commented Jan 6, 2025

Bukhtawar Jan 8, 2025

yupeng9 Jan 9, 2025

Bukhtawar Jan 8, 2025

yupeng9 Jan 9, 2025

Bukhtawar left a comment •

edited

Loading

andrross Jan 8, 2025

andrross Jan 8, 2025 •

edited

Loading

yupeng9 Jan 9, 2025

andrross Jan 8, 2025

yupeng9 Jan 9, 2025

andrross Jan 8, 2025

yupeng9 Jan 9, 2025

andrross Jan 9, 2025

andrross Jan 9, 2025

yupeng9 Jan 9, 2025

yupeng9 commented Jan 9, 2025

mch2 Jan 9, 2025

yupeng9 Jan 10, 2025

mch2 Jan 9, 2025

yupeng9 Jan 10, 2025

mch2 Jan 9, 2025

mch2 Jan 9, 2025 •

edited

Loading

github-actions bot commented Jan 10, 2025

		streamPoller = new DefaultStreamPoller(startPointer, persistedPointers, ingestionShardConsumer, this, resetState);
		streamPoller.start();

[Pull-based Ingestion][WIP] Introduce the new pull-based ingestion engine, APIs, and Kafka plugin #16958

Are you sure you want to change the base?

[Pull-based Ingestion][WIP] Introduce the new pull-based ingestion engine, APIs, and Kafka plugin #16958

Conversation

yupeng9 commented Jan 6, 2025

Description

Related Issues

github-actions bot commented Jan 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bukhtawar left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrross Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yupeng9 commented Jan 9, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mch2 Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Jan 10, 2025

Bukhtawar left a comment •

edited

Loading

andrross Jan 8, 2025 •

edited

Loading

mch2 Jan 9, 2025 •

edited

Loading