Iceberg with data migrations #24780

bashtanov · 2025-01-11T02:32:33Z

https://redpandadata.atlassian.net/browse/CORE-8439

Add a test for iceberg to read from table whose topic was deleted
Fix minor data migration test issues
Add a test to run iceberg translation for topics unmounted and then, optionally, mounted
For recovered and mounted topics, make Redpanda preserve most topic properties including iceberg ones (fixes https://redpandadata.atlassian.net/browse/CORE-563)
When unmounting make sure all messages are translated for iceberg

Backports Required

Release Notes

Features

Make Iceberg and topic mount/unmount work well together

bashtanov · 2025-01-11T02:33:23Z

/dt

vbotbuildovich · 2025-01-13T19:20:18Z

CI test results

test results on build#60655

test_id	test_kind	job_url	test_status	passed
idempotency_tests_rpunit.idempotency_tests_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/60655#01946058-a634-40d0-9eaa-a36681179d0c	FLAKY	1/2
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli	ducktape	https://buildkite.com/redpanda/redpanda/builds/60655#019460b1-911e-438a-9908-47112e140ed0	FLAKY	2/6

test results on build#60858

test_id	test_kind	job_url	test_status	passed
idempotency_tests_rpunit.idempotency_tests_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/60858#0194702b-4eab-49b6-a88e-86bc997a49fa	FLAKY	1/2
rptest.tests.datalake.simple_connect_test.RedpandaConnectIcebergTest.test_translating_avro_serialized_records.cloud_storage_type=CloudStorageType.S3.scenario=remount	ducktape	https://buildkite.com/redpanda/redpanda/builds/60858#01947072-773f-429b-924e-88e414d4e4b8	FAIL	0/1

test results on build#61039

test_id	test_kind	job_url	test_status	passed
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic	ducktape	https://buildkite.com/redpanda/redpanda/builds/61039#01948df4-24e0-4a1b-adc7-427bf0a8eec8	FLAKY	1/2
rptest.tests.topic_creation_test.TopicRecreateTest.test_topic_recreation_while_producing.workload=IDEMPOTENT.cleanup_policy=delete	ducktape	https://buildkite.com/redpanda/redpanda/builds/61039#01948e11-1c37-43dc-9d36-c64694dbeea3	FLAKY	1/2

test results on build#61291

test_id	test_kind	job_url	test_status	passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery	ducktape	https://buildkite.com/redpanda/redpanda/builds/61291#0194ad0b-4239-400c-99f3-ec7ba3ebc4e8	FLAKY	1/2
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=0.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/61291#0194ad0b-4239-4508-9eb4-142a5a4211f1	FLAKY	1/2
storage_single_thread_rpunit.storage_single_thread_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/61291#0194acab-90cb-41b8-8d84-aa2c0d1c0c2e	FLAKY	1/2

test results on build#61361

test_id	test_kind	job_url	test_status	passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery	ducktape	https://buildkite.com/redpanda/redpanda/builds/61361#0194b306-2928-4576-bc99-42bbc006d21c	FLAKY	1/2
rptest.tests.enterprise_features_license_test.EnterpriseFeaturesTest.test_enable_features.feature=Feature.oidc.install_license=True.disable_trial=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/61361#0194b30b-c7e9-4daf-8a19-487158760573	FLAKY	1/2
rptest.tests.node_pool_migration_test.NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool.balancing_mode=off.test_mode=TestMode.FAST_MOVES.cleanup_policy=compact	ducktape	https://buildkite.com/redpanda/redpanda/builds/61361#0194b30b-c7e8-4103-bbfb-cf32306b71a9	FLAKY	1/2
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS	ducktape	https://buildkite.com/redpanda/redpanda/builds/61361#0194b30b-c7e7-4f8b-96d2-fb381349c39f	FLAKY	1/2
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic	ducktape	https://buildkite.com/redpanda/redpanda/builds/61361#0194b30b-c7e7-4f8b-96d2-fb381349c39f	FLAKY	1/2

tests/rptest/tests/datalake/datalake_verifier.py

src/v/cluster/archival/ntp_archiver_service.cc

src/v/datalake/translation/partition_translator.cc

vbotbuildovich · 2025-01-16T20:45:14Z

Retry command for Build#60858

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/datalake/simple_connect_test.py::RedpandaConnectIcebergTest.test_translating_avro_serialized_records@{"cloud_storage_type":1,"scenario":"remount"}

bashtanov · 2025-01-17T00:08:44Z

Meh. It does not fail when I run it locally with repeat.

bashtanov · 2025-01-17T08:50:38Z

I increased timeout in the test, as it coordinator loop, as one last use of the long one may be in progress while we are waiting. Also added some logging and removed dead code. Please re-review.

bashtanov · 2025-01-22T11:08:53Z

/dt

mmaslankaprv · 2025-01-24T11:17:34Z

src/v/cluster/data_migration_worker.cc

-        co_return co_await flush(partition);
+        auto block_offset = block_res.value();
+
+        auto deadline = model::timeout_clock::now() + 5s;


It seems to me that this timeout value is very low considering the possible amount of work is there to be done. Should we consider making it larger and configurable ?

It will be retried by migration reconciliation, including when controller leadership changes. Since partition is blocked already translation backlog will eventually get processed. Translator flush doesn't do any active work, it just waits for certain offset to be translated. The only problem is cloud storage flush will be invoked every time. @WillemKauf @andrwng if a partition does not receive any further writes is it much overhead to flush its cloud data every few seconds?

With this call we are waiting for the translator to actually execute the translation work, and this may take a while depending on the translation gap. If all the calls are idempotent then this is great, it will just be retried and eventually succeed

bashtanov · 2025-01-28T11:24:42Z

/dt

Check that with redpanda.iceberg.delete=false old table data remains available even before we recreate the topic.

And switch back to normal admin after disruptions are over.

add log lines, fix typos

if we unmount the topic before this table may lack metadata

Introduce "offline mode" that cuts all ties to the topic in Redpanda cluster. It carries on querying the query engine and verifying results using info cached before going into offline mode.

for to make functionality is tested while topic is being actively used

Make it possible to configure the number of messages produced by stream

Add scenarios: 1) On unmount all messages that made their way to the topic eventually become available via query engine 2) Upon remount and further produce both old and new messages are in the topic and in the table

to prevent archiver shutdown while waiting

This is mostly to preserve iceberg properties, but also to make sure any newly introduced topic properties are preserved by default.

Allows to use it for subscriptions where feedback from a called function is necessary, such as a future or an error code. All functions are supposed to return the same type.

Make offset_monitor more universal so that it can be used for different data types.

Also create and subscribe one of these actions: flush data to cloud.

Wait for the offset to be translated when asked by partition to "flush".

When blocking writes collect the offset of the blocking message. Then use it to dispatch all-components flush through partition (leading to cloud storage flush that ignores the offset parameter and datalake translator that waits for the correspondent kafka offset)

github-actions bot added area/build area/redpanda labels Jan 11, 2025

bashtanov force-pushed the iceberg-w-data-migrations branch from 2241484 to b140307 Compare January 13, 2025 08:46

bashtanov marked this pull request as ready for review January 13, 2025 09:25

bashtanov force-pushed the iceberg-w-data-migrations branch 6 times, most recently from 5d9c8d7 to d493031 Compare January 13, 2025 15:45

bashtanov requested review from mmaslankaprv, bharathv and ztlpn January 14, 2025 08:36