Skip to content

Iceberg with data migrations #24780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jan 30, 2025

Conversation

bashtanov
Copy link
Contributor

@bashtanov bashtanov commented Jan 11, 2025

https://redpandadata.atlassian.net/browse/CORE-8439

  • Add a test for iceberg to read from table whose topic was deleted
  • Fix minor data migration test issues
  • Add a test to run iceberg translation for topics unmounted and then, optionally, mounted
  • For recovered and mounted topics, make Redpanda preserve most topic properties including iceberg ones (fixes https://redpandadata.atlassian.net/browse/CORE-563)
  • When unmounting make sure all messages are translated for iceberg

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Features

  • Make Iceberg and topic mount/unmount work well together

@bashtanov
Copy link
Contributor Author

/dt

@bashtanov bashtanov force-pushed the iceberg-w-data-migrations branch from 2241484 to b140307 Compare January 13, 2025 08:46
@bashtanov bashtanov marked this pull request as ready for review January 13, 2025 09:25
@bashtanov bashtanov force-pushed the iceberg-w-data-migrations branch 6 times, most recently from 5d9c8d7 to d493031 Compare January 13, 2025 15:45
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 13, 2025

CI test results

test results on build#60655
test_id test_kind job_url test_status passed
idempotency_tests_rpunit.idempotency_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60655#01946058-a634-40d0-9eaa-a36681179d0c FLAKY 1/2
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli ducktape https://buildkite.com/redpanda/redpanda/builds/60655#019460b1-911e-438a-9908-47112e140ed0 FLAKY 2/6
test results on build#60858
test_id test_kind job_url test_status passed
idempotency_tests_rpunit.idempotency_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60858#0194702b-4eab-49b6-a88e-86bc997a49fa FLAKY 1/2
rptest.tests.datalake.simple_connect_test.RedpandaConnectIcebergTest.test_translating_avro_serialized_records.cloud_storage_type=CloudStorageType.S3.scenario=remount ducktape https://buildkite.com/redpanda/redpanda/builds/60858#01947072-773f-429b-924e-88e414d4e4b8 FAIL 0/1
test results on build#61039
test_id test_kind job_url test_status passed
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic ducktape https://buildkite.com/redpanda/redpanda/builds/61039#01948df4-24e0-4a1b-adc7-427bf0a8eec8 FLAKY 1/2
rptest.tests.topic_creation_test.TopicRecreateTest.test_topic_recreation_while_producing.workload=IDEMPOTENT.cleanup_policy=delete ducktape https://buildkite.com/redpanda/redpanda/builds/61039#01948e11-1c37-43dc-9d36-c64694dbeea3 FLAKY 1/2
test results on build#61291
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61291#0194ad0b-4239-400c-99f3-ec7ba3ebc4e8 FLAKY 1/2
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=0.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/61291#0194ad0b-4239-4508-9eb4-142a5a4211f1 FLAKY 1/2
storage_single_thread_rpunit.storage_single_thread_rpunit unit https://buildkite.com/redpanda/redpanda/builds/61291#0194acab-90cb-41b8-8d84-aa2c0d1c0c2e FLAKY 1/2
test results on build#61361
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61361#0194b306-2928-4576-bc99-42bbc006d21c FLAKY 1/2
rptest.tests.enterprise_features_license_test.EnterpriseFeaturesTest.test_enable_features.feature=Feature.oidc.install_license=True.disable_trial=True ducktape https://buildkite.com/redpanda/redpanda/builds/61361#0194b30b-c7e9-4daf-8a19-487158760573 FLAKY 1/2
rptest.tests.node_pool_migration_test.NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool.balancing_mode=off.test_mode=TestMode.FAST_MOVES.cleanup_policy=compact ducktape https://buildkite.com/redpanda/redpanda/builds/61361#0194b30b-c7e8-4103-bbfb-cf32306b71a9 FLAKY 1/2
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/61361#0194b30b-c7e7-4f8b-96d2-fb381349c39f FLAKY 1/2
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic ducktape https://buildkite.com/redpanda/redpanda/builds/61361#0194b30b-c7e7-4f8b-96d2-fb381349c39f FLAKY 1/2

@bashtanov bashtanov force-pushed the iceberg-w-data-migrations branch from d493031 to 796d262 Compare January 16, 2025 17:29
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#60858

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/datalake/simple_connect_test.py::RedpandaConnectIcebergTest.test_translating_avro_serialized_records@{"cloud_storage_type":1,"scenario":"remount"}

@bashtanov
Copy link
Contributor Author

Meh. It does not fail when I run it locally with repeat.

@bashtanov bashtanov marked this pull request as draft January 17, 2025 00:09
@bashtanov bashtanov force-pushed the iceberg-w-data-migrations branch from 796d262 to 9e89a98 Compare January 17, 2025 08:49
@bashtanov bashtanov marked this pull request as ready for review January 17, 2025 08:49
@bashtanov
Copy link
Contributor Author

I increased timeout in the test, as it coordinator loop, as one last use of the long one may be in progress while we are waiting. Also added some logging and removed dead code. Please re-review.

@bashtanov
Copy link
Contributor Author

/dt

co_return co_await flush(partition);
auto block_offset = block_res.value();

auto deadline = model::timeout_clock::now() + 5s;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this timeout value is very low considering the possible amount of work is there to be done. Should we consider making it larger and configurable ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be retried by migration reconciliation, including when controller leadership changes. Since partition is blocked already translation backlog will eventually get processed. Translator flush doesn't do any active work, it just waits for certain offset to be translated. The only problem is cloud storage flush will be invoked every time. @WillemKauf @andrwng if a partition does not receive any further writes is it much overhead to flush its cloud data every few seconds?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this call we are waiting for the translator to actually execute the translation work, and this may take a while depending on the translation gap. If all the calls are idempotent then this is great, it will just be retried and eventually succeed

@bashtanov
Copy link
Contributor Author

/dt

Check that with redpanda.iceberg.delete=false old table data remains
available even before we recreate the topic.
And switch back to normal admin after disruptions are over.
add log lines, fix typos
if we unmount the topic before this table may lack metadata
Introduce "offline mode" that cuts all ties to the topic in Redpanda
cluster. It carries on querying the query engine and verifying results
using info cached before going into offline mode.
for to make functionality is tested while topic is being actively used
Make it possible to configure the number of messages produced by stream
Add scenarios:
1) On unmount all messages that made their way to the topic eventually
become available via query engine
2) Upon remount and further produce both old and new messages are in the
topic and in the table
This is mostly to preserve iceberg properties, but also to make sure any
newly introduced topic properties are preserved by default.
This is mostly to preserve iceberg properties, but also to make sure any
newly introduced topic properties are preserved by default.
Allows to use it for subscriptions where feedback from a called function
is necessary, such as a future or an error code.
All functions are supposed to return the same type.
Make offset_monitor more universal so that it can be used for different
data types.
Also create and subscribe one of these actions: flush data to cloud.
Wait for the offset to be translated when asked by partition to "flush".
When blocking writes collect the offset of the blocking message.
Then use it to dispatch all-components flush through partition
(leading to cloud storage flush that ignores the offset parameter and
datalake translator that waits for the correspondent kafka offset)
@bashtanov bashtanov force-pushed the iceberg-w-data-migrations branch from 9e89a98 to ae89710 Compare January 29, 2025 15:50
@mmaslankaprv mmaslankaprv self-requested a review January 30, 2025 08:50
@bashtanov bashtanov merged commit 6d82667 into redpanda-data:dev Jan 30, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants