Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[server][common][vpj] Introduce ComplexVenicePartitioner to materialized view #1509

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

xunyin8
Copy link
Contributor

@xunyin8 xunyin8 commented Feb 7, 2025

[server][common][vpj] Introduce ComplexVenicePartitioner to materialized view

The change will not work if record is actually large and chunked. Proper chunking support is needed and will be addressed in a separate PR.

  1. Introduced ComplexVenicePartitioner which extends VenicePartitioner and offer a new API to partition by value and provide possible one-to-many partition mapping.

  2. Added value provider of type Lazy to VeniceViewWriter's processRecord API to access deserialized value if needed. e.g. when a ComplexVenicePartitioner is involved.

  3. MergeConflictResultWrapper and WriteComputeResultWrapper will now provide deserialized value in a best effort manner. This is useful when we already deserialized the value for a partial update operation so that the deserialized value can be provided directly to the materialized view writer.

  4. Refactored VeniceWriter to expose some APIs to child class. Introduced ComplexVeniceWriter which extends VeniceWriter. Reasoning here is that the ComplexVeniceWriter will have different APIs to be used in MaterializedViewWriter and CompositeVeniceWriter to write to materialized view partition(s) and potentially involving a ComplexVenicePartitioner. Alternatively we could push common logic from VeniceWriter to AbstractVeniceWriter. However, ComplexVeniceWriter needs/shares too much common logic with VeniceWriter (chunking, DIV support, pubSubAdapter, etc...) it will make AbstractVeniceWriter too specialized and unable to offer the flexibility it needs to support something like the CompositeVeniceWriter.

  5. Override putLargeValue in ComplexVeniceWriter to skip chunking and writing large messages. Once we have proper chunking support we need to be careful to not re-chunk when writing the same value to different partition in ComplexVeniceWriter.

How was this PR tested?

Added new integration test with A/A, W/C and a new test value based partitioner.
Will add new unit tests once we have some consensus on the API changes

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.

Copy link
Contributor

@FelixGV FelixGV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some early thoughts... did not read the whole PR yet. But hopefully useful in terms of discussing the API changes.

@xunyin8 xunyin8 force-pushed the value-based-partitioner branch from 0d80a11 to fa6c001 Compare February 13, 2025 07:18
@xunyin8 xunyin8 changed the title [server][common][vpj] Introduce VeniceComplexPartitioner to materialized view [server][common][vpj] Introduce ComplexVenicePartitioner to materialized view Feb 13, 2025
@xunyin8 xunyin8 force-pushed the value-based-partitioner branch from fa6c001 to fee9bc7 Compare February 13, 2025 07:30
…zed view

The change will not work if record is actually large and chunked. Proper chunking
support is needed and will be addressed in a separate PR.

1. Introduced VeniceComplexPartitioner which extends VenicePartitioner and offer
a new API to partition by value and provide possible one-to-many partition mapping.

2. Added value provider of type Lazy<GenericRecord> to VeniceViewWriter's processRecord
API to access deserialized value if needed. e.g. when a VeniceComplexPartitioner is
involved.

3. MergeConflictResult will now provide deserialized value in a best effort manner.
This is useful when we already deserialized the value for a partial update operation
so that the deserialized value can be provided directly to the materialized view writer.

4. Refactored VeniceWriter to expose an API to write to desired partition with new
DIV. This is only used by the new method writeWithComplexPartitioner for now to handle
the partitioning and writes of the same value to mulitple partitions. However, this newly
exposed API should also come handy when we build proper chunking support to forward chunks
to predetermined view topic partitions.

5. writeWithComplexPartitioner in VeniceWriter will re-chunk when writing to each partition.
This should be optimized when we build proper chunking support.
@xunyin8 xunyin8 force-pushed the value-based-partitioner branch from fee9bc7 to ca52bc5 Compare February 18, 2025 07:23
Copy link
Contributor

@gaojieliu gaojieliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code change looks good overall.
I do think we need to take care of the comment I just left, which is very tricky as it is a race condition.

Lazy<GenericRecord> oldValueProvider = Lazy.of(() -> {
ChunkedValueManifestContainer oldValueManifestContainer = new ChunkedValueManifestContainer();
int oldValueReaderSchemaId = schemaRepository.getSupersetSchema(storeName).getId();
return readStoredValueRecord(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readStoredValueRecord method would read the most-recent data, which means it will try to read the transient record cache first and then RocksDB.
For WC enabled store, delete will update the transient record to be null for the key and I think this method will always return null.

Even we perform the lookup before updating transient record cache, it will be wrong as it is a lazy function, so when ViewWriter tries to produce to view topics, it will still read the most recent value, which is null, and the situation will become worse when parallel compute for AA/WC workload is enabled as all the updates to the same key in the same batch will be executed (updating transient cache) before producing to version/view topics, which means for the delete operation, the lazy function can read the most-recent value, which can be populated by a later put in the same batch.

Can we do a non-lazy lookup always before finding out a more optimized solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants