feat(Low-Code Concurrent CDK): Add ConcurrentPerPartitionCursor #111

tolik0 · 2024-12-04T11:35:39Z

Summary by CodeRabbit

Release Notes

New Features
- Introduced concurrent cursor capabilities for improved data processing.
- Added methods for managing partitioned cursors, enhancing state management.
Bug Fixes
- Streamlined request handling in the SimpleRetriever class for consistency.
Documentation
- Updated tests to reflect changes in cursor handling and ensure comprehensive coverage.
Tests
- Added unit tests for ConcurrentDeclarativeSource and updated tests for record filtering to match new cursor parameterization.

coderabbitai · 2024-12-04T11:39:05Z

📝 Walkthrough

Walkthrough

The changes in this pull request introduce enhancements to the Airbyte CDK, particularly focusing on concurrent processing of streams. Key updates include the addition of new cursor types, modifications to existing classes to support these cursors, and refinements in request handling. The ConcurrentDeclarativeSource class is updated to incorporate a new cursor logic, while various methods across several files are adjusted to ensure compatibility with the new features. Additionally, unit tests are introduced or modified to validate the changes made to the stream handling and cursor management.

Changes

File	Change Summary
`airbyte_cdk/sources/declarative/concurrent_declarative_source.py`	Added import for `PerPartitionWithGlobalCursor` and updated `_group_streams` method to handle new cursor logic for concurrent processing.
`airbyte_cdk/sources/declarative/incremental/__init__.py`	Added imports for `ConcurrentCursorFactory` and `ConcurrentPerPartitionCursor` to `__all__` list and corrected formatting issue.
`airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py`	Introduced `ConcurrentCursorFactory` and `ConcurrentPerPartitionCursor` classes with methods for managing partitioned cursors and state.
`airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py`	Added methods for creating concurrent cursors and modified existing method to enhance request options logic.
`airbyte_cdk/sources/declarative/retrievers/simple_retriever.py`	Updated request handling methods to utilize `request_option_provider` instead of `stream_slicer`.
`airbyte_cdk/sources/streams/concurrent/cursor.py`	Added `close_partition_without_emit` method to `ConcurrentCursor` for closing partitions without emitting state messages.
`airbyte_cdk/sources/declarative/extractors/record_filter.py`	Consolidated cursor parameters in `ClientSideIncrementalRecordFilterDecorator` constructor and updated filtering logic.
`airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py`	Added `should_be_synced` method and modified `is_greater_than_or_equal` for stricter validation.
`unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py`	Added unit tests for `ConcurrentDeclarativeSource` with parameterized test cases.
`unit_tests/sources/declarative/extractors/test_record_filter.py`	Updated tests to reflect changes in `ClientSideIncrementalRecordFilterDecorator` constructor parameters.

Possibly related PRs

chore(refactor): refactor partition generator to take any stream slicer #39: Enhancements to _group_streams in ConcurrentDeclarativeSource, directly related to stream handling.
fix(airbyte-cdk): Fix Record Filter Validation in ConcurrentDeclarativeSource #45: Modifications to ConcurrentDeclarativeSource for excluding decorators from validation, relevant to stream configurations.
feat(low-code concurrent): Concurrent execution for streams without partition routers nor cursor #61: Aligns with enhancements allowing concurrent execution of streams without partition routers.
feat(Low-Code Concurrent CDK): Allow non-incremental substreams and list based partition router streams with parents to be processed by the concurrent cdk #89: Introduces modifications for non-incremental substreams, connecting with the main PR's focus on stream processing.
fix(concurrent-cdk): Move the grouping of concurrent and synchronous streams into the read and discover commands instead of when initializing the source #130: Restructuring of stream grouping in methods aligns with changes in stream handling and state management.
chore: fix typing issues #135: Fixes typing issues related to logic and handling of stream configurations in ConcurrentDeclarativeSource.

Suggested reviewers

maxi297
aaronsteers

Would you like to consider any additional reviewers or related PRs? Wdyt?

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 9

🧹 Outside diff range and nitpick comments (5)

airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)

59-66: Consider simplifying the constructor parameters

The __init__ method of ConcurrentPerPartitionCursor has many parameters, which can make it harder to maintain and understand. Would it be beneficial to encapsulate related parameters into data classes or reduce the number of parameters if possible? Wdyt?

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2)

23-25: Unused import statement

We import PerPartitionWithGlobalCursor, but it's not used elsewhere in the code. Should we remove this import to keep the code clean? Wdyt?

309-361: Refactor duplicated code in _group_streams method

There seems to be duplicated code in the conditional blocks handling DatetimeBasedCursorModel streams. Could we refactor these blocks into a helper function to reduce redundancy and improve readability? Wdyt?

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

915-970: Consider reducing code duplication

The methods create_concurrent_cursor_from_datetime_based_cursor and create_concurrent_cursor_from_perpartition_cursor share similar logic. Could we extract the common parts into a shared helper function to simplify maintenance? Wdyt?
airbyte_cdk/sources/streams/concurrent/cursor.py (1)
243-251: Consider extracting common logic between close methods?

The new close_partition_without_emit shares a lot of logic with close_partition. What do you think about extracting the common logic into a private method to reduce duplication? Something like this, wdyt?
+    def _close_partition_internal(self, partition: Partition) -> bool:
+        slice_count_before = len(self.state.get("slices", []))
+        self._add_slice_to_state(partition)
+        should_merge = slice_count_before < len(self.state["slices"])
+        if should_merge:
+            self._merge_partitions()
+        self._has_closed_at_least_one_slice = True
+        return should_merge
+
     def close_partition_without_emit(self, partition: Partition) -> None:
-        slice_count_before = len(self.state.get("slices", []))
-        self._add_slice_to_state(partition)
-        if slice_count_before < len(
-            self.state["slices"]
-        ):  # only emit if at least one slice has been processed
-            self._merge_partitions()
-        self._has_closed_at_least_one_slice = True
+        self._close_partition_internal(partition)

     def close_partition(self, partition: Partition) -> None:
-        slice_count_before = len(self.state.get("slices", []))
-        self._add_slice_to_state(partition)
-        if slice_count_before < len(
-            self.state["slices"]
-        ):  # only emit if at least one slice has been processed
-            self._merge_partitions()
+        if self._close_partition_internal(partition):
             self._emit_state_message()
-        self._has_closed_at_least_one_slice = True

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between acb6630 and a3304b9.

📒 Files selected for processing (7)

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2 hunks)
airbyte_cdk/sources/declarative/incremental/__init__.py (2 hunks)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3 hunks)
airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (1 hunks)
airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py (1 hunks)
airbyte_cdk/sources/streams/concurrent/cursor.py (1 hunks)

🔇 Additional comments (5)

airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (2)

25-29: Question about returning only the first cursor

In the ConcurrentCursorFactory, the create method returns only the first cursor from _create_function. Since _create_function returns a tuple of cursors, should we ensure that we're not discarding any necessary cursors? Perhaps we should handle all cursors returned. Wdyt?

221-224: Ensure partition key serialization handles edge cases

In _to_partition_key and _to_dict, we use self._partition_serializer. Do we need to ensure that partition keys are properly sanitized or encoded to handle special characters and prevent serialization issues? Wdyt?

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)

317-318: Type checking with isinstance

We use isinstance(declarative_stream.retriever.stream_slicer, PerPartitionWithGlobalCursor) to check the type. Should we consider using duck typing or interface checks to allow for subclasses or alternative implementations that fulfill the same protocol? Wdyt?

airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py (1)

41-66: Clarify the purpose of DeclarativePartitionFactory1

The new class DeclarativePartitionFactory1 appears to be similar to DeclarativePartitionFactory but with a different approach to handling the retriever. Is this intended to replace the original factory or serve a different purpose? Providing more context or renaming the class for clarity might help. Wdyt?

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (1)

181-181: LGTM!

The change to use request_option_provider instead of stream_slicer improves modularity by separating concerns.

coderabbitai · 2024-12-04T11:39:08Z

airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py

+    def state(self) -> MutableMapping[str, Any]:
+        states = []
+        for partition_tuple, cursor in self._cursor_per_partition.items():
+            cursor_state = cursor._connector_state_converter.convert_to_state_message(
+                cursor._cursor_field, cursor.state
+            )
+            if cursor_state:
+                states.append(
+                    {
+                        "partition": self._to_dict(partition_tuple),
+                        "cursor": copy.deepcopy(cursor_state),
+                    }
+                )
+        state: dict[str, Any] = {"states": states}
+        return state
+


🛠️ Refactor suggestion

Potential performance issue when generating state

The state property method iterates over self._cursor_per_partition.items() and performs deep copies of cursor states. This could become a performance bottleneck with a large number of partitions. Should we consider optimizing this by avoiding deep copies or processing states incrementally? Wdyt?

Is there a reason we're doing a deepcopy here? If so, I think we should document it

coderabbitai · 2024-12-04T11:39:08Z

airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py

+    def _ensure_partition_limit(self) -> None:
+        """
+        Ensure the maximum number of partitions is not exceeded. If so, the oldest added partition will be dropped.
+        """
+        while len(self._cursor_per_partition) > self.DEFAULT_MAX_PARTITIONS_NUMBER - 1:
+            self._over_limit += 1
+            oldest_partition = self._cursor_per_partition.popitem(last=False)[
+                0
+            ]  # Remove the oldest partition
+            logger.warning(
+                f"The maximum number of partitions has been reached. Dropping the oldest partition: {oldest_partition}. Over limit: {self._over_limit}."
+            )
+


⚠️ Potential issue

Clarification on partition limit logic

In _ensure_partition_limit, we increment _over_limit every time we remove a partition when the limit is reached. However, in limit_reached, we check if _over_limit > DEFAULT_MAX_PARTITIONS_NUMBER. Is this the intended behavior? Should the condition be adjusted to properly reflect when the limit is truly exceeded? Wdyt?

coderabbitai · 2024-12-04T11:39:08Z

airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py

+    def limit_reached(self) -> bool:
+        return self._over_limit > self.DEFAULT_MAX_PARTITIONS_NUMBER
+


⚠️ Potential issue

Possible off-by-one error in limit_reached method

The method limit_reached returns True when _over_limit > DEFAULT_MAX_PARTITIONS_NUMBER. Given how _over_limit is incremented, could this condition lead to an off-by-one error? Should we revisit this logic to ensure accurate limit checks? Wdyt?

coderabbitai · 2024-12-04T11:39:08Z

airbyte_cdk/sources/declarative/concurrent_declarative_source.py

+                    stream_state = state_manager.get_stream_state(
+                        stream_name=declarative_stream.name, namespace=declarative_stream.namespace
+                    )
+                    partition_router = declarative_stream.retriever.stream_slicer._partition_router
+
+                    cursor = self._constructor.create_concurrent_cursor_from_perpartition_cursor(
+                            state_manager=state_manager,
+                            model_type=DatetimeBasedCursorModel,
+                            component_definition=incremental_sync_component_definition,
+                            stream_name=declarative_stream.name,
+                            stream_namespace=declarative_stream.namespace,
+                            config=config or {},
+                            stream_state=stream_state,
+                            partition_router=partition_router,


⚠️ Potential issue

Handling potential None in stream_state

When retrieving stream_state, do we need to handle the case where it might be None? Ensuring that stream_state is properly initialized could prevent unexpected errors during cursor creation. Wdyt?

coderabbitai · 2024-12-04T11:39:08Z

airbyte_cdk/sources/declarative/concurrent_declarative_source.py

+                        )
+
+
+                    partition_generator = StreamSlicerPartitionGenerator(
+                        DeclarativePartitionFactory(
+                            declarative_stream.name,
+                            declarative_stream.get_json_schema(),
+                            self._retriever_factory(
+                                name_to_stream_mapping[declarative_stream.name],
+                                config,
+                                stream_state,
+                            ),
+                            self.message_repository,
+                        ),
+                        cursor,
+                    )


🛠️ Refactor suggestion

Error handling for partition generator creation

In constructing the partition_generator, are there potential edge cases where dependencies might not be properly initialized? Should we add error handling or input validation to make the code more robust against such scenarios? Wdyt?

coderabbitai · 2024-12-04T11:39:09Z

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

+    ConcurrentCursorFactory,
+    ConcurrentPerPartitionCursor,


⚠️ Potential issue

Missing comma in import statement

In the import statement for ConcurrentCursorFactory and ConcurrentPerPartitionCursor, there's a missing comma between them. Should we add a comma to prevent syntax errors? Wdyt?

coderabbitai · 2024-12-04T11:39:09Z

airbyte_cdk/sources/declarative/incremental/__init__.py

+    "ConcurrentCursorFactory"
+    "ConcurrentPerPartitionCursor",


⚠️ Potential issue

Comma missing in __all__ list

In the __all__ list, there's a missing comma after "ConcurrentCursorFactory". This could lead to import errors. Should we add the comma to fix this issue? Wdyt?

coderabbitai · 2024-12-04T11:39:09Z

airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py

+        The DeclarativePartitionFactory takes a retriever_factory and not a retriever directly. The reason is that our components are not
+        thread safe and classes like `DefaultPaginator` may not work because multiple threads can access and modify a shared field across each other.
+        In order to avoid these problems, we will create one retriever per thread which should make the processing thread-safe.
+        """


⚠️ Potential issue

Update docstring to match the new constructor

The docstring for DeclarativePartitionFactory1 mentions that it takes a retriever_factory, but the constructor now accepts a retriever instance directly. Should we update the docstring to reflect this change and prevent confusion? Wdyt?

coderabbitai

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (5)

airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)

152-156: Could we simplify the partition limit condition in _ensure_partition_limit method?

Currently, the condition is len(self._cursor_per_partition) > self.DEFAULT_MAX_PARTITIONS_NUMBER - 1. Would it improve readability to use len(self._cursor_per_partition) >= self.DEFAULT_MAX_PARTITIONS_NUMBER instead? This change might make the maximum partition limit clearer. Wdyt?
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
910-965: Consider extracting common validation logic and fixing indentation.

A few suggestions to improve this new method:

The component type validation logic (lines 922-926) is duplicated from create_concurrent_cursor_from_datetime_based_cursor. Should we extract it into a helper method to follow DRY?

The indentation in the return statement (lines 955-964) seems inconsistent with the rest of the file. Should we align it with the standard indentation?

Instead of using type ignore on line 961, we could properly type message_repository. Wdyt?
     def create_concurrent_cursor_from_perpartition_cursor(
         self,
         state_manager: ConnectorStateManager,
         model_type: Type[BaseModel],
         component_definition: ComponentDefinition,
         stream_name: str,
         stream_namespace: Optional[str],
         config: Config,
         stream_state: MutableMapping[str, Any],
         partition_router,
             **kwargs: Any,
     ) -> ConcurrentPerPartitionCursor:
-        component_type = component_definition.get("type")
-        if component_definition.get("type") != model_type.__name__:
-            raise ValueError(
-                f"Expected manifest component of type {model_type.__name__}, but received {component_type} instead"
-            )
+        self._validate_component_type(component_definition, model_type)

         # ... rest of the method ...

         return ConcurrentPerPartitionCursor(
-                cursor_factory=cursor_factory,
-                partition_router=partition_router,
-                stream_name=stream_name,
-                stream_namespace=stream_namespace,
-                stream_state=stream_state,
-                message_repository=self._message_repository,  # type: ignore
-                connector_state_manager=state_manager,
-                cursor_field=cursor_field,
-            )
+            cursor_factory=cursor_factory,
+            partition_router=partition_router,
+            stream_name=stream_name,
+            stream_namespace=stream_namespace,
+            stream_state=stream_state,
+            message_repository=self._message_repository,
+            connector_state_manager=state_manager,
+            cursor_field=cursor_field,
+        )
Helper method to add:
def _validate_component_type(self, component_definition: ComponentDefinition, expected_type: Type[BaseModel]) -> None:
    component_type = component_definition.get("type")
    if component_definition.get("type") != expected_type.__name__:
        raise ValueError(
            f"Expected manifest component of type {expected_type.__name__}, but received {component_type} instead"
        )
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (3)

276-276: Should the test case name match the function name for consistency?

In the parameterization at line 276, the test name is "test_incremental_parent_state", but the test function is named test_incremental_parent_state_no_incremental_dependency. Aligning the test case name with the function name could enhance clarity. Wdyt?

483-495: Could we improve the docstring formatting for better readability?

The docstring contains a detailed explanation of the test, but it's indented inconsistently. Adjusting the indentation could enhance readability and maintain consistency with PEP 257 guidelines. Wdyt?

519-523: Is comparing only the last state sufficient to validate the final state?

In the assertion at lines 519-523, we compare final_state[-1] with expected_state. Should we consider comparing all elements of final_state to ensure that all state messages throughout the sync match the expectations? Wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4ddbb84 and dfcf17f.

📒 Files selected for processing (7)

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2 hunks)
airbyte_cdk/sources/declarative/extractors/record_filter.py (2 hunks)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (1 hunks)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)
airbyte_cdk/sources/streams/concurrent/cursor.py (1 hunks)
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

airbyte_cdk/sources/declarative/concurrent_declarative_source.py

🔇 Additional comments (2)

airbyte_cdk/sources/declarative/extractors/record_filter.py (1)

78-81: Is it safe to use an empty stream_name when creating Record instances?

In the filter_records method, we're creating Record instances with an empty stream_name. Could this lead to issues if other parts of the codebase expect stream_name to be non-empty? Should we consider passing the actual stream name to prevent potential problems? Wdyt?

airbyte_cdk/sources/streams/concurrent/cursor.py (1)

243-251: Could we clarify the behavior of close_partition_without_emit method?

In the close_partition_without_emit method, we update _has_closed_at_least_one_slice to True, but we don't emit a state message. Is this intended? Should we ensure that the state is eventually emitted elsewhere to keep the state consistent? Wdyt?

coderabbitai · 2024-12-05T14:11:39Z

airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py

+        if self._to_partition_key(record.associated_slice.partition) not in self._cursor_per_partition:
+            partition_state = (
+                self._state_to_migrate_from
+                if self._state_to_migrate_from
+                else self._NO_CURSOR_STATE
+            )
+            cursor = self._create_cursor(partition_state)
+


⚠️ Potential issue

Should we add a check for record.associated_slice being None in should_be_synced?

In the should_be_synced method, we access record.associated_slice.partition without verifying if record.associated_slice is not None. This could raise an AttributeError if record.associated_slice is None. Should we add a check to ensure record.associated_slice is not None before proceeding? Wdyt?

coderabbitai · 2024-12-05T14:11:39Z

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

+        ) and not isinstance(stream_slicer, PerPartitionWithGlobalCursor):
            # Many of the custom component implementations of DatetimeBasedCursor override get_request_params() (or other methods).
            # Because we're decoupling RequestOptionsProvider from the Cursor, custom components will eventually need to reimplement


⚠️ Potential issue

Simplify the type checking condition.

The current condition seems overly complex and might cause issues with valid subclasses:

not isinstance(stream_slicer, DatetimeBasedCursor) or type(stream_slicer) is not DatetimeBasedCursor

Should we simplify it to just use isinstance? The current combination of isinstance and type checks with OR could unexpectedly reject valid subclasses of DatetimeBasedCursor. Wdyt?

- if ( - not isinstance(stream_slicer, DatetimeBasedCursor) - or type(stream_slicer) is not DatetimeBasedCursor - ) and not isinstance(stream_slicer, PerPartitionWithGlobalCursor): + if not isinstance(stream_slicer, (DatetimeBasedCursor, PerPartitionWithGlobalCursor)):

coderabbitai · 2024-12-05T14:11:39Z

unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py

+        source_config=manifest, config=config, catalog=catalog, state=state
+    )
+    messages = list(
+        source.read(logger=source.logger, config=config, catalog=catalog, state=[])
+    )


⚠️ Potential issue

Should we pass the initial_state to the read method?

I noticed that in the _run_read function, we initialize ConcurrentDeclarativeSource with the state parameter, but when calling source.read, we're passing state=[] instead of state=state. Passing the initial_state to read might be necessary to ensure the state is correctly utilized during the read operation. Wdyt?

tolik0 · 2024-12-06T13:49:31Z

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

❌ Job failed.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

unit_tests/sources/declarative/extractors/test_record_filter.py (1)

Line range hint 341-432: Comprehensive test coverage for cursor types! 👍

The parameterized tests cover all important cursor scenarios:

Basic datetime cursor

Global substream cursor (with and without state)

Per-partition cursor with various state configurations

One thought: would it be valuable to add a test case for error handling when an invalid cursor type is passed? wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between dfcf17f and b84e68a.

📒 Files selected for processing (1)

unit_tests/sources/declarative/extractors/test_record_filter.py (2 hunks)

🔇 Additional comments (1)

unit_tests/sources/declarative/extractors/test_record_filter.py (1)

293-293: Nice API simplification! 🎉

I see you've consolidated the cursor parameters (date_time_based_cursor and substream_cursor) into a single cursor parameter. This makes the API cleaner and more flexible. The tests demonstrate it works well with all cursor types (DatetimeBasedCursor, GlobalSubstreamCursor, and PerPartitionWithGlobalCursor).

Also applies to: 431-431

tolik0 · 2024-12-06T13:58:54Z

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)

305-316: Consider extracting condition to a descriptive method

The nested conditions make the code harder to read and maintain. Would you consider extracting these conditions into a separate method with a descriptive name? Wdyt?

Something like:

+def _supports_concurrent_perpartition_cursor(
+    self,
+    declarative_stream: DeclarativeStream,
+    incremental_sync_component_definition: Mapping[str, Any]
+) -> bool:
+    return (
+        incremental_sync_component_definition
+        and incremental_sync_component_definition.get("type", "") == DatetimeBasedCursorModel.__name__
+        and self._stream_supports_concurrent_partition_processing(declarative_stream=declarative_stream)
+        and hasattr(declarative_stream.retriever, "stream_slicer")
+        and isinstance(declarative_stream.retriever.stream_slicer, PerPartitionWithGlobalCursor)
+    )

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b84e68a and a6b5daa.

⛔ Files ignored due to path filters (1)

poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (5)

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2 hunks)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (1 hunks)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py

🔇 Additional comments (4)

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

84-85: ⚠️ Potential issue

Add missing comma between imports

There's a missing comma between ConcurrentCursorFactory and ConcurrentPerPartitionCursor in the import statement.

Apply this diff:

-    ConcurrentCursorFactory
-    ConcurrentPerPartitionCursor,
+    ConcurrentCursorFactory,
+    ConcurrentPerPartitionCursor,

airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (2)

Line range hint 320-358: LGTM! Well-structured comparison logic

The implementation includes proper null checks and validation for partition matching. The error messages are clear and descriptive.

306-319: ⚠️ Potential issue

Add null check for record.associated_slice

The code accesses record.associated_slice.partition without verifying if record.associated_slice is not None. This could lead to a NullPointerException.

Consider adding a null check like this:

def should_be_synced(self, record: Record) -> bool:
+   if not record.associated_slice:
+       raise ValueError("Record must have an associated slice")
    if (
        self._to_partition_key(record.associated_slice.partition)
        not in self._cursor_per_partition
    ):

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)

317-331: Consider handling potential None stream_state

When retrieving stream_state, we might want to handle the case where it could be None to prevent unexpected errors during cursor creation. Wdyt?

Let's verify the current handling of stream_state:

#!/bin/bash
# Check how stream_state is handled in other parts of the codebase
rg -A 3 "get_stream_state" | grep -v "test"

coderabbitai · 2024-12-06T14:04:10Z

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

+    def create_concurrent_cursor_from_perpartition_cursor(
+        self,
+        state_manager: ConnectorStateManager,
+        model_type: Type[BaseModel],
+        component_definition: ComponentDefinition,
+        stream_name: str,
+        stream_namespace: Optional[str],
+        config: Config,
+        stream_state: MutableMapping[str, Any],
+        partition_router,
+        **kwargs: Any,
+    ) -> ConcurrentPerPartitionCursor:
+        component_type = component_definition.get("type")
+        if component_definition.get("type") != model_type.__name__:
+            raise ValueError(
+                f"Expected manifest component of type {model_type.__name__}, but received {component_type} instead"
+            )
+
+        datetime_based_cursor_model = model_type.parse_obj(component_definition)
+
+        if not isinstance(datetime_based_cursor_model, DatetimeBasedCursorModel):
+            raise ValueError(
+                f"Expected {model_type.__name__} component, but received {datetime_based_cursor_model.__class__.__name__}"
+            )
+
+        interpolated_cursor_field = InterpolatedString.create(
+            datetime_based_cursor_model.cursor_field,
+            parameters=datetime_based_cursor_model.parameters or {},
+        )
+        cursor_field = CursorField(interpolated_cursor_field.eval(config=config))
+
+        # Create the cursor factory
+        cursor_factory = ConcurrentCursorFactory(
+            partial(
+                self.create_concurrent_cursor_from_datetime_based_cursor,
+                state_manager=state_manager,
+                model_type=model_type,
+                component_definition=component_definition,
+                stream_name=stream_name,
+                stream_namespace=stream_namespace,
+                config=config,
+            )
+        )
+
+        # Return the concurrent cursor and state converter
+        return ConcurrentPerPartitionCursor(
+            cursor_factory=cursor_factory,
+            partition_router=partition_router,
+            stream_name=stream_name,
+            stream_namespace=stream_namespace,
+            stream_state=stream_state,
+            message_repository=self._message_repository,  # type: ignore
+            connector_state_manager=state_manager,
+            cursor_field=cursor_field,
+        )
+


🛠️ Refactor suggestion

Consider adding error handling for cursor creation

The cursor creation logic could benefit from additional error handling. For instance, what happens if the cursor factory creation fails? Wdyt?

Here's a suggested enhancement:

def create_concurrent_cursor_from_perpartition_cursor( self, ... ) -> ConcurrentPerPartitionCursor: + try: component_type = component_definition.get("type") if component_definition.get("type") != model_type.__name__: raise ValueError( f"Expected manifest component of type {model_type.__name__}, but received {component_type} instead" ) # Rest of the implementation... return ConcurrentPerPartitionCursor( cursor_factory=cursor_factory, partition_router=partition_router, stream_name=stream_name, stream_namespace=stream_namespace, stream_state=stream_state, message_repository=self._message_repository, connector_state_manager=state_manager, cursor_field=cursor_field, ) + except Exception as e: + raise ValueError( + f"Failed to create concurrent cursor for stream '{stream_name}': {str(e)}" + ) from e

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def create_concurrent_cursor_from_perpartition_cursor(

self,

state_manager: ConnectorStateManager,

model_type: Type[BaseModel],

component_definition: ComponentDefinition,

stream_name: str,

stream_namespace: Optional[str],

config: Config,

stream_state: MutableMapping[str, Any],

partition_router,

**kwargs: Any,

) -> ConcurrentPerPartitionCursor:

component_type = component_definition.get("type")

if component_definition.get("type") != model_type.__name__:

raise ValueError(

f"Expected manifest component of type {model_type.__name__}, but received {component_type} instead"

)

datetime_based_cursor_model = model_type.parse_obj(component_definition)

if not isinstance(datetime_based_cursor_model, DatetimeBasedCursorModel):

raise ValueError(

f"Expected {model_type.__name__} component, but received {datetime_based_cursor_model.__class__.__name__}"

)

interpolated_cursor_field = InterpolatedString.create(

datetime_based_cursor_model.cursor_field,

parameters=datetime_based_cursor_model.parameters or {},

)

cursor_field = CursorField(interpolated_cursor_field.eval(config=config))

# Create the cursor factory

cursor_factory = ConcurrentCursorFactory(

partial(

self.create_concurrent_cursor_from_datetime_based_cursor,

state_manager=state_manager,

model_type=model_type,

component_definition=component_definition,

stream_name=stream_name,

stream_namespace=stream_namespace,

config=config,

)

)

# Return the concurrent cursor and state converter

return ConcurrentPerPartitionCursor(

cursor_factory=cursor_factory,

partition_router=partition_router,

stream_name=stream_name,

stream_namespace=stream_namespace,

stream_state=stream_state,

message_repository=self._message_repository, # type: ignore

connector_state_manager=state_manager,

cursor_field=cursor_field,

)

def create_concurrent_cursor_from_perpartition_cursor(

self,

state_manager: ConnectorStateManager,

model_type: Type[BaseModel],

component_definition: ComponentDefinition,

stream_name: str,

stream_namespace: Optional[str],

config: Config,

stream_state: MutableMapping[str, Any],

partition_router,

**kwargs: Any,

) -> ConcurrentPerPartitionCursor:

try:

component_type = component_definition.get("type")

if component_definition.get("type") != model_type.__name__:

raise ValueError(

f"Expected manifest component of type {model_type.__name__}, but received {component_type} instead"

)

datetime_based_cursor_model = model_type.parse_obj(component_definition)

if not isinstance(datetime_based_cursor_model, DatetimeBasedCursorModel):

raise ValueError(

f"Expected {model_type.__name__} component, but received {datetime_based_cursor_model.__class__.__name__}"

)

interpolated_cursor_field = InterpolatedString.create(

datetime_based_cursor_model.cursor_field,

parameters=datetime_based_cursor_model.parameters or {},

)

cursor_field = CursorField(interpolated_cursor_field.eval(config=config))

# Create the cursor factory

cursor_factory = ConcurrentCursorFactory(

partial(

self.create_concurrent_cursor_from_datetime_based_cursor,

state_manager=state_manager,

model_type=model_type,

component_definition=component_definition,

stream_name=stream_name,

stream_namespace=stream_namespace,

config=config,

)

)

# Return the concurrent cursor and state converter

return ConcurrentPerPartitionCursor(

cursor_factory=cursor_factory,

partition_router=partition_router,

stream_name=stream_name,

stream_namespace=stream_namespace,

stream_state=stream_state,

message_repository=self._message_repository, # type: ignore

connector_state_manager=state_manager,

cursor_field=cursor_field,

)

except Exception as e:

raise ValueError(

f"Failed to create concurrent cursor for stream '{stream_name}': {str(e)}"

) from e

tolik0 · 2024-12-06T15:54:06Z

Regression tests: https://github.com/airbytehq/airbyte/actions/runs/12201672892/job/34040590334

maxi297

I'm very happy with the progress on this! I've added a couple of comments more in terms of code maintenance and structure but the functional part seemed fine. I want to check the tests eventually (tomorrow hopefully) but I can still leave a couple of comments here

maxi297 · 2024-12-09T15:33:57Z

airbyte_cdk/sources/declarative/concurrent_declarative_source.py

@@ -299,6 +302,60 @@ def _group_streams(
                            cursor=final_state_cursor,
                        )
                    )
+                elif (


Should we filter only on list partition router until we support the global cursor part?

maxi297 · 2024-12-09T15:46:57Z

airbyte_cdk/sources/declarative/extractors/record_filter.py

@@ -77,7 +75,7 @@ def filter_records(
        records = (
            record
            for record in records
-            if (self._substream_cursor or self._date_time_based_cursor).should_be_synced(
+            if self._cursor.should_be_synced(


This change is so beautiful that I want to cry ❤️

maxi297 · 2024-12-09T15:57:03Z

airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py

+
+
+class ConcurrentCursorFactory:
+    def __init__(self, create_function: Callable[..., Cursor]):


Should the input type for Callable be StreamState?

maxi297 · 2024-12-09T16:01:29Z

airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py

+        self._set_initial_state(stream_state)
+
+    @property
+    def cursor_field(self) -> CursorField:


Should we add this in the Cursor interface? If we put it there, it seems like the return type will need to be Optional because of the FinalStateCursor

maxi297 · 2024-12-09T16:18:46Z

airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py

+    def state(self) -> MutableMapping[str, Any]:
+        states = []
+        for partition_tuple, cursor in self._cursor_per_partition.items():
+            cursor_state = cursor._connector_state_converter.convert_to_state_message(


I have two confusions regarding this method:

Why is Cursor.state public?
I know it was already like that before you changes but I'm trying to understand where it is used and with it is public. Before your changes, it didn't seem like we needed it to be public.

Is there a way we can access the information we need without accessing private parameters?
It feel like what we need to make public here is the representation of the data that is being set as an AirbyteMessage of type state. In order words, the one in ConcurrentCursor._emit_state_message because when PerPartitionCursor will want to emit the state, it does not care about the implementation of the cursor but care about setting it in a state message

Based on the two points above, can we have the state method be more explicit (something like as_state_message) with a comment saying it is used for PerPartitionCursor and it is NOT used for other classes that are not cursors to emit states?

maxi297 · 2024-12-09T16:30:48Z

airbyte_cdk/sources/streams/concurrent/cursor.py

@@ -240,6 +240,15 @@ def observe(self, record: Record) -> None:
    def _extract_cursor_value(self, record: Record) -> Any:
        return self._connector_state_converter.parse_value(self._cursor_field.extract_value(record))

+    def close_partition_without_emit(self, partition: Partition) -> None:


It feels like if we add a public method here, it should be also in the public interface. Based on that, I'm wondering if it would be easier to have the ConcurrentCursorFactory set the message_repository as a NoopMessageRepository to prevent the cursors from emitting anything. This solution would work with any type of cursors as well (let's say we want to support custom cursors soon)

maxi297 · 2024-12-09T16:33:49Z

airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py

+    def state(self) -> MutableMapping[str, Any]:
+        states = []
+        for partition_tuple, cursor in self._cursor_per_partition.items():
+            cursor_state = cursor._connector_state_converter.convert_to_state_message(
+                cursor._cursor_field, cursor.state
+            )
+            if cursor_state:
+                states.append(
+                    {
+                        "partition": self._to_dict(partition_tuple),
+                        "cursor": copy.deepcopy(cursor_state),
+                    }
+                )
+        state: dict[str, Any] = {"states": states}
+        return state
+


Is there a reason we're doing a deepcopy here? If so, I think we should document it

maxi297 · 2024-12-09T16:36:14Z

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py

@@ -178,7 +178,7 @@ def _request_headers(
            stream_slice,
            next_page_token,
            self._paginator.get_request_headers,
-            self.stream_slicer.get_request_headers,
+            self.request_option_provider.get_request_headers,


@brianjlai was there a reason we were still using the stream_slicer here instead of the request_option_provider?

tolik0 added 3 commits December 3, 2024 16:24

Add concurrent PerPartitionCursor

d326a26

Merge branch 'main' into tolik0/concurrent-perpartitioncursor

37efbae

Use request options provider for ConcurrentPerPartitionCursor

a3304b9

github-actions bot added the enhancement New feature or request label Dec 4, 2024

Delete unused DeclarativePartitionFactory

4ddbb84

coderabbitai bot requested changes Dec 4, 2024

View reviewed changes

tolik0 added 2 commits December 5, 2024 15:17

Fixed record filter

41b029d

Merge branch 'main' into tolik0/concurrent-perpartitioncursor

eb8eec8

coderabbitai bot requested changes Dec 5, 2024

View reviewed changes

Add unit test

dfcf17f

tolik0 temporarily deployed to DockerHub December 5, 2024 16:14 — with GitHub Actions Inactive

tolik0 temporarily deployed to PyPi December 5, 2024 16:14 — with GitHub Actions Inactive

Fix record filter unit tests

b84e68a

coderabbitai bot reviewed Dec 6, 2024

View reviewed changes

tolik0 added 3 commits December 6, 2024 15:51

Update poetry lock

2038075

Merge branch 'main' into tolik0/concurrent-perpartitioncursor

c77b9a2

Update poetry lock again

c59ed5a

Auto-fix lint and format issues

a6b5daa

coderabbitai bot requested changes Dec 6, 2024

View reviewed changes

maxi297 reviewed Dec 9, 2024

View reviewed changes

		def limit_reached(self) -> bool:
		return self._over_limit > self.DEFAULT_MAX_PARTITIONS_NUMBER



		class ConcurrentCursorFactory:
		def __init__(self, create_function: Callable[..., Cursor]):

feat(Low-Code Concurrent CDK): Add ConcurrentPerPartitionCursor #111

Are you sure you want to change the base?

feat(Low-Code Concurrent CDK): Add ConcurrentPerPartitionCursor #111

Conversation

tolik0 commented Dec 4, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

Release Notes

coderabbitai bot commented Dec 4, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot Dec 4, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 4, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 4, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 4, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 4, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 4, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 4, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 5, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 5, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 5, 2024

Choose a reason for hiding this comment

tolik0 commented Dec 6, 2024 • edited by github-actions bot Loading

coderabbitai bot left a comment

Choose a reason for hiding this comment

tolik0 commented Dec 6, 2024 • edited by github-actions bot Loading

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 6, 2024

Choose a reason for hiding this comment

tolik0 commented Dec 6, 2024

maxi297 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tolik0 commented Dec 4, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 4, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

tolik0 commented Dec 6, 2024 •

edited by github-actions bot

Loading

tolik0 commented Dec 6, 2024 •

edited by github-actions bot

Loading