Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TMP development to show how things could work with concurrent cursor #228

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

maxi297
Copy link
Contributor

@maxi297 maxi297 commented Jan 17, 2025

Note that this change might impact @tolik0 work here

@@ -487,6 +488,7 @@ def __init__(
self._message_repository = message_repository or InMemoryMessageRepository(
self._evaluate_log_level(emit_connector_builder_messages)
)
self._state_manager = state_manager
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that this makes sense here given that we want to instantiate cursors with the state eventually to avoid a set_initial_state method call

@@ -1476,6 +1476,17 @@ def _merge_stream_slicers(
stream_cursor=cursor_component,
)
elif model.incremental_sync:
if model.retriever.type == "AsyncRetriever":
if model.incremental_sync.type != "DatetimeBasedCursor":
# TODO explain why it isn't supported
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: I'm not exactly sure why other types of cursors wouldn't be supported but I was only doing this for source-amazon-ads so I wanted to be more restrictive than not.

Note that Global/PerPartition cursors were not updated which we will need for source-amazon-ads

@@ -151,7 +153,7 @@ def __json_serializable__(self) -> Any:
return self._stream_slice

def __hash__(self) -> int:
return hash(orjson.dumps(self._stream_slice, option=orjson.OPT_SORT_KEYS))
return SliceHasher.hash("dummy_name", self._stream_slice)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had to be updated to support AsyncPartition. SliceHasher considers __json_serializable__ but orjson does not. I figure we should have the same slicing logic everywhere and if we want to update this logic to orjson, just do it once in SliceHasher

TODO: I'm not sure why we require the name of the stream for the slice hasher.

@@ -322,6 +322,7 @@
"http_method": "GET",
},
},
"incremental_sync": {"$ref": "#/definitions/incremental_cursor"},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure that async_retriever with incremental syncs are also concurrent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant