-
Notifications
You must be signed in to change notification settings - Fork 41
Document loaders datalake support #180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document loaders datalake support #180
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just had some smaller comments, but overall is looking solid.
libs/azure-storage/tests/integration_tests/test_document_loaders.py
Outdated
Show resolved
Hide resolved
libs/azure-storage/tests/integration_tests/test_document_loaders.py
Outdated
Show resolved
Hide resolved
4e78e22 to
1a4c148
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I just had some more suggestions on how to structure the test updates. We are getting close though.
| ) | ||
|
|
||
| def _is_adls_directory(self, blob: BlobProperties) -> bool: | ||
| return blob.size == 0 and blob.metadata.get("hdi_isfolder") == "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In between the blob.size and the hdi_isfolder check, let's add a blob.metadata. So:
blob.size == 0 and blob.metadata and blob.metadata.get("hdi_isfolder") == "true"Mainly blobs could be empty and not be directories and have metadata that equates to None. This additional clause will make sure we short-circuit early to avoid trying to call .get() on a None value and cause the entire listing to error out.
| def get_datalake_test_blobs( | ||
| blob_names: Optional[Union[str, Iterable[str]]] = None, prefix: Optional[str] = None | ||
| ) -> list[dict[str, Any]]: | ||
| if blob_names is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of duplicating the code for get_test_blobs(), I'm wondering if we can expose a new _get_test_blobs() utility that both get_test_blobs() and get_datalake_test_blobs() can call out to by passing in the list of blobs to use? Mainly I think this can help reused code especially if we want to expand on the scaffolding in the future.
libs/azure-storage/tests/utils.py
Outdated
| "blob_content": "{'test': 'test content'}", | ||
| }, | ||
| { | ||
| "blob_name": "text_file.txt", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness, maybe let's make this blob name directory/test_file.txt so it fits the more common pattern when we have there is an ADLS directory and blob underneath of it when we get to the datalake blobs list?
libs/azure-storage/tests/utils.py
Outdated
| { | ||
| **blob, | ||
| "size": len(blob["blob_content"]), | ||
| "metadata": blob.get("metadata", {}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For any of the metadata, if it is missing let's set the value to None instead of empty dictionary as it seems like when testing against the blob service the end value from the SDK is None if there is no metadata.
| for blob in get_datalake_test_blobs(prefix=prefix): | ||
| mock_blob = MagicMock() | ||
| mock_blob.name = blob["blob_name"] | ||
| mock_blob.size = int(blob["size"]) | ||
| mock_blob.metadata = blob["metadata"] | ||
| yield mock_blob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it makes sense if we take the BlobProperty scaffolding logic and roll it up into the get_test_blobs() and get_datalake_test_blobs() utilities. And in order to get mocks back we could either expose an as_mock boolean or expose it as new utilities get_test_mock_blobs(). I don't have a strong preference on either option, but I think that should be able to help us consolidate some of the logic in all of the places we need to build up these mock blobs, which has expanded quite a bit now.
| ]: | ||
| yield blob_name | ||
| for blob in get_test_blobs(prefix=prefix): | ||
| mock_blob = MagicMock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When mocking the blob properties, we should use be throwing in a BlobProperties spec to give a little more safety on public interfaces.
libs/azure-storage/tests/utils.py
Outdated
| ] | ||
|
|
||
|
|
||
| def get_expected_datalake_blobs() -> list[dict[str, Any]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of exposing a new utility, let's add a new parameter include_directories boolean that defaults to False to the get_datalake_test_blobs() to make it a little more explicit when ADLS directories are included in the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just a little more feedback and we should be set.
| ) | ||
| container_client.create_container() | ||
| for blob in get_datalake_test_blobs(): | ||
| for blob in get_datalake_test_blobs(include_directories=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the datalake tests, should we be including the directories in what we upload? To my understanding, ADLS will automatically create the directory and that is more representative of how customers will have ADLS directories in the first place.
libs/azure-storage/tests/utils.py
Outdated
| mock_blob = MagicMock(spec=BlobProperties) | ||
| mock_blob.name = blob["blob_name"] | ||
| mock_blob.size = len(blob["blob_content"]) | ||
| mock_blob.metadata = blob.get("metadata", {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's set the metadata value to None if it is not present e.g.:
mock_blob.metadata = blob.get("metadata", None)
libs/azure-storage/tests/utils.py
Outdated
| return [ | ||
| { | ||
| **blob, | ||
| "size": len(blob["blob_content"]), | ||
| "metadata": blob.get("metadata", None), | ||
| } | ||
| for blob in blobs | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the size and metadata, we should just lower this into our _get_test_blobs() helper function so that we are consistent on what data is returned. In general, size and metadata will be returned no matter the account type.
libs/azure-storage/tests/utils.py
Outdated
| updated_blobs = [ | ||
| blob | ||
| for blob in blob_list | ||
| if prefix is None or blob["blob_name"].startswith(prefix) | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to expose prefix logic in this helper method? I'd assume because prefix is exposed in both the get_test_blobs and get_test_datalake_blobs utilities that we would not need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! 🚢
Added support to now exclude directories from being returned as documents when a DataLake account is used.