Document loaders datalake support #180

anjaliratnam-msft · 2025-10-16T17:02:06Z

Added support to now exclude directories from being returned as documents when a DataLake account is used.

kyleknap

Looks good! Just had some smaller comments, but overall is looking solid.

libs/azure-storage/langchain_azure_storage/document_loaders.py

libs/azure-storage/tests/utils.py

libs/azure-storage/tests/integration_tests/test_document_loaders.py

libs/azure-storage/tests/unit_tests/test_document_loaders.py

kyleknap

Looks good! I just had some more suggestions on how to structure the test updates. We are getting close though.

kyleknap · 2025-10-17T18:49:49Z

libs/azure-storage/langchain_azure_storage/document_loaders.py

        )
+
+    def _is_adls_directory(self, blob: BlobProperties) -> bool:
+        return blob.size == 0 and blob.metadata.get("hdi_isfolder") == "true"


In between the blob.size and the hdi_isfolder check, let's add a blob.metadata. So:

blob.size == 0 and blob.metadata and blob.metadata.get("hdi_isfolder") == "true"

Mainly blobs could be empty and not be directories and have metadata that equates to None. This additional clause will make sure we short-circuit early to avoid trying to call .get() on a None value and cause the entire listing to error out.

kyleknap · 2025-10-17T19:00:43Z

libs/azure-storage/tests/utils.py

+def get_datalake_test_blobs(
+    blob_names: Optional[Union[str, Iterable[str]]] = None, prefix: Optional[str] = None
+) -> list[dict[str, Any]]:
+    if blob_names is not None:


Instead of duplicating the code for get_test_blobs(), I'm wondering if we can expose a new _get_test_blobs() utility that both get_test_blobs() and get_datalake_test_blobs() can call out to by passing in the list of blobs to use? Mainly I think this can help reused code especially if we want to expand on the scaffolding in the future.

kyleknap · 2025-10-17T19:03:15Z

libs/azure-storage/tests/utils.py

+        "blob_content": "{'test': 'test content'}",
+    },
+    {
+        "blob_name": "text_file.txt",


For completeness, maybe let's make this blob name directory/test_file.txt so it fits the more common pattern when we have there is an ADLS directory and blob underneath of it when we get to the datalake blobs list?

kyleknap · 2025-10-17T19:04:16Z

libs/azure-storage/tests/utils.py

+                        {
+                            **blob,
+                            "size": len(blob["blob_content"]),
+                            "metadata": blob.get("metadata", {}),


For any of the metadata, if it is missing let's set the value to None instead of empty dictionary as it seems like when testing against the blob service the end value from the SDK is None if there is no metadata.

kyleknap · 2025-10-17T19:06:43Z

libs/azure-storage/tests/unit_tests/test_document_loaders.py

+            for blob in get_datalake_test_blobs(prefix=prefix):
+                mock_blob = MagicMock()
+                mock_blob.name = blob["blob_name"]
+                mock_blob.size = int(blob["size"])
+                mock_blob.metadata = blob["metadata"]
+                yield mock_blob


I'm wondering if it makes sense if we take the BlobProperty scaffolding logic and roll it up into the get_test_blobs() and get_datalake_test_blobs() utilities. And in order to get mocks back we could either expose an as_mock boolean or expose it as new utilities get_test_mock_blobs(). I don't have a strong preference on either option, but I think that should be able to help us consolidate some of the logic in all of the places we need to build up these mock blobs, which has expanded quite a bit now.

kyleknap · 2025-10-17T19:08:37Z

libs/azure-storage/tests/unit_tests/test_document_loaders.py

-            ]:
-                yield blob_name
+            for blob in get_test_blobs(prefix=prefix):
+                mock_blob = MagicMock()


When mocking the blob properties, we should use be throwing in a BlobProperties spec to give a little more safety on public interfaces.

kyleknap · 2025-10-17T19:11:21Z

libs/azure-storage/tests/utils.py

+    ]
+
+
+def get_expected_datalake_blobs() -> list[dict[str, Any]]:


Instead of exposing a new utility, let's add a new parameter include_directories boolean that defaults to False to the get_datalake_test_blobs() to make it a little more explicit when ADLS directories are included in the list.

kyleknap

Looks good. Just a little more feedback and we should be set.

kyleknap · 2025-10-17T21:11:06Z

libs/azure-storage/tests/integration_tests/test_document_loaders.py

        )
        container_client.create_container()
-        for blob in get_datalake_test_blobs():
+        for blob in get_datalake_test_blobs(include_directories=True):


For the datalake tests, should we be including the directories in what we upload? To my understanding, ADLS will automatically create the directory and that is more representative of how customers will have ADLS directories in the first place.

kyleknap · 2025-10-17T21:14:41Z

libs/azure-storage/tests/utils.py

+        mock_blob = MagicMock(spec=BlobProperties)
+        mock_blob.name = blob["blob_name"]
+        mock_blob.size = len(blob["blob_content"])
+        mock_blob.metadata = blob.get("metadata", {})


Let's set the metadata value to None if it is not present e.g.:

mock_blob.metadata = blob.get("metadata", None)

kyleknap · 2025-10-17T21:18:13Z

libs/azure-storage/tests/utils.py

+    return [
+        {
+            **blob,
+            "size": len(blob["blob_content"]),
+            "metadata": blob.get("metadata", None),
+        }
+        for blob in blobs
+    ]


For the size and metadata, we should just lower this into our _get_test_blobs() helper function so that we are consistent on what data is returned. In general, size and metadata will be returned no matter the account type.

kyleknap · 2025-10-17T21:19:50Z

libs/azure-storage/tests/utils.py

+    updated_blobs = [
+        blob
+        for blob in blob_list
+        if prefix is None or blob["blob_name"].startswith(prefix)
+    ]


Do we need to expose prefix logic in this helper method? I'd assume because prefix is exposed in both the get_test_blobs and get_test_datalake_blobs utilities that we would not need it.

kyleknap

Looks good! 🚢

kyleknap reviewed Oct 16, 2025

View reviewed changes

anjaliratnam-msft added 3 commits October 16, 2025 17:30

datalake support

9de64dc

updates

483523a

updates

1a4c148

anjaliratnam-msft force-pushed the support-datalake branch from 4e78e22 to 1a4c148 Compare October 17, 2025 00:32

updates

ad08c63

kyleknap reviewed Oct 17, 2025

View reviewed changes

updates

257f5c9

kyleknap reviewed Oct 17, 2025

View reviewed changes

updates

57a049e

kyleknap approved these changes Oct 17, 2025

View reviewed changes

anjaliratnam-msft merged commit 859a51c into langchain-ai:main Oct 17, 2025
8 checks passed

Document loaders datalake support #180

Document loaders datalake support #180

Uh oh!

Conversation

anjaliratnam-msft commented Oct 16, 2025

Uh oh!

kyleknap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kyleknap left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyleknap left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyleknap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants