Skip to content

Conversation

@anjaliratnam-msft
Copy link
Collaborator

Added support to now exclude directories from being returned as documents when a DataLake account is used.

Copy link
Collaborator

@kyleknap kyleknap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just had some smaller comments, but overall is looking solid.

Copy link
Collaborator

@kyleknap kyleknap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I just had some more suggestions on how to structure the test updates. We are getting close though.

)

def _is_adls_directory(self, blob: BlobProperties) -> bool:
return blob.size == 0 and blob.metadata.get("hdi_isfolder") == "true"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In between the blob.size and the hdi_isfolder check, let's add a blob.metadata. So:

blob.size == 0 and blob.metadata and blob.metadata.get("hdi_isfolder") == "true"

Mainly blobs could be empty and not be directories and have metadata that equates to None. This additional clause will make sure we short-circuit early to avoid trying to call .get() on a None value and cause the entire listing to error out.

def get_datalake_test_blobs(
blob_names: Optional[Union[str, Iterable[str]]] = None, prefix: Optional[str] = None
) -> list[dict[str, Any]]:
if blob_names is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of duplicating the code for get_test_blobs(), I'm wondering if we can expose a new _get_test_blobs() utility that both get_test_blobs() and get_datalake_test_blobs() can call out to by passing in the list of blobs to use? Mainly I think this can help reused code especially if we want to expand on the scaffolding in the future.

"blob_content": "{'test': 'test content'}",
},
{
"blob_name": "text_file.txt",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, maybe let's make this blob name directory/test_file.txt so it fits the more common pattern when we have there is an ADLS directory and blob underneath of it when we get to the datalake blobs list?

{
**blob,
"size": len(blob["blob_content"]),
"metadata": blob.get("metadata", {}),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For any of the metadata, if it is missing let's set the value to None instead of empty dictionary as it seems like when testing against the blob service the end value from the SDK is None if there is no metadata.

Comment on lines 196 to 201
for blob in get_datalake_test_blobs(prefix=prefix):
mock_blob = MagicMock()
mock_blob.name = blob["blob_name"]
mock_blob.size = int(blob["size"])
mock_blob.metadata = blob["metadata"]
yield mock_blob
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it makes sense if we take the BlobProperty scaffolding logic and roll it up into the get_test_blobs() and get_datalake_test_blobs() utilities. And in order to get mocks back we could either expose an as_mock boolean or expose it as new utilities get_test_mock_blobs(). I don't have a strong preference on either option, but I think that should be able to help us consolidate some of the logic in all of the places we need to build up these mock blobs, which has expanded quite a bit now.

]:
yield blob_name
for blob in get_test_blobs(prefix=prefix):
mock_blob = MagicMock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When mocking the blob properties, we should use be throwing in a BlobProperties spec to give a little more safety on public interfaces.

]


def get_expected_datalake_blobs() -> list[dict[str, Any]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of exposing a new utility, let's add a new parameter include_directories boolean that defaults to False to the get_datalake_test_blobs() to make it a little more explicit when ADLS directories are included in the list.

Copy link
Collaborator

@kyleknap kyleknap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a little more feedback and we should be set.

)
container_client.create_container()
for blob in get_datalake_test_blobs():
for blob in get_datalake_test_blobs(include_directories=True):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the datalake tests, should we be including the directories in what we upload? To my understanding, ADLS will automatically create the directory and that is more representative of how customers will have ADLS directories in the first place.

mock_blob = MagicMock(spec=BlobProperties)
mock_blob.name = blob["blob_name"]
mock_blob.size = len(blob["blob_content"])
mock_blob.metadata = blob.get("metadata", {})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's set the metadata value to None if it is not present e.g.:

mock_blob.metadata = blob.get("metadata", None)

Comment on lines 101 to 108
return [
{
**blob,
"size": len(blob["blob_content"]),
"metadata": blob.get("metadata", None),
}
for blob in blobs
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the size and metadata, we should just lower this into our _get_test_blobs() helper function so that we are consistent on what data is returned. In general, size and metadata will be returned no matter the account type.

Comment on lines 114 to 118
updated_blobs = [
blob
for blob in blob_list
if prefix is None or blob["blob_name"].startswith(prefix)
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to expose prefix logic in this helper method? I'd assume because prefix is exposed in both the get_test_blobs and get_test_datalake_blobs utilities that we would not need it.

Copy link
Collaborator

@kyleknap kyleknap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! 🚢

@anjaliratnam-msft anjaliratnam-msft merged commit 859a51c into langchain-ai:main Oct 17, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants