Skip to content

Commit b90cc41

Browse files
committed
Update document loader proposal based on feedback
Specifically, added the following: * Combine the two document loaders into a single AzureBlobStorageLoader class. * Encourage blob_parser parameter more as a future consideration. This would be helpful if a customer did not necessarily wanted to use the blob loader interfaces but still wanted more control over how to parse blobs.
1 parent 762a619 commit b90cc41

File tree

1 file changed

+83
-58
lines changed

1 file changed

+83
-58
lines changed

libs/azure-storage/proposals/document_loaders.md

Lines changed: 83 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -186,47 +186,29 @@ Below is the proposed specification for the Azure Blob Storage document loaders.
186186
All Azure Storage document loaders will live in the [`langchain_azure_storage` package][langchain-azure-storage-pkg]
187187
under a new `document_loaders` module.
188188

189-
There will be two document loaders introduced:
189+
There will be a single document loader introduced, `AzureBlobStorageLoader`. This single loader will encompass
190+
functionality from both the community-sourced `AzureBlobStorageFileLoader` and `AzureBlobStorageContainerLoader`
191+
document loaders.
190192

191-
* `AzureBlobStorageFileLoader` - Loads a `Document` from a single blob in Azure Blob Storage.
192-
* `AzureBlobStorageContainerLoader` - Loads `Document` objects from all blobs in a given container in Azure Blob Storage.
193-
Assuming no chunking is happening, each `Document` loaded will correspond 1:1 with a blob in the container.
194-
195-
Each document loader will subclass from [`BaseLoader`][langchain-document-loader-base-ref] and support both synchronous
193+
The document loader will subclass from [`BaseLoader`][langchain-document-loader-base-ref] and support both synchronous
196194
and asynchronous loading of documents, as well as lazy loading of documents.
197195

198-
Below show the proposed constructor signatures for each document loader:
196+
Below shows the proposed constructor signature for the document loader:
199197

200198
```python
201-
from typing import Optional, Union, Callable
199+
from typing import Optional, Union, Callable, Iterable
202200
import azure.core.credentials
203201
import azure.core.credentials_async
204202
from langchain_core.document_loaders import BaseLoader
205203

206204

207-
class AzureBlobStorageFileLoader(BaseLoader):
205+
class AzureBlobStorageLoader(BaseLoader):
208206
def __init__(self,
209207
account_url: str,
210208
container_name: str,
211-
blob_name: str,
209+
blob_names: Optional[Union[str, Iterable[str]]] = None,
212210
*,
213-
credential: Optional[
214-
Union[
215-
azure.core.credentials.AzureSasCredential,
216-
azure.core.credentials.TokenCredential,
217-
azure.core.credentials_async.AsyncTokenCredential,
218-
]
219-
] = None,
220-
loader_factory: Optional[Callable[str, BaseLoader]] = None,
221-
): ...
222-
223-
224-
class AzureBlobStorageContainerLoader(BaseLoader):
225-
def __init__(self,
226-
account_url: str,
227-
container_name: str,
228-
*,
229-
prefix: str = "",
211+
prefix: Optional[str] = None,
230212
credential: Optional[
231213
Union[
232214
azure.core.credentials.AzureSasCredential,
@@ -241,13 +223,16 @@ class AzureBlobStorageContainerLoader(BaseLoader):
241223
In terms of parameters supported:
242224
* `account_url` - The URL to the storage account (e.g., `https://<account>.blob.core.windows.net`)
243225
* `container_name` - The name of the container within the storage account
244-
* `blob_name` - (File loader only) The name of the blob within the container to load.
226+
* `blob_names` - The name of the blob(s) within the container to load. If provided, only the specified blob(s)
227+
in the container will be loaded. If not provided, the loader will list blobs from the container to load, which
228+
will be all blobs unless `prefix` is specified.
245229
* `credential` - The credential object to use for authentication. If not provided,
246230
the loader will use [Azure default credentials][azure-default-credentials]. The
247231
`credential` field only supports token-based credentials and SAS credentials. It does
248232
not support access key based credentials nor anonymous access.
249-
* `prefix` - (Container loader only) An optional prefix to filter blobs within the container.
250-
Only blobs whose names start with the specified prefix will be loaded.
233+
* `prefix` - An optional prefix to filter blobs when listing from the container. Only blobs whose names start with the
234+
specified prefix will be loaded. This parameter is incompatible with `blob_names` and will raise a `ValueError` if both
235+
are provided.
251236
* `loader_factory` - A callable that returns a custom document loader (e.g., `UnstructuredLoader`) to use
252237
for parsing blobs downloaded. When provided, the Azure Storage document loader will download each blob to
253238
a temporary local file and then call `loader_factory` with the path to the temporary file to get a document
@@ -263,30 +248,46 @@ Below are some example usage patterns for the Azure Blob Storage document loader
263248
Below shows how to load a document from a single blob in Azure Blob Storage:
264249

265250
```python
266-
from langchain_azure_storage.document_loaders import AzureBlobStorageFileLoader
251+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
267252

268-
loader = AzureBlobStorageFileLoader("https://<account>.blob.core.windows.net", "<container>", "<blob>")
253+
loader = AzureBlobStorageLoader("https://<account>.blob.core.windows.net", "<container>", "<blob>")
269254
for doc in loader.lazy_load():
270255
print(doc.page_content) # Prints content of blob. There should only be one document loaded.
271256
```
272257

258+
### Load from a list of blobs
259+
Below shows how to load documents from a list of blobs in Azure Blob Storage:
260+
261+
```python
262+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
263+
264+
loader = AzureBlobStorageLoader(
265+
"https://<account>.blob.core.windows.net",
266+
"<container>",
267+
["blob1", "blob2", "blob3"]
268+
)
269+
for doc in loader.lazy_load():
270+
print(doc.page_content) # Prints content of each blob from list.
271+
```
272+
273273
#### Load from a container
274274

275275
Below shows how to load documents from all blobs in a given container in Azure Blob Storage:
276276

277277
```python
278-
from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader
278+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
279279

280-
loader = AzureBlobStorageContainerLoader("https://<account>.blob.core.windows.net", "<container>")
280+
loader = AzureBlobStorageLoader("https://<account>.blob.core.windows.net", "<container>")
281281
for doc in loader.lazy_load():
282282
print(doc.page_content) # Prints content of each blob in the container.
283283
```
284284

285285
Below shows how to load documents from blobs in a container with a given prefix:
286286

287287
```python
288-
from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader
289-
loader = AzureBlobStorageContainerLoader(
288+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
289+
290+
loader = AzureBlobStorageLoader(
290291
"https://<account>.blob.core.windows.net", "<container>", prefix="some/prefix/"
291292
)
292293
for doc in loader.lazy_load():
@@ -297,11 +298,11 @@ for doc in loader.lazy_load():
297298
Below shows how to load documents asynchronously. This is acheived by calling the `aload()` or `alazy_load()` methods on the document loader. For example:
298299

299300
```python
300-
from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader
301+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
301302

302303

303304
async def main():
304-
loader = AzureBlobStorageContainerLoader("https://<account>.blob.core.windows.net", "<container>")
305+
loader = AzureBlobStorageLoader("https://<account>.blob.core.windows.net", "<container>")
305306
async for doc in loader.alazy_load():
306307
print(doc.page_content) # Prints content of each blob in the container.
307308
```
@@ -312,9 +313,10 @@ Below shows how to override the default credentials used by the document loader:
312313
```python
313314
from azure.core.credentials import AzureSasCredential
314315
from azure.idenity import ManagedIdentityCredential
316+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
315317

316318
# Override with SAS token
317-
loader = AzureBlobStorageContainerLoader(
319+
loader = AzureBlobStorageLoader(
318320
"https://<account>.blob.core.windows.net",
319321
"<container>",
320322
credential=AzureSasCredential("<sas-token>")
@@ -323,7 +325,7 @@ loader = AzureBlobStorageContainerLoader(
323325

324326
# Override with more specific token credential than the entire
325327
# default credential chain (e.g., system-assigned managed identity)
326-
loader = AzureBlobStorageContainerLoader(
328+
loader = AzureBlobStorageLoader(
327329
"https://<account>.blob.core.windows.net",
328330
"<container>",
329331
credential=ManagedIdentityCredential()
@@ -338,10 +340,10 @@ the `UnstructuredLoader` to parse the local file and return `Document` objects
338340
on behalf of the Azure Storage document loader:
339341

340342
```python
341-
from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader
343+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
342344
from langchain_unstructured import UnstructuredLoader
343345

344-
loader = AzureBlobStorageContainerLoader(
346+
loader = AzureBlobStorageLoader(
345347
"https://<account>.blob.core.windows.net",
346348
"<container>",
347349
# The UnstructuredLoader class accepts a string to the local file path to its constructor,
@@ -358,7 +360,7 @@ If a customer wants to provide additional configuration to the document loader,
358360
define a callable that returns an instantiated document loader. For example, to provide
359361
custom configuration to the `UnstructuredLoader`:
360362
```python
361-
from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader
363+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
362364
from langchain_unstructured import UnstructuredLoader
363365

364366

@@ -370,7 +372,7 @@ def loader_factory(file_path: str) -> UnstructuredLoader:
370372
)
371373

372374

373-
loader = AzureBlobStorageContainerLoader(
375+
loader = AzureBlobStorageLoader(
374376
"https://<account>.blob.core.windows.net", "<container>",
375377
loader_factory=loader_factory
376378
)
@@ -385,11 +387,13 @@ customers will need to perform the following changes:
385387
1. Depend on the `langchain-azure-storage` package instead of `langchain-community`.
386388
2. Update import statements from `langchain_community.document_loaders` to
387389
`langchain_azure_storage.document_loaders`.
388-
3. Update document loader constructor calls to:
390+
3. Change class names from `AzureBlobStorageFileLoader` and `AzureBlobStorageContainerLoader`
391+
to `AzureBlobStorageLoader`.
392+
4. Update document loader constructor calls to:
389393
1. Use an account URL instead of a connection string.
390394
2. Specify `UnstructuredLoader` as the `loader_factory` if they continue to want to use
391395
Unstructured for parsing documents.
392-
4. Ensure environment has proper credentials (e.g., running `azure login` command, setting up
396+
5. Ensure environment has proper credentials (e.g., running `azure login` command, setting up
393397
managed identity, etc.) as the connection string would have previously contained the credentials.
394398

395399
Below shows code snippets of what usage patterns look like before and after the proposed migration:
@@ -414,16 +418,16 @@ file_loader = AzureBlobStorageFileLoader(
414418
**After migration:**
415419

416420
```python
417-
from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader, AzureBlobStorageFileLoader
421+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
418422
from langchain_unstructured import UnstructuredLoader
419423

420-
container_loader = AzureBlobStorageContainerLoader(
424+
container_loader = AzureBlobStorageLoader(
421425
"https://<account>.blob.core.windows.net",
422426
"<container>",
423427
loader_factory=UnstructuredLoader
424428
)
425429

426-
file_loader = AzureBlobStorageFileLoader(
430+
file_loader = AzureBlobStorageLoader(
427431
"https://<account>.blob.core.windows.net",
428432
"<container>",
429433
"<blob>",
@@ -464,16 +468,16 @@ When a `credential` is provided, the credential will be:
464468
```python
465469
import azure.identity
466470
import azure.identity.aio
467-
from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader
471+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
468472

469-
sync_doc_loader = AzureBlobStorageContainerLoader(
473+
sync_doc_loader = AzureBlobStorageLoader(
470474
"https://<account>.blob.core.windows.net",
471475
"<container>",
472476
credential=azure.identity.ManagedIdentityCredential()
473477
)
474478
sync_doc_loader.aload() # Raises ValueError because a sync credential was provided
475479

476-
async_doc_loader = AzureBlobStorageContainerLoader(
480+
async_doc_loader = AzureBlobStorageLoader(
477481
"https://<account>.blob.core.windows.net",
478482
"<container>",
479483
credential=azure.identity.aio.ManagedIdentityCredential()
@@ -490,9 +494,9 @@ When a `credential` is provided, the credential will be:
490494
By default, the document loaders will populate the `source` metadata field of each `Document`
491495
object with the URL of the blob (e.g., `https://<account>.blob.core.windows.net/<container>/<blob>`). For example:
492496
```python
493-
from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader
497+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
494498

495-
loader = AzureBlobStorageContainerLoader("https://<account>.blob.core.windows.net", "<container>")
499+
loader = AzureBlobStorageLoader("https://<account>.blob.core.windows.net", "<container>")
496500
for doc in loader.lazy_load():
497501
print(doc.metadata["source"]) # Prints URL of each blob in the container.
498502
```
@@ -515,7 +519,7 @@ from langchain_core.documents import Document
515519
from typing import Iterator
516520

517521

518-
class AzureBlobStorageContainerLoader(BaseLoader):
522+
class AzureBlobStorageLoader(BaseLoader):
519523
...
520524
def _lazy_load_from_custom_loader(self, blob_name: str) -> Iterator[Document]:
521525
with tempfile.NamedTemporaryFile() as temp_file:
@@ -573,6 +577,27 @@ However, similar to why document loaders were chosen over blob loaders, blob par
573577
over libraries like Unstructured and takeaway from the batteries-included value proposition that LangChain document
574578
loaders provide.
575579

580+
It's important to note that this decision does not prevent us from exposing a `blob_parser` parameter in the future.
581+
Specifically, this would be useful if we see customers wanting to customize loading behavior more but not necessarily
582+
want to drop down to using a blob loader interface.
583+
584+
585+
#### Exposing document loaders as two classes, `AzureBlobStorageFileLoader` and `AzureBlobStorageContainerLoader`, instead of a single `AzureBlobStorageLoader`
586+
Exposing the document loaders as these two classes would be beneficial in that they would match the existing community
587+
document loaders and lessen the amount of changes needed to migrate. However, combining them into a single class
588+
has the following advantages:
589+
590+
* It simplifies the getting started experience. Customers will no longer have to make a decision on which Azure Storage
591+
document loader class to use as there will be only one document loader class to choose from.
592+
* It simplifies class names by removing the additional `File` and `Container` qualifiers, which could lead to
593+
misinterpretations on what the classes do.
594+
* It is easier to maintain as there is only one class that will need to be maintained and less code will likely need to
595+
be duplicated.
596+
597+
While this will introduce an additional step in migrating (i.e., change class names), the impact is limited
598+
as customers will still be providing the same positional parameters even after changing class names
599+
(i.e., use account + container for the container loader and account + container + blob for the file loader).
600+
576601

577602
#### Alternatives to default parsing to UTF-8 text
578603
The default parsing logic when no `loader_factory` is provided is to treat the blob content as UTF-8 text
@@ -638,10 +663,10 @@ customize how blobs are parsed to text. However, possible requested extension po
638663
* Wanting the blob data to be passed using an in-memory representation than file on disk
639664

640665
If we ever plan to extend the interface, we should strongly consider exposing blob loaders
641-
instead as discussed in the [alternatives considered](#exposing-a-blob_parser-parameter-instead-of-loader_factory)
666+
and/or a `blob_parser` parameter instead as discussed in the [alternatives considered](#exposing-a-blob_parser-parameter-instead-of-loader_factory)
642667
section above.
643668

644-
If blob loaders do not suffice, we could consider expanding the `loader_factory` to:
669+
If blob loaders nor a `blob_parser` parameter suffice, we could consider expanding the `loader_factory` to:
645670

646671
* Inspect signature arguments of callable provided to `loader_factory` and call the callable with
647672
additional parameters if detected (e.g., detect if the a `blob_properties` parameter is present and
@@ -666,7 +691,7 @@ Based on customer requests, in the future, we could consider exposing these prop
666691
## Future work
667692
Below are some possible future work ideas that could be considered after the initial implementation based on customer feedback:
668693

669-
* Expose blob loader integrations for Azure Blob Storage (see [alternatives considered](#exposing-a-blob_parser-parameter-instead-of-loader_factory) section).
694+
* Expose blob loader and/or blob parser integrations (see [alternatives considered](#exposing-a-blob_parser-parameter-instead-of-loader_factory) section).
670695
* Proxy additional blob properties as document metadata (see [FAQs](#q-why-is-the-blob-properties-not-exposed-in-the-document-metadata) section).
671696
* Support `async_credential` parameter to allow using both sync and async token credentials with a single document loader instance
672697
(see [FAQs](#q-why-not-support-synchronous-token-credentials-when-calling-asynchronous-methods-and-vice-versa) section).

0 commit comments

Comments
 (0)