Skip to content

Commit

Permalink
feat(doc-loaders): Add support for DirectoryLoader (#620)
Browse files Browse the repository at this point in the history
Co-authored-by: David Miguel <[email protected]>
  • Loading branch information
Nana-Kwame-bot and davidmigloz authored Dec 16, 2024
1 parent 8283c00 commit 4730f2a
Show file tree
Hide file tree
Showing 12 changed files with 861 additions and 6 deletions.
1 change: 1 addition & 0 deletions docs/_sidebar.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@
- [Text](/modules/retrieval/document_loaders/how_to/text.md)
- [JSON](/modules/retrieval/document_loaders/how_to/json.md)
- [Web page](/modules/retrieval/document_loaders/how_to/web.md)
- [Directory](/modules/retrieval/document_loaders/how_to/directory.md)
- [Document transformers](/modules/retrieval/document_transformers/document_transformers.md)
- Text splitters
- [Split by character](/modules/retrieval/document_transformers/text_splitters/character_text_splitter.md)
Expand Down
213 changes: 213 additions & 0 deletions docs/modules/retrieval/document_loaders/how_to/directory.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
# Directory

Use `DirectoryLoader` to load `Document`s from multiple files in a directory with extensive customization options.

## Overview

The `DirectoryLoader` is a versatile document loader that allows you to load documents from a directory with powerful filtering, sampling, and customization capabilities. It supports multiple file types out of the box and provides extensive configuration options.

## Basic Usage

```dart
// Load all text files from a directory recursively
final loader = DirectoryLoader(
'/path/to/documents',
glob: '*.txt',
recursive: true,
);
final documents = await loader.load();
```

## Constructor Parameters

### `filePath` (required)
- Type: `String`
- Description: The path to the directory containing documents to load.

### `glob`
- Type: `String`
- Default: `'*'` (all files)
- Description: A glob pattern to match files. Only files matching this pattern will be loaded.
- Examples:
```dart
// Load only JSON and text files
DirectoryLoader('/path', glob: '*.{txt,json}')
// Load files starting with 'report'
DirectoryLoader('/path', glob: 'report*')
```

### `recursive`
- Type: `bool`
- Default: `true`
- Description: Whether to search recursively in subdirectories.

### `exclude`
- Type: `List<String>`
- Default: `[]`
- Description: Glob patterns to exclude from loading.
- Example:
```dart
DirectoryLoader(
'/path',
exclude: ['*.tmp', 'draft*'],
)
```

### `loaderMap`
- Type: `Map<String, BaseDocumentLoader Function(String)>`
- Default: `DirectoryLoader.defaultLoaderMap`
- Description: A map to customize loaders for different file types.
- Default Supported Types:
- `.txt`: TextLoader
- `.json`: JsonLoader (with root schema)
- `.csv` and `.tsv`: CsvLoader
- Example of extending loaders:
```dart
final loader = DirectoryLoader(
'/path/to/docs',
loaderMap: {
// Add a custom loader for XML files
'.xml': (path) => CustomXmlLoader(path),
// Combine with default loaders
...DirectoryLoader.defaultLoaderMap,
},
);
```

### `loadHidden`
- Type: `bool`
- Default: `false`
- Description: Whether to load hidden files.
- Platform Specific:
- On Unix-like systems (Linux, macOS): Identifies hidden files by names starting with '.'
- On Windows: May not work as expected due to different hidden file conventions
- Recommended to use platform-specific checks for comprehensive hidden file handling across different operating systems
- Example of platform-aware hidden file checking:
```dart
import 'dart:io' show Platform;
bool isHiddenFile(File file) {
if (Platform.isWindows) {
// Windows-specific hidden file check
return (File(file.path).statSync().modeString().startsWith('h'));
} else {
// Unix-like systems
return path.basename(file.path).startsWith('.');
}
}
```

### `sampleSize`
- Type: `int`
- Default: `0` (load all files)
- Description: Maximum number of files to load.
- Example:
```dart
// Load only 10 files
DirectoryLoader('/path', sampleSize: 10)
```

### `randomizeSample`
- Type: `bool`
- Default: `false`
- Description: Whether to randomize the sample of files.

### `sampleSeed`
- Type: `int?`
- Default: `null`
- Description: Seed for random sampling to ensure reproducibility.
- Example:
```dart
// Consistent random sampling
DirectoryLoader(
'/path',
sampleSize: 10,
randomizeSample: true,
sampleSeed: 42,
)
```

### `metadataBuilder`
- Type: `Map<String, dynamic> Function(File file, Map<String, dynamic> defaultMetadata)?`
- Default: `null`
- Description: A custom function to build metadata for each document.
- Example:
```dart
final loader = DirectoryLoader(
'/path',
metadataBuilder: (file, defaultMetadata) {
return {
...defaultMetadata,
'custom_tag': 'important_document',
'processing_date': DateTime.now().toIso8601String(),
};
},
);
```

## Default Metadata

By default, each document receives metadata including:
- `source`: Full file path
- `name`: Filename
- `extension`: File extension
- `size`: File size in bytes
- `lastModified`: Last modification timestamp (milliseconds since epoch)

## Lazy Loading

The `DirectoryLoader` supports lazy loading through the `lazyLoad()` method, which returns a `Stream<Document>`. This is useful for processing large numbers of documents without loading everything into memory at once.

```dart
final loader = DirectoryLoader('/path/to/documents');
await for (final document in loader.lazyLoad()) {
// Process each document as it's loaded
print(document.pageContent);
}
```

## Error Handling

- Throws an `ArgumentError` if the blob pattern is empty

## Advanced Example

```dart
final loader = DirectoryLoader(
'/path/to/documents',
glob: '*.{txt,json,csv}', // Multiple file types
recursive: true, // Search subdirectories
exclude: ['temp*', '*.backup'], // Exclude temp and backup files
loadHidden: false, // Ignore hidden files
sampleSize: 50, // Load only 50 files
randomizeSample: true, // Randomize the sample
sampleSeed: 123, // Reproducible random sampling
loaderMap: {
// Custom loader for a specific file type
'.json': (path) => CustomJsonLoader(path),
},
metadataBuilder: (file, defaultMetadata) {
// Add custom metadata
return {
...defaultMetadata,
'category': _categorizeFile(file),
};
},
);
final documents = await loader.load();
```

## Best Practices

- Use `lazyLoad()` for large directories to manage memory efficiently
- Provide specific glob patterns to reduce unnecessary file processing
- Customize loaders for specialized file types
- Use `metadataBuilder` to add context-specific information to documents

## Limitations

- Relies on file system access
- Performance may vary with large directories
12 changes: 10 additions & 2 deletions examples/browser_summarizer/pubspec.lock
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,14 @@ packages:
url: "https://pub.dev"
source: hosted
version: "2.4.4"
glob:
dependency: transitive
description:
name: glob
sha256: "0e7014b3b7d4dac1ca4d6114f82bf1782ee86745b9b42a92c9289c23d8a0ab63"
url: "https://pub.dev"
source: hosted
version: "2.1.2"
html:
dependency: transitive
description:
Expand Down Expand Up @@ -330,10 +338,10 @@ packages:
dependency: transitive
description:
name: path
sha256: "087ce49c3f0dc39180befefc60fdb4acd8f8620e5682fe2476afd0b3688bb4af"
sha256: "75cca69d1490965be98c73ceaea117e8a04dd21217b37b292c9ddbec0d955bc5"
url: "https://pub.dev"
source: hosted
version: "1.9.0"
version: "1.9.1"
path_provider_linux:
dependency: transitive
description:
Expand Down
20 changes: 18 additions & 2 deletions examples/docs_examples/pubspec.lock
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,14 @@ packages:
url: "https://pub.dev"
source: hosted
version: "2.1.3"
file:
dependency: transitive
description:
name: file
sha256: a3b4f84adafef897088c160faf7dfffb7696046cb13ae90b508c2cbc95d3b8d4
url: "https://pub.dev"
source: hosted
version: "7.0.1"
fixnum:
dependency: transitive
description:
Expand Down Expand Up @@ -151,6 +159,14 @@ packages:
url: "https://pub.dev"
source: hosted
version: "0.8.13"
glob:
dependency: transitive
description:
name: glob
sha256: "0e7014b3b7d4dac1ca4d6114f82bf1782ee86745b9b42a92c9289c23d8a0ab63"
url: "https://pub.dev"
source: hosted
version: "2.1.2"
google_generative_ai:
dependency: transitive
description:
Expand Down Expand Up @@ -359,10 +375,10 @@ packages:
dependency: transitive
description:
name: path
sha256: "087ce49c3f0dc39180befefc60fdb4acd8f8620e5682fe2476afd0b3688bb4af"
sha256: "75cca69d1490965be98c73ceaea117e8a04dd21217b37b292c9ddbec0d955bc5"
url: "https://pub.dev"
source: hosted
version: "1.9.0"
version: "1.9.1"
petitparser:
dependency: transitive
description:
Expand Down
20 changes: 18 additions & 2 deletions examples/wikivoyage_eu/pubspec.lock
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,14 @@ packages:
url: "https://pub.dev"
source: hosted
version: "2.1.3"
file:
dependency: transitive
description:
name: file
sha256: a3b4f84adafef897088c160faf7dfffb7696046cb13ae90b508c2cbc95d3b8d4
url: "https://pub.dev"
source: hosted
version: "7.0.1"
fixnum:
dependency: transitive
description:
Expand All @@ -113,6 +121,14 @@ packages:
url: "https://pub.dev"
source: hosted
version: "2.4.4"
glob:
dependency: transitive
description:
name: glob
sha256: "0e7014b3b7d4dac1ca4d6114f82bf1782ee86745b9b42a92c9289c23d8a0ab63"
url: "https://pub.dev"
source: hosted
version: "2.1.2"
html:
dependency: transitive
description:
Expand Down Expand Up @@ -240,10 +256,10 @@ packages:
dependency: transitive
description:
name: path
sha256: "087ce49c3f0dc39180befefc60fdb4acd8f8620e5682fe2476afd0b3688bb4af"
sha256: "75cca69d1490965be98c73ceaea117e8a04dd21217b37b292c9ddbec0d955bc5"
url: "https://pub.dev"
source: hosted
version: "1.9.0"
version: "1.9.1"
petitparser:
dependency: transitive
description:
Expand Down
2 changes: 2 additions & 0 deletions melos.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ command:
flutter_markdown: ^0.7.3
freezed_annotation: ^2.4.2
gcloud: ^0.8.13
glob: ^2.1.2
google_generative_ai: ^0.4.6
googleapis: ^13.0.0
googleapis_auth: ^1.6.0
Expand All @@ -53,6 +54,7 @@ command:
math_expressions: ^2.6.0
meta: ^1.11.0
objectbox: ^4.0.3
path: ^1.9.1
pinecone: ^0.7.2
rxdart: ">=0.27.7 <0.29.0"
shared_preferences: ^2.3.0
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
export 'directory_io.dart' if (dart.library.js_interop) 'directory_stub.dart';
Loading

0 comments on commit 4730f2a

Please sign in to comment.