You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Add custom text types and recursive
* Add custom text types and recursive
* Fix format
* Update qdrant, Add pdf to unstructured
* Use unstructed as the default text extractor if installed
* Add tests for unstructured
* Update tests env for unstructured
* Fix error if last message is a function call, issue #569
* Remove csv, md and tsv from UNSTRUCTURED_FORMATS
* Update docstring of docs_path
* Update test for get_files_from_dir
* Update docstring of custom_text_types
* Fix missing search_string in update_context
* Add custom_text_types to notebook example
prompt will be different for different tasks. The default value is `default`, which supports both code and qa.
46
46
- client (Optional, qdrant_client.QdrantClient(":memory:")): A QdrantClient instance. If not provided, an in-memory instance will be assigned. Not recommended for production.
47
47
will be used. If you want to use other vector db, extend this class and override the `retrieve_docs` function.
48
-
- docs_path (Optional, str): the path to the docs directory. It can also be the path to a single file,
49
-
or the url to a single file. Default is None, which works only if the collection is already created.
48
+
- docs_path (Optional, Union[str, List[str]]): the path to the docs directory. It can also be the path to a single file,
49
+
the url to a single file or a list of directories, files and urls. Default is None, which works only if the collection is already created.
50
50
- collection_name (Optional, str): the name of the collection.
51
51
If key not provided, a default name `autogen-docs` will be used.
52
52
- model (Optional, str): the model to use for the retrieve chat.
@@ -66,11 +66,14 @@ def __init__(
66
66
- customized_answer_prefix (Optional, str): the customized answer prefix for the retrieve chat. Default is "".
67
67
If not "" and the customized_answer_prefix is not in the answer, `Update Context` will be triggered.
68
68
- update_context (Optional, bool): if False, will not apply `Update Context` for interactive retrieval. Default is True.
69
-
- custom_token_count_function(Optional, Callable): a custom function to count the number of tokens in a string.
69
+
- custom_token_count_function(Optional, Callable): a custom function to count the number of tokens in a string.
70
70
The function should take a string as input and return three integers (token_count, tokens_per_message, tokens_per_name).
71
71
Default is None, tiktoken will be used and may not be accurate for non-OpenAI models.
72
-
- custom_text_split_function(Optional, Callable): a custom function to split a string into a list of strings.
72
+
- custom_text_split_function(Optional, Callable): a custom function to split a string into a list of strings.
73
73
Default is None, will use the default function in `autogen.retrieve_utils.split_text_to_chunks`.
74
+
- custom_text_types (Optional, List[str]): a list of file types to be processed. Default is `autogen.retrieve_utils.TEXT_FORMATS`.
75
+
This only applies to files under the directories in `docs_path`. Explictly included files and urls will be chunked regardless of their types.
76
+
- recursive (Optional, bool): whether to search documents recursively in the docs_path. Default is True.
74
77
- parallel (Optional, int): How many parallel workers to use for embedding. Defaults to the number of CPU cores.
75
78
- on_disk (Optional, bool): Whether to store the collection on disk. Default is False.
76
79
- quantization_config: Quantization configuration. If None, quantization will be disabled.
"""Create a Qdrant collection from all the files in a given directory, the directory can also be a single file or a url to
150
-
a single file.
156
+
"""Create a Qdrant collection from all the files in a given directory, the directory can also be a single file or a
157
+
url to a single file.
151
158
152
159
Args:
153
160
dir_path (str): the path to the directory, file or url.
@@ -156,24 +163,35 @@ def create_qdrant_from_dir(
156
163
collection_name (Optional, str): the name of the collection. Default is "all-my-documents".
157
164
chunk_mode (Optional, str): the chunk mode. Default is "multi_lines".
158
165
must_break_at_empty_line (Optional, bool): Whether to break at empty line. Default is True.
159
-
embedding_model (Optional, str): the embedding model to use. Default is "BAAI/bge-small-en-v1.5". The list of all the available models can be at https://qdrant.github.io/fastembed/examples/Supported_Models/.
166
+
embedding_model (Optional, str): the embedding model to use. Default is "BAAI/bge-small-en-v1.5".
167
+
The list of all the available models can be at https://qdrant.github.io/fastembed/examples/Supported_Models/.
168
+
custom_text_split_function (Optional, Callable): a custom function to split a string into a list of strings.
169
+
Default is None, will use the default function in `autogen.retrieve_utils.split_text_to_chunks`.
170
+
custom_text_types (Optional, List[str]): a list of file types to be processed. Default is TEXT_FORMATS.
171
+
recursive (Optional, bool): whether to search documents recursively in the dir_path. Default is True.
160
172
parallel (Optional, int): How many parallel workers to use for embedding. Defaults to the number of CPU cores
161
173
on_disk (Optional, bool): Whether to store the collection on disk. Default is False.
162
-
quantization_config: Quantization configuration. If None, quantization will be disabled. Ref: https://qdrant.github.io/qdrant/redoc/index.html#tag/collections/operation/create_collection
163
-
hnsw_config: HNSW configuration. If None, default configuration will be used. Ref: https://qdrant.github.io/qdrant/redoc/index.html#tag/collections/operation/create_collection
174
+
quantization_config: Quantization configuration. If None, quantization will be disabled.
payload_indexing: Whether to create a payload index for the document field. Default is False.
165
-
qdrant_client_options: (Optional, dict): the options for instantiating the qdrant client. Reference: https://github.com/qdrant/qdrant-client/blob/master/qdrant_client/qdrant_client.py#L36-L58.
179
+
qdrant_client_options: (Optional, dict): the options for instantiating the qdrant client.
0 commit comments