From c05b692d69b6dae1ac5f518e84b17f32e7d94372 Mon Sep 17 00:00:00 2001 From: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Date: Fri, 27 Sep 2024 11:16:04 +0200 Subject: [PATCH] docs: document chunking (#111) [skip ci] Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --- README.md | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/README.md b/README.md index e3907dbd..7bf65c6c 100644 --- a/README.md +++ b/README.md @@ -207,6 +207,28 @@ results = doc_converter.convert(conv_input) You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads. +### Chunking + +You can perform a hierarchy-aware chunking of a Docling document as follows: + +```python +from docling.document_converter import DocumentConverter +from docling_core.transforms.chunker import HierarchicalChunker + +doc = DocumentConverter().convert_single("https://arxiv.org/pdf/2206.01062").output +chunks = list(HierarchicalChunker().chunk(doc)) +# > [ +# > ChunkWithMetadata( +# > path='$.main-text[0]', +# > text='DocLayNet: A Large Human-Annotated Dataset [...]', +# > page=1, +# > bbox=[107.30, 672.38, 505.19, 709.08] +# > ), +# > [...] +# > ] +``` + + ## Technical report For more details on Docling's inner workings, check out the [Docling Technical Report](https://arxiv.org/abs/2408.09869).