add splitter config and SplitterOpsFactory to ExtractPDFFlow #78

jojortz · 2023-12-29T06:17:58Z

Add splitter field to ExtractPDFConfig
Create SplitterOpsFactory, which has two splitters:
- ParagraphSplitter
- MarkdownHeaderSplitter
Create notebook to show pdf extract without split, extract_pdf_no_split.ipynb
Test new config on the following notebooks:
- extract_pdf.ipynb
- pipeline_pdf.ipynb
- extract_pdf_no_split.ipynb
Update the following notebooks to have necessary instruction in GuidedPrompt:
- openai_pdf_source_10k_QA.ipynb

goldmermaid · 2023-12-29T07:32:19Z

uniflow/flow/extract/extract_pdf_flow.py

@@ -17,13 +18,13 @@ class ExtractPDFFlow(Flow):
    def __init__(
        self,
        model_config: Dict[str, Any],
+        splitter: str = "",


Nit: None is a better default value.

goldmermaid · 2023-12-29T07:35:57Z

uniflow/flow/extract/extract_pdf_flow.py

+        if splitter:
+            self._split_op = SplitterOpsFactory.get(splitter)
+        else:
+            self._split_op = None


I would suggest to always assume there will be a splitter for pdf flow for now. It is good to have this if-else case, but it can introduce extra complexity and in the long term cause inconsistency from with and without splitter pdf extractor. in this case, you should use a default splitter with for example paragraph splitter and allow people to configure it to other splitter such as markdown.

In the future, if there is a request regarding without splitter, we can think about bring it back. I would like to keep the code as simple as possible for now.

goldmermaid · 2023-12-29T07:37:13Z

uniflow/flow/extract/extract_pdf_flow.py

+        if self._split_op:
+            nodes = self._split_op(nodes)


In this case, you can directly use node = self._split_op(nodes) here.

CambioML

LGTM

add splitter config and SplitterOpsFactory to ExtractPDFFlow

46e36f8

jojortz requested review from CambioML and goldmermaid as code owners December 29, 2023 06:17

goldmermaid reviewed Dec 29, 2023

View reviewed changes

make ParagraphSplitter default, delete extract_pdf_no_split notebook

3d85675

CambioML approved these changes Dec 29, 2023

View reviewed changes

jojortz merged commit c275036 into main Dec 29, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add splitter config and SplitterOpsFactory to ExtractPDFFlow #78

add splitter config and SplitterOpsFactory to ExtractPDFFlow #78

jojortz commented Dec 29, 2023

goldmermaid Dec 29, 2023

goldmermaid Dec 29, 2023

goldmermaid Dec 29, 2023

CambioML left a comment

add splitter config and SplitterOpsFactory to ExtractPDFFlow #78

add splitter config and SplitterOpsFactory to ExtractPDFFlow #78

Conversation

jojortz commented Dec 29, 2023

goldmermaid Dec 29, 2023

Choose a reason for hiding this comment

goldmermaid Dec 29, 2023

Choose a reason for hiding this comment

goldmermaid Dec 29, 2023

Choose a reason for hiding this comment

CambioML left a comment

Choose a reason for hiding this comment