Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add splitter config and SplitterOpsFactory to ExtractPDFFlow #78

Merged
merged 2 commits into from
Dec 29, 2023

Conversation

jojortz
Copy link
Contributor

@jojortz jojortz commented Dec 29, 2023

  • Add splitter field to ExtractPDFConfig
  • Create SplitterOpsFactory, which has two splitters:
    • ParagraphSplitter
    • MarkdownHeaderSplitter
  • Create notebook to show pdf extract without split, extract_pdf_no_split.ipynb
  • Test new config on the following notebooks:
    • extract_pdf.ipynb
    • pipeline_pdf.ipynb
    • extract_pdf_no_split.ipynb
  • Update the following notebooks to have necessary instruction in GuidedPrompt:
    • openai_pdf_source_10k_QA.ipynb

@@ -17,13 +18,13 @@ class ExtractPDFFlow(Flow):
def __init__(
self,
model_config: Dict[str, Any],
splitter: str = "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: None is a better default value.

Comment on lines 37 to 40
if splitter:
self._split_op = SplitterOpsFactory.get(splitter)
else:
self._split_op = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to always assume there will be a splitter for pdf flow for now. It is good to have this if-else case, but it can introduce extra complexity and in the long term cause inconsistency from with and without splitter pdf extractor. in this case, you should use a default splitter with for example paragraph splitter and allow people to configure it to other splitter such as markdown.

In the future, if there is a request regarding without splitter, we can think about bring it back. I would like to keep the code as simple as possible for now.

Comment on lines 53 to 54
if self._split_op:
nodes = self._split_op(nodes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, you can directly use node = self._split_op(nodes) here.

Copy link
Owner

@CambioML CambioML left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jojortz jojortz merged commit c275036 into main Dec 29, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants