Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/make parsing chunking providers #820

Merged
merged 6 commits into from
Aug 1, 2024

Conversation

emrgnt-cmplxty
Copy link
Contributor

@emrgnt-cmplxty emrgnt-cmplxty commented Aug 1, 2024

🚀 This description was created by Ellipsis for commit 9bfef06

Summary:

Introduced chunking and parsing providers, updated configurations, and modified pipelines and tests to integrate new providers.

Key points:

  • Added ChunkingProvider and ParsingProvider classes in r2r/base/providers/chunking.py and r2r/base/providers/parsing.py.
  • Updated r2r.toml and pyproject.toml to include new chunking and parsing configurations.
  • Modified r2r/base/__init__.py to import new providers.
  • Created ChunkingPipe and ParsingPipe in r2r/pipes/ingestion.
  • Updated IngestionPipeline in r2r/pipelines/ingestion_pipeline.py to use new pipes.
  • Added R2RChunkingProvider and UnstructuredChunkingProvider in r2r/providers/chunking.
  • Added R2RParsingProvider and UnstructuredParsingProvider in r2r/providers/parsing.
  • Updated R2RConfig in r2r/main/assembly/config.py to handle new configurations.
  • Updated R2RProviderFactory in r2r/main/assembly/factory.py to create new providers.
  • Updated tests in tests/test_config.py to cover new configurations.

Generated with ❤️ by ellipsis.dev

@emrgnt-cmplxty emrgnt-cmplxty marked this pull request as ready for review August 1, 2024 22:12
@emrgnt-cmplxty emrgnt-cmplxty merged commit 7d9b756 into dev Aug 1, 2024
2 of 3 checks passed
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 9bfef06 in 54 seconds

More details
  • Looked at 1900 lines of code in 26 files
  • Skipped 1 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. r2r.toml:10
  • Draft comment:
    The chunking configuration in r2r.toml correctly specifies the unstructured provider and the by_title method, which aligns with the capabilities introduced in the PR. This is a crucial update ensuring that the new chunking functionality is configurable.
  • Reason this comment was not posted:
    Confidence changes required: 0%
    The PR introduces a new chunking provider using the unstructured library, which is added to the dependencies in pyproject.toml. The r2r.toml configuration file is updated to include settings for this new chunking provider, specifying methods like by_title which aligns with the capabilities of the unstructured library. The codebase includes new classes for handling chunking operations, specifically R2RChunkingProvider and UnstructuredChunkingProvider, which are correctly implemented to handle the respective chunking logic based on the configuration. The PR also refactors the parsing configuration into its own section in r2r.toml, moving away from the previous ingestion section, which is a logical change to separate concerns of parsing and chunking from other types of data ingestion.

Workflow ID: wflow_mGiWs0CZ0xfjNbW6


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@emrgnt-cmplxty emrgnt-cmplxty deleted the feature/make-parsing-chunking-providers branch August 7, 2024 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant