Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adding Semantic Text Splitter Component (Text Splitters) #4254

Merged
merged 3 commits into from
Nov 11, 2024

Conversation

joaoguilhermeS
Copy link
Collaborator

Feat: Semantic Text Splitter with Advanced Threshold Controls

Overview
Introduces a new Semantic Text Splitter component that provides flexible text chunking with statistical threshold controls and regex support.

Features:

  • Multiple threshold control methods:
    • Percentile-based splitting
    • Standard deviation thresholds
    • Interquartile range
  • Configurable chunk size and count
  • Optional regex-based splitting

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request python Pull requests that update Python code labels Oct 23, 2024
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Oct 23, 2024
Copy link
Collaborator

@edwinjosechittilappilly edwinjosechittilappilly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joaoguilhermeS is langchain_experimental a part of the pyproject of langflow? if not @ogabrielluiz can we add it to the pyproject?
@joaoguilhermeS Can you confirm if it's already added by uv to the pyproject file?

Also, I suggest we add the beta as True in the component since it is a part of langchain_experimental

@joaoguilhermeS
Copy link
Collaborator Author

Thanks for your review @edwinjosechittilappilly I have added the langchain_experimental to the pyproject and everything seems to be working. I also added the beta flag on the Semantic Text Splitter component.

@edwinjosechittilappilly
Copy link
Collaborator

@joaoguilhermeS Great job.

https://github.com/langflow-ai/langflow/blob/jg/feat-semantic-text-splitter/src/backend/base/pyproject.toml
I already have langchain_experimental.

Can you test adding a different version of langchain_experimental in both?

I suggest updating the base pyproject so we won’t need to add it in other pyprojects outside of the base.

pyproject.toml Outdated Show resolved Hide resolved
@joaoguilhermeS
Copy link
Collaborator Author

Hey @edwinjosechittilappilly, I tested the component with the current langchain-experimental dependency version and it is working just fine, so I think there is no need to update so we do not break other components. Thanks for the review.

Copy link
Collaborator

@edwinjosechittilappilly edwinjosechittilappilly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
@Cristhianzl we might need to update the uvlock accordingly later if required.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 5, 2024
Copy link
Collaborator

@edwinjosechittilappilly edwinjosechittilappilly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joaoguilhermeS if there is no change in the dependencies then I would suggest the uvlock need not be updated.

Apart from that its all good.

@dosubot dosubot bot removed the lgtm This PR has been approved by a maintainer label Nov 5, 2024
@joaoguilhermeS joaoguilhermeS force-pushed the jg/feat-semantic-text-splitter branch 2 times, most recently from 8129155 to c4ba19e Compare November 5, 2024 20:21
Copy link

codspeed-hq bot commented Nov 5, 2024

CodSpeed Performance Report

Merging #4254 will not alter performance

Comparing jg/feat-semantic-text-splitter (0db0ec3) with main (fa28541)

Summary

✅ 15 untouched benchmarks

Copy link
Collaborator

@edwinjosechittilappilly edwinjosechittilappilly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 5, 2024
@edwinjosechittilappilly
Copy link
Collaborator

@ogabrielluiz I believe you might need to approve for this to get merged.(Since changes requested). Can you take a look into it, Its been blocked from merging.

@edwinjosechittilappilly edwinjosechittilappilly merged commit 357c587 into main Nov 11, 2024
28 checks passed
@edwinjosechittilappilly edwinjosechittilappilly deleted the jg/feat-semantic-text-splitter branch November 11, 2024 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lgtm This PR has been approved by a maintainer python Pull requests that update Python code size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants