Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add input chunking to summarize task #157

Merged
merged 5 commits into from
Dec 19, 2024
Merged

feat: add input chunking to summarize task #157

merged 5 commits into from
Dec 19, 2024

Conversation

edward-ly
Copy link
Contributor

Closes #150. Lightly tested with gpt-4o, gpt-4o-mini, and gpt-3.5-turbo.

@edward-ly edward-ly requested a review from julien-nc November 19, 2024 23:13
Copy link
Member

@julien-nc julien-nc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

The admin setting input could be moved in the "Text generation" section. Wdyt?

lib/TaskProcessing/SummaryProvider.php Outdated Show resolved Hide resolved
lib/TaskProcessing/SummaryProvider.php Outdated Show resolved Hide resolved
lib/Service/OpenAiSettingsService.php Outdated Show resolved Hide resolved
@edward-ly
Copy link
Contributor Author

The admin setting input could be moved in the "Text generation" section. Wdyt?

Maybe the chunk size could also be considered as a usage limit, but since we only use it on the summarize task, I think it would make sense to move it there too.

Copy link
Member

@marcelklehr marcelklehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks really good. The algorithm deviates a little from what llm2 does, though: In llm2 the summaries of the chunks are concatenated and fed through the same algorithm again (ie. chunked and summarized) until there is only one summary left.

Copy link
Member

@marcelklehr marcelklehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, clicked the wrong button.

@edward-ly
Copy link
Contributor Author

The code looks really good. The algorithm deviates a little from what llm2 does, though: In llm2 the summaries of the chunks are concatenated and fed through the same algorithm again (ie. chunked and summarized) until there is only one summary left.

Now that you mention it, I do see the similar loop in llm2 now, fixed.

@marcelklehr
Copy link
Member

I'm about to change the algorithm in llm2 slightly, so we should change it here as well a last time: When the input is shorter than the chunk size it should still go through the llm summarizer once.

Copy link
Member

@julien-nc julien-nc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 A few adjustments and good to go!

lib/TaskProcessing/SummaryProvider.php Outdated Show resolved Hide resolved
src/components/AdminSettings.vue Outdated Show resolved Hide resolved
src/components/AdminSettings.vue Outdated Show resolved Hide resolved
lib/TaskProcessing/SummaryProvider.php Show resolved Hide resolved
lib/TaskProcessing/SummaryProvider.php Outdated Show resolved Hide resolved
lib/TaskProcessing/SummaryProvider.php Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Dec 4, 2024

Hello there,
Thank you so much for taking the time and effort to create a pull request to our Nextcloud project.

We hope that the review process is going smooth and is helpful for you. We want to ensure your pull request is reviewed to your satisfaction. If you have a moment, our community management team would very much appreciate your feedback on your experience with this PR review process.

Your feedback is valuable to us as we continuously strive to improve our community developer experience. Please take a moment to complete our short survey by clicking on the following link: https://cloud.nextcloud.com/apps/forms/s/i9Ago4EQRZ7TWxjfmeEpPkf6

Thank you for contributing to Nextcloud and we hope to hear from you soon!

(If you believe you should not receive this message, you can add yourself to the blocklist.)

@edward-ly edward-ly force-pushed the feat/chunk-size branch 2 times, most recently from d9f8184 to 344c92c Compare December 5, 2024 00:41
@marcelklehr
Copy link
Member

Looks good and works in my tests. It seems we lost multilingual support though. When summarizing a German text that is longer than the chunk size the resulting summary was in english in my test.

@edward-ly
Copy link
Contributor Author

Looks good and works in my tests. It seems we lost multilingual support though. When summarizing a German text that is longer than the chunk size the resulting summary was in english in my test.

In my testing, summarizing English text produced output in either Spanish or Italian, hence the change in prompt. If you want me to revert that change for now, we can do that.

@marcelklehr
Copy link
Member

In my testing, summarizing English text produced output in either Spanish or Italian

Oh, that's not intended of course :/ Ideally we should find a prompt that works for all languages to produce the summary in the same language as the input.

@edward-ly
Copy link
Contributor Author

Changed the prompt again after some basic prompt engineering and experimenting. Hopefully, this will produce more reliable results.

@edward-ly edward-ly force-pushed the feat/chunk-size branch 2 times, most recently from 31940fb to 79189c6 Compare December 7, 2024 01:14
@marcelklehr
Copy link
Member

Thanks for these adjustments, works well for me now!

@edward-ly edward-ly requested a review from julien-nc December 9, 2024 16:05
@edward-ly
Copy link
Contributor Author

Rebased and adjusted after #167.

@edward-ly edward-ly merged commit 15b93f9 into main Dec 19, 2024
7 checks passed
@marcelklehr
Copy link
Member

Woop woop 🎉
Thanks @edward-ly !

@edward-ly edward-ly deleted the feat/chunk-size branch December 20, 2024 06:37
@julien-nc julien-nc mentioned this pull request Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Chunking for summarize
4 participants