limit num_proc when saving datasets to disk#2948
Conversation
📝 Walkthrough""" WalkthroughThe change updates the logic for determining the number of parallel processes used when saving a preprocessed dataset. Instead of always using the configured number of workers, the number of processes is now capped at one eighth of the dataset size or the number of workers, whichever is smaller, with a minimum of one process. A new utility function was introduced to centralize default process count determination from environment variables. Additionally, multiple test configurations across various test files were updated to include a new Changes
Possibly related PRs
Suggested labels
Suggested reviewers
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
🧰 Additional context used🧬 Code Graph Analysis (1)src/axolotl/utils/schemas/config.py (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
🔇 Additional comments (2)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
…t least 8 rows per worker to save
Codecov ReportAttention: Patch coverage is
📢 Thoughts on this report? Let us know! |
|
📖 Documentation Preview: https://687d9823e8193750b14d9c3a--resonant-treacle-0fd729.netlify.app Deployed on Netlify from commit 36e95ba |
Description
#2918 has an edge case when the number of rows is smaller than the number of processes for saving, that it fails to save due to an indexing error:
Summary by CodeRabbit
Summary by CodeRabbit