Chunkify: A Python Script for Text Processing with Large Language Models

Overview

Chunkify was made as a proof-of-concept for a chunking method that doesn't rely on a tokenizer. The text processing features were added because they are commonly used and useful. Aya Expanse is particularly good at these tasks and is downloaded automatically the first time you run it if you use the batch file.

Key Features

Document Chunking: Divides large documents into manageable chunks, intelligently identifying breaks based on chapters, headings, paragraphs, or sentences.
Automatic Template Selection: Adapts to the loaded model, ensuring the correct instruction template is used.
Real-time Monitoring: Provides continuous feedback on the generation process, allowing users to track progress.
Compatible with multiple document formats, including PDF and HTML
Multiple Processing Modes:
- Summary: Generates concise summaries of the content.
- Translation: Translates text into a language you can specify (default is English).
- Distillation: Rewrites content for conciseness while retaining key information.
- Correction: Fixes grammar, spelling, and style issues.
Custom Instruction Templates: Enables users to create and utilize their own templates for specific models.
File Output Support: Saves results to specified output files.

Requirements

Python 3.8 or later
KoboldCpp executable in the script directory
Essential Python packages:
- requests
- dataclasses (included in Python 3.7+)
- extractous (for text extraction)
- PyQt5 (for GUI)

Installation

Windows Installation:

Clone the repository or download the ZIP file from GitHub.
Install Python 3.8 or later if not already present.
Download KoboldCPP.exe from the KoboldCPP releases page and place it in the project folder.
Run chunkify-run.bat. This script will install necessary dependencies and download the Aya Expanse 8b Q6_K model.
Upon completion, you should see a message: "Please connect to custom endpoint at http://localhost:5001".

macOS Installation:

Follow the Windows installation steps, ensuring you use the appropriate KoboldCPP binary for macOS.

Linux Installation:

Similar to Windows, clone the repository, install Python 3.8 or later, and download the Linux KoboldCPP binary from the releases page.
Run the script using: ./chunkify-run.sh.

Usage

GUI Launch:
- Windows: Run chunkify-run.bat.
- macOS/Linux: Execute python3 chunkify-gui.py.
Ensure KoboldCPP is running and displaying the message: "Please connect to custom endpoint at http://localhost:5001".
Configure settings and API details through the GUI or a configuration JSON file.
Click "Process" to initiate the text processing task.
Monitor progress in the GUI's output area.

Configuration

Configuration can be managed through:

Command-line arguments
chunkify_config.json file
GUI settings

Configuration File Format (JSON)

{
  "templates_directory": "./templates",
  "api_url": "http://localhost:5001",
  "api_password": "",
  "temperature": 0.2,
  "repetition_penalty": 1.0,
  "top_k": 0,
  "top_p": 1.0,
  "min_p": 0.02,
  "selected_template": "Auto",
  "translation_language": "English"
}

Command-Line Usage

python chunkify.py --content input.txt --task summary

or with a configuration file:

python chunkify.py --config config.json --content input.txt --task translate

Output Format

When using the --file option, the script generates a Markdown-formatted output file containing:

Document metadata
Task-specific results

The default output file is output.txt in the script directory, or the GUI will save files with an added '_processed' suffix.

Template System

Templates are used to format LLM instructions. They are JSON files with a specific structure:

{
  "name": ["template_name"],
  "akas": ["template_alias"],
  "system_start": "### System:",
  "system_end": "\n",
  "user_start": "### Human:",
  "user_end": "\n",
  "assistant_start": "### Assistant:",
  "assistant_end": "\n"
}

Templates are located in the templates subdirectory by default.

Limitations

Context length is model-dependent.
Chunking and generation length are set to half the context size.
Speed varies based on API response time.
Template formatting must match LLM expectations.
Consider a GPU with 8GB VRAM or a powerful CPU with 16GB RAM for optimal performance.

Contribution and License

Feel free to contribute and submit issues or pull requests. The script is licensed under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
templates		templates
LICENSE		LICENSE
README.md		README.md
chunker_regex.py		chunker_regex.py
chunkify-gui.py		chunkify-gui.py
chunkify-no-kobold.bat		chunkify-no-kobold.bat
chunkify-run.bat		chunkify-run.bat
chunkify-run.sh		chunkify-run.sh
chunkify.kcppt		chunkify.kcppt
chunkify.py		chunkify.py
chunkify_config.json		chunkify_config.json
chunkify_littleprince_translate.gif		chunkify_littleprince_translate.gif
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chunkify: A Python Script for Text Processing with Large Language Models

Overview

Key Features

Requirements

Installation

Windows Installation:

macOS Installation:

Linux Installation:

Usage

Configuration

Configuration File Format (JSON)

Command-Line Usage

Output Format

Template System

Limitations

Contribution and License

About

Releases

Packages

Languages

License

jabberjabberjabber/Chunkify

Folders and files

Latest commit

History

Repository files navigation

Chunkify: A Python Script for Text Processing with Large Language Models

Overview

Key Features

Requirements

Installation

Windows Installation:

macOS Installation:

Linux Installation:

Usage

Configuration

Configuration File Format (JSON)

Command-Line Usage

Output Format

Template System

Limitations

Contribution and License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages