This repository captures our proof-of-concept work to compress Dolly-15k prompts, analyse the quality of the synthetic data, and prepare fine-tuning corpora aimed at short instructions. The project now ships with a tidy set of notebooks under src/ plus a small helper module (src/workflows/generation.py) so the workflow can be reproduced or extended without digging through exploratory scratchpads.
| Step | Notebook | Purpose | Main Outputs |
|---|---|---|---|
| 1 | src/initial_synthetic_data_generation.ipynb |
Stream Dolly-15k prompts through gpt-5-nano using the concise compression prompt from our original run. The helper handles batching, retry logic, token counting, and resumable writes. |
src/training_data/dolly-prompt-compression.csv |
| 2 | src/initial_synthetic_data_generation_v2.ipynb |
Preserve the experimental second generation that produced two variants plus a compressed prompt. Results were noisier but remain available for future experimentation. | src/training_data/dolly-prompt-compression-v2.csv |
| 3 | src/small_prompts_data_creation.ipynb |
Filter the synthetic dataset to <=128 and <=64 token prompts, apply light post-processing (article + punctuation trims), and materialise reproducible train/test splits. | dolly-short-prompt-compression.csv, dolly-very-short-prompt-compression.csv, dsp-*.csv, dvsp-*.csv |
| 4 | src/small_prompts_fine_tuning.ipynb |
Minimal Hugging Face Seq2SeqTrainer setup used when fine-tuning on Colab. Configure it in-place, swap in the short or very-short dataset, and run training on a GPU runtime. |
Generates models (e.g. dotslashderek/small-prompt-compression) |
| 5 | src/evaluations.ipynb |
Consolidated metrics: compression ratio, ROUGE overlap, and token-length distributions. Use these outputs when communicating the motivation for short-prompt models. | Console summaries for README / reports |
The shared utilities that power both generation notebooks live in src/workflows/generation.py. They provide:
primary_config(...)/variant_config(...)– configuration builders for the two generations.run_generation(...)– resumable batching with automatic retries and progress writes.summarize_dataset(...)– lightweight sanity check after each run.
With the latest synthetic dataset in place:
-
Full synthetic set (
dolly-prompt-compression.csv) – 14,779 rows, 236,923 → 177,070 tokens (ratio 0.7474). Average ROUGE scores against the originals: ROUGE-1 0.722 / ROUGE-2 0.521 / ROUGE-L 0.675.
Token distribution shows the skew toward short prompts:- 1–16 tokens: 10,807 (73.1 %)
- 17–32 tokens: 2,934 (19.9 %)
- 33–48 tokens: 663 (4.5 %)
-
Short subset (
≤128tokens) – 14,739 usable rows, 247,170 → 176,591 tokens (ratio 0.7145).- 1–16 tokens: 9,766 (66.3 %)
- 17–32 tokens: 3,641 (24.7 %)
-
Very-short subset (
≤64tokens) – 14,514 rows, 228,794 → 162,432 tokens (ratio 0.7099).- 1–16 tokens: 9,766 (67.3 %)
- 17–32 tokens: 3,641 (25.1 %)
These distributions reinforced the decision to concentrate fine-tuning experiments on instruction prompts under ~32 tokens.
- Set credentials: export
OPENAI_API_KEYbefore running any generation notebook. - Generate synthetic data: run
initial_synthetic_data_generation.ipynb. The notebook downloads Dolly-15k (once), resumes from previous runs, and writes tosrc/training_data/dolly-prompt-compression.csv. - Optional variant pass: run
initial_synthetic_data_generation_v2.ipynbif you want the variant-heavy dataset for comparison. - Curate small prompt corpora: execute
small_prompts_data_creation.ipynbto populate the filtered datasets and train/test splits. - Fine-tune (optional): open
small_prompts_fine_tuning.ipynbin Colab or another GPU environment, pick the subset you want (use_very_shortflag), and run the trainer. - Review metrics:
evaluations.ipynbreports the compression ratios, ROUGE overlap, and token histograms that informed our modelling choices.
Below are the results from running our fine-tuning workflow in Google Colab:
📊 Baseline (no FT) — ROUGE & compression: {'rouge1': 0.6408, 'rouge2': 0.4432, 'rougeL': 0.5992, 'rougeLsum': 0.5991, 'comp_ratio_mean': 0.4979, 'comp_ratio_p90': 0.84, 'pct_violations': 0.0}
Epoch Training Loss Validation Loss Rouge1 Rouge2 Rougel Rougelsum Comp Ratio Mean Comp Ratio P90 Pct Violations
1 2.695400 2.268063 0.787342 0.600898 0.745362 0.745457 0.872906 1.000000 0.018321
2 2.342700 2.186183 0.795900 0.616433 0.756211 0.756367 0.867175 1.000000 0.010179
3 2.282200 2.167105 0.801126 0.623248 0.762223 0.762250 0.862491 1.000000 0.008299
- Baseline metrics show strong ROUGE overlap and compression ratios before fine-tuning.
- After 3 epochs, validation ROUGE and compression ratios improve, with violation rates dropping below 1%.
- These results confirm the effectiveness of our short-prompt fine-tuning pipeline.
src/
workflows/
generation.py # shared generation helpers
initial_synthetic_data_generation.ipynb
initial_synthetic_data_generation_v2.ipynb
small_prompts_data_creation.ipynb
small_prompts_fine_tuning.ipynb
evaluations.ipynb
src/training_data/ # cleaned + derived datasets
training_data/ # raw and intermediate CSVs from the original POC
Older exploratory notebooks (e.g. build_better_data.ipynb, test_bart_compression.ipynb) are preserved for reference but no longer drive the main workflow.
- Revisit the variant-heavy generation prompt (see the v2 notebook) once we have more tolerance for stylistic drift.
- Explore model-specific tokenisation strategies when compressing prompts for non-T5 architectures.
- Extend the evaluation notebook with task-specific scoring once downstream datasets are identified.
With everything in src/ now self-contained, you can follow the notebooks top to bottom to reproduce the synthetic dataset, carve out small prompt subsets, and fine-tune models tailored to the most common prompt lengths we observed.
The most recent evaluation of the fine-tuned model (small-prompt-compression-model) on a random sample from dsp-train.csv:
Prompts processed: 200 (max_to_process=200)
Total input tokens: [fill in from your run]
Total generated tokens: [fill in from your run]
Compression ratio (generated/input): [fill in]
ROUGE scores: {'rouge1': [fill in], 'rouge2': [fill in], 'rougeL': [fill in], 'rougeLsum': [fill in]}
Model load time: [fill in] sec
Total generation time: [fill in] sec
Avg generation time per prompt: [fill in] sec
Note: This model has not been quantized or optimized for CPU inference yet. Generation times and throughput will improve with quantization or distillation.
- ROUGE scores here are slightly lower than those observed during fine-tuning in Colab. This may be due to differences in evaluation set sampling, Hugging Face’s tokenization, or the lack of GPU acceleration and optimization for CPU inference.
- The model achieves strong compression ratios, but some prompts are not compressed as aggressively as the reference targets. This is expected, as the model balances brevity with semantic fidelity.
- For production or large-scale inference, consider quantizing the model or using optimized inference runtimes to improve speed and efficiency.