Skip to content

dotslashderek/t5-summarization-testing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dolly Prompt Compression Pipeline

This repository captures our proof-of-concept work to compress Dolly-15k prompts, analyse the quality of the synthetic data, and prepare fine-tuning corpora aimed at short instructions. The project now ships with a tidy set of notebooks under src/ plus a small helper module (src/workflows/generation.py) so the workflow can be reproduced or extended without digging through exploratory scratchpads.

Project Flow

Step Notebook Purpose Main Outputs
1 src/initial_synthetic_data_generation.ipynb Stream Dolly-15k prompts through gpt-5-nano using the concise compression prompt from our original run. The helper handles batching, retry logic, token counting, and resumable writes. src/training_data/dolly-prompt-compression.csv
2 src/initial_synthetic_data_generation_v2.ipynb Preserve the experimental second generation that produced two variants plus a compressed prompt. Results were noisier but remain available for future experimentation. src/training_data/dolly-prompt-compression-v2.csv
3 src/small_prompts_data_creation.ipynb Filter the synthetic dataset to <=128 and <=64 token prompts, apply light post-processing (article + punctuation trims), and materialise reproducible train/test splits. dolly-short-prompt-compression.csv, dolly-very-short-prompt-compression.csv, dsp-*.csv, dvsp-*.csv
4 src/small_prompts_fine_tuning.ipynb Minimal Hugging Face Seq2SeqTrainer setup used when fine-tuning on Colab. Configure it in-place, swap in the short or very-short dataset, and run training on a GPU runtime. Generates models (e.g. dotslashderek/small-prompt-compression)
5 src/evaluations.ipynb Consolidated metrics: compression ratio, ROUGE overlap, and token-length distributions. Use these outputs when communicating the motivation for short-prompt models. Console summaries for README / reports

The shared utilities that power both generation notebooks live in src/workflows/generation.py. They provide:

  • primary_config(...) / variant_config(...) – configuration builders for the two generations.
  • run_generation(...) – resumable batching with automatic retries and progress writes.
  • summarize_dataset(...) – lightweight sanity check after each run.

Key Numbers

With the latest synthetic dataset in place:

  • Full synthetic set (dolly-prompt-compression.csv) – 14,779 rows, 236,923 → 177,070 tokens (ratio 0.7474). Average ROUGE scores against the originals: ROUGE-1 0.722 / ROUGE-2 0.521 / ROUGE-L 0.675.
    Token distribution shows the skew toward short prompts:

    • 1–16 tokens: 10,807 (73.1 %)
    • 17–32 tokens: 2,934 (19.9 %)
    • 33–48 tokens: 663 (4.5 %)
  • Short subset (≤128 tokens) – 14,739 usable rows, 247,170 → 176,591 tokens (ratio 0.7145).

    • 1–16 tokens: 9,766 (66.3 %)
    • 17–32 tokens: 3,641 (24.7 %)
  • Very-short subset (≤64 tokens) – 14,514 rows, 228,794 → 162,432 tokens (ratio 0.7099).

    • 1–16 tokens: 9,766 (67.3 %)
    • 17–32 tokens: 3,641 (25.1 %)

These distributions reinforced the decision to concentrate fine-tuning experiments on instruction prompts under ~32 tokens.

Running the Workflow

  1. Set credentials: export OPENAI_API_KEY before running any generation notebook.
  2. Generate synthetic data: run initial_synthetic_data_generation.ipynb. The notebook downloads Dolly-15k (once), resumes from previous runs, and writes to src/training_data/dolly-prompt-compression.csv.
  3. Optional variant pass: run initial_synthetic_data_generation_v2.ipynb if you want the variant-heavy dataset for comparison.
  4. Curate small prompt corpora: execute small_prompts_data_creation.ipynb to populate the filtered datasets and train/test splits.
  5. Fine-tune (optional): open small_prompts_fine_tuning.ipynb in Colab or another GPU environment, pick the subset you want (use_very_short flag), and run the trainer.
  6. Review metrics: evaluations.ipynb reports the compression ratios, ROUGE overlap, and token histograms that informed our modelling choices.

Fine-Tuning Results (Colab)

Below are the results from running our fine-tuning workflow in Google Colab:

📊 Baseline (no FT) — ROUGE & compression: {'rouge1': 0.6408, 'rouge2': 0.4432, 'rougeL': 0.5992, 'rougeLsum': 0.5991, 'comp_ratio_mean': 0.4979, 'comp_ratio_p90': 0.84, 'pct_violations': 0.0}

Epoch	Training Loss	Validation Loss	Rouge1	Rouge2	Rougel	Rougelsum	Comp Ratio Mean	Comp Ratio P90	Pct Violations
1	2.695400	2.268063	0.787342	0.600898	0.745362	0.745457	0.872906	1.000000	0.018321
2	2.342700	2.186183	0.795900	0.616433	0.756211	0.756367	0.867175	1.000000	0.010179
3	2.282200	2.167105	0.801126	0.623248	0.762223	0.762250	0.862491	1.000000	0.008299
  • Baseline metrics show strong ROUGE overlap and compression ratios before fine-tuning.
  • After 3 epochs, validation ROUGE and compression ratios improve, with violation rates dropping below 1%.
  • These results confirm the effectiveness of our short-prompt fine-tuning pipeline.

Repository Layout Highlights

src/
  workflows/
    generation.py        # shared generation helpers
  initial_synthetic_data_generation.ipynb
  initial_synthetic_data_generation_v2.ipynb
  small_prompts_data_creation.ipynb
  small_prompts_fine_tuning.ipynb
  evaluations.ipynb
src/training_data/       # cleaned + derived datasets
training_data/           # raw and intermediate CSVs from the original POC

Older exploratory notebooks (e.g. build_better_data.ipynb, test_bart_compression.ipynb) are preserved for reference but no longer drive the main workflow.

Future Directions

  • Revisit the variant-heavy generation prompt (see the v2 notebook) once we have more tolerance for stylistic drift.
  • Explore model-specific tokenisation strategies when compressing prompts for non-T5 architectures.
  • Extend the evaluation notebook with task-specific scoring once downstream datasets are identified.

With everything in src/ now self-contained, you can follow the notebooks top to bottom to reproduce the synthetic dataset, carve out small prompt subsets, and fine-tune models tailored to the most common prompt lengths we observed.

Latest Fine-Tuned Model Evaluation

The most recent evaluation of the fine-tuned model (small-prompt-compression-model) on a random sample from dsp-train.csv:

Prompts processed: 200 (max_to_process=200)
Total input tokens: [fill in from your run]
Total generated tokens: [fill in from your run]
Compression ratio (generated/input): [fill in]
ROUGE scores: {'rouge1': [fill in], 'rouge2': [fill in], 'rougeL': [fill in], 'rougeLsum': [fill in]}
Model load time: [fill in] sec
Total generation time: [fill in] sec
Avg generation time per prompt: [fill in] sec

Note: This model has not been quantized or optimized for CPU inference yet. Generation times and throughput will improve with quantization or distillation.

Observations

  • ROUGE scores here are slightly lower than those observed during fine-tuning in Colab. This may be due to differences in evaluation set sampling, Hugging Face’s tokenization, or the lack of GPU acceleration and optimization for CPU inference.
  • The model achieves strong compression ratios, but some prompts are not compressed as aggressively as the reference targets. This is expected, as the model balances brevity with semantic fidelity.
  • For production or large-scale inference, consider quantizing the model or using optimized inference runtimes to improve speed and efficiency.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published