Lillama: Large Language Models Compression via Low-Rank Feature Distillation

Installation

If you're working with GPUs of capability < 8.0, you can ignore the flash-attn installation.

pip install -U git+https://github.com/yaya-sy/lillama.git flash-attn

Compression

Here is an example of how you can compress Phi-2 3B by 20%.

Step 1: Prepare the dataset

HF_DATASETS_TRUST_REMOTE_CODE=True lillama-sharegpt \
  --tokenizer microsoft/phi-2 \
  --dataset Open-Orca/SlimOrca \
  --subset 13_000_00 \
  --output-folder distillation-data

This will prepare Slim-Orca (sharegpt format) for distillation

Step 2: Compress the model

HF_DATASETS_TRUST_REMOTE_CODE=True lillama-distill \
  --llm microsoft/phi-2 \
  --train-data distillation-data/ \
  --output-folder distilled-phi2/ \
  --batch-size 8 \
  --log-interval 256

For big models (for example Mixtral-47B), you should use the argument --no-evaluate so the whole model will not be loaded on GPU.

The distilled weights will be saved in distilled-phi2/checkpoints.

Evaluation

0-shot evaluation with `lm-eval'

You can evaluate the compressed model as:

HF_DATASETS_TRUST_REMOTE_CODE=True python -m lillama.evaluation.lm_eval \
  --llm microsoft/phi-2 \
  --distill-path distilled-phi2/checkpoints/ \
  --output-folder distilled-phi2-eval/

This will save two .json files. The file full_results_0_shot.json contains the detailed results while 0_shot_task_results.jsoncontains the summarized evaluation results.

Generate with the compressed model

You can also manually inspect the generations of the model using Huggingface Transformers:

from lillama.utils import load_lr_llm
import torch
from transformers import pipeline
from transformers import AutoTokenizer

# load the model
_, lr_llm, _ = load_lr_llm(checkpoint="microsoft/phi-2",
                           distill_path="distilled-phi2/checkpoints")
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
print(f"Number of parameters of the low-rank LLM: {lr_llm.num_parameters():,}")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# use text-generation pipeline from Huggingface Transformers
lr_pipe = pipeline("text-generation", model=lr_llm, do_sample=True, tokenizer=tokenizer, temperature=0.3, device=device)
output = lr_pipe("What is the cause of the Civil War? Here is the story:",
                 max_new_tokens=256,
                 min_new_tokens=32,
                 top_p=0.9,
                 top_k=10)[0]["generated_text"]

Share the compressed with the community on Hugginface

At the moment the model can only be loaded with lillama. To share the model and use it independently, you should modify manually the model file.

First, save the compressed model as Huggingface Model:

from lillama.utils import save_lr_llm

save_lr_llm(distill_path="distilled-phi2/checkpoints",
            checkpoint="microsoft/phi-2",
            output_path="hf_compressed_model")

This will save compressed model and its config.

Then you have to modify the model file by replacing the linear layers torch.nn.Linear(input_feature, output_features) with the low rank ones: torch.nn.Sequential(torch.nn.Linear(input_feature, rank), torch.nn.Linear(rank, output_features)). I haven't automatized this, but I've done it for `Mixtral', so you can it as template: https://huggingface.co/yaya-sy/minixtral/blob/main/modeling_mixtral.py

You also have to modify he config.json for Transformers auto_map. Please se how I achieved this here: https://huggingface.co/yaya-sy/minixtral/blob/main/config.json

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
src		src
README.md		README.md
lillama.ipynb		lillama.ipynb
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lillama: Large Language Models Compression via Low-Rank Feature Distillation

Installation

Compression

Step 1: Prepare the dataset

Step 2: Compress the model

Evaluation

0-shot evaluation with `lm-eval'

Generate with the compressed model

Share the compressed with the community on Hugginface

About

Releases

Packages

Languages

yaya-sy/lillama

Folders and files

Latest commit

History

Repository files navigation

Lillama: Large Language Models Compression via Low-Rank Feature Distillation

Installation

Compression

Step 1: Prepare the dataset

Step 2: Compress the model

Evaluation

0-shot evaluation with `lm-eval'

Generate with the compressed model

Share the compressed with the community on Hugginface

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages