diff --git a/PARAMETERS.md b/PARAMETERS.md deleted file mode 100644 index 94d6379897..0000000000 --- a/PARAMETERS.md +++ /dev/null @@ -1,87 +0,0 @@ -## LoraConfig Parameters - -Adjusting the `LoraConfig` parameters allows you to balance model performance and computational efficiency in Low-Rank Adaptation (LoRA). Here’s a concise breakdown of key parameters: - -**r** -- **Description**: Rank of the low-rank decomposition for factorizing weight matrices. -- **Impact**: - - **Higher**: Retains more information, increases computational load. - - **Lower**: Fewer parameters, more efficient training, potential performance drop if too small. - - -**lora_alpha** -- **Description**: Scaling factor for the low-rank matrices' contribution. -- **Impact**: - - **Higher**: Increases influence, speeds up convergence, risks instability or overfitting. - - **Lower**: Subtler effect, may require more training steps. - -**lora_dropout** -- **Description**: Probability of zeroing out elements in low-rank matrices for regularization. -- **Impact**: - - **Higher**: More regularization, prevents overfitting, may slow training and degrade performance. - - **Lower**: Less regularization, may speed up training, risks overfitting. - -**loftq_config** -- **Description**: Configuration for LoftQ, a quantization method for the backbone weights and initialization of LoRA layers. -- **Impact**: - - **Not None**: If specified, LoftQ will quantize the backbone weights and initialize the LoRA layers. It requires setting `init_lora_weights='loftq'`. - - **None**: LoftQ quantization is not applied. - - **Note**: Do not pass an already quantized model when using LoftQ as LoftQ handles the quantization process itself. - - -**use_rslora** -- **Description**: Enables Rank-Stabilized LoRA (RSLora). -- **Impact**: - - **True**: Uses Rank-Stabilized LoRA, setting the adapter scaling factor to `lora_alpha/math.sqrt(r)`, which has been proven to work better as per the [Rank-Stabilized LoRA paper](https://doi.org/10.48550/arXiv.2312.03732). - - **False**: Uses the original default scaling factor `lora_alpha/r`. - -**gradient_accumulation_steps** -- **Default**: 1 -- **Description**: The number of steps to accumulate gradients before performing a backpropagation update. -- **Impact**: - - **Higher**: Accumulate gradients over multiple steps, effectively increasing the batch size without requiring additional memory. This can improve training stability and convergence, especially with large models and limited hardware. - - **Lower**: Faster updates but may require more memory per step and can be less stable. - -**weight_decay** -- **Default**: 0.01 -- **Description**: Regularization technique that applies a small penalty to the weights during training. -- **Impact**: - - **Non-zero Value (e.g., 0.01)**: Adds a penalty proportional to the magnitude of the weights to the loss function, helping to prevent overfitting by discouraging large weights. - - **Zero**: No weight decay is applied, which can lead to overfitting, especially in large models or with small datasets. - -**learning_rate** -- **Default**: 2e-4 -- **Description**: The rate at which the model updates its parameters during training. -- **Impact**: - - **Higher**: Faster convergence but risks overshooting optimal parameters and causing instability in training. - - **Lower**: More stable and precise updates but may slow down convergence, requiring more training steps to achieve good performance. - -## Target Modules - -**q_proj (query projection)** -- **Description**: Part of the attention mechanism in transformer models, responsible for projecting the input into the query space. -- **Impact**: Transforms the input into query vectors that are used to compute attention scores. - -**k_proj (key projection)** -- **Description**: Projects the input into the key space in the attention mechanism. -- **Impact**: Produces key vectors that are compared with query vectors to determine attention weights. - -**v_proj (value projection)** -- **Description**: Projects the input into the value space in the attention mechanism. -- **Impact**: Produces value vectors that are weighted by the attention scores and combined to form the output. - -**o_proj (output projection)** -- **Description**: Projects the output of the attention mechanism back into the original space. -- **Impact**: Transforms the combined weighted value vectors back to the input dimension, integrating attention results into the model. - -**gate_proj (gate projection)** -- **Description**: Typically used in gated mechanisms within neural networks, such as gating units in gated recurrent units (GRUs) or other gating mechanisms. -- **Impact**: Controls the flow of information through the gate, allowing selective information passage based on learned weights. - -**up_proj (up projection)** -- **Description**: Used for up-projection, typically increasing the dimensionality of the input. -- **Impact**: Expands the input to a higher-dimensional space, often used in feedforward layers or when transitioning between different layers with differing dimensionalities. - -**down_proj (down projection)** -- **Description**: Used for down-projection, typically reducing the dimensionality of the input. -- **Impact**: Compresses the input to a lower-dimensional space, useful for reducing computational complexity and controlling the model size. diff --git a/README.md b/README.md index 2c50f45763..534079ed49 100644 --- a/README.md +++ b/README.md @@ -35,7 +35,7 @@ All notebooks are **beginner friendly**! Add your dataset, click "Run All", and - Run [Llama 3 conversational notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing) and [Mistral 7B v3 ChatML](https://colab.research.google.com/drive/15F1xyn8497_dUbxZP4zWmPZ3PJx1Oymv?usp=sharing) - This [text completion notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing) is for continued pretraining / raw text - This [continued pretraining notebook](https://colab.research.google.com/drive/1tEd1FrOXWMnCU9UIvdYhs61tkxdMuKZu?usp=sharing) is for learning another language - +- Click [here](https://github.com/unslothai/unsloth/wiki) for detailed documentation for Unsloth. ## 🦥 Unsloth.ai News - 📣 NEW! Continued Pretraining [notebook](https://colab.research.google.com/drive/1tEd1FrOXWMnCU9UIvdYhs61tkxdMuKZu?usp=sharing) for other languages like Korean! @@ -76,7 +76,7 @@ model = FastLanguageModel.get_peft_model( ## 🥇 Performance Benchmarking -- For the full list of **reproducable** benchmarking tables, [go to our website](https://unsloth.ai/blog/mistral-benchmark#Benchmark%20tables) +- For the full list of **reproducible** benchmarking tables, [go to our website](https://unsloth.ai/blog/mistral-benchmark#Benchmark%20tables) | 1 A100 40GB | 🤗Hugging Face | Flash Attention | 🦥Unsloth Open Source | 🦥[Unsloth Pro](https://unsloth.ai/pricing) | |--------------|--------------|-----------------|---------------------|-----------------| @@ -100,14 +100,16 @@ model = FastLanguageModel.get_peft_model( ### Conda Installation Select either `pytorch-cuda=11.8` for CUDA 11.8 or `pytorch-cuda=12.1` for CUDA 12.1. If you have `mamba`, use `mamba` instead of `conda` for faster solving. See this [Github issue](https://github.com/unslothai/unsloth/issues/73) for help on debugging Conda installs. ```bash -conda create --name unsloth_env python=3.10 +conda create --name unsloth_env \ + python=3.10 \ + pytorch-cuda=<11.8/12.1> \ + pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers \ + -y conda activate unsloth_env -conda install pytorch-cuda=<12.1/11.8> pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers - pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" -pip install --no-deps trl peft accelerate bitsandbytes +pip install --no-deps "trl<0.9.0" peft accelerate bitsandbytes ``` ### Pip Installation @@ -162,7 +164,7 @@ pip install --no-deps packaging ninja einops flash-attn xformers trl peft accele # Pre Ampere RTX 2080, T4, GTX 1080 GPUs: pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" -pip install --no-deps xformers trl peft accelerate bitsandbytes +pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes ``` 7. For Pytorch 2.3.0: Use the `"ampere"` path for newer RTX 30xx GPUs or higher. ```bash @@ -257,7 +259,7 @@ trainer.train() # (1) Saving to GGUF / merging to 16bit for vLLM # (2) Continued training from a saved LoRA adapter # (3) Adding an evaluation loop / OOMs -# (4) Cutomized chat templates +# (4) Customized chat templates ``` diff --git a/unsloth/models/mistral.py b/unsloth/models/mistral.py index fc2e1a9fb0..ff2e909fb9 100644 --- a/unsloth/models/mistral.py +++ b/unsloth/models/mistral.py @@ -512,7 +512,7 @@ def from_pretrained( if "n_total_devices >" not in inner_training_loop: raise RuntimeError( "Our OSS was designed for people with few GPU resources to level the playing field.\n" - "The OSS Apache 2 license only supports four GPUs - please obtain a commercial license from our website.\n" + "The OSS Apache 2 license only supports one GPU - please obtain a commercial license.\n" "We're a 2 person team, so we still have to fund our development costs - thanks!\n" "If you don't, please consider at least sponsoring us through Ko-fi! Appreciate it!", ) @@ -521,6 +521,7 @@ def from_pretrained( "is_sagemaker_mp_enabled()", "False", ) + exec(inner_training_loop, globals()) Trainer._inner_training_loop = _fast_inner_training_loop # Save max_seq_length @@ -560,6 +561,7 @@ def from_pretrained( # Add save modules patch_saving_functions(model) + Trainer._inner_training_loop = _fast_inner_training_loop # Save tokenizer for inference purposes tokenizer.padding_side = "left" # Force inference diff --git a/unsloth/models/qwen2.py b/unsloth/models/qwen2.py index 115bf3e090..47327280b9 100644 --- a/unsloth/models/qwen2.py +++ b/unsloth/models/qwen2.py @@ -12,9 +12,7 @@ # See the License for the specific language governing permissions and # limitations under the License. -from .llama import * -import os -from ._utils import __version__ +from .mistral import * from transformers.models.qwen2.modeling_qwen2 import ( Qwen2Attention, @@ -34,7 +32,7 @@ pass -class FastQwen2Model(FastLlamaModel): +class FastQwen2Model(FastMistralModel): @staticmethod def pre_patch(): @@ -72,7 +70,7 @@ def from_pretrained( trust_remote_code = False, **kwargs, ): - return FastLlamaModel.from_pretrained( + return FastMistralModel.from_pretrained( model_name = model_name, max_seq_length = max_seq_length, dtype = dtype,