diff --git a/PARAMETERS.md b/PARAMETERS.md new file mode 100644 index 0000000000..94d6379897 --- /dev/null +++ b/PARAMETERS.md @@ -0,0 +1,87 @@ +## LoraConfig Parameters + +Adjusting the `LoraConfig` parameters allows you to balance model performance and computational efficiency in Low-Rank Adaptation (LoRA). Here’s a concise breakdown of key parameters: + +**r** +- **Description**: Rank of the low-rank decomposition for factorizing weight matrices. +- **Impact**: + - **Higher**: Retains more information, increases computational load. + - **Lower**: Fewer parameters, more efficient training, potential performance drop if too small. + + +**lora_alpha** +- **Description**: Scaling factor for the low-rank matrices' contribution. +- **Impact**: + - **Higher**: Increases influence, speeds up convergence, risks instability or overfitting. + - **Lower**: Subtler effect, may require more training steps. + +**lora_dropout** +- **Description**: Probability of zeroing out elements in low-rank matrices for regularization. +- **Impact**: + - **Higher**: More regularization, prevents overfitting, may slow training and degrade performance. + - **Lower**: Less regularization, may speed up training, risks overfitting. + +**loftq_config** +- **Description**: Configuration for LoftQ, a quantization method for the backbone weights and initialization of LoRA layers. +- **Impact**: + - **Not None**: If specified, LoftQ will quantize the backbone weights and initialize the LoRA layers. It requires setting `init_lora_weights='loftq'`. + - **None**: LoftQ quantization is not applied. + - **Note**: Do not pass an already quantized model when using LoftQ as LoftQ handles the quantization process itself. + + +**use_rslora** +- **Description**: Enables Rank-Stabilized LoRA (RSLora). +- **Impact**: + - **True**: Uses Rank-Stabilized LoRA, setting the adapter scaling factor to `lora_alpha/math.sqrt(r)`, which has been proven to work better as per the [Rank-Stabilized LoRA paper](https://doi.org/10.48550/arXiv.2312.03732). + - **False**: Uses the original default scaling factor `lora_alpha/r`. + +**gradient_accumulation_steps** +- **Default**: 1 +- **Description**: The number of steps to accumulate gradients before performing a backpropagation update. +- **Impact**: + - **Higher**: Accumulate gradients over multiple steps, effectively increasing the batch size without requiring additional memory. This can improve training stability and convergence, especially with large models and limited hardware. + - **Lower**: Faster updates but may require more memory per step and can be less stable. + +**weight_decay** +- **Default**: 0.01 +- **Description**: Regularization technique that applies a small penalty to the weights during training. +- **Impact**: + - **Non-zero Value (e.g., 0.01)**: Adds a penalty proportional to the magnitude of the weights to the loss function, helping to prevent overfitting by discouraging large weights. + - **Zero**: No weight decay is applied, which can lead to overfitting, especially in large models or with small datasets. + +**learning_rate** +- **Default**: 2e-4 +- **Description**: The rate at which the model updates its parameters during training. +- **Impact**: + - **Higher**: Faster convergence but risks overshooting optimal parameters and causing instability in training. + - **Lower**: More stable and precise updates but may slow down convergence, requiring more training steps to achieve good performance. + +## Target Modules + +**q_proj (query projection)** +- **Description**: Part of the attention mechanism in transformer models, responsible for projecting the input into the query space. +- **Impact**: Transforms the input into query vectors that are used to compute attention scores. + +**k_proj (key projection)** +- **Description**: Projects the input into the key space in the attention mechanism. +- **Impact**: Produces key vectors that are compared with query vectors to determine attention weights. + +**v_proj (value projection)** +- **Description**: Projects the input into the value space in the attention mechanism. +- **Impact**: Produces value vectors that are weighted by the attention scores and combined to form the output. + +**o_proj (output projection)** +- **Description**: Projects the output of the attention mechanism back into the original space. +- **Impact**: Transforms the combined weighted value vectors back to the input dimension, integrating attention results into the model. + +**gate_proj (gate projection)** +- **Description**: Typically used in gated mechanisms within neural networks, such as gating units in gated recurrent units (GRUs) or other gating mechanisms. +- **Impact**: Controls the flow of information through the gate, allowing selective information passage based on learned weights. + +**up_proj (up projection)** +- **Description**: Used for up-projection, typically increasing the dimensionality of the input. +- **Impact**: Expands the input to a higher-dimensional space, often used in feedforward layers or when transitioning between different layers with differing dimensionalities. + +**down_proj (down projection)** +- **Description**: Used for down-projection, typically reducing the dimensionality of the input. +- **Impact**: Compresses the input to a lower-dimensional space, useful for reducing computational complexity and controlling the model size.