Implement backward passes for llama with small training llama from scratch example #1360

- GGML_OP_ADD_AT - GGML_OP_CPY - GGML_OP_MUL_MAT (src0.grad) - GGML_OP_PERMUTE - GGML_OP_RESHAPE - GGML_OP_SCALE - GGML_OP_TRANSPOSE - GGML_OP_VIEW implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW. this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset). the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0. still missing backward passes for llama: - GGML_OP_DIAG_MASK_INF - GGML_OP_GET_ROWS - GGML_OP_RMS_NORM - GGML_OP_ROPE - GGML_OP_SILU - GGML_OP_SOFT_MAX

- GGML_OP_DIAG_MASK_INF - GGML_OP_GET_ROWS - GGML_OP_RMS_NORM - GGML_OP_SILU - GGML_OP_SOFT_MAX add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1. GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know... GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF. Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants. staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and functions with "_inplace" are added which are inplace. in llama we need to call the inplace variants so that it is implemented as before. for llama backward pass we need to use the non-inplace variants. still not completely implemented backward passes for llama: - GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK - GGML_OP_GET_ROWS: only necessary for tokenizer

after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.

…get_rows_back

…e console

use sum instead of mean for gradient of scalar scale parameter

use add1(x,y) instead of add(x,repeat(y,x))

use scale(x,y) instead of mul(x,repeat(y,x))

this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c

ggml_diag constructs diagonal matrices with entries. ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]

…of same type. with this we can duplicate tensor of any typ as long as they are contiguous.

when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy

required for view backward pass src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.

uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions. this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.

…sors uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions. this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.

uses the same logic as in add

also test sub, mul and div up to max n_dims

add_at (required for view backward pass) is a bit tricky for n_dims > 1.

nargs and ndims was swapped, corrupting the stack

add nb parameters to add_at like in view. together with offset they define how to view dst and src0 during the add_at operation.

I would have used formulas from other frameworks, but they differed so I could not decide which is correct. Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.

some tests may fail when gradients are large. could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds. when looking at the values the "failed" tests look actually ok. for example: rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324 it is due to the test logic in check_gradients that they fail.

- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required) - repeat is not yet tested and looks like it only works for single element src0 inputs.

ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]

requires ggml_sum_rows

… to output a sinusoidal wave. had to increase maximum number of optimization parameters to train from scratch.

…ts in the baby-llama example

… a renewable context for eval and opt when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed. so we need to keep the original gradients and make dups for opt

…l afterwards ctx0 for evaluation and optimization is renewed for each sample

…y loss

predict the next token, not the current token!

this increases computation graph, need parallel batched forward for more efficiency.

…as in view

necessary to set values into kv_self cache and properly propagate the gradients

use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients

…low gradient propagation

…with A*B=C this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices. training this instead of the normal model resulted in much worse results though...

instead of logit targets 0 and 1 use -1 and +1.

# Conflicts: # ggml.c # llama.cpp

avoid exceeding timeout of automated tests

use c++ includes instead of c includes use std::min, std::max instead of MIN, MAX macros

documentation for vDSP_vdiv states: "Note that B comes before A!"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement backward passes for llama with small training llama from scratch example #1360

Implement backward passes for llama with small training llama from scratch example #1360

Commits on May 1, 2023

Commits on May 6, 2023

Commits on May 7, 2023

Commits on May 8, 2023

Commits on May 11, 2023

Commits on May 13, 2023