-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement backward passes for llama with small training llama from scratch example #1360
Conversation
- GGML_OP_ADD_AT - GGML_OP_CPY - GGML_OP_MUL_MAT (src0.grad) - GGML_OP_PERMUTE - GGML_OP_RESHAPE - GGML_OP_SCALE - GGML_OP_TRANSPOSE - GGML_OP_VIEW implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW. this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset). the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0. still missing backward passes for llama: - GGML_OP_DIAG_MASK_INF - GGML_OP_GET_ROWS - GGML_OP_RMS_NORM - GGML_OP_ROPE - GGML_OP_SILU - GGML_OP_SOFT_MAX
- GGML_OP_DIAG_MASK_INF - GGML_OP_GET_ROWS - GGML_OP_RMS_NORM - GGML_OP_SILU - GGML_OP_SOFT_MAX add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1. GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know... GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF. Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants. staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and functions with "_inplace" are added which are inplace. in llama we need to call the inplace variants so that it is implemented as before. for llama backward pass we need to use the non-inplace variants. still not completely implemented backward passes for llama: - GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK - GGML_OP_GET_ROWS: only necessary for tokenizer
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
use sum instead of mean for gradient of scalar scale parameter
use add1(x,y) instead of add(x,repeat(y,x))
use scale(x,y) instead of mul(x,repeat(y,x))
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
ggml_diag constructs diagonal matrices with entries. ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
…of same type. with this we can duplicate tensor of any typ as long as they are contiguous.
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
required for view backward pass src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
documentation for vDSP_vdiv states: "Note that B comes before A!"
documentation for vDSP_vdiv states: "Note that B comes before A!"
I got a batched forward function working. With this greater number of parallel batches can be trained with ease. This should also be useful to implement beam search sampling. |
Ha, that is interesting. I somehow thought that to support batched inference some changes in
Yes, this is needed for the beam-search decoding in Edit: now that #1405 has been merged, this PR is now highest priority for merging. Will be look into more details in the following days |
struct ggml_tensor* F08 = ggml_transpose (ctx, F07); | ||
struct ggml_tensor* F09 = ggml_cont (ctx, F08); | ||
struct ggml_tensor* F10 = ggml_reshape (ctx, F09, src0->grad); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow!
Btw, would it make sense to have GGML_OP_REPEAT_BACK
that implements this in a kernel?
Maybe if this is some sort of bottleneck. Otherwise, keep it like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I shuddered myself at this^^ I think one of the conts is not even necessary.
I also think it would make sense to implement this as an extra operation, should not be that difficult.
Most of the reshaping etc is only necessary to use sum_rows which could be done better in a special operation.
Hmm, what's wrong with the CI?
Did it checkout the wrong branch 🤔 Edit: I see now - it is actually smart enough to merge |
@xaedes Is it ok if I merged latest |
@ggerganov Sure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty big addition to ggml
- I hope it will open up some applications for training in the future
- The
tests
will be moved to theggml
repo after we merge this PR - The batched forward example is very valuable. Have to apply it to
whisper.cpp
andbert.cpp
- After merging this, it would be a good time for some refactoring passes in
ggml.c
to try and reduce code duplication and overall code size
Thanks @xaedes - excellent effort!
@xaedes has there been further progress in this direction since June? In particular, how far are things from being able to train LoRA adapters? |
@RonanKMcGovern |
Training a llama directly with ggml would be really nice.
For this I implemented the backward passes required for the llama model, tested them with test_grad0 from ggml repo and trained a small llama from scratch to output a sinus wave.
Also see the more detailed discussion in ggerganov/ggml#8 (comment)
List of all new operations that I had to add:
Notable other changes:
The performance of various parts of the training could be improved, especially a fast ggml_out_prod could help speeding up the matrix multiplication backward pass.
There are two additional test files, one for testing gradients taken from ggml repo and one small test for testing optimization in general.
Exemplary training of a small llama model is demonstrated in the self-contained baby-llama example, where it is trained to output a sinus wave.
Some notes on first training tests:
A parallel batched forward function would probably be a good improvement. Training on multiple examples in (parallel) batch really seems to improve the training, but currently I can only do that by calling the forward function multiple times with different input data, which costs a lot of nodes in the computation graph, especially since backward pass is necessary as well.
The batches could be stored in another dimension of the tensors. It would probably just require some reshapes and view operations to make it work.
I did not look into training a LoRa finetune yet, but all the necessary machinery for that seems to be working.