Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement backward passes for llama with small training llama from scratch example #1360

Merged
merged 110 commits into from
May 13, 2023

Conversation

xaedes
Copy link
Collaborator

@xaedes xaedes commented May 7, 2023

Training a llama directly with ggml would be really nice.

For this I implemented the backward passes required for the llama model, tested them with test_grad0 from ggml repo and trained a small llama from scratch to output a sinus wave.

Also see the more detailed discussion in ggerganov/ggml#8 (comment)

List of all new operations that I had to add:

  • GGML_OP_ADD1 : Could be replaced with add(X,repeat(Y,X)) but it is faster when repeat can be avoided.
  • GGML_OP_ACC : Necessary for view backward pass. This adds src1 to view(src0, nb, offset) and returns tensor of shape src0.
  • GGML_OP_SET : (this is new) Necessary for propagating gradients through kv cache. Instead of copying to kv cache this function sets the values in the kv cache viewed with offsets and strides and returns a tensor representing the modified kv cache. This can also be inplace returning a view to the modified kv cache.
  • GGML_OP_LOG : (this is new) Necessary for cross entropy loss
  • GGML_OP_SUM_ROWS : Necessary for repeat backward pass: Reduces rows by summing them. shape[a,b,c,d] -> shape[1,b,c,d]
  • GGML_OP_SILU_BACK : Necessary for silu backward pass
  • GGML_OP_RMS_NORM_BACK : Could also be implemented using primitives, at the cost of performance.
  • GGML_OP_GET_ROWS_BACK : Necessary for get_rows backward pass: Adds src0[i] rows to opt0[src1[i]] rows, returning a tensor of shape opt0.
  • GGML_OP_DIAG : Necessary for softmax backward pass, alternative would have been to implement SOFTMAX_BACK directly, but DIAG is at least usable for other stuff. It turns rows into diagonal matrices.
  • GGML_OP_DIAG_MASK_ZERO : Necessary for diag_mask_inf backward pass
  • GGML_OP_ROPE_BACK : Necessary for rope backward pass.

Notable other changes:

  • add inplace and non-inplace variants for scale, diag_mask_inf, soft_max and rope
  • fix sub, mul and div functions to work correctly with transposed tensor, uses the same logic as in add:
  • fix ggml_forward_add functions to work correctly with transposed tensors. uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions. this also slightly changes the mem access pattern of the different threads to work as in ggml_compute_forward_add_q_f32.
  • de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type. with this we can duplicate tensor of any type as long as they are contiguous. the function is used in dup, get_rows_back and diag_mask (when not inplace).
  • there are some maybe too verbose comments including step-by-step derivation of gradients that could (or should?) be cleaned away.
  • (this is new) I added 1d and 4d functions for ggml_reshape and ggml_view.

The performance of various parts of the training could be improved, especially a fast ggml_out_prod could help speeding up the matrix multiplication backward pass.

There are two additional test files, one for testing gradients taken from ggml repo and one small test for testing optimization in general.

Exemplary training of a small llama model is demonstrated in the self-contained baby-llama example, where it is trained to output a sinus wave.

Some notes on first training tests:

  • lbfgs optimizer is faster and trains better than adam
  • target logits should represent NEXT token, not the current
  • target logits -1 & +1 work much better than 0 & +1
  • I think adding a BOS token also helped improving the training
  • trained with cross entropy loss gave worse generation results than trained with summed squared logit difference loss

A parallel batched forward function would probably be a good improvement. Training on multiple examples in (parallel) batch really seems to improve the training, but currently I can only do that by calling the forward function multiple times with different input data, which costs a lot of nodes in the computation graph, especially since backward pass is necessary as well.
The batches could be stored in another dimension of the tensors. It would probably just require some reshapes and view operations to make it work.

I did not look into training a LoRa finetune yet, but all the necessary machinery for that seems to be working.

xaedes added 30 commits May 1, 2023 02:41
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW

implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.

this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.

still missing backward passes for llama:

- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX

add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK

GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.

GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...

GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.

Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.

still not completely implemented backward passes for llama:

- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
use sum instead of mean for gradient of scalar scale parameter
use add1(x,y) instead of add(x,repeat(y,x))
use scale(x,y) instead of mul(x,repeat(y,x))
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
…of same type.

with this we can duplicate tensor of any typ as long as they are contiguous.
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
required for view backward pass

src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
documentation for vDSP_vdiv states: "Note that B comes before A!"
xaedes and others added 2 commits May 8, 2023 21:16
documentation for vDSP_vdiv states: "Note that B comes before A!"
@ggerganov
Copy link
Owner

After fixing the vDSP_vsub argument order, it works now

image

@xaedes
Copy link
Collaborator Author

xaedes commented May 11, 2023

I got a batched forward function working. With this greater number of parallel batches can be trained with ease.

This should also be useful to implement beam search sampling.

@ggerganov
Copy link
Owner

ggerganov commented May 11, 2023

Ha, that is interesting. I somehow thought that to support batched inference some changes in ggml.c would be needed.
Well done!

This should also be useful to implement beam search sampling.

Yes, this is needed for the beam-search decoding in whisper.cpp
Also, another project that can greatly benefit from batched inference is bert.cpp

Edit: now that #1405 has been merged, this PR is now highest priority for merging. Will be look into more details in the following days

struct ggml_tensor* F08 = ggml_transpose (ctx, F07);
struct ggml_tensor* F09 = ggml_cont (ctx, F08);
struct ggml_tensor* F10 = ggml_reshape (ctx, F09, src0->grad);

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow!

Btw, would it make sense to have GGML_OP_REPEAT_BACK that implements this in a kernel?
Maybe if this is some sort of bottleneck. Otherwise, keep it like this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I shuddered myself at this^^ I think one of the conts is not even necessary.
I also think it would make sense to implement this as an extra operation, should not be that difficult.
Most of the reshaping etc is only necessary to use sum_rows which could be done better in a special operation.

github-actions[bot]

This comment was marked as off-topic.

github-actions[bot]

This comment was marked as off-topic.

@ggerganov
Copy link
Owner

ggerganov commented May 13, 2023

Hmm, what's wrong with the CI?
It was successful before my changes, then I made some minor edits (fix warnings, indentation) and now it is failing.

/home/runner/work/llama.cpp/llama.cpp/ggml.c: In function ‘ggml_compute_forward_add1’:
/home/runner/work/llama.cpp/llama.cpp/ggml.c:7398:14: error: ‘GGML_TYPE_Q4_2’ undeclared (first use in this function); did you mean ‘GGML_TYPE_Q4_1’?
 7398 |         case GGML_TYPE_Q4_2:
      |              ^~~~~~~~~~~~~~
      |              GGML_TYPE_Q4_1

Did it checkout the wrong branch 🤔

Edit: I see now - it is actually smart enough to merge origin/master into the branch

@ggerganov
Copy link
Owner

ggerganov commented May 13, 2023

@xaedes Is it ok if I merged latest master into your branch? In case you have some pending changes, I can wait for you to do it yourself so I don't mess up your flow

@xaedes
Copy link
Collaborator Author

xaedes commented May 13, 2023

Is it ok if I merged latest master into your branch? In case you have some pending changes, I can wait for you to do it yourself so I don't mess up your flow

@ggerganov Sure

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

ggml.h Outdated Show resolved Hide resolved
ggml.h Show resolved Hide resolved
ggml.h Show resolved Hide resolved
ggml.c Show resolved Hide resolved
ggml.c Show resolved Hide resolved
ggml.c Show resolved Hide resolved
ggml.c Show resolved Hide resolved
ggml.c Show resolved Hide resolved
ggml.c Show resolved Hide resolved
ggml.c Show resolved Hide resolved
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty big addition to ggml - I hope it will open up some applications for training in the future

  • The tests will be moved to the ggml repo after we merge this PR
  • The batched forward example is very valuable. Have to apply it to whisper.cpp and bert.cpp
  • After merging this, it would be a good time for some refactoring passes in ggml.c to try and reduce code duplication and overall code size

Thanks @xaedes - excellent effort!

@RonanKMcGovern
Copy link

@xaedes has there been further progress in this direction since June? In particular, how far are things from being able to train LoRA adapters?

@xaedes
Copy link
Collaborator Author

xaedes commented Sep 22, 2023

@RonanKMcGovern
Yep, finetuning LORA adapters on LLAMA models works: #2632

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants