Implement backward passes for llama with small training llama from scratch example #1360

xaedes · 2023-05-07T21:36:41Z

Training a llama directly with ggml would be really nice.

For this I implemented the backward passes required for the llama model, tested them with test_grad0 from ggml repo and trained a small llama from scratch to output a sinus wave.

Also see the more detailed discussion in ggerganov/ggml#8 (comment)

List of all new operations that I had to add:

GGML_OP_ADD1 : Could be replaced with add(X,repeat(Y,X)) but it is faster when repeat can be avoided.
GGML_OP_ACC : Necessary for view backward pass. This adds src1 to view(src0, nb, offset) and returns tensor of shape src0.
GGML_OP_SET : (this is new) Necessary for propagating gradients through kv cache. Instead of copying to kv cache this function sets the values in the kv cache viewed with offsets and strides and returns a tensor representing the modified kv cache. This can also be inplace returning a view to the modified kv cache.
GGML_OP_LOG : (this is new) Necessary for cross entropy loss
GGML_OP_SUM_ROWS : Necessary for repeat backward pass: Reduces rows by summing them. shape[a,b,c,d] -> shape[1,b,c,d]
GGML_OP_SILU_BACK : Necessary for silu backward pass
GGML_OP_RMS_NORM_BACK : Could also be implemented using primitives, at the cost of performance.
GGML_OP_GET_ROWS_BACK : Necessary for get_rows backward pass: Adds src0[i] rows to opt0[src1[i]] rows, returning a tensor of shape opt0.
GGML_OP_DIAG : Necessary for softmax backward pass, alternative would have been to implement SOFTMAX_BACK directly, but DIAG is at least usable for other stuff. It turns rows into diagonal matrices.
GGML_OP_DIAG_MASK_ZERO : Necessary for diag_mask_inf backward pass
GGML_OP_ROPE_BACK : Necessary for rope backward pass.

Notable other changes:

add inplace and non-inplace variants for scale, diag_mask_inf, soft_max and rope
fix sub, mul and div functions to work correctly with transposed tensor, uses the same logic as in add:
fix ggml_forward_add functions to work correctly with transposed tensors. uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions. this also slightly changes the mem access pattern of the different threads to work as in ggml_compute_forward_add_q_f32.
de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type. with this we can duplicate tensor of any type as long as they are contiguous. the function is used in dup, get_rows_back and diag_mask (when not inplace).
there are some maybe too verbose comments including step-by-step derivation of gradients that could (or should?) be cleaned away.
(this is new) I added 1d and 4d functions for ggml_reshape and ggml_view.

The performance of various parts of the training could be improved, especially a fast ggml_out_prod could help speeding up the matrix multiplication backward pass.

There are two additional test files, one for testing gradients taken from ggml repo and one small test for testing optimization in general.

Exemplary training of a small llama model is demonstrated in the self-contained baby-llama example, where it is trained to output a sinus wave.

Some notes on first training tests:

lbfgs optimizer is faster and trains better than adam
target logits should represent NEXT token, not the current
target logits -1 & +1 work much better than 0 & +1
I think adding a BOS token also helped improving the training
trained with cross entropy loss gave worse generation results than trained with summed squared logit difference loss

A parallel batched forward function would probably be a good improvement. Training on multiple examples in (parallel) batch really seems to improve the training, but currently I can only do that by calling the forward function multiple times with different input data, which costs a lot of nodes in the computation graph, especially since backward pass is necessary as well.
The batches could be stored in another dimension of the tensors. It would probably just require some reshapes and view operations to make it work.

I did not look into training a LoRa finetune yet, but all the necessary machinery for that seems to be working.

- GGML_OP_ADD_AT - GGML_OP_CPY - GGML_OP_MUL_MAT (src0.grad) - GGML_OP_PERMUTE - GGML_OP_RESHAPE - GGML_OP_SCALE - GGML_OP_TRANSPOSE - GGML_OP_VIEW implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW. this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset). the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0. still missing backward passes for llama: - GGML_OP_DIAG_MASK_INF - GGML_OP_GET_ROWS - GGML_OP_RMS_NORM - GGML_OP_ROPE - GGML_OP_SILU - GGML_OP_SOFT_MAX

- GGML_OP_DIAG_MASK_INF - GGML_OP_GET_ROWS - GGML_OP_RMS_NORM - GGML_OP_SILU - GGML_OP_SOFT_MAX add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1. GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know... GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF. Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants. staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and functions with "_inplace" are added which are inplace. in llama we need to call the inplace variants so that it is implemented as before. for llama backward pass we need to use the non-inplace variants. still not completely implemented backward passes for llama: - GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK - GGML_OP_GET_ROWS: only necessary for tokenizer

after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.

…get_rows_back

…e console

use sum instead of mean for gradient of scalar scale parameter

use add1(x,y) instead of add(x,repeat(y,x))

use scale(x,y) instead of mul(x,repeat(y,x))

this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c

ggml_diag constructs diagonal matrices with entries. ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]

…of same type. with this we can duplicate tensor of any typ as long as they are contiguous.

when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy

required for view backward pass src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.

documentation for vDSP_vdiv states: "Note that B comes before A!"

ggerganov · 2023-05-08T18:19:31Z

After fixing the vDSP_vsub argument order, it works now

xaedes · 2023-05-11T17:55:59Z

I got a batched forward function working. With this greater number of parallel batches can be trained with ease.

This should also be useful to implement beam search sampling.

ggerganov · 2023-05-11T21:14:47Z

Ha, that is interesting. I somehow thought that to support batched inference some changes in ggml.c would be needed.
Well done!

This should also be useful to implement beam search sampling.

Yes, this is needed for the beam-search decoding in whisper.cpp
Also, another project that can greatly benefit from batched inference is bert.cpp

Edit: now that #1405 has been merged, this PR is now highest priority for merging. Will be look into more details in the following days

ggerganov · 2023-05-12T20:07:50Z

ggml.c

+                    struct ggml_tensor* F08 = ggml_transpose (ctx, F07);
+                    struct ggml_tensor* F09 = ggml_cont      (ctx, F08);
+                    struct ggml_tensor* F10 = ggml_reshape   (ctx, F09, src0->grad);
+


Wow!

Btw, would it make sense to have GGML_OP_REPEAT_BACK that implements this in a kernel?
Maybe if this is some sort of bottleneck. Otherwise, keep it like this.

Yea I shuddered myself at this^^ I think one of the conts is not even necessary.
I also think it would make sense to implement this as an extra operation, should not be that difficult.
Most of the reshaping etc is only necessary to use sum_rows which could be done better in a special operation.

ggerganov · 2023-05-13T07:20:13Z

Hmm, what's wrong with the CI?
It was successful before my changes, then I made some minor edits (fix warnings, indentation) and now it is failing.

/home/runner/work/llama.cpp/llama.cpp/ggml.c: In function ‘ggml_compute_forward_add1’:
/home/runner/work/llama.cpp/llama.cpp/ggml.c:7398:14: error: ‘GGML_TYPE_Q4_2’ undeclared (first use in this function); did you mean ‘GGML_TYPE_Q4_1’?
 7398 |         case GGML_TYPE_Q4_2:
      |              ^~~~~~~~~~~~~~
      |              GGML_TYPE_Q4_1

Did it checkout the wrong branch 🤔

Edit: I see now - it is actually smart enough to merge origin/master into the branch

ggerganov · 2023-05-13T07:51:00Z

@xaedes Is it ok if I merged latest master into your branch? In case you have some pending changes, I can wait for you to do it yourself so I don't mess up your flow

xaedes · 2023-05-13T11:39:49Z

Is it ok if I merged latest master into your branch? In case you have some pending changes, I can wait for you to do it yourself so I don't mess up your flow

@ggerganov Sure

github-actions

clang-tidy made some suggestions

ggml.h

ggml.c

ggerganov

This is a pretty big addition to ggml - I hope it will open up some applications for training in the future

The tests will be moved to the ggml repo after we merge this PR
The batched forward example is very valuable. Have to apply it to whisper.cpp and bert.cpp
After merging this, it would be a good time for some refactoring passes in ggml.c to try and reduce code duplication and overall code size

Thanks @xaedes - excellent effort!

RonanKMcGovern · 2023-09-21T16:56:50Z

@xaedes has there been further progress in this direction since June? In particular, how far are things from being able to train LoRA adapters?

xaedes · 2023-09-22T10:37:58Z

@RonanKMcGovern
Yep, finetuning LORA adapters on LLAMA models works: #2632

xaedes added 30 commits May 1, 2023 02:41

norm & rms_norm can not be threaded:

b908007

after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.

remove already resolved TODO

36d8a05

implement backward pass of ggml_rope and ggml_rope_back

488decf

implement backward pass for ggml_get_rows and for new operation ggml_…

4e1f81d

…get_rows_back

add test-grad0.c

0da2675

use GGML_PRINT_DEBUG for debug messages which will otherwise flood th…

20e3c1d

…e console

test both gradients of mul_mat

9345f4c

disable graph dot export as it floods console

9d6fc28

bug fixes for silu_back

6fb08b4

successfully test silu backward

671e592

bug fix for scale backward pass

a367eb9

use sum instead of mean for gradient of scalar scale parameter

successfully test scale backward

0197bcb

improve performance of sum backward pass

bfe5072

use add1(x,y) instead of add(x,repeat(y,x))

improve performance of sqr backward pass

b583136

use scale(x,y) instead of mul(x,repeat(y,x))

successfully test rope backward

7571147

bug fix for cpy backward pass

0ea8201

successfully test cpy backward

b2bd822

bug fix for reshape backward pass

c483a7d

successfully test reshape backward

ecf949b

add test-opt.c

54ab300

this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c

correctly implement softmax backward pass using new operation ggml_diag

1a80e9a

ggml_diag constructs diagonal matrices with entries. ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]

successfully test soft_max backward

fea42be

align shape annotations

9310650

add shape annotations for llama

38675e5

de-duplicate ggml_forward_dup code taking care of contiguous tensors …

c1a8893

…of same type. with this we can duplicate tensor of any typ as long as they are contiguous.

fix ggml_compute_forward_dup_same_cont for when nelements < nthreads

83fa6b3

when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy

bug fix for add_at forward

cecd6c7

required for view backward pass src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.

successfully test view backward

124fdca

swap arguments to vDSP_vdiv call

cafbb78

documentation for vDSP_vdiv states: "Note that B comes before A!"

xaedes force-pushed the train-example branch from 2c4f95d to cafbb78 Compare May 8, 2023 18:13

xaedes and others added 2 commits May 8, 2023 21:16

swap arguments to vDSP_vdiv call

9c3fe4e

documentation for vDSP_vdiv states: "Note that B comes before A!"

ggml : swap vDSP_vsub args as per documentation

6ca682b

xaedes added 2 commits May 11, 2023 19:31

add parallel batched forward function for baby-llama training

3e3ed95

cleanup code for batched training

581e5eb

remove trailing whitespace

b9ef08c

ggerganov reviewed May 12, 2023

View reviewed changes

minor : fix compiler warnings + indentation style

f977243

This comment was marked as off-topic.

Sign in to view

ggml : fix null ptr deref in backward pass

33034cf

This comment was marked as off-topic.

Sign in to view

ggerganov added 2 commits May 13, 2023 15:20

Merge remote-tracking branch 'origin/master' into HEAD

092913e

ggml : remove Q4_2 remnants

95a487a

github-actions bot reviewed May 13, 2023

View reviewed changes

ggerganov added 2 commits May 13, 2023 15:34

ggml : fix clang-tidy warnings

ef3d42a

baby-llama : couple of clang-tidy warnings

dae6ba2

ggerganov approved these changes May 13, 2023

View reviewed changes

ggerganov merged commit f954edd into ggerganov:master May 13, 2023

Green-Sky mentioned this pull request May 15, 2023

How do we finetune the model with new data? #466

Closed

xaedes mentioned this pull request May 30, 2023

Train Text from scratch #1652

Merged

ziwang-com mentioned this pull request Jun 1, 2023

从头开始的训练美洲驼 ziwang-com/zero-lora#64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement backward passes for llama with small training llama from scratch example #1360

Implement backward passes for llama with small training llama from scratch example #1360

xaedes commented May 7, 2023 •

edited

Loading

ggerganov commented May 8, 2023

xaedes commented May 11, 2023 •

edited

Loading

ggerganov commented May 11, 2023 •

edited

Loading

ggerganov May 12, 2023

xaedes May 12, 2023

This comment was marked as off-topic.

This comment was marked as off-topic.

ggerganov commented May 13, 2023 •

edited

Loading

ggerganov commented May 13, 2023 •

edited

Loading

xaedes commented May 13, 2023

github-actions bot left a comment

ggerganov left a comment

RonanKMcGovern commented Sep 21, 2023

xaedes commented Sep 22, 2023

Implement backward passes for llama with small training llama from scratch example #1360

Implement backward passes for llama with small training llama from scratch example #1360

Conversation

xaedes commented May 7, 2023 • edited Loading

ggerganov commented May 8, 2023

xaedes commented May 11, 2023 • edited Loading

ggerganov commented May 11, 2023 • edited Loading

ggerganov May 12, 2023

Choose a reason for hiding this comment

xaedes May 12, 2023

Choose a reason for hiding this comment

This comment was marked as off-topic.

This comment was marked as off-topic.

ggerganov commented May 13, 2023 • edited Loading

ggerganov commented May 13, 2023 • edited Loading

xaedes commented May 13, 2023

github-actions bot left a comment

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

RonanKMcGovern commented Sep 21, 2023

xaedes commented Sep 22, 2023

xaedes commented May 7, 2023 •

edited

Loading

xaedes commented May 11, 2023 •

edited

Loading

ggerganov commented May 11, 2023 •

edited

Loading

ggerganov commented May 13, 2023 •

edited

Loading

ggerganov commented May 13, 2023 •

edited

Loading