-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Coming from the original PyTorch implementations, people are finding it increasingly cumbersome to type ggml_
and ctx
over and over again. One line of Python can turn into 20 lines of C++. This is creating too much friction, and we are getting lost in the boilerplate instead of being able to see the big picture.
I would like to create a header that takes advantage of the C++ way of using operator overloading. Eventually, it will include PyTorch and NumPy aliases to allow simply copying-and-pasting code from Python into ggml C++ with only minor fixup.
The new struct will just wrap things like struct ggml_tensor_wrapper { ggml_tensor * data; ggml_context * ctx; };
the goal is to not change the resulting .exe
binary. This will be done by making it header-only and declaring everything inline, ultimately still calling the ggml_
series of functions.
Example 1
Before
// feed-forward network
{
// norm
{
cur = ggml_norm(ctx0, inpFF, hparams.eps);
// cur = mlp_ln_w*cur + mlp_ln_b
cur = ggml_add(ctx0,
ggml_mul(ctx0, cur, layer.mlp_ln_w),
layer.mlp_ln_b);
}
// fully connected
cur = ggml_mul_mat(ctx0,
layer.mlp_0_w,
cur);
cur = ggml_add(ctx0, cur, layer.mlp_0_b);
// GELU activation
cur = ggml_gelu(ctx0, cur);
// projection
cur = ggml_mul_mat(ctx0,
layer.mlp_1_w,
cur);
cur = ggml_add(ctx0, cur, layer.mlp_1_b);
}
After
// feed-forward network
{
// norm
{
cur = inpFF.norm(hparams.eps);
cur = cur * layer.mlp_ln_w + layer.mlp_ln_b;
}
cur = (layer.mlp_0_w ^ cur) + layer.mlp_0_b; // fully connected
cur = cur.gelu();
cur = (layer.mlp_1_w ^ cur) + layer.mlp_1_b; // projection
}
Example 2
Before
struct ggml_tensor * aheads_KQs = ggml_reshape_2d(ctx0, KQ_soft_max, KQ_soft_max->ne[0] * KQ_soft_max->ne[1], KQ_soft_max->ne[2]);
aheads_KQs = ggml_transpose(ctx0, aheads_KQs);
aheads_KQs = ggml_cont(ctx0, aheads_KQs);
aheads_KQs = ggml_mul_mat(ctx0, wstate.aheads_masks.m[il], aheads_KQs);
aheads_KQs = ggml_transpose(ctx0, aheads_KQs);
aheads_KQs = ggml_cont(ctx0, aheads_KQs);
aheads_KQs = ggml_reshape_3d(ctx0, aheads_KQs, KQ_soft_max->ne[0], KQ_soft_max->ne[1], wstate.aheads_masks.m[il]->ne[1]);
if (aheads_cross_QKs == NULL) {
aheads_cross_QKs = aheads_KQs;
} else {
aheads_cross_QKs = ggml_concat(ctx0, aheads_cross_QKs, aheads_KQs, 2);
}
After
// typedef ggml_tensor_wrapper gg
// .flatten is from PyTorch
// .T is from numpy. For convenience, tensor.T() = tensor.transpose().cont()
gg aheads_KQs{KQ_soft_max.flatten(0, 1).T()};
aheads_KQs = (wstate.aheads_masks.m[il] ^ aheads_KQs).T();
aheads_KQs = aheads_KQs.reshape(KQ_soft_max->ne[0], KQ_soft_max->ne[1], wstate.aheads_masks.m[il]->ne[1]);
if (aheads_cross_QKs == NULL) {
aheads_cross_QKs = aheads_KQs;
} else {
aheads_cross_QKs = aheads_cross_QKs.concat(aheads_KQs, 2);
}
Example 3
I am drowning in boilerplate at https://github.com/mmwillet/TTS.cpp/blob/0b420102d53c16f36ea75e626a3a3d40d7b26a4d/src/kokoro_model.cpp#L1141 .