fp8 GGUF and GPU accelerated inference support #8780

jim-plus · 2024-07-30T16:46:03Z

jim-plus
Jul 30, 2024

How difficult would it be to support conversion to fp8 for GGUF, and to add accelerated GPU support? I have a 4060ti 16gb Lovelace GPU and am interested in leveraging its fp8 support.

charyang-ai · 2024-11-26T02:15:21Z

charyang-ai
Nov 26, 2024

@jim-plus is there any progress on the FP8 support? I am working on AMD GPU and also need this FP8 feature.

0 replies

teleprint-me · 2024-11-26T04:05:06Z

teleprint-me
Nov 26, 2024

It's straightforward enough, but the accuracy is worse than int8.

typedef union {
    float value; /**< Floating-point value */
    uint32_t bits; /**< Raw bit representation */
} Float32;

// Encode a float into an 8-bit floating-point representation
uint8_t encode_float8(float value) {
    if (value == 0.0f) {
        return 0; // Encoded as all zeros
    }

    Float32 encoder = {.value = value};

    // Extract IEEE-754 components
    uint32_t sign = (encoder.bits >> 31) & 0x1;
    uint32_t exponent = (encoder.bits >> 23) & 0xff;
    uint32_t mantissa = encoder.bits & 0x7fffff;

    // Define bias parameters
    uint32_t e_bias_32 = 127;
    uint32_t e_bias_8 = 3;

    // Define exponent limits
    uint32_t e_max = 7;
    uint32_t e_min = 0;

    // Calculate compressed exponent
    int8_t e_compressed = fmaxf(fminf(exponent - e_bias_32 + e_bias_8, e_max), e_min);

    // Calculate compressed mantissa (top 4 bits of the 23-bit mantissa)
    uint8_t m_compressed = (mantissa >> 19) & 0xf;

    // Pack into an 8-bit integer
    return (uint8_t) ((sign << 7) | (e_compressed << 4) | m_compressed);
}

// Decode an 8-bit floating-point representation back to a float
float decode_float8(uint8_t bits) {
    // Extract fields
    uint8_t sign = (bits >> 7) & 0x01;
    uint8_t exponent = (bits >> 4) & 0x07;
    uint8_t mantissa = bits & 0x0F;

    // Define parameters
    uint32_t e_bias_32 = 127;
    uint32_t e_bias_8 = 3;

    // Expand exponent
    int32_t e_expanded = exponent - e_bias_8 + e_bias_32;

    // Expand mantissa with implicit leading 1
    float m_expanded = 1.0f + (mantissa / 16.0f);

    // Reconstruct float
    float result = ldexpf(m_expanded, e_expanded - e_bias_32);
    return sign ? -result : result;
}

I've run multiple experiments between int8 and fp8 and int8 always reduces the margin of error. int8 is also much faster.

19 replies

teleprint-me Nov 28, 2024

@netrunnereve

You might find this interesting and helpful.

Applied Digital Signal Processing

It helped clarify the concepts for me. I've always been enamored with music, so the audio concepts really helped visualize the math and gave it a more concrete and grounded form of intuition.

Episodes 3, 4, 5 really drive home the concepts we've been discussing here.

Djip007 Nov 28, 2024

Yes, it's because GGML does not account for the residual (bias) or alpha (squeezing factor).

I don't know what you think bias/squeezing factor are, but it is widely used in ggml.

#define QK8_0 32
typedef struct {
    ggml_half d;       // delta
    int8_t  qs[QK8_0]; // quants
} block_q8_0;

#define QK8_1 32
typedef struct {
    union {
        struct {
            ggml_half d; // delta
            ggml_half s; // d * sum(qs[i])
        };
        ggml_half2 ds;
    };
    int8_t qs[QK8_1]; // quants
} block_q8_1;

Q8_0 quant use scale,
Q8_1 quant use bias+scale. Correction: not for this
it is the case for other QN_1:

#define QK4_1 32
typedef struct {
    union {
        struct {
            ggml_half d; // delta
            ggml_half m; // min
        };
        ggml_half2 dm;
    };
    uint8_t qs[QK4_1 / 2]; // nibbles / quants
} block_q4_1;

Now keep in mind that most quantize tensor is use in large matmul that result on for exemple on Mistral_Nemo with dot product on vector of size 14336 for the longer.

teleprint-me Nov 28, 2024

It's not complicated.

// reference implementation for deterministic creation of model files
void quantize_row_q8_0_ref(const float * restrict x, block_q8_0 * restrict y, int64_t k) {
    assert(k % QK8_0 == 0); // assert even divisible by the block size
    const int nb = k / QK8_0; // calculate the stride

    for (int i = 0; i < nb; i++) {
        float amax = 0.0f; // absolute max

        for (int j = 0; j < QK8_0; j++) {
            const float v = x[i*QK8_0 + j];
            amax = MAX(amax, fabsf(v)); // get the absolute max from the block
        }

        const float d = amax / ((1 << 7) - 1); // calculate the integer domain
        const float id = d ? 1.0f/d : 0.0f; // creates an inverse to fit the domain

        y[i].d = GGML_FP32_TO_FP16(d); // packs the scalar into 16-bits

        for (int j = 0; j < QK8_0; ++j) {
            const float x0 = x[i*QK8_0 + j]*id; // scales the float down to integer

            y[i].qs[j] = roundf(x0); // sets the quantized value
        }
    }
}

Not once does it account for the bias or the squeezing factor.

FP8 is not capable of accounting for the squeezing factor. You can only do this with an INT8.

If you want more details on what I mean by this, you can look here.

The math proves that the scaling and bias are insufficient for properly approximating the original value. That's why another variable is required.

FP8 is hobbled by allocating bits to the sign, exponent, and mantissa. But you can "bake" that information into an INT8. This increases the range of the integer domain as a result. This in turn outperforms FP8.

If you look at DSP methods, it becomes clear why they use 16-bit and can reduce so much noise from the output signal. That's why the input values are normalized to $[-1, 1]$. It normalizes the range proportionally from 32-bits to 16-bits.

The same concept can be applied to any other range, but introduces a higher standard deviation as a result. That's why you can hear the noise in audio output signals as the bit-width decreases.

Djip007 Nov 28, 2024

void quantize_row_q4_1_ref(const float * restrict x, block_q4_1 * restrict y, int64_t k) {
    const int qk = QK4_1;

    assert(k % qk == 0);

    const int nb = k / qk;

    for (int i = 0; i < nb; i++) {
        float min = FLT_MAX;
        float max = -FLT_MAX;

        for (int j = 0; j < qk; j++) {
            const float v = x[i*qk + j];

            if (v < min) min = v;
            if (v > max) max = v;
        }

        const float d  = (max - min) / ((1 << 4) - 1);
        // - max-min is what you call real domain
        // - (1 << 4) - 1=15  is what you call integer domaine
        // => d is what Squeezing Factor
        const float id = d ? 1.0f/d : 0.0f;

        y[i].d = GGML_FP32_TO_FP16(d);
        y[i].m = GGML_FP32_TO_FP16(min);  // this is the bias! (that is not define in you link)

        for (int j = 0; j < qk/2; ++j) {
            const float x0 = (x[i*qk + 0    + j] - min)*id;
            const float x1 = (x[i*qk + qk/2 + j] - min)*id;

            const uint8_t xi0 = MIN(15, (int8_t)(x0 + 0.5f));
            const uint8_t xi1 = MIN(15, (int8_t)(x1 + 0.5f));

            y[i].qs[j]  = xi0;
            y[i].qs[j] |= xi1 << 4;
        }
    }
}

(and yes look like Q8_1 do not compute the bias.)

FP8 is not capable of accounting for the squeezing factor. You can only do this with an INT8.

I don't know why you think it is not possible.

template <int E, int QK>
static inline void conv(const float* x, bloc_fp8<E, QK>* y, int64_t size) {
    const auto qk_size = size / QK;
    for (int64_t q=0; q<qk_size; ++q) {
        float m = 0;
        for (int64_t i=0; i<QK; i++) {
            m = std::max(std::abs(x[q*QK+i]),m); 
        }
        const float D = FP8<E>::MAX()/m;
        y[q].d = m/FP8<E>::MAX();  // what you call squeezing factor
        for (int64_t i=0; i<QK; i++) {
            y[q].qs[i] = x[q*QK+i]*D;
        }
    }
}

teleprint-me Nov 29, 2024

Here are the sources I used to derive the math. I also worked through the applications with GPT while providing GPT the necessary information and context.

If I use 3 bits for the exponent and 4 bits for the mantissa, I've immediately compromised half the bit width and reserved a bit for a single sign.

This creates a domain of $2^4$ bits. That's why FP8 is capped at around ~ $\pm 450$. It's a balancing act between range and precision. The upside is you use a single byte at the cost of increased complexity (both in hardware and application). This last part is the efficiency loss. The gain is the reconstruction process is fairly straightforward and requires no extra metadata. This cost puts in parity with a Q4 while operating at the near precision of Q8 depending on the context.

With INT8, you get the full bit width, but at the cost of requiring scalar and sign. This leaves you with about $2^7$ bits. The scalar adds an extra 2-bytes, but allows you to store away metadata required for reconstruction. Reconstruction is fairly simple and arithmetic is fairly cheap in comparison to floating-point operations. The downside to this gained efficiency is an added 2-bytes, totalling in at a requirement of 1-byte less than simply using a float. The upside is that I can scale back to near single precision accuracy well past the boundary of the functions domain while experiencing mild clipping in quadrant I.

Int8 is often used in other applications due to requirements for manually adjusting required parameters. But I suppose well see a slight shift if hardware manufacturers find value in this added complexity.

No matter which we choose, we create a trade-off between efficiency, precision, and storage.

If we look at the line you claimed was the squeezing factor and inspect it more closely, we'll see that the inverse of delta is actually the step size of the domain.

const float id = d ? 1.0f/d : 0.0f; // 1 / d steps

The squeezing factor "squeezes" the real number into the integer domain. This is not what that is.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fp8 GGUF and GPU accelerated inference support #8780

{{title}}

Replies: 2 comments 19 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

fp8 GGUF and GPU accelerated inference support #8780

Replies: 2 comments · 19 replies

Replies: 2 comments 19 replies