Replies: 2 comments 19 replies
-
@jim-plus is there any progress on the FP8 support? I am working on AMD GPU and also need this FP8 feature. |
Beta Was this translation helpful? Give feedback.
0 replies
-
It's straightforward enough, but the accuracy is worse than int8. typedef union {
float value; /**< Floating-point value */
uint32_t bits; /**< Raw bit representation */
} Float32;
// Encode a float into an 8-bit floating-point representation
uint8_t encode_float8(float value) {
if (value == 0.0f) {
return 0; // Encoded as all zeros
}
Float32 encoder = {.value = value};
// Extract IEEE-754 components
uint32_t sign = (encoder.bits >> 31) & 0x1;
uint32_t exponent = (encoder.bits >> 23) & 0xff;
uint32_t mantissa = encoder.bits & 0x7fffff;
// Define bias parameters
uint32_t e_bias_32 = 127;
uint32_t e_bias_8 = 3;
// Define exponent limits
uint32_t e_max = 7;
uint32_t e_min = 0;
// Calculate compressed exponent
int8_t e_compressed = fmaxf(fminf(exponent - e_bias_32 + e_bias_8, e_max), e_min);
// Calculate compressed mantissa (top 4 bits of the 23-bit mantissa)
uint8_t m_compressed = (mantissa >> 19) & 0xf;
// Pack into an 8-bit integer
return (uint8_t) ((sign << 7) | (e_compressed << 4) | m_compressed);
}
// Decode an 8-bit floating-point representation back to a float
float decode_float8(uint8_t bits) {
// Extract fields
uint8_t sign = (bits >> 7) & 0x01;
uint8_t exponent = (bits >> 4) & 0x07;
uint8_t mantissa = bits & 0x0F;
// Define parameters
uint32_t e_bias_32 = 127;
uint32_t e_bias_8 = 3;
// Expand exponent
int32_t e_expanded = exponent - e_bias_8 + e_bias_32;
// Expand mantissa with implicit leading 1
float m_expanded = 1.0f + (mantissa / 16.0f);
// Reconstruct float
float result = ldexpf(m_expanded, e_expanded - e_bias_32);
return sign ? -result : result;
} I've run multiple experiments between int8 and fp8 and int8 always reduces the margin of error. int8 is also much faster. |
Beta Was this translation helpful? Give feedback.
19 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
How difficult would it be to support conversion to fp8 for GGUF, and to add accelerated GPU support? I have a 4060ti 16gb Lovelace GPU and am interested in leveraging its fp8 support.
Beta Was this translation helpful? Give feedback.
All reactions