Float16 type #3467

rwgardner · 2013-06-20T12:34:42Z

This is a request for support for half-precision floating point numbers (Float16s).

(If there has been any discussion about adding support for these, which I would expect there was, I did not find it.)

Although the precision is low, Float16s are still useful when you have a very large quantity of floating point numbers (which is what we have) and want to reduce memory footprint, cache impact, or disk storage. (Currently, we manually convert our half precision floats with bit manipulations and reinterpretation, but the code would be cleaner if Julia supported them natively.)

Thanks.

nolta · 2013-06-20T17:36:58Z

LLVM 3.1 added support for half floats, so this should be doable. Marking as 'up for grabs'.

StefanKarpinski · 2013-06-20T17:57:47Z

Since this is strictly a storage type, very few operations are needed – mostly conversion to and from larger float types.

timholy · 2013-06-20T21:53:24Z

@rwgardner, my guess is that this will happen sooner if you submit a pull request. ("Up for grabs" is a good choice here, and it basically means "waiting for someone to do it." Since you want the feature...) It's good that you first submitted it as an issue, however, in case there were strong objections; since that doesn't seem to be the case, it looks like the way is clear for you to add this feature.

Some time in the not-too-distant past, support for Int128 was added. Perhaps a good start might be browsing the commit history (with git log and git show) to find out exactly how that was done---it might be a great model for this case.

StefanKarpinski · 2013-06-21T00:05:06Z

Float16 should be substantially easier than Int128. Up for grabs is more like "waiting for someone to do it and pretty nicely isolated and doable by a determined newcomer."

ViralBShah · 2013-06-21T07:44:41Z

The cool thing about Int128 was that it was done fully in julia. I believe that to get a fast Float16 implementation, one may need to leverage LLVM's Float16 capabilities in intrinsics.cpp and codegen.cpp.

I believe a first cut implementation can be done by leveraging bitshifts and such the way @rwgardner has already done, and it would be nice to receive that as a pull request as a starting point.

rwgardner · 2013-06-21T12:17:17Z

Sounds good. I'm not "grabbing" this yet, but I will if I really want it done. (Unfortunately, I don't get paid to work on Julia for the most part, which means I need to do this in my free time. That's something I'd love to do, but in short, a new first baby due any day has been and will be dominating that for a while.)

ViralBShah · 2013-06-21T12:34:39Z

Is it possible for you to isolate the code that you have already written for Float16 and submit that?

StefanKarpinski · 2013-06-22T15:11:34Z

Outline of what needs to be done:

Add intrinsics for floating point truncation and extension to and from 16-bit floats
- can be done either by adding specific intrinsics or generalizing the existing ones
Add convert methods to/from Float16 and other numeric types
Add promotion rules for Float16 and other numeric types

@JeffBezanson, any thoughts on whether it's better to add new specific intrinsics (fptrunc16 and fpext32) or generalize the existing ones? I was leaning towards generalizing the existing ones and renaming fptrunc32 => fptrunc and fpext64 => fpext.

ghost · 2013-06-22T15:19:06Z

If rwgardner is alright with it, I can try implementing Float16. I've wanted to find a way to get my hands dirty in Julia.

ViralBShah · 2013-06-22T15:22:07Z

@mattgallivan Please jump in. More the merrier. @StefanKarpinski 's outline is basically what needs to be done, and one can follow the Float32 implementation in src and base.

StefanKarpinski · 2013-06-22T17:37:48Z

Just to expand on what I mean by "generalizing the existing ones", this means turning the fptrunc and fpext intrinsics into versions that aren't specific to bit sizes but use type info to figure out the appropriate sizes and call the corresponding LLVM instructions. We've gradually been moving from specific versions with bit sizes in their names to more generic ones.

ViralBShah · 2013-06-22T17:47:33Z

The Int stuff already does that and it would be nice to do so with FloatingPoint too. I wonder if we should take this opportunity to also add Float128 at the same time, assuming LLVM supports it.

Keno · 2013-06-22T18:11:08Z

Since there is no hardware support for quad-precision arithmetic, adding Float128, is quite a bit more complicated.

StefanKarpinski · 2013-06-22T19:22:53Z

Yeah, that's a whole different can of worms. You actually want to compute with Float128 or it's completely useless. For Float16, it's fine to just be able to store them.

rwgardner · 2013-06-24T14:00:04Z

@mattgallivan all sounds good. I would love to contribute and would have a lot of fun doing it, but my life is about as insane as it's ever been right now. Hopefully I can contribute in other ways in the future.

You may not want this (I'm sure it could be written more efficiently, etc., and you may want to do it in fortran or C), but here's what I have. It also hasn't been heavily validated yet, but you might use it for validation by comparing it to your code. I haven't done any conversion back to Float16.

bitstype 16 MyFloat16

function convert(::Type{Float32}, val::MyFloat16)
    val = uint32(reinterpret(Uint16, val))
    sign = (val & 0x8000) >> 15
    exp  = (val & 0x7c00) >> 10
    sig  = (val & 0x3ff) >> 0
    ret::Uint32

    if exp == 0
        if sig == 0
            sign = sign << 31
            ret = sign | exp | sig
        else
            n_bit = 1
            bit = 0x0200
            while (bit & sig) == 0
                n_bit = n_bit + 1
            bit = bit >> 1
            end
            sign = sign << 31
            exp = (-14 - n_bit + 127) << 23
            sig = ((sig & (~bit)) << n_bit) << (23 - 10)
            ret = sign | exp | sig
        end
    elseif exp == 0x1f
        if sig == 0
        if sign == 0
                ret = 0x7f800000
            else
            ret = 0xff800000
            end
    else
            ret = 0xffffffff
    end
    else
        sign = sign << 31
    exp  = (exp - 15 + 127) << 23
    sig  = sig << (23 - 10)
    ret = sign | exp | sig
    end
    return reinterpret(Float32, ret)
end

function convert(::Type{Float64}, val::MyFloat16)
    val = uint64(reinterpret(Uint16, val))
    sign = (val & 0x8000) >> 15
    exp  = (val & 0x7c00) >> 10
    sig  = (val & 0x3ff) >> 0
    ret::Uint64

    if exp == 0
    if sig == 0
            sign = sign << 63
            ret = sign | exp | sig
        else
            n_bit = 1
            bit = 0x0200
            while (bit & sig) == 0
                n_bit = n_bit + 1
                bit = bit >> 1
            end
            sign = sign << 63
            exp = (-14 - n_bit + 1023) << 52
            sig = ((sig & (~bit)) << n_bit) << (52 - 10)
            ret = sign | exp | sig
        end
    elseif exp == 0x1f
        if sig == 0
            if sign == 0
                ret = 0x7ff0000000000000
            else
                ret = 0xfff0000000000000
            end
        else
            ret = 0xffffffffffffffff
        end
    else
        sign = sign << 63
        exp  = (exp - 15 + 1023) << 52
        sig  = sig << (52 - 10)
        ret = sign | exp | sig
    end

    return reinterpret(Float64, ret)
end

We could convert to only Float32 or Float64 and then use existing code to convert between those. It seems more efficient to convert to/from both directly in most cases, but it may not be on some architectures, partly depending on whether there is hardware support for converting between Float32 and Float64. (I don't know if that's something floating point units typically support or not.)

ViralBShah · 2013-06-27T17:30:58Z

@StefanKarpinski Would it be good to start off with this as a pure julia implementation and get it in base to begin with?

ViralBShah · 2013-07-16T09:41:41Z

Until the LLVM bug is sorted out, it may be worthwhile to put @rwgardner 's julia implementation in Base. That way, at least the storage format can be used, and the conversions could be potentially faster when the LLVM issue is fixed.

@loladiro Does LLVM 3.3 fix the Float16 bugs?

StefanKarpinski · 2013-07-16T14:41:21Z

Even using @rwgardner's conversions, the following patch unfortunately still causes LLVM failures:

https://gist.github.com/StefanKarpinski/9092d04bc24c44493d08

julia> float16(1.5)
LLVM ERROR: Cannot select: 0x104151b10: ch = store 0x102070910, 0x10421df10, 0x104231d10, 0x10434d410<ST2[%14]> [ORD=77165] [ID=35]
  0x10421df10: f16,ch = load 0x10434dc10, 0x102070010, 0x10434d410<LD2[FixedStack0]> [ORD=77156] [ID=27]
    0x102070010: i64 = FrameIndex<0> [ORD=77155] [ID=4]
    0x10434d410: i64 = undef [ORD=77150] [ID=2]
  0x104231d10: i64 = add 0x104233910, 0x1041a7810 [ORD=77163] [ID=33]
    0x104233910: i64,ch,glue = CopyFromReg 0x104087a10, 0x104088010, 0x104087a10:1 [ORD=77157] [ID=32]
      0x104088010: i64 = Register %RAX [ORD=77157] [ID=10]
      0x104087a10: ch,glue = callseq_end 0x10434da10, 0x104264310, 0x104264310, 0x10434da10:1 [ORD=77157] [ID=31]
        0x104264310: i64 = TargetConstant<0> [ORD=77155] [ID=5]
        0x104264310: i64 = TargetConstant<0> [ORD=77155] [ID=5]
        0x10434da10: ch,glue = X86ISD::CALL 0x104279410, 0x104232910, 0x104085410, 0x10417a710, 0x104279410:1 [ORD=77157] [ID=30]
          0x104232910: i64 = X86ISD::Wrapper 0x104085310 [ID=16]

Keno · 2013-07-16T15:31:07Z

You'll still want to leave in the disable in the compiler, otherwise LLVM will generate bad code. LLVM 3.3 does not fix this.

JeffBezanson · 2013-07-16T16:01:59Z

Yes, with this implementation no compiler changes are needed; it's just a 16-bit bitstype.

StefanKarpinski · 2013-07-16T17:51:31Z

Ok, if someone wants to finish this, I'm away for the day.

ViralBShah · 2013-08-02T18:41:05Z

Bump.

Keno · 2013-08-02T18:42:17Z

@StefanKarpinski do you just want to apply your patch?

StefanKarpinski · 2013-08-03T20:00:34Z

I don't think just applying the patch works. There was a bunch of changes it needed to work.

ViralBShah · 2013-08-14T07:02:52Z

It would be nice to have a nicer show() method for float16. Asking the question here in case this was done by design.

julia> float16(100.25)
Float16(0x5644)

StefanKarpinski · 2013-08-14T14:30:48Z

Printing 16-bit floats correctly and minimally is quite non-trivial. Our 32-bit and 64-bit float printing are handled by the double-conversion library which does not support 16-bit floats. It might be possible to figure out a hack that approximates correct minimal Float16 printing using the printing routines for Float32, but it's not obvious how.

ViralBShah · 2013-08-14T17:00:24Z

I wonder what is going on here:

julia> a = float16(rand(5,5))
5x5 Float16 Array:
 0.445801  0.154785  0.431641   0.384521  0.188354 
 0.4646    0.281006  0.766602   0.563965  0.0402222
 0.685059  0.92627   0.921875   0.933594  0.468994 
 0.841797  0.582031  0.0185242  0.481934  0.151367 
 0.348877  0.952637  0.672852   0.864746  0.166138

JeffBezanson · 2013-08-14T21:16:26Z

Float16 printing has several problems right now, e.g.

julia> print_shortest(STDOUT,NaN16)
NaN32

(plus NaN16 does not work properly)
I'm about to commit some fixes.

showcompact has a fallback definition that is printing the Float16s in that array by converting them to Float64. The question is whether we should print the f0 suffix. For now I'll say that is specific to Float32, and leave it off.

Keno mentioned this issue Jun 29, 2013

WIP: Generalize float intrinsics #3580

Closed

Keno closed this as completed in ea1f3b2 Aug 7, 2013

jiahao mentioned this issue Nov 28, 2013

Full Support for IEEE 754-2008, ISO/IEC TR 18037, ISO/IEC 10967, ISO/IEC 11404 compliance and implementation #4965

Closed

jiahao mentioned this issue Dec 11, 2014

support fp128 and maybe quad-double #757

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Float16 type #3467

Float16 type #3467

rwgardner commented Jun 20, 2013

nolta commented Jun 20, 2013

StefanKarpinski commented Jun 20, 2013

timholy commented Jun 20, 2013

StefanKarpinski commented Jun 21, 2013

ViralBShah commented Jun 21, 2013

rwgardner commented Jun 21, 2013

ViralBShah commented Jun 21, 2013

StefanKarpinski commented Jun 22, 2013

ghost commented Jun 22, 2013

ViralBShah commented Jun 22, 2013

StefanKarpinski commented Jun 22, 2013

ViralBShah commented Jun 22, 2013

Keno commented Jun 22, 2013

StefanKarpinski commented Jun 22, 2013

rwgardner commented Jun 24, 2013

ViralBShah commented Jun 27, 2013

ViralBShah commented Jul 16, 2013

StefanKarpinski commented Jul 16, 2013

Keno commented Jul 16, 2013

JeffBezanson commented Jul 16, 2013

StefanKarpinski commented Jul 16, 2013

ViralBShah commented Aug 2, 2013

Keno commented Aug 2, 2013

StefanKarpinski commented Aug 3, 2013

ViralBShah commented Aug 14, 2013

StefanKarpinski commented Aug 14, 2013

ViralBShah commented Aug 14, 2013

JeffBezanson commented Aug 14, 2013

Float16 type #3467

Float16 type #3467

Comments

rwgardner commented Jun 20, 2013

nolta commented Jun 20, 2013

StefanKarpinski commented Jun 20, 2013

timholy commented Jun 20, 2013

StefanKarpinski commented Jun 21, 2013

ViralBShah commented Jun 21, 2013

rwgardner commented Jun 21, 2013

ViralBShah commented Jun 21, 2013

StefanKarpinski commented Jun 22, 2013

ghost commented Jun 22, 2013

ViralBShah commented Jun 22, 2013

StefanKarpinski commented Jun 22, 2013

ViralBShah commented Jun 22, 2013

Keno commented Jun 22, 2013

StefanKarpinski commented Jun 22, 2013

rwgardner commented Jun 24, 2013

ViralBShah commented Jun 27, 2013

ViralBShah commented Jul 16, 2013

StefanKarpinski commented Jul 16, 2013

Keno commented Jul 16, 2013

JeffBezanson commented Jul 16, 2013

StefanKarpinski commented Jul 16, 2013

ViralBShah commented Aug 2, 2013

Keno commented Aug 2, 2013

StefanKarpinski commented Aug 3, 2013

ViralBShah commented Aug 14, 2013

StefanKarpinski commented Aug 14, 2013

ViralBShah commented Aug 14, 2013

JeffBezanson commented Aug 14, 2013