Skip to content

Commit

Permalink
Grisu printing for floating-point numbers is now the standard.
Browse files Browse the repository at this point in the history
This means that what we print for Float64s will *always* eval
to the same floating-point value we originally had. Moreover,
the printed form uses the least number of digits possible to
accomplish this for any floating-point value.

There are still some unresolved issues around 32-bit floats.
See issue #335.
  • Loading branch information
StefanKarpinski committed Jan 12, 2012
1 parent 1cfeaf6 commit d37b9df
Show file tree
Hide file tree
Showing 3 changed files with 54 additions and 20 deletions.
58 changes: 52 additions & 6 deletions j/grisu.j
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,17 @@ grisu(x::Real, n::Integer) = n >= 0 ? grisu(float64(x), GRISU_PRECISION, int32(n
grisu(float64(x), GRISU_FIXED, -int32(n))

# normal:
# 0 <= pt < n ####.#### n+1
# pt < 0 .000######## n-pt+1
# 0 < pt < n ####.#### n+1
# pt <= 0 .000######## n-pt+1
# n <= pt (dot) ########000. pt+1
# n <= pt (no dot) ########000 pt
# exponential:
# pt < 0 ########e-### n+k+2
# 0 <= pt ########e### n+k+1
# pt <= 0 ########e-### n+k+2
# 0 < pt ########e### n+k+1

function print_shortest(x::Real, dot::Bool)
if isnan(x); return print("NaN"); end
if isinf(x); return print(x < 0 ? "-Inf" : "Inf"); end
sign, digits, pt = grisu(x)
n = length(digits)
if sign
Expand All @@ -51,7 +53,7 @@ function print_shortest(x::Real, dot::Bool)
print(digits)
print('e')
print(e)
elseif pt < 0
elseif pt <= 0
# => .000########
print('.')
while pt < 0
Expand All @@ -69,10 +71,54 @@ function print_shortest(x::Real, dot::Bool)
if dot
print('.')
end
else # 0 <= pt <= n
else # => ####.####
print(digits[1:pt])
print('.')
print(digits[pt+1:])
end
end
print_shortest(x::Real) = print_shortest(x, false)

function show(x::Float)
if isnan(x); return print("NaN"); end
if isinf(x); return print(x < 0 ? "-Inf" : "Inf"); end
sign, digits, pt = grisu(x)
n = length(digits)
if sign
print('-')
end
if pt <= -4 || pt > 6 # .00001 to 100000.
# => #.#######e###
print(digits[1])
print('.')
if n > 1
print(digits[2:])
else
print('0')
end
print('e')
print(pt-1)
elseif pt <= 0
# => 0.00########
print("0.")
while pt < 0
print('0')
pt += 1
end
print(digits)
elseif pt >= n
# => ########00.0
print(digits)
while pt > n
print('0')
n += 1
end
print(".0")
else # => ####.####
print(digits[1:pt])
print('.')
print(digits[pt+1:])
end
end

showcompact(x::Float) = print(float32(x))
3 changes: 2 additions & 1 deletion j/libc.j
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@

## time-related functions ##

sleep(s::Real) = ccall(dlsym(libc, :usleep), Uint32, (Uint32,), uint32(iround(s*1e6)))
# TODO: check for usleep errors?
sleep(s::Real) = ccall(dlsym(libc, :usleep), Void, (Uint32,), uint32(iround(s*1e6)))

strftime(t) = strftime("%c", t)
function strftime(fmt::ByteString, t)
Expand Down
13 changes: 0 additions & 13 deletions j/show.j
Original file line number Diff line number Diff line change
Expand Up @@ -17,19 +17,6 @@ function show_trailing_hex(n::Uint64, ndig::Integer)
end
show(n::Unsigned) = (print("0x"); show_trailing_hex(uint64(n),sizeof(n)<<1))

show_float64(f::Float64, ndig) =
ccall(:jl_show_float, Void, (Float64, Int32), f, int32(ndig))

show(f::Float64) = show_float64(f, 17)
show(f::Float32) = show_float64(float64(f), 9)

num2str(f::Float, ndig) = print_to_string(show_float64, float64(f), ndig)
num2str(f::Float) = show_to_string(f)
num2str(n::Integer) = dec(n)

showcompact(f::Float64) = show_float64(f, 8)
showcompact(f::Float32) = show_float64(float64(f), 8)

show{T}(p::Ptr{T}) =
print(is(T,None) ? "Ptr{Void}" : typeof(p), " @0x$(hex(unsigned(p), WORD_SIZE>>2))")

Expand Down

10 comments on commit d37b9df

@JeffBezanson
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coolness.
Why both print_shortest and show(Float)? Looks like there is duplicated code between the two. There should be one core routine that takes all options and show calls that with defaults.

@StefanKarpinski
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I replied to this, but the reply appears to have gone missing. Basically, the two methods are similar, but not similar enough to make it easy to merge the two. Note, in particular that print_shortest does not use proper scientific notation, but rather always uses an integer mantissa followed by an exponent. That's because this "improper" form takes one less character on account of not having a . in it. Also, we want these methods to be very fast for printing lots and lots of numbers — especially print_shortest since it's likely to be used to print huge volumes of data into CSV or TSV files. For that purpose, I suspect the grisu algorithm will be very effective at giving good compression. I'm guessing that grisu-formatted, bzipped CSV/TSV files will actually be a very compact data representation for a lot of real-world data, which tends to be more "special" when written in decimal than in binary.

@JeffBezanson
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This already saves so many digits that I don't feel the need to play games to save an extra byte.

We need to find out if performance tweaking on our end is helpful, or if all the time is already inside the grisu library. By merging the routines, we can avoid the ASCIIString and tuple allocation. There is also some array copying in the print(digits[1:pt]) types of operations.

It would also be nice to have the num2str function back.

@JeffBezanson
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick test and this is only about 2.6x slower than printing an integer. I'm quite pleased with that, considering the extra objects and indexing and such involved. Still might make sense to try to max out performance though.

In related news, it may be time for the built-in integer printing methods (based on printf) to die:

julia> @time for i=1:1000000; string(1); end
elapsed time: 1.094998836517334 seconds

julia> @time for i=1:1000000; dec(1); end
elapsed time: 0.2381420135498047 seconds

@StefanKarpinski
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some value to being able to generate the absolute tersest human-readable ASCII representation of data. Saving a byte for every value adds up for large amounts of data. I'm going to leave it. Are you talking about merging all of these into a single function that contains the call to the grisu library? If so, that's a non-starter because I'm going to have to write a fair number of these to implement printf formatting and there's no way they're all going to be one huge function with a bazillion options. Once it all works, I can figure out how to avoid the ASCIIString allocation and substring copy. The most obvious thing would be for all of them to share a byte buffer and just have the core call return how many digits there are (which is what it actually returns). On the printing side, I need to expose a write method for printing part of a byte array without having to create a subarray copy.

@StefanKarpinski
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah. That's awesome. So the Julia-implemented integer printing is that much faster? Very cool. That only leaves bootstrapping concerns. It would really, really suck to not be able to print integers or floats before bootstrapping. Floats, sure, but integers, that's brutal. It's already hard enough to debug changes in, say, range.j.

@JeffBezanson
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can deal with that using the debug code block at the top of sysimg.j. We can put in a definition that does a ccall to print an integer. Then when our bootstrapping process improves we can just delete that whole block, and its functionality will be provided by a prior version of the julia library.

@StefanKarpinski
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's fair enough. Moving more things from C to Julia (and gaining performance) is always good.

@JeffBezanson
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we have write(stream, pointer(arr, start), len). But this reminds me we need a write method for SubArray.

@StefanKarpinski
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, I can use that.

Please sign in to comment.