-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unhandled sentinel value for len in compression causes invalid Array dimensions #435
Comments
are they readable by the |
They are indeed! |
can you share a small sample file? if not, can you tell us what |
I don't know what causes the writer to select the uncompressed option and it did not happen in the simple sample files I created. I can try it some more if it is important. I don't understand what is meant by "what pyarrow report in terms of tyoe". If you give me the command which makes the report and I see if I can send it. |
I managed to produce a file which triggers the problem: julia> Arrow.Table("c:/temp\\arrowtest\\test/test.arrow")
ERROR: TaskFailedException
nested task error: ArgumentError: invalid Array dimensions
Stacktrace:
[1] Array
@ .\boot.jl:477 [inlined]
[2] uncompress(ptr::Ptr{UInt8}, buffer::Arrow.Flatbuf.Buffer, compression::Arrow.Flatbuf.BodyCompression)
@ Arrow \.julia\dev\Arrow\src\table.jl:529
[3] buildbitmap(batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, nodeidx::Int64, bufferidx::Int64)
@ Arrow \.julia\dev\Arrow\src\table.jl:512
[4] build(f::Arrow.Flatbuf.Field, #unused#::Arrow.Flatbuf.Int, batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool)
@ Arrow \.julia\dev\Arrow\src\table.jl:683
[5] build(field::Arrow.Flatbuf.Field, batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool)
@ Arrow \.julia\dev\Arrow\src\table.jl:498
[6] iterate(x::Arrow.VectorIterator, ::Tuple{Int64, Int64, Int64})
@ Arrow \.julia\dev\Arrow\src\table.jl:474
[7] iterate
@ \.julia\packages\Arrow\rYdxZ\src\table.jl:471 [inlined]
[8] copyto!(dest::Vector{Any}, src::Arrow.VectorIterator)
@ Base .\abstractarray.jl:946
[9] _collect
@ .\array.jl:713 [inlined]
[10] collect
@ .\array.jl:707 [inlined]
[11] macro expansion
@ \.julia\packages\Arrow\rYdxZ\src\table.jl:376 [inlined]
[12] (::Arrow.var"#108#114"{Bool, Channel{Any}, WorkerUtilities.OrderedSynchronizer, Dict{Int64, Arrow.DictEncoding}, Arrow.Batch, Int64})()
@ Arrow .\threadingconstructs.jl:341
Stacktrace:
[1] sync_end(c::Channel{Any})
@ Base .\task.jl:445
[2] macro expansion
@ .\task.jl:477 [inlined]
[3] Arrow.Table(blobs::Vector{Arrow.ArrowBlob}; convert::Bool)
@ Arrow \.julia\dev\Arrow\src\table.jl:321
[4] Table
@ \.julia\packages\Arrow\rYdxZ\src\table.jl:295 [inlined]
[5] #Table#98
@ \.julia\packages\Arrow\rYdxZ\src\table.jl:290 [inlined]
[6] Table
@ \.julia\packages\Arrow\rYdxZ\src\table.jl:290 [inlined]
[7] Arrow.Table(input::String)
@ Arrow \.julia\dev\Arrow\src\table.jl:290
[8] top-level scope
@ REPL[27]:1
With #436 julia> Arrow.Table("c:/temp\\arrowtest\\test/test.arrow") |> DataFrame
102×15 DataFrame
Row │ isA intkey primitiveIntkey doublekey booleanKey numberkey primitiveNumberkey stringkey objectkey arrayKey NrofSamples Max Min Sum SqrSum
│ Int32 Int32 Int32 Float64 Bool Float64 Float64 String String String Int32 Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0 1 2 3.0 true 4.0 5.0 6 StringObject{string='7'} [I@4dd6fd0a 2 100.0 10.0 110.0 10100.0
2 │ 0 10 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
3 │ 0 11 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
4 │ 0 12 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
5 │ 0 13 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
6 │ 0 14 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
7 │ 0 15 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
8 │ 0 16 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
9 │ 0 17 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
10 │ 0 18 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
11 │ 0 19 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
12 │ 0 20 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
13 │ 0 21 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
14 │ 0 22 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0 Loads with pyarrow ootb: julia> pywith(pyarrow.ipc.open_file("c:/temp\\arrowtest\\test/test.arrow")) do reader
reader.read_pandas()
end
Python DataFrame:
isA intkey primitiveIntkey doublekey booleanKey ... NrofSamples Max Min Sum SqrSum
0 0 1 2 3.0 True ... 2 100.0 10.0 110.0 10100.0
1 0 10 20 30.0 False ... 1 100.0 100.0 100.0 10000.0
2 0 11 20 30.0 False ... 1 100.0 100.0 100.0 10000.0
3 0 12 20 30.0 False ... 1 100.0 100.0 100.0 10000.0
4 0 13 20 30.0 False ... 1 100.0 100.0 100.0 10000.0
.. ... ... ... ... ... ... ... ... ... ... ...
97 0 106 20 30.0 False ... 1 100.0 100.0 100.0 10000.0
98 0 107 20 30.0 False ... 1 100.0 100.0 100.0 10000.0
99 0 108 20 30.0 False ... 1 100.0 100.0 100.0 10000.0
100 0 109 20 30.0 False ... 1 100.0 100.0 100.0 10000.0
101 1 10 20 30.0 False ... 1 100.0 100.0 100.0 10000.0
[102 rows x 15 columns] |
Fix #435 Hipshot PR from the github API. Not sure how to add tests for this if needed. It reads my files correctly at least :)
I'm generating a bunch of Arrow files from the apache java implementation and many of them are not readable by Arrow.jl (but they are readable by the java implementation).
When following the java decoding process in the debugger, it seems that both implementations agree up to the following line in the java implementation:
https://github.com/apache/arrow/blob/febd0ff144cfb8b2baffb1cb0be57ca40dc7cc77/java/vector/src/main/java/org/apache/arrow/vector/compression/AbstractCompressionCodec.java#L72-L75
It seems like length == -1 is some kind of sentinel value for no compression (maybe the compressor gave up or something?) which does not seem to be handled in the corresponding function in Arrow.jl:
arrow-julia/src/table.jl
Lines 521 to 524 in e893c32
I have verified that Arrow.jl indeed does read out len = -1 (which in turn causes an error saying
invalid Array dimensions
when creating the decodedbytes vector).The text was updated successfully, but these errors were encountered: