Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled sentinel value for len in compression causes invalid Array dimensions #435

Closed
DrChainsaw opened this issue May 4, 2023 · 5 comments · Fixed by #436
Closed

Comments

@DrChainsaw
Copy link
Contributor

I'm generating a bunch of Arrow files from the apache java implementation and many of them are not readable by Arrow.jl (but they are readable by the java implementation).

When following the java decoding process in the debugger, it seems that both implementations agree up to the following line in the java implementation:
https://github.com/apache/arrow/blob/febd0ff144cfb8b2baffb1cb0be57ca40dc7cc77/java/vector/src/main/java/org/apache/arrow/vector/compression/AbstractCompressionCodec.java#L72-L75

It seems like length == -1 is some kind of sentinel value for no compression (maybe the compressor gave up or something?) which does not seem to be handled in the corresponding function in Arrow.jl:

arrow-julia/src/table.jl

Lines 521 to 524 in e893c32

len = unsafe_load(convert(Ptr{Int64}, ptr))
ptr += 8 # skip past uncompressed length as Int64
encodedbytes = unsafe_wrap(Array, ptr, buffer.length - 8)
decodedbytes = Vector{UInt8}(undef, len)

I have verified that Arrow.jl indeed does read out len = -1 (which in turn causes an error saying invalid Array dimensions when creating the decodedbytes vector).

@Moelf
Copy link
Contributor

Moelf commented May 4, 2023

but they are readable by the java implementation

are they readable by the pyarrow?

@DrChainsaw
Copy link
Contributor Author

are they readable by the pyarrow?

They are indeed!

@Moelf
Copy link
Contributor

Moelf commented May 4, 2023

can you share a small sample file? if not, can you tell us what pyarrow report in terms of tyoe?

@DrChainsaw
Copy link
Contributor Author

I don't know what causes the writer to select the uncompressed option and it did not happen in the simple sample files I created. I can try it some more if it is important.

I don't understand what is meant by "what pyarrow report in terms of tyoe". If you give me the command which makes the report and I see if I can send it.

@DrChainsaw
Copy link
Contributor Author

I managed to produce a file which triggers the problem:
test.zip

julia> Arrow.Table("c:/temp\\arrowtest\\test/test.arrow")
ERROR: TaskFailedException

    nested task error: ArgumentError: invalid Array dimensions
    Stacktrace:
      [1] Array
        @ .\boot.jl:477 [inlined]
      [2] uncompress(ptr::Ptr{UInt8}, buffer::Arrow.Flatbuf.Buffer, compression::Arrow.Flatbuf.BodyCompression)
        @ Arrow \.julia\dev\Arrow\src\table.jl:529
      [3] buildbitmap(batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, nodeidx::Int64, bufferidx::Int64)
        @ Arrow \.julia\dev\Arrow\src\table.jl:512
      [4] build(f::Arrow.Flatbuf.Field, #unused#::Arrow.Flatbuf.Int, batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool)
        @ Arrow \.julia\dev\Arrow\src\table.jl:683
      [5] build(field::Arrow.Flatbuf.Field, batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool)
        @ Arrow \.julia\dev\Arrow\src\table.jl:498
      [6] iterate(x::Arrow.VectorIterator, ::Tuple{Int64, Int64, Int64})
        @ Arrow \.julia\dev\Arrow\src\table.jl:474
      [7] iterate
        @ \.julia\packages\Arrow\rYdxZ\src\table.jl:471 [inlined]
      [8] copyto!(dest::Vector{Any}, src::Arrow.VectorIterator)
        @ Base .\abstractarray.jl:946
      [9] _collect
        @ .\array.jl:713 [inlined]
     [10] collect
        @ .\array.jl:707 [inlined]
     [11] macro expansion
        @ \.julia\packages\Arrow\rYdxZ\src\table.jl:376 [inlined]
     [12] (::Arrow.var"#108#114"{Bool, Channel{Any}, WorkerUtilities.OrderedSynchronizer, Dict{Int64, Arrow.DictEncoding}, Arrow.Batch, Int64})()
        @ Arrow .\threadingconstructs.jl:341
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base .\task.jl:445
 [2] macro expansion
   @ .\task.jl:477 [inlined]
 [3] Arrow.Table(blobs::Vector{Arrow.ArrowBlob}; convert::Bool)
   @ Arrow \.julia\dev\Arrow\src\table.jl:321
 [4] Table
   @ \.julia\packages\Arrow\rYdxZ\src\table.jl:295 [inlined]
 [5] #Table#98
   @ \.julia\packages\Arrow\rYdxZ\src\table.jl:290 [inlined]
 [6] Table
   @ \.julia\packages\Arrow\rYdxZ\src\table.jl:290 [inlined]
 [7] Arrow.Table(input::String)
   @ Arrow \.julia\dev\Arrow\src\table.jl:290
 [8] top-level scope
   @ REPL[27]:1

With #436

julia> Arrow.Table("c:/temp\\arrowtest\\test/test.arrow") |> DataFrame
102×15 DataFrame
 Row │ isA    intkey  primitiveIntkey  doublekey  booleanKey  numberkey  primitiveNumberkey  stringkey  objectkey                  arrayKey     NrofSamples  Max      Min      Sum      SqrSum  
     │ Int32  Int32   Int32            Float64    Bool        Float64    Float64             String     String                     String       Int32        Float64  Float64  Float64  Float64 
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   10       1                2        3.0        true        4.0                 5.0  6          StringObject{string='7'}   [I@4dd6fd0a            2    100.0     10.0    110.0  10100.0
   20      10               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
   30      11               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
   40      12               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
   50      13               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
   60      14               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
   70      15               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
   80      16               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
   90      17               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
  100      18               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
  110      19               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
  120      20               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
  130      21               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
  140      22               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0

Loads with pyarrow ootb:

julia> pywith(pyarrow.ipc.open_file("c:/temp\\arrowtest\\test/test.arrow")) do reader
       reader.read_pandas()
       end
Python DataFrame:
     isA  intkey  primitiveIntkey  doublekey  booleanKey  ...  NrofSamples    Max    Min    Sum   SqrSum
0      0       1                2        3.0        True  ...            2  100.0   10.0  110.0  10100.0
1      0      10               20       30.0       False  ...            1  100.0  100.0  100.0  10000.0
2      0      11               20       30.0       False  ...            1  100.0  100.0  100.0  10000.0
3      0      12               20       30.0       False  ...            1  100.0  100.0  100.0  10000.0
4      0      13               20       30.0       False  ...            1  100.0  100.0  100.0  10000.0
..   ...     ...              ...        ...         ...  ...          ...    ...    ...    ...      ...
97     0     106               20       30.0       False  ...            1  100.0  100.0  100.0  10000.0
98     0     107               20       30.0       False  ...            1  100.0  100.0  100.0  10000.0
99     0     108               20       30.0       False  ...            1  100.0  100.0  100.0  10000.0
100    0     109               20       30.0       False  ...            1  100.0  100.0  100.0  10000.0
101    1      10               20       30.0       False  ...            1  100.0  100.0  100.0  10000.0

[102 rows x 15 columns]

quinnj pushed a commit that referenced this issue May 22, 2023
Fix #435 

Hipshot PR from the github API. Not sure how to add tests for this if
needed. It reads my files correctly at least :)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants