Skip to content

avoid defining a one arg hash since it has some invalidation issues#3516

Merged
bkamins merged 3 commits intomainfrom
kc/hash
Oct 17, 2025
Merged

avoid defining a one arg hash since it has some invalidation issues#3516
bkamins merged 3 commits intomainfrom
kc/hash

Conversation

@KristofferC
Copy link
Copy Markdown
Member

On 1.12 the following script:

❯ cat repro.jl 
using DataFrames
using CSV
write("temp.csv", """
Username; Identifier;First name;Last name
booker12;9012;Rachel;Booker
grey07;2070;Laura;Grey
johnson81;4081;Craig;Johnson
jenkins46;9346;Mary;Jenkins
smith79;5079;Jamie;Smith
""")
@time @eval CSV.File("temp.csv")

takes

  1.516691 seconds (10.49 M allocations: 551.201 MiB, 8.69% gc time, 99.93% compilation time: 92% of which was recompilation)

and with this PR it takes

  0.136307 seconds (396.16 k allocations: 20.289 MiB, 14.90% gc time, 99.19% compilation time)

@bkamins
Copy link
Copy Markdown
Member

bkamins commented Oct 16, 2025

If I remember the idea correctly (@nalimilan worked on it) - we wante to avoid the ⊻ h for speed (and probably this is the reason why tests fail). Maybe just removing ⊻ h is fine (we should never call this method passing h explicitely as OnRowCol is internal).

@bkamins bkamins added the ecosystem Issues in DataFrames.jl ecosystem label Oct 16, 2025
@bkamins bkamins added this to the patch milestone Oct 16, 2025
@KristofferC
Copy link
Copy Markdown
Member Author

KristofferC commented Oct 16, 2025

and probably this is the reason why tests fail

The reason was just that the tests had an explicit method error check for that method.

Edit: Okay there are other test failures on nightly

@KristofferC
Copy link
Copy Markdown
Member Author

we wante to avoid the ⊻ h for speed

The zero default for h in hash will be constant propagated and the ⊻ h is optimized away:

julia> code_llvm(Base.hash, Tuple{DataFrames.OnColRow})
; Function Signature: hash(DataFrames.OnColRow{T} where T)
;  @ hashing.jl:28 within `hash`
define i64 @julia_hash_8877(ptr noundef nonnull %"x::<unknown type>") #0 {
top:
  %jlcallframe1 = alloca [2 x ptr], align 8
;  @ hashing.jl:28 within `hash` @ /Users/kc/CSV_slow/dev/DataFrames/src/join/core.jl:36
; ┌ @ Base_compiler.jl:54 within `getproperty`
   store ptr %"x::<unknown type>", ptr %jlcallframe1, align 8
   %0 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 1
   store ptr @"jl_sym#h#8879.jit", ptr %0, align 8
   %jl_f_getfield_ret = call nonnull ptr @jl_f_getfield(ptr null, ptr nonnull %jlcallframe1, i32 2)
   store ptr %"x::<unknown type>", ptr %jlcallframe1, align 8
   store ptr @"jl_sym#row#8880.jit", ptr %0, align 8
   %jl_f_getfield_ret1 = call nonnull ptr @jl_f_getfield(ptr null, ptr nonnull %jlcallframe1, i32 2)
; └
; ┌ @ essentials.jl:920 within `getindex`
   %memoryref_data = load ptr, ptr %jl_f_getfield_ret, align 8
   %jl_f_getfield_ret1.unbox2 = load i64, ptr %jl_f_getfield_ret1, align 8
   %memoryref_offset = shl i64 %jl_f_getfield_ret1.unbox2, 3
   %1 = getelementptr i8, ptr %memoryref_data, i64 %memoryref_offset
   %memoryref_data3 = getelementptr i8, ptr %1, i64 -8
   %2 = load i64, ptr %memoryref_data3, align 8
; └
; ┌ @ int.jl:379 within `xor`
   ret i64 %2
; └
}

even if it wasn't optimized away I am 99% sure it's performance impact would be unmeasurable.

@KristofferC
Copy link
Copy Markdown
Member Author

Based on the nightly failures I removed the xor with h (as suggested).

@adienes
Copy link
Copy Markdown
Contributor

adienes commented Oct 16, 2025

The zero default for h in hash will be constant propagated

(the default is no longer zero) but should still be constant propagated

@KristofferC
Copy link
Copy Markdown
Member Author

There is some weird assumption made in this hashing code because there shouldn't be any real reason why you couldn't mix in the h in there like normally done. I just want to get the invalidation fixed first and might look into that later.

@KristofferC
Copy link
Copy Markdown
Member Author

Nightly error is JuliaLang/julia#59857

@bkamins
Copy link
Copy Markdown
Member

bkamins commented Oct 16, 2025

The idea of @nalimilan was that ocr1.h is a vector that holds precomputed hashes, see https://github.com/JuliaData/DataFrames.jl/blob/main/src/join/core.jl#L40.

@adienes
Copy link
Copy Markdown
Contributor

adienes commented Oct 16, 2025

I kind of feel like this should not be extending Base.hash at all.

Base.hash(ocr1::OnColRow, h::UInt) = throw(MethodError(hash, (ocr1, h)))
@inline Base.hash(ocr1::OnColRow) = @inbounds ocr1.h[ocr1.row]

as here hash(::OnColRow) is UB if _prehash has not been called. maybe it can be just a local _hash ?

@bkamins
Copy link
Copy Markdown
Member

bkamins commented Oct 16, 2025

maybe it can be just a local _hash ?

This was my first thought how we should refactor this.
Then we need to define _hash(x) = x as a default and have special _hash for this object. And next change the code where we call hash when doing joining.

@KristofferC
Copy link
Copy Markdown
Member Author

I suggest getting this in, in order to fix the immediate (quite severe) invalidation, and then further discussing if this needs to be a Base.hash method at all.

@bkamins bkamins requested a review from nalimilan October 17, 2025 05:02
Copy link
Copy Markdown
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Can you please bump the patch version of DataFrames.jl in this PR, so that we can make a patch release? Thank you. (I will also wait some time for @nalimilan to have a look at this)

@KristofferC
Copy link
Copy Markdown
Member Author

Done

Copy link
Copy Markdown
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting this @KristofferC. Looks OK as a quick fix, though refactoring to call a custom _hash function is probably a good idea to avoid defining a hash method which ignores its second argument.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@KristofferC
Copy link
Copy Markdown
Member Author

Two approvals here and CI is ok (nightly is a parsing regression). Good to go?

@bkamins bkamins merged commit 00cd7d0 into main Oct 17, 2025
7 of 8 checks passed
@bkamins bkamins deleted the kc/hash branch October 17, 2025 14:57
@bkamins
Copy link
Copy Markdown
Member

bkamins commented Oct 17, 2025

Thank you!

@nalimilan
Copy link
Copy Markdown
Member

I had a quick look at the code and I'm actually not sure we can avoid using hash, as we work with Dicts (is that right @bkamins?). Maybe the current version is the best we can do. It's not super problematic as these are internal types anyway. If @assert was disabled in production we could add @assert h == 0, but currently that's not the case (we need something like JuliaLang/julia#53404).

@adienes
Copy link
Copy Markdown
Contributor

adienes commented Oct 19, 2025

# should give the same hash as AbstractVector{T}
function hashrows_col!(h::Vector{UInt},

is that an important invariant? because I don't think it is satisfied

@nalimilan
Copy link
Copy Markdown
Member

is that an important invariant? because I don't think it is satisfied

That comment isn't written in the most explicit way. What it means is that the result of this method (used for PooledArray and similar) must be equivalent to that of the fallback function above it. Is that how you understood it? This is extensively tested so the invariant most likely holds. The logic is relatively simple anyway.

@bkamins
Copy link
Copy Markdown
Member

bkamins commented Oct 19, 2025

Maybe the current version is the best we can do.

I was also checking this and yes - it seems we would need to replace Dict with some custom alternative. The problem with h=0 assert is that this changes across Julia versions, see https://github.com/JuliaData/DataFrames.jl/blob/main/src/join/core.jl#L43 (and that is why we have this condition in _prehash).

@adienes
Copy link
Copy Markdown
Contributor

adienes commented Oct 19, 2025

the performance difference of mixing in one more hash value seems really very minimal. something like

function Base.hash(ocr1::OnColRow, h::UInt)
    if isempty(ocr1.h)
        result = Base.tuplehash_seed
        for col in reverse(ocr1.cols)
            result = hash(@inbounds(col[ocr1.row]), result)
        end
        return hash(result, h)
    else
        return hash(@inbounds(ocr1.h[ocr1.row]), h)
    end
end

both avoids safety concerns of calling hash before prehash and allows any second h argument. some local / small benchmarks suggest performance parity with the status quo

Is that how you understood it?

ah my bad, I thought it was referring to Base.hash(::AbstractVector)

@nalimilan
Copy link
Copy Markdown
Member

Definitely work trying if performance seems acceptable. The function could almost call _prehash when isempty(ocr1.h). But that requires systematic benchmarking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ecosystem Issues in DataFrames.jl ecosystem

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants