avoid defining a one arg hash since it has some invalidation issues by KristofferC · Pull Request #3516 · JuliaData/DataFrames.jl

KristofferC · 2025-10-16T14:28:48Z

On 1.12 the following script:

❯ cat repro.jl 
using DataFrames
using CSV
write("temp.csv", """
Username; Identifier;First name;Last name
booker12;9012;Rachel;Booker
grey07;2070;Laura;Grey
johnson81;4081;Craig;Johnson
jenkins46;9346;Mary;Jenkins
smith79;5079;Jamie;Smith
""")
@time @eval CSV.File("temp.csv")

takes

  1.516691 seconds (10.49 M allocations: 551.201 MiB, 8.69% gc time, 99.93% compilation time: 92% of which was recompilation)

and with this PR it takes

  0.136307 seconds (396.16 k allocations: 20.289 MiB, 14.90% gc time, 99.19% compilation time)

bkamins · 2025-10-16T17:53:02Z

If I remember the idea correctly (@nalimilan worked on it) - we wante to avoid the ⊻ h for speed (and probably this is the reason why tests fail). Maybe just removing ⊻ h is fine (we should never call this method passing h explicitely as OnRowCol is internal).

KristofferC · 2025-10-16T17:59:10Z

and probably this is the reason why tests fail

The reason was just that the tests had an explicit method error check for that method.

Edit: Okay there are other test failures on nightly

KristofferC · 2025-10-16T18:03:06Z

we wante to avoid the ⊻ h for speed

The zero default for h in hash will be constant propagated and the ⊻ h is optimized away:

julia> code_llvm(Base.hash, Tuple{DataFrames.OnColRow})
; Function Signature: hash(DataFrames.OnColRow{T} where T)
;  @ hashing.jl:28 within `hash`
define i64 @julia_hash_8877(ptr noundef nonnull %"x::<unknown type>") #0 {
top:
  %jlcallframe1 = alloca [2 x ptr], align 8
;  @ hashing.jl:28 within `hash` @ /Users/kc/CSV_slow/dev/DataFrames/src/join/core.jl:36
; ┌ @ Base_compiler.jl:54 within `getproperty`
   store ptr %"x::<unknown type>", ptr %jlcallframe1, align 8
   %0 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 1
   store ptr @"jl_sym#h#8879.jit", ptr %0, align 8
   %jl_f_getfield_ret = call nonnull ptr @jl_f_getfield(ptr null, ptr nonnull %jlcallframe1, i32 2)
   store ptr %"x::<unknown type>", ptr %jlcallframe1, align 8
   store ptr @"jl_sym#row#8880.jit", ptr %0, align 8
   %jl_f_getfield_ret1 = call nonnull ptr @jl_f_getfield(ptr null, ptr nonnull %jlcallframe1, i32 2)
; └
; ┌ @ essentials.jl:920 within `getindex`
   %memoryref_data = load ptr, ptr %jl_f_getfield_ret, align 8
   %jl_f_getfield_ret1.unbox2 = load i64, ptr %jl_f_getfield_ret1, align 8
   %memoryref_offset = shl i64 %jl_f_getfield_ret1.unbox2, 3
   %1 = getelementptr i8, ptr %memoryref_data, i64 %memoryref_offset
   %memoryref_data3 = getelementptr i8, ptr %1, i64 -8
   %2 = load i64, ptr %memoryref_data3, align 8
; └
; ┌ @ int.jl:379 within `xor`
   ret i64 %2
; └
}

even if it wasn't optimized away I am 99% sure it's performance impact would be unmeasurable.

KristofferC · 2025-10-16T18:39:05Z

Based on the nightly failures I removed the xor with h (as suggested).

adienes · 2025-10-16T19:31:10Z

The zero default for h in hash will be constant propagated

(the default is no longer zero) but should still be constant propagated

KristofferC · 2025-10-16T19:32:43Z

There is some weird assumption made in this hashing code because there shouldn't be any real reason why you couldn't mix in the h in there like normally done. I just want to get the invalidation fixed first and might look into that later.

KristofferC · 2025-10-16T19:33:11Z

Nightly error is JuliaLang/julia#59857

bkamins · 2025-10-16T19:36:35Z

The idea of @nalimilan was that ocr1.h is a vector that holds precomputed hashes, see https://github.com/JuliaData/DataFrames.jl/blob/main/src/join/core.jl#L40.

adienes · 2025-10-16T19:46:21Z

I kind of feel like this should not be extending Base.hash at all.

Base.hash(ocr1::OnColRow, h::UInt) = throw(MethodError(hash, (ocr1, h)))
@inline Base.hash(ocr1::OnColRow) = @inbounds ocr1.h[ocr1.row]

as here hash(::OnColRow) is UB if _prehash has not been called. maybe it can be just a local _hash ?

bkamins · 2025-10-16T19:49:14Z

maybe it can be just a local _hash ?

This was my first thought how we should refactor this.
Then we need to define _hash(x) = x as a default and have special _hash for this object. And next change the code where we call hash when doing joining.

KristofferC · 2025-10-16T21:02:13Z

I suggest getting this in, in order to fix the immediate (quite severe) invalidation, and then further discussing if this needs to be a Base.hash method at all.

bkamins

OK. Can you please bump the patch version of DataFrames.jl in this PR, so that we can make a patch release? Thank you. (I will also wait some time for @nalimilan to have a look at this)

KristofferC · 2025-10-17T07:10:33Z

Done

nalimilan

Thanks for spotting this @KristofferC. Looks OK as a quick fix, though refactoring to call a custom _hash function is probably a good idea to avoid defining a hash method which ignores its second argument.

src/join/core.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

KristofferC · 2025-10-17T14:52:33Z

Two approvals here and CI is ok (nightly is a parsing regression). Good to go?

bkamins · 2025-10-17T14:57:44Z

Thank you!

nalimilan · 2025-10-19T10:38:19Z

I had a quick look at the code and I'm actually not sure we can avoid using hash, as we work with Dicts (is that right @bkamins?). Maybe the current version is the best we can do. It's not super problematic as these are internal types anyway. If @assert was disabled in production we could add @assert h == 0, but currently that's not the case (we need something like JuliaLang/julia#53404).

adienes · 2025-10-19T13:26:16Z

# should give the same hash as AbstractVector{T}
function hashrows_col!(h::Vector{UInt},

is that an important invariant? because I don't think it is satisfied

nalimilan · 2025-10-19T13:52:22Z

is that an important invariant? because I don't think it is satisfied

That comment isn't written in the most explicit way. What it means is that the result of this method (used for PooledArray and similar) must be equivalent to that of the fallback function above it. Is that how you understood it? This is extensively tested so the invariant most likely holds. The logic is relatively simple anyway.

bkamins · 2025-10-19T13:59:54Z

Maybe the current version is the best we can do.

I was also checking this and yes - it seems we would need to replace Dict with some custom alternative. The problem with h=0 assert is that this changes across Julia versions, see https://github.com/JuliaData/DataFrames.jl/blob/main/src/join/core.jl#L43 (and that is why we have this condition in _prehash).

adienes · 2025-10-19T15:01:39Z

the performance difference of mixing in one more hash value seems really very minimal. something like

function Base.hash(ocr1::OnColRow, h::UInt)
    if isempty(ocr1.h)
        result = Base.tuplehash_seed
        for col in reverse(ocr1.cols)
            result = hash(@inbounds(col[ocr1.row]), result)
        end
        return hash(result, h)
    else
        return hash(@inbounds(ocr1.h[ocr1.row]), h)
    end
end

both avoids safety concerns of calling hash before prehash and allows any second h argument. some local / small benchmarks suggest performance parity with the status quo

Is that how you understood it?

ah my bad, I thought it was referring to Base.hash(::AbstractVector)

nalimilan · 2025-10-19T15:26:14Z

Definitely work trying if performance seems acceptable. The function could almost call _prehash when isempty(ocr1.h). But that requires systematic benchmarking.

KristofferC mentioned this pull request Oct 16, 2025

1.12 - 2x+ TTFX Regressions JuliaLang/julia#59856

Closed

bkamins added the ecosystem Issues in DataFrames.jl ecosystem label Oct 16, 2025

KristofferC force-pushed the kc/hash branch from c6c1175 to d27bc69 Compare October 16, 2025 17:53

bkamins added this to the patch milestone Oct 16, 2025

avoid defining a one arg hash since it has some invalidation issues

04ee4ac

KristofferC force-pushed the kc/hash branch from d27bc69 to 04ee4ac Compare October 16, 2025 18:38

bkamins requested a review from nalimilan October 17, 2025 05:02

bkamins approved these changes Oct 17, 2025

View reviewed changes

bump patch version

973ae75

nalimilan approved these changes Oct 17, 2025

View reviewed changes

src/join/core.jl Show resolved Hide resolved

Update src/join/core.jl

6e421b1

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins merged commit 00cd7d0 into main Oct 17, 2025
7 of 8 checks passed

bkamins deleted the kc/hash branch October 17, 2025 14:57

adienes mentioned this pull request Nov 22, 2025

Eliminate (almost) all invalidations by disabling world splitting JuliaLang/julia#59091

Draft

Conversation

KristofferC commented Oct 16, 2025

Uh oh!

bkamins commented Oct 16, 2025

Uh oh!

KristofferC commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KristofferC commented Oct 16, 2025

Uh oh!

KristofferC commented Oct 16, 2025

Uh oh!

adienes commented Oct 16, 2025

Uh oh!

KristofferC commented Oct 16, 2025

Uh oh!

KristofferC commented Oct 16, 2025

Uh oh!

bkamins commented Oct 16, 2025

Uh oh!

adienes commented Oct 16, 2025

Uh oh!

bkamins commented Oct 16, 2025

Uh oh!

KristofferC commented Oct 16, 2025

Uh oh!

bkamins left a comment

Choose a reason for hiding this comment

Uh oh!

KristofferC commented Oct 17, 2025

Uh oh!

nalimilan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KristofferC commented Oct 17, 2025

Uh oh!

Uh oh!

bkamins commented Oct 17, 2025

Uh oh!

nalimilan commented Oct 19, 2025

Uh oh!

adienes commented Oct 19, 2025

Uh oh!

nalimilan commented Oct 19, 2025

Uh oh!

bkamins commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adienes commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nalimilan commented Oct 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

KristofferC commented Oct 16, 2025 •

edited

Loading

bkamins commented Oct 19, 2025 •

edited

Loading

adienes commented Oct 19, 2025 •

edited

Loading