Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 3, 2025

Fix SourceTextComparer to use content-based equality consistently

Problem

SourceTextComparer violates the IEqualityComparer contract:

  • Equals uses content-based comparison (via ContentEquals)
  • GetHashCode uses checksum which includes encoding and original bytes
  • This means two SourceText instances with identical content but different encodings/BOMs can be equal but have different hash codes

Solution

Change GetHashCode to use SourceText.GetContentHash() instead of the checksum-based approach. This makes both methods strictly content-based.

Changes

  • SourceTextComparer.cs: Changed GetHashCode() to use GetContentHash() with MemoryMarshal.Read<int> to efficiently read the first 4 bytes of the 16-byte hash as the hash code
  • SourceTextTests.cs: Added 7 tests validating the contract with different encodings, BOMs, and creation methods (all with WorkItem attributes referencing issue SourceTextComparer has odd IEqualityComparer semantics #71006)

Testing Results

  • ✅ All 7 comparer tests pass
  • ✅ All 228 text-related tests continue to pass
  • ✅ No security issues detected by CodeQL

This also fixes SyntaxTreeComparer and improves SourceTextValueProvider caching behavior in analyzers and incremental generators.

Original prompt

This section details on the original issue you should resolve

<issue_title>SourceTextComparer has odd IEqualityComparer semantics</issue_title>
<issue_description>The Equals method is using the content, whereas the GetHashCode method is using the contents and encoding. This mismatch seems like it could cause issues downstream.</issue_description>

Comments on the Issue (you are @copilot in this section)

@CyrusNajmabadi Yes. THere are definite issues here. This means you can generate two source-text instances that have the following behavior:
  1. Each produces a different hash code for SourceTextComparer.GetHashCode
  2. They return .Equals for SourceTextComparer.Equals

This violates the contract of IEqualityComparer.

The way to do this is to create the source-texts from byte[]s or streams, where one as a BOM and the other doesn't (but otherwise the same content). For both, just pass the Encoding.Utf8 encoding. This will produce source-texts with equal content, but different hashes (since the hashes use the 'checksum' of the ST, which uses the original bytes to compute itself).

There are two solutions here:

  1. SourceTextComparer.GetHashCode should use SourceText.GetContentHash to produce the final hashcode (and that alone). THis will being .GetHashCode in line with .Equals. This would make it so that SourceTextComparer has a strictly content based view of source-texts. Two source texts with the same content would be considered the same, regardless of encoding, hash-algorithm, or originating bytes.
  2. SourceTextComparer.Equals should use the Encoding and Checksum to determine equality. This will bring .Equals in line with .GetHashCode. This will make it so that SourceTextComparer has a strictly originating based view of source-texts. So for two source texts to be the same, they would have had to come from the same original bytes as well.

I don't acually know which is preferred for the use cases where SOurceTextComparer is used (looks like it's in incremental generators for example).

--

My preference is that we take teh former view of things (a content-based view). It seems to me that if something is producing a source-text with the same content that that should be considered a no-op change for something like incremental-generators.

Anyways, this is onto compiler team to decide what to do. I'm happy to share insights here though if there are any questions.</comment_new>
<comment_new>@CyrusNajmabadi
Actually, just realized that for incremental-generators, we won't use GetHashCode at all. It's just going to use .Equals. So that shouldn't change.

The only place i can see using the GetHashCode part of this is SourceTextValueProvider (which is used to cache data against a source-text within an analyzer).

But here's the crazy thing. The STVP uses a CWT to cache data against the source-text (which is my expectation). So this IEqualityComaprer is not used for that purpose either.

So i legit cannot tell what purpose this serves at all. I think we likely can remove this type.</comment_new>
<comment_new>@CyrusNajmabadi
@mavasani @chsienki to look at this. I'm fairly certain this guy can be removed. And in the case of SOurceTExts, we should be using identity semantics for analyzers, and for incremental-generators, we use content-equality and we throw on GetHashCode (since that should never be called). </comment_new>
<comment_new>@CyrusNajmabadi
Works for me. Going with content-based seems like the sensible approach.</comment_new>
<comment_new>@chsienki
Content-based seems sensible. FWIW I don't think this is used at all for generators. A couple of the fuzz tests use it but that's it. </comment_new>


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 2 commits November 3, 2025 20:38
Changed GetHashCode to use GetContentHash() instead of GetChecksum(),
aligning it with the content-based Equals method. This ensures the
IEqualityComparer contract is satisfied: two SourceText instances with
the same content but different encodings/BOMs will now have the same
hash code.

Added comprehensive tests to validate the fix.

Co-authored-by: CyrusNajmabadi <[email protected]>
Replaced LINQ Concat with manual byte array creation to avoid
potential ambiguity with custom extension methods.

Co-authored-by: CyrusNajmabadi <[email protected]>
Copilot AI changed the title [WIP] Fix IEqualityComparer semantics in SourceTextComparer Fix SourceTextComparer IEqualityComparer contract violation Nov 3, 2025
Copilot AI requested a review from CyrusNajmabadi November 3, 2025 20:45
Copilot finished work on behalf of CyrusNajmabadi November 3, 2025 20:45

namespace Microsoft.CodeAnalysis.UnitTests.Text
{
public class SourceTextComparerTests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot just move these into SourceTextTests.cs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 38aa364. Moved all tests into SourceTextTests.cs and deleted the separate file.

Hash.Combine(contentsHash,
Hash.Combine(encodingHash, ((int)obj.ChecksumAlgorithm).GetHashCode())));
var contentHash = obj.GetContentHash();
return !contentHash.IsDefault ? Hash.CombineValues(contentHash) : 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot You won't need to check for .IsDefault. Also, just use MemoryMarshal.TryRead(contentHash.AsSpan(), ...) to just read out the first 4 bytes of the already strongly hashed bytes into the int you need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 97acc72. Removed the IsDefault check and switched to MemoryMarshal.TryRead to read the first 4 bytes directly as an int.

Moved all SourceTextComparer tests from the separate
SourceTextComparerTests.cs file into SourceTextTests.cs and deleted
the separate test file. Test method names prefixed with "Comparer_"
to distinguish them as comparer-specific tests.

Co-authored-by: CyrusNajmabadi <[email protected]>
Simplified GetHashCode to use MemoryMarshal.TryRead to directly read
the first 4 bytes of the content hash as an int, removing the
unnecessary IsDefault check and Hash.CombineValues call.

Co-authored-by: CyrusNajmabadi <[email protected]>
Copilot finished work on behalf of CyrusNajmabadi November 3, 2025 21:05
Assert.Equal("span", ex.ParamName);
}

[Fact]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot update this to [Fact, WorkItem("")] with the write url in the workitem for what is being fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in fc7e4c0. Added WorkItem attributes with the issue URL to all 7 comparer tests.

Hash.Combine(encodingHash, ((int)obj.ChecksumAlgorithm).GetHashCode())));
var contentHash = obj.GetContentHash();
MemoryMarshal.TryRead(contentHash.AsSpan(), out int hashCode);
return hashCode;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot actually, can just be MemoryMarshal.Read<int>(... Add comment stating that the content hash is 16 bytes, so reading off the first 4 bytes is always safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c8af757. Changed to MemoryMarshal.Read<int> and added a comment explaining that the 16-byte hash makes reading the first 4 bytes safe.

Added WorkItem attributes referencing issue #71006 to all 7
SourceTextComparer tests to document which issue they validate.

Co-authored-by: CyrusNajmabadi <[email protected]>
Changed from MemoryMarshal.TryRead to MemoryMarshal.Read<int> and
added a comment explaining that GetContentHash returns a 16-byte hash,
making it safe to read the first 4 bytes as an int.

Co-authored-by: CyrusNajmabadi <[email protected]>
Copilot finished work on behalf of CyrusNajmabadi November 3, 2025 21:18
@CyrusNajmabadi CyrusNajmabadi marked this pull request as ready for review November 3, 2025 21:28
Copy link
Member

@RikkiGibson RikkiGibson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, comment is not blocking.

// Create source texts from byte arrays where one has a BOM and the other doesn't
const string content = "Test content";

var utf8WithBOM = new UTF8Encoding(encoderShouldEmitUTF8Identifier: true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this flag only controls the return value of GetPreamble(). You could consider deleting the UTF8Encoding locals and just using Encoding.UTF8 in this test.

@CyrusNajmabadi CyrusNajmabadi merged commit 3898dec into main Nov 6, 2025
25 checks passed
@dotnet-policy-service dotnet-policy-service bot added this to the Next milestone Nov 6, 2025
@CyrusNajmabadi CyrusNajmabadi deleted the copilot/fix-source-text-comparer-issue branch November 6, 2025 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SourceTextComparer has odd IEqualityComparer semantics

4 participants