Skip to content

Split SourceLocation into LineColumn and SourceLocation#17587

Merged
MichaReiser merged 3 commits intomainfrom
micha/line-character-column
Apr 27, 2025
Merged

Split SourceLocation into LineColumn and SourceLocation#17587
MichaReiser merged 3 commits intomainfrom
micha/line-character-column

Conversation

@MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented Apr 23, 2025

Summary

This PR splits SourceLocation into LineColumn and SourceLocation and moves the TextSize to LSP position conversion logic into LineIndex.

LineIndex as it is before this PR had two methods:

  • source_location: Converts a TextSize to a SourceLocation
  • offset: Converts a SourceLocation back to a TextSize

The problem with the current implementation is that source_location trims a leading BOM offset, whereas offset doesn't have any custom BOM handling.
That means, mapping the first character right after a BOM to the old source_location would give row: 0, column: 0, but mapping that position back to an offset would point before instead of after the BOM.

This PR fixes this inconsistency by removing the offset for the old SourceLocation (now LineColumn) because the only case where we need to map back a column is in the formatter but the special BOM handling doesn't matter.

However, we don't want to skip the BOM for LSP operations because LSP operations don't return line/column information; instead, they map a position to a line and the nth character on that line.
This is why this PR introduces a new pair of source_location and offset methods to map between TextSize and a line and character_offset where character_offset is an UTF8, UTF16 or UTF32 offset (bytes, code units, Unicode scalar values).

The reason I dove into all this is because the playground needs to convert the ranges to UTF16 and I wanted to avoid copying the whole conversion logic a third time (ruff server, red knot server, wasm)

Test Plan

  • Tested the Ruff and VS code extension with unicode content
  • Tested that the line numbers in the CLI are correct
  • Tested notebooks
  • cargo test

@MichaReiser MichaReiser force-pushed the micha/line-character-column branch from fccdd82 to 714a2bc Compare April 23, 2025 17:25
@MichaReiser MichaReiser added the internal An internal refactor or improvement label Apr 23, 2025

impl From<(SourceLocation, SourceLocation)> for Range {
fn from((start, end): (SourceLocation, SourceLocation)) -> Self {
impl From<(LineColumn, LineColumn)> for Range {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the playground to use LineCharacter in a seprate PR

@MichaReiser MichaReiser force-pushed the micha/line-character-column branch 2 times, most recently from d82d4c5 to c2ef93f Compare April 23, 2025 17:37
@github-actions
Copy link
Contributor

github-actions bot commented Apr 23, 2025

mypy_primer results

No ecosystem changes detected ✅

@github-actions
Copy link
Contributor

github-actions bot commented Apr 23, 2025

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@MichaReiser MichaReiser changed the title Split SourceLocation into LineColumn and LineCharacter Split SourceLocation into LineColumn and SourceLocation Apr 23, 2025
@MichaReiser MichaReiser force-pushed the micha/line-character-column branch 5 times, most recently from a2cc634 to 94c6b83 Compare April 24, 2025 07:31
@MichaReiser MichaReiser marked this pull request as ready for review April 24, 2025 07:31
Copy link
Member

@dhruvmanila dhruvmanila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, much better API, thanks for taking a look at this!

I've a few comments but otherwise this looks like a pretty straightforward and logical change to me.

@dhruvmanila
Copy link
Member

Also, apologies for the late review, falling a bit behind on notifications.

@MichaReiser
Copy link
Member Author

MichaReiser commented Apr 27, 2025

Also, apologies for the late review, falling a bit behind on notifications.

No worries. I think you prioritized correctly. This PR isn't urgent.

@MichaReiser MichaReiser force-pushed the micha/line-character-column branch from 94c6b83 to b868451 Compare April 27, 2025 10:21
@MichaReiser MichaReiser merged commit 1c65e0a into main Apr 27, 2025
34 checks passed
@MichaReiser MichaReiser deleted the micha/line-character-column branch April 27, 2025 10:27
dylwil3 pushed a commit to dylwil3/ruff that referenced this pull request Apr 27, 2025
facebook-github-bot pushed a commit to facebook/pyrefly that referenced this pull request Jun 6, 2025
Summary:
I'm trying to pull in some latest changes from upstream `ruff_python` libraries to get a sense of what an upgrade would look like (potentially getting more upstream utilities that can be used).

The change I'm pulling from is 0.11.12 (released last week). There's a new release 0.11.13 this week which contains T-string support for Python 3.14 (astral-sh/ruff#17851), but the changes there introduced nontrivial downstream breakage (i.e. expr structure gets shuffled around) so I think it makes more sense to do a separate upgrade just for that one feature.

The main backward-incompatible change in this upgrade comes from this PR: astral-sh/ruff#17587. The main consequence is that `SourceLocation` now no longer directly contains line and column info for user-visible texts-- a new structure `LineColumn` is now used for that purpose, and `SourceLocation` now represents the "raw" line and character offset data in the original string. The reason why the "raw" numbers and "user-visible" numbers are different seem to come from Unicode's byte-offset-mark (BOM) character (I'm not super familiar with those -- read the original PR if you are interested in the details). For us, I think the main response there should be to rely on `LineColumn` instead of `SourceLocation` now. That's mostly a trivial thing to do, except there are cases where we want to convert line+column number back to a byte offset in the string and there we have to use `SourceLocation` -- technically speaking that conversion can't be made loseless so we need to be careful about where it happens. I think we perform that kind of conversion mostly in tests so we are fine. But I'll mark the place where we do it in prod to raise awareness that in certain cases it might be an issue.

There are also big changes in how the "semantic syntax checker" behaves. Good news is that a bunch of new checks were added so we can reliably detect more stuffs. Bad news is that many of the added checks require us to implement an AST visitor to track context and I don't think it's a trivial thing to do. Right now I'm just returning some dummy values to get the very basic checks working. But in the future we could come back and do more of the visit properly.

Reviewed By: ndmitchell

Differential Revision: D76156394

fbshipit-source-id: 0f55b5888259948d67400389a5efdff69c727dab
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

internal An internal refactor or improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants