Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify PREG_OFFSET_CAPTURE returns byte-offset #4543

Open
russtanner opened this issue Mar 17, 2025 · 3 comments
Open

Clarify PREG_OFFSET_CAPTURE returns byte-offset #4543

russtanner opened this issue Mar 17, 2025 · 3 comments
Labels
bug Documentation contains incorrect information Extension: pcre Status: Needs Triage

Comments

@russtanner
Copy link

russtanner commented Mar 17, 2025

Description

The following code:

<?php
function test() {

  print "\n\nBEGIN ===";

  // Each multibyte character (…) is treated like 3 characters even though multibyte support is enabled (/u). The "…" character is a 3-byte unicode char (0xe2 0x80 0xa6).
  $str = <<<'HTML'
    Text 1 without multibyte character
    {{Token1}}
    Text 2 with multibyte character…
    {{Token2}}
    Text 3 with multibyte character……
    {{Token3}}
  HTML;

  // Get an arrray of all delimters and their offsets.
  $Count = preg_match_all(
    '/('
    . '{{' // Token delimiter to find.
    . ')/su', // s=multiline, u=multibyte (utf8)
    $str,
    $Matches, // Captures found token delimiters and the index of each one.
    PREG_OFFSET_CAPTURE
  );

  // We only need the first array (index pointers) returned by preg_match_all().
  $Matches = $Matches[0];

  foreach ($Matches as $Match) {
    print "\n---\n#Index:{$Match[1]} - Needle:\"{$Match[0]}\" - Match:\n" . mb_substr($str, $Match[1], 30) . "\n";
  }

  print "\n";

}

test();
?>

Resulted in this output:

BEGIN ===
---
#Index:39 - Needle:"{{" - Match:
{{Token1}}
  Text 2 with multi

---
#Index:89 - Needle:"{{" - Match:
Token2}}
  Text 3 with multiby

---
#Index:142 - Needle:"{{" - Match:
n3}}

But I expected this output instead:

BEGIN ===
---
#Index:39 - Needle:"{{" - Match:
{{Token1}}
  Text 2 with multi

---
#Index:87 - Needle:"{{" - Match:
{{Token2}}
  Text 3 with multi

---
#Index:136 - Needle:"{{" - Match:
{{Token3}}

PHP Version

PHP 8.3.18

Operating System

Alma Linux 9.5

@russtanner russtanner added bug Documentation contains incorrect information Status: Needs Triage labels Mar 17, 2025
@iluuu1994
Copy link
Member

See https://www.php.net/manual/en/function.preg-match.php:

flags

PREG_OFFSET_CAPTURE

If this flag is passed, for every occurring match the appendant string offset (in bytes) will also be returned.

The offset encodes the byte-offset, not character-offset. Hence, you can use substr rather than mb_substr and you'll get the correct result. Changing this is of course not backwards compatible.

@russtanner
Copy link
Author

See https://www.php.net/manual/en/function.preg-match.php:

flags
PREG_OFFSET_CAPTURE
If this flag is passed, for every occurring match the appendant string offset (in bytes) will also be returned.

The offset encodes the byte-offset, not character-offset. Hence, you can use substr rather than mb_substr and you'll get the correct result. Changing this is of course not backwards compatible.

OK, I spent a day working on this and completely missed that in the documentation. My mistake. Thanks for the clarification.

@iluuu1994
Copy link
Member

@russtanner No worries! I see it's not documented for https://www.php.net/manual/en/pcre.constants.php#constant.preg-split-offset-capture, maybe we can improve it there.

@iluuu1994 iluuu1994 changed the title preg_match_all() with /u (unicode mode) returns inaccurate index in the Matches parameter Clarify PREG_OFFSET_CAPTURE returns byte-offset Mar 17, 2025
@iluuu1994 iluuu1994 transferred this issue from php/php-src Mar 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Documentation contains incorrect information Extension: pcre Status: Needs Triage
Projects
None yet
Development

No branches or pull requests

2 participants