Clarify PREG_OFFSET_CAPTURE returns byte-offset #4543

russtanner · 2025-03-17T06:14:22Z

Description

The following code:

<?php
function test() {

  print "\n\nBEGIN ===";

  // Each multibyte character (…) is treated like 3 characters even though multibyte support is enabled (/u). The "…" character is a 3-byte unicode char (0xe2 0x80 0xa6).
  $str = <<<'HTML'
    Text 1 without multibyte character
    {{Token1}}
    Text 2 with multibyte character…
    {{Token2}}
    Text 3 with multibyte character……
    {{Token3}}
  HTML;

  // Get an arrray of all delimters and their offsets.
  $Count = preg_match_all(
    '/('
    . '{{' // Token delimiter to find.
    . ')/su', // s=multiline, u=multibyte (utf8)
    $str,
    $Matches, // Captures found token delimiters and the index of each one.
    PREG_OFFSET_CAPTURE
  );

  // We only need the first array (index pointers) returned by preg_match_all().
  $Matches = $Matches[0];

  foreach ($Matches as $Match) {
    print "\n---\n#Index:{$Match[1]} - Needle:\"{$Match[0]}\" - Match:\n" . mb_substr($str, $Match[1], 30) . "\n";
  }

  print "\n";

}

test();
?>

Resulted in this output:

BEGIN ===
---
#Index:39 - Needle:"{{" - Match:
{{Token1}}
  Text 2 with multi

---
#Index:89 - Needle:"{{" - Match:
Token2}}
  Text 3 with multiby

---
#Index:142 - Needle:"{{" - Match:
n3}}

But I expected this output instead:

BEGIN ===
---
#Index:39 - Needle:"{{" - Match:
{{Token1}}
  Text 2 with multi

---
#Index:87 - Needle:"{{" - Match:
{{Token2}}
  Text 3 with multi

---
#Index:136 - Needle:"{{" - Match:
{{Token3}}

PHP Version

PHP 8.3.18

Operating System

Alma Linux 9.5

The text was updated successfully, but these errors were encountered:

iluuu1994 · 2025-03-17T12:58:08Z

See https://www.php.net/manual/en/function.preg-match.php:

flags

PREG_OFFSET_CAPTURE

If this flag is passed, for every occurring match the appendant string offset (in bytes) will also be returned.

The offset encodes the byte-offset, not character-offset. Hence, you can use substr rather than mb_substr and you'll get the correct result. Changing this is of course not backwards compatible.

russtanner · 2025-03-17T14:32:53Z

See https://www.php.net/manual/en/function.preg-match.php:

flags
PREG_OFFSET_CAPTURE
If this flag is passed, for every occurring match the appendant string offset (in bytes) will also be returned.

The offset encodes the byte-offset, not character-offset. Hence, you can use substr rather than mb_substr and you'll get the correct result. Changing this is of course not backwards compatible.

OK, I spent a day working on this and completely missed that in the documentation. My mistake. Thanks for the clarification.

iluuu1994 · 2025-03-17T14:34:58Z

@russtanner No worries! I see it's not documented for https://www.php.net/manual/en/pcre.constants.php#constant.preg-split-offset-capture, maybe we can improve it there.

russtanner added bug Documentation contains incorrect information Status: Needs Triage labels Mar 17, 2025

iluuu1994 added Extension: pcre and removed Status: Needs Triage labels Mar 17, 2025

github-actions bot added the Status: Needs Triage label Mar 17, 2025

iluuu1994 changed the title ~~preg_match_all() with /u (unicode mode) returns inaccurate index in the Matches parameter~~ Clarify PREG_OFFSET_CAPTURE returns byte-offset Mar 17, 2025

iluuu1994 transferred this issue from php/php-src Mar 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify PREG_OFFSET_CAPTURE returns byte-offset #4543

Clarify PREG_OFFSET_CAPTURE returns byte-offset #4543

russtanner commented Mar 17, 2025 •

edited

Loading

iluuu1994 commented Mar 17, 2025

russtanner commented Mar 17, 2025

iluuu1994 commented Mar 17, 2025

Clarify PREG_OFFSET_CAPTURE returns byte-offset #4543

Clarify PREG_OFFSET_CAPTURE returns byte-offset #4543

Comments

russtanner commented Mar 17, 2025 • edited Loading

Description

PHP Version

Operating System

iluuu1994 commented Mar 17, 2025

russtanner commented Mar 17, 2025

iluuu1994 commented Mar 17, 2025

russtanner commented Mar 17, 2025 •

edited

Loading