You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<?phpfunctiontest() {
print"\n\nBEGIN ===";
// Each multibyte character (…) is treated like 3 characters even though multibyte support is enabled (/u). The "…" character is a 3-byte unicode char (0xe2 0x80 0xa6).$str = <<<'HTML'
Text 1 without multibyte character {{Token1}} Text 2 with multibyte character… {{Token2}} Text 3 with multibyte character…… {{Token3}}
HTML;
// Get an arrray of all delimters and their offsets.$Count = preg_match_all(
'/('
. '{{'// Token delimiter to find.
. ')/su', // s=multiline, u=multibyte (utf8)$str,
$Matches, // Captures found token delimiters and the index of each one.PREG_OFFSET_CAPTURE
);
// We only need the first array (index pointers) returned by preg_match_all().$Matches = $Matches[0];
foreach ($Matchesas$Match) {
print"\n---\n#Index:{$Match[1]} - Needle:\"{$Match[0]}\" - Match:\n" . mb_substr($str, $Match[1], 30) . "\n";
}
print"\n";
}
test();
?>
Resulted in this output:
BEGIN ===
---
#Index:39 - Needle:"{{" - Match:
{{Token1}}
Text 2 with multi
---
#Index:89 - Needle:"{{" - Match:
Token2}}
Text 3 with multiby
---
#Index:142 - Needle:"{{" - Match:
n3}}
But I expected this output instead:
BEGIN ===
---
#Index:39 - Needle:"{{" - Match:
{{Token1}}
Text 2 with multi
---
#Index:87 - Needle:"{{" - Match:
{{Token2}}
Text 3 with multi
---
#Index:136 - Needle:"{{" - Match:
{{Token3}}
PHP Version
PHP 8.3.18
Operating System
Alma Linux 9.5
The text was updated successfully, but these errors were encountered:
If this flag is passed, for every occurring match the appendant string offset (in bytes) will also be returned.
The offset encodes the byte-offset, not character-offset. Hence, you can use substr rather than mb_substr and you'll get the correct result. Changing this is of course not backwards compatible.
flags PREG_OFFSET_CAPTURE
If this flag is passed, for every occurring match the appendant string offset (in bytes) will also be returned.
The offset encodes the byte-offset, not character-offset. Hence, you can use substr rather than mb_substr and you'll get the correct result. Changing this is of course not backwards compatible.
OK, I spent a day working on this and completely missed that in the documentation. My mistake. Thanks for the clarification.
iluuu1994
changed the title
preg_match_all() with /u (unicode mode) returns inaccurate index in the Matches parameter
Clarify PREG_OFFSET_CAPTURE returns byte-offset
Mar 17, 2025
Description
The following code:
Resulted in this output:
But I expected this output instead:
PHP Version
PHP 8.3.18
Operating System
Alma Linux 9.5
The text was updated successfully, but these errors were encountered: