Announcement

**AHSauge** · 31 March 2019, 09:06 AM

Originally posted by xnor View Post

So you are absolutely clueless.

Code:

substr("😂", 0, 3)

Results in an invalid UTF-8 sequence.

Thanks for the example, and thanks for proving my point. As I said, length is the only operation that doesn't work. That said, this will still work:

Code:

$str = "Iñtërnâtiôn😂àlizætiøn";
$pos = strpos($str, "😂");
echo substr($str, 0, $pos);
echo substr($str, $pos);

Originally posted by xnor View Post

You're an ignorant troll and with that you're done.
You're absolutely clueless about any of this. All your further comments (on string length and programming with strings in general, security, the web, ..) just again demonstrate this. What a colossal waste of time.

Oh, I'm clueless. Non-shortest form is just fiction of course. No security issues to be found. I mean, Microsoft have never ever been exposed to this issue.

**AJenbo** · 31 March 2019, 11:04 AM

uid313 BOM is for utf16, don't put it in UTF-8

**uid313** · 31 March 2019, 11:13 AM

Originally posted by AJenbo View Post

uid313 BOM is for utf16, don't put it in UTF-8

Many people don't know the difference between ANSI, UTF-8, UTF-16, BOM, Unicode, UTF-32, UTF-7 and all that. They don't know or care about encodings.
Then they save a file in their text editor and they have no idea what encoding their text editor saves it as.
Then PHP chokes on the script with a mystic error about namespace statement must be the first statement, which is very confusing.

**Delgarde** · 01 April 2019, 03:18 AM

Originally posted by AHSauge View Post

Can you name a programming language or framework that internally uses UTF-8 for string handling?

I'm not 100% certain, but I think Rust does. If I remember correctly, they'd looked at how all the other common languages were dealing with Unicode, and concluded that there were only two practical choices – either UCS-4 (with the benefits of fixed width, albeit the costs of high memory use) or UTF-8 (variable width, but the simplest and most compatible form of it). And so they went with the latter, figuring that using four times as much memory as needed for common strings wouldn't be appreciated by users... and also that even fixed width encodings don't allow you to ignore all of the complexities of Unicode, so the benefits of fixed width were often overstated.

Edit: I don't think this is the article I was remembering, but there's a useful discussion of the subject here:

https://www.reddit.com/r/rust/comments/2b08l5/uft8_and_string_why_vecu8/

**Delgarde** · 01 April 2019, 06:14 AM

Originally posted by AHSauge View Post

You lose random access. If you think scanning the string every time you want to use a length is not PITA, then... yeah, I don't know what to say really...

How often do you actually need random access, out of curiosity? Knowing the length is a common requirement, but that's unrelated to random access... plenty of string implementations store length instead of relying on zero-termination. And iterating over a stream of characters to tokenize a string, that's pretty common. But how often do you actually need a high-performant operation to get the 23rd character of a string? In my experience, it's actually pretty rare, and usually a sign of missing higher APIs...

**ssokolow** · 01 April 2019, 09:45 AM

Originally posted by Delgarde View Post

I'm not 100% certain, but I think Rust does. If I remember correctly, they'd looked at how all the other common languages were dealing with Unicode, and concluded that there were only two practical choices – either UCS-4 (with the benefits of fixed width, albeit the costs of high memory use) or UTF-8 (variable width, but the simplest and most compatible form of it). And so they went with the latter, figuring that using four times as much memory as needed for common strings wouldn't be appreciated by users... and also that even fixed width encodings don't allow you to ignore all of the complexities of Unicode, so the benefits of fixed width were often overstated.

Edit: I don't think this is the article I was remembering, but there's a useful discussion of the subject here:

https://www.reddit.com/r/rust/commen...ing_why_vecu8/

Yeah. String is a single-element struct wrapping a Vec<u8>, which enforces that it contains a sequence of valid unicode code points. (Unless you break its invariants using an unsafe block and an unchecked constructor)

(The jargon for that is the "newtype pattern" and it shows up in various languages when you want the same in-memory representation but a different API with different invariants.)

OsString is a single-element struct wrapping either an unconstrained Vec<u8> (POSIX) or a Wtf8Buf, which is a single-element struct which enforces "UTF-8 plus unpaired surrogates" on a Vec<u8> (Windows).

WTF-8 (Wobbly Transformation Format - 8-bit) is a relaxed UTF-8 that allows un-paired surrogates from Windows APIs to round-trip properly. (Same reason Unicode includes all of those precomposed character codepoints. They're so you can reliably round-trip strings from other encodings.)

(The reason they use WTF-8 for OsString on Windows is so that, regardless of platform, converting a String to an OsString within the program (as opposed to at the edge of the program where it interacts with OS APIs) is just a typecast and converting the other way is just a well-formedness check (if not using unchecked APIs) followed by a typecast.)

Originally posted by Delgarde View Post

How often do you actually need random access, out of curiosity? Knowing the length is a common requirement, but that's unrelated to random access... plenty of string implementations store length instead of relying on zero-termination. And iterating over a stream of characters to tokenize a string, that's pretty common. But how often do you actually need a high-performant operation to get the 23rd character of a string? In my experience, it's actually pretty rare, and usually a sign of missing higher APIs...

Not to mention that even UCS-4 can't get you the 23rd's "character" in constant time. It can get you the 23rd codepoint but expecting codepoints to equal what we intuitively think of as "characters" is a very western thing. You have to iterate grapheme clusters to find the 23rd "character" properly.

One of the Rust guys did a great blog post on the topic (which links to two other excellent related ones): Let’s Stop Ascribing Meaning to Code Points by Manish Goregaokar.

**F.Ultra** · 01 April 2019, 04:40 PM

Originally posted by AJenbo View Post

uid313 BOM is for utf16, don't put it in UTF-8

Oh there exists a BOM that indicates that the file is in UTF-8 (0xEF, 0xBB, oxBF), the damn Notepad application in Windows keeps storing one every time you save a file in UTF-8 format (or atleast it did back some years ago).

**Delgarde** · 02 April 2019, 05:28 AM

Originally posted by ssokolow View Post

One of the Rust guys did a great blog post on the topic (which links to two other excellent related ones): Let’s Stop Ascribing Meaning to Code Points by Manish Goregaokar.

Oh yes, I think that's the one I remember reading. Yeah, this stuff is way more complicated than a lot of developers appreciate... you can mostly get by with a simplified model if you're only dealing with western languages, but even then you're going to be surprised from time to time. And once you start dealing with non-roman alphabets, all bets are off...

Announcement

JIT Is Approved For PHP 8 To Open Up Faster CPU Performance

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment