JIT Is Approved For PHP 8 To Open Up Faster CPU Performance

Delgarde replied

02 April 2019, 05:28 AM
Originally posted by ssokolow View Post

One of the Rust guys did a great blog post on the topic (which links to two other excellent related ones): Let’s Stop Ascribing Meaning to Code Points by Manish Goregaokar.

Oh yes, I think that's the one I remember reading. Yeah, this stuff is way more complicated than a lot of developers appreciate... you can mostly get by with a simplified model if you're only dealing with western languages, but even then you're going to be surprised from time to time. And once you start dealing with non-roman alphabets, all bets are off...
Leave a comment:
F.Ultra replied

01 April 2019, 04:40 PM
Originally posted by AJenbo View Post

uid313 BOM is for utf16, don't put it in UTF-8

Oh there exists a BOM that indicates that the file is in UTF-8 (0xEF, 0xBB, oxBF), the damn Notepad application in Windows keeps storing one every time you save a file in UTF-8 format (or atleast it did back some years ago).
Leave a comment:
ssokolow replied

01 April 2019, 09:45 AM
Originally posted by Delgarde View Post

I'm not 100% certain, but I think Rust does. If I remember correctly, they'd looked at how all the other common languages were dealing with Unicode, and concluded that there were only two practical choices – either UCS-4 (with the benefits of fixed width, albeit the costs of high memory use) or UTF-8 (variable width, but the simplest and most compatible form of it). And so they went with the latter, figuring that using four times as much memory as needed for common strings wouldn't be appreciated by users... and also that even fixed width encodings don't allow you to ignore all of the complexities of Unicode, so the benefits of fixed width were often overstated.

Edit: I don't think this is the article I was remembering, but there's a useful discussion of the subject here:

https://www.reddit.com/r/rust/commen...ing_why_vecu8/

Yeah. String is a single-element struct wrapping a Vec<u8>, which enforces that it contains a sequence of valid unicode code points. (Unless you break its invariants using an unsafe block and an unchecked constructor)

(The jargon for that is the "newtype pattern" and it shows up in various languages when you want the same in-memory representation but a different API with different invariants.)

OsString is a single-element struct wrapping either an unconstrained Vec<u8> (POSIX) or a Wtf8Buf, which is a single-element struct which enforces "UTF-8 plus unpaired surrogates" on a Vec<u8> (Windows).

WTF-8 (Wobbly Transformation Format - 8-bit) is a relaxed UTF-8 that allows un-paired surrogates from Windows APIs to round-trip properly. (Same reason Unicode includes all of those precomposed character codepoints. They're so you can reliably round-trip strings from other encodings.)

(The reason they use WTF-8 for OsString on Windows is so that, regardless of platform, converting a String to an OsString within the program (as opposed to at the edge of the program where it interacts with OS APIs) is just a typecast and converting the other way is just a well-formedness check (if not using unchecked APIs) followed by a typecast.)

Originally posted by Delgarde View Post

How often do you actually need random access, out of curiosity? Knowing the length is a common requirement, but that's unrelated to random access... plenty of string implementations store length instead of relying on zero-termination. And iterating over a stream of characters to tokenize a string, that's pretty common. But how often do you actually need a high-performant operation to get the 23rd character of a string? In my experience, it's actually pretty rare, and usually a sign of missing higher APIs...

Not to mention that even UCS-4 can't get you the 23rd's "character" in constant time. It can get you the 23rd codepoint but expecting codepoints to equal what we intuitively think of as "characters" is a very western thing. You have to iterate grapheme clusters to find the 23rd "character" properly.

One of the Rust guys did a great blog post on the topic (which links to two other excellent related ones): Let’s Stop Ascribing Meaning to Code Points by Manish Goregaokar.

Last edited by ssokolow; 01 April 2019, 09:53 AM.
Leave a comment:
Delgarde replied

01 April 2019, 06:14 AM
Originally posted by AHSauge View Post

You lose random access. If you think scanning the string every time you want to use a length is not PITA, then... yeah, I don't know what to say really...

How often do you actually need random access, out of curiosity? Knowing the length is a common requirement, but that's unrelated to random access... plenty of string implementations store length instead of relying on zero-termination. And iterating over a stream of characters to tokenize a string, that's pretty common. But how often do you actually need a high-performant operation to get the 23rd character of a string? In my experience, it's actually pretty rare, and usually a sign of missing higher APIs...
Leave a comment:
Delgarde replied

01 April 2019, 03:18 AM
Originally posted by AHSauge View Post

Can you name a programming language or framework that internally uses UTF-8 for string handling?

I'm not 100% certain, but I think Rust does. If I remember correctly, they'd looked at how all the other common languages were dealing with Unicode, and concluded that there were only two practical choices – either UCS-4 (with the benefits of fixed width, albeit the costs of high memory use) or UTF-8 (variable width, but the simplest and most compatible form of it). And so they went with the latter, figuring that using four times as much memory as needed for common strings wouldn't be appreciated by users... and also that even fixed width encodings don't allow you to ignore all of the complexities of Unicode, so the benefits of fixed width were often overstated.

Edit: I don't think this is the article I was remembering, but there's a useful discussion of the subject here:

https://www.reddit.com/r/rust/comments/2b08l5/uft8_and_string_why_vecu8/

Last edited by Delgarde; 01 April 2019, 03:26 AM.
Likes 1
Leave a comment:
uid313 replied

31 March 2019, 11:13 AM
Originally posted by AJenbo View Post

uid313 BOM is for utf16, don't put it in UTF-8

Many people don't know the difference between ANSI, UTF-8, UTF-16, BOM, Unicode, UTF-32, UTF-7 and all that. They don't know or care about encodings.
Then they save a file in their text editor and they have no idea what encoding their text editor saves it as.
Then PHP chokes on the script with a mystic error about namespace statement must be the first statement, which is very confusing.
Leave a comment:
AJenbo replied

31 March 2019, 11:04 AM
uid313 BOM is for utf16, don't put it in UTF-8
Leave a comment:
AHSauge replied

31 March 2019, 09:06 AM
Originally posted by xnor View Post

So you are absolutely clueless.

Code:

substr("😂", 0, 3)

Results in an invalid UTF-8 sequence.

Thanks for the example, and thanks for proving my point. As I said, length is the only operation that doesn't work. That said, this will still work:

Code:

$str = "Iñtërnâtiôn😂àlizætiøn"; $pos = strpos($str, "😂"); echo substr($str, 0, $pos); echo substr($str, $pos);

Originally posted by xnor View Post

You're an ignorant troll and with that you're done.
You're absolutely clueless about any of this. All your further comments (on string length and programming with strings in general, security, the web, ..) just again demonstrate this. What a colossal waste of time.

Oh, I'm clueless. Non-shortest form is just fiction of course. No security issues to be found. I mean, Microsoft have never ever been exposed to this issue.
Last edited by AHSauge; 31 March 2019, 09:12 AM. Reason: Added link to IIS security issue
Leave a comment:
xnor replied

31 March 2019, 08:40 AM
Originally posted by AHSauge View Post

You are wrong here, and it shows how blatant ignorant you are about how UTF-8 works. Seeing as I'm supposable wrong here. Could you give me an example? Where does it break?

So you are absolutely clueless.

Code:

substr("😂", 0, 3)

Results in an invalid UTF-8 sequence.

Originally posted by AHSauge View Post

The answer is obviously counting bananas

You're an ignorant troll and with that you're done.
You're absolutely clueless about any of this. All your further comments (on string length and programming with strings in general, security, the web, ..) just again demonstrate this. What a colossal waste of time.
Leave a comment:
AHSauge replied

31 March 2019, 07:58 AM
Originally posted by xnor View Post

LOL, followed by a bunch of wrong statements. Dunning-Kruger, eh? xD

Riight, because you're obvious an expert...

Originally posted by xnor View Post

Wrong.
Use a single-byte substr() function on an UTF-8 encoded string and it will happily chop off your string at any byte. You do understand why that is a problem, or do YOU not understand what UTF-8 is?

You are wrong here, and it shows how blatant ignorant you are about how UTF-8 works. Seeing as I'm supposable wrong here. Could you give me an example? Where does it break?

Originally posted by xnor View Post

You mean like strlen() does anyway, that is scanning the entire string counting the number of bytes?
Here's what you should say to yourself: "I'm clueless".

What do you want to count? The number of code points? Code units? Abstract characters? Encoded characters? User-perceived characters? Grapheme clusters? Glyphs?
The thing is that if you know what you're doing then this is not a problem. And UTF-16 doesn't improve anything here.

Also, in my many years working as a software developer I've very rarely needed to count string length (that is any of the above and not number of bytes), but that's probably because I'm not painting websites or doing boring input validation.

The answer is obviously counting bananas

Seriously though, if you say you rarely have needed to count string length, then I guess you've stayed away from all functions depending on the length as well? I've been developing for well over 10 years now. Knowing the length of a string is a quite common issue. You might not be directly exposed to it, but functions you are using sure needs to know it.

Originally posted by xnor View Post

It is in fact a huge trouble. You again don't know what you're talking about.
Even today there are still many applications that do not handle BOMs properly. It's one of the reasons why UTF-16 should be considered dangerous on the web ... and it's also one of the reasons why the majority of the web is UTF-8.

Riiiiight. UTF-16...dangerous...suuure...and I suppose you don't see any security issues with UTF-8 what so ever, right? Security issues due to non-shortest form in UTF-8 encoded strings, does not exist, right?

As for the web, don't kid yourself. The web uses UTF-8 because it's ASCII compatible and consequently also offers Unicode for western languages without requiring significantly more bandwidth. Why would an English-only website double the size of it's websites for absolutely no gain?

Originally posted by xnor View Post

😂

Some trivia for you: The python implementation of Unicode switches between ISO 8859-1 / Latin1, UCS-2 and UCS-4 depending on the string. The benefit? You always have constant random access and know number of Unicode code points.

Originally posted by xnor View Post

That's quite an ignorant statement.
They consciously went for an encoding that not only completely broke backwards compatibility, they also knew that it was limited to 65,536 characters.
It just adds to the tragedy that UTF-8 had been developed only 2 years later, still a few years before Java was published.

So they should have had a magic crystal ball so they could see that the Unicode consortium would add stuff exceeding 16 bits in the future. Is that your point here?

Seeing as you're so pro-UTF-8 here. Can you name a programming language or framework that internally uses UTF-8 for string handling?
Leave a comment:

Announcement

JIT Is Approved For PHP 8 To Open Up Faster CPU Performance

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: