Originally posted by ssokolow
View Post
Announcement
Collapse
No announcement yet.
JIT Is Approved For PHP 8 To Open Up Faster CPU Performance
Collapse
X
-
-
Oh there exists a BOM that indicates that the file is in UTF-8 (0xEF, 0xBB, oxBF), the damn Notepad application in Windows keeps storing one every time you save a file in UTF-8 format (or atleast it did back some years ago).
Leave a comment:
-
Originally posted by Delgarde View Post
I'm not 100% certain, but I think Rust does. If I remember correctly, they'd looked at how all the other common languages were dealing with Unicode, and concluded that there were only two practical choices – either UCS-4 (with the benefits of fixed width, albeit the costs of high memory use) or UTF-8 (variable width, but the simplest and most compatible form of it). And so they went with the latter, figuring that using four times as much memory as needed for common strings wouldn't be appreciated by users... and also that even fixed width encodings don't allow you to ignore all of the complexities of Unicode, so the benefits of fixed width were often overstated.
Edit: I don't think this is the article I was remembering, but there's a useful discussion of the subject here:
https://www.reddit.com/r/rust/commen...ing_why_vecu8/
(The jargon for that is the "newtype pattern" and it shows up in various languages when you want the same in-memory representation but a different API with different invariants.)
OsString is a single-element struct wrapping either an unconstrained Vec<u8> (POSIX) or a Wtf8Buf, which is a single-element struct which enforces "UTF-8 plus unpaired surrogates" on a Vec<u8> (Windows).
WTF-8 (Wobbly Transformation Format - 8-bit) is a relaxed UTF-8 that allows un-paired surrogates from Windows APIs to round-trip properly. (Same reason Unicode includes all of those precomposed character codepoints. They're so you can reliably round-trip strings from other encodings.)
(The reason they use WTF-8 for OsString on Windows is so that, regardless of platform, converting a String to an OsString within the program (as opposed to at the edge of the program where it interacts with OS APIs) is just a typecast and converting the other way is just a well-formedness check (if not using unchecked APIs) followed by a typecast.)
Originally posted by Delgarde View Post
How often do you actually need random access, out of curiosity? Knowing the length is a common requirement, but that's unrelated to random access... plenty of string implementations store length instead of relying on zero-termination. And iterating over a stream of characters to tokenize a string, that's pretty common. But how often do you actually need a high-performant operation to get the 23rd character of a string? In my experience, it's actually pretty rare, and usually a sign of missing higher APIs...
One of the Rust guys did a great blog post on the topic (which links to two other excellent related ones): Let’s Stop Ascribing Meaning to Code Points by Manish Goregaokar.Last edited by ssokolow; 01 April 2019, 09:53 AM.
Leave a comment:
-
Originally posted by AHSauge View PostYou lose random access. If you think scanning the string every time you want to use a length is not PITA, then... yeah, I don't know what to say really...
Leave a comment:
-
Originally posted by AHSauge View PostCan you name a programming language or framework that internally uses UTF-8 for string handling?
Edit: I don't think this is the article I was remembering, but there's a useful discussion of the subject here:
Last edited by Delgarde; 01 April 2019, 03:26 AM.
- Likes 1
Leave a comment:
-
Many people don't know the difference between ANSI, UTF-8, UTF-16, BOM, Unicode, UTF-32, UTF-7 and all that. They don't know or care about encodings.
Then they save a file in their text editor and they have no idea what encoding their text editor saves it as.
Then PHP chokes on the script with a mystic error about namespace statement must be the first statement, which is very confusing.
Leave a comment:
-
Originally posted by xnor View PostSo you are absolutely clueless.
Code:substr("😂", 0, 3)
Code:$str = "Iñtërnâtiôn😂àlizætiøn"; $pos = strpos($str, "😂"); echo substr($str, 0, $pos); echo substr($str, $pos);
Originally posted by xnor View PostYou're an ignorant troll and with that you're done.
You're absolutely clueless about any of this. All your further comments (on string length and programming with strings in general, security, the web, ..) just again demonstrate this. What a colossal waste of time.
Leave a comment:
-
Originally posted by AHSauge View PostYou are wrong here, and it shows how blatant ignorant you are about how UTF-8 works. Seeing as I'm supposable wrong here. Could you give me an example? Where does it break?
Code:substr("😂", 0, 3)
Originally posted by AHSauge View PostThe answer is obviously counting bananas
You're absolutely clueless about any of this. All your further comments (on string length and programming with strings in general, security, the web, ..) just again demonstrate this. What a colossal waste of time.
Leave a comment:
-
Originally posted by xnor View PostLOL, followed by a bunch of wrong statements. Dunning-Kruger, eh? xD
Originally posted by xnor View PostWrong.
Use a single-byte substr() function on an UTF-8 encoded string and it will happily chop off your string at any byte. You do understand why that is a problem, or do YOU not understand what UTF-8 is?
Originally posted by xnor View PostYou mean like strlen() does anyway, that is scanning the entire string counting the number of bytes?
Here's what you should say to yourself: "I'm clueless".
What do you want to count? The number of code points? Code units? Abstract characters? Encoded characters? User-perceived characters? Grapheme clusters? Glyphs?
The thing is that if you know what you're doing then this is not a problem. And UTF-16 doesn't improve anything here.
Also, in my many years working as a software developer I've very rarely needed to count string length (that is any of the above and not number of bytes), but that's probably because I'm not painting websites or doing boring input validation.
Seriously though, if you say you rarely have needed to count string length, then I guess you've stayed away from all functions depending on the length as well? I've been developing for well over 10 years now. Knowing the length of a string is a quite common issue. You might not be directly exposed to it, but functions you are using sure needs to know it.
Originally posted by xnor View PostIt is in fact a huge trouble. You again don't know what you're talking about.
Even today there are still many applications that do not handle BOMs properly. It's one of the reasons why UTF-16 should be considered dangerous on the web ... and it's also one of the reasons why the majority of the web is UTF-8.
As for the web, don't kid yourself. The web uses UTF-8 because it's ASCII compatible and consequently also offers Unicode for western languages without requiring significantly more bandwidth. Why would an English-only website double the size of it's websites for absolutely no gain?
Some trivia for you: The python implementation of Unicode switches between ISO 8859-1 / Latin1, UCS-2 and UCS-4 depending on the string. The benefit? You always have constant random access and know number of Unicode code points.
Originally posted by xnor View PostThat's quite an ignorant statement.
They consciously went for an encoding that not only completely broke backwards compatibility, they also knew that it was limited to 65,536 characters.
It just adds to the tragedy that UTF-8 had been developed only 2 years later, still a few years before Java was published.
Seeing as you're so pro-UTF-8 here. Can you name a programming language or framework that internally uses UTF-8 for string handling?
Leave a comment:
Leave a comment: