Announcement

**kapouer** · 30 March 2019, 06:33 PM

At the almost same time, https://v8.dev/blog/jitless is doing the reverse approach.

**MrEcho** · 30 March 2019, 06:50 PM

The PHP team have been pretty good about not breaking things. Some of the stuff ive seen is stuff you shouldn't have been doing in the first place, once allowed, now turning it off. That and fixing long standing issues. As for security, I would say the language it self is pretty locked down. Now what you do with that language, thats on you, it can only hold your hand so much. Badly written code shouldn't have a reflection on the language its self. I think a lot of that comes from having to still support 5.x.

For most workloads the JIT isnt really going to do much, most of the time tuning your OPCache will really speed things up. Where its really going to make a difference is big dataset kind of stuff. Processing 10's of thousands of rows / elements.

Ive started to use Server Timing to pass along how long things take, and how many sql calls im making per request. Its not like some of the other stuff out there, but its enough for me to kind of see whats going on, and on production.

Server-Timing - HTTP | MDN

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Server-Timing

The Server-Timing header communicates one or more metrics and descriptions for a given request-response cycle. It is used to surface any backend server timing metrics (e.g. database read/write, CPU time, file system access, etc.) in the developer tools in the user's browser or in the PerformanceServerTiming interface.

**Delgarde** · 31 March 2019, 03:06 AM

Originally posted by xnor View Post

I don't think you know what you're talking about. UTF-16 is the worst possible choice. It's still variable length but also byte order dependent.

Yeah, its predecessor UCS-2 seemed like a good choice at the time, a fixed-width encoding that could cope with all of Unicode. But UTF-16 has never been a particularly good idea in its own right - it's convenient as a mostly-compatible extension of the established UCS-2 and it's more compact for certain languages... but as you say, it has all the disadvantages of a variable-width encoding _and_ of byte-order issues. Essentially it's a historic anomaly...

Of course, dealing with Unicode text has so many headaches that the actual encoding is just the starter. Don't get me started on combination and marks, and whether an accented character is one character or two...

**AHSauge** · 31 March 2019, 03:47 AM

Originally posted by hreindl View Post

that i am trired of people like you?

surely PHP breaks BC here and there, but there is no code that i have wirtten in my whole life which wasn't easily to even port to native 7.1 with type-hints, return-types and nullable-return-types in strict-mode

So because it's easy for you, it can not possibly ever be a problem at all, right?
In PHP 8 mbstring.func_overload will be dropped completely. It's not out of the blue, and I'd say it's a good move. However, do you think it will be easy for people relying on this to port it to PHP8?

Originally posted by hreindl View Post

which is bullshit, espcially now that there is an native upercase ß

Bull shit? Do you know what you're talking about? Case folding is very much a thing, and it's not the same as lower case.
The ß was just an example. There are plenty of others.

Originally posted by hreindl View Post

and because it's a minefield it's not as easy as you think "just support native UTF8 and be done" without either large and often subtle BC breaks or a dramatic performance dropdown and so again: other than python 2/3 it was just a no brainer to bring some hundret thousands line of code to PHP 7.3 in full strict mode without any single warning in production for a ton of runnign instanes, if you don't like PHP just use somthing else but creep way with your "let me guess" bullshit - PHP6 didn't work out that well for good reasons

For starters, I've never said PHP need or should support UTF-8 natively. What I've said is that PHP should have had Unicode support ages ago.
Second, PHP 6 failed due to lack of competence and interest. When you have a single digit number of people who fully understand Unicode, i18n and l10n, you're not exactly set up for success when you're trying to do massive changes to a language all at once. All the people who are used to ISO 8859-1 / Latin1 working well, isn't going to have any particular interest in working on it. As one of the developer put it, there was a lack of mind share. Basically, the ones who understood got bored of trying to explain to and push the people that don't understand into doing the necessary work. It all grinned to a halt.

**AHSauge** · 31 March 2019, 04:32 AM

Originally posted by xnor View Post

That statement makes no sense. Either string functions support the encoding used or not. And PHP's normal string functions do not. The mb_ functions however do.

It makes every bit of sense. It's just that you don't understand what UTF-8 is.
UTF-8 was designed to be fully backwards compatible with ASCII. Anything that can handle ASCII can also handle UTF-8. There are no extra null bytes that will break you program, and all characters consist of a sequence of bytes that are unique. If you look for a byte that is 0x41, you will always and only find the letter 'A'. No other character in UTF-8 consist of that byte. So as long as you know how to handle ASCII, you can handle UTF-8 as well. If your ASCII program tries to split a string at the first letter 'A', it will do so successfully and both strings you're left with will still be valid UTF-8. The only operation that won't give you a valid result is strlen as it will give you the number of bytes, not character. Aside from that, any ASCII string is automatically valid UTF-8, and any UTF-8 string consisting of only single byte characters (0x00-0x7F) is automatically valid ASCII.

Extended ASCII on the other hand, that's where the trouble starts. UTF-8 is certainly not compatible with that as it's based upon multi byte characters having the 8th bit set.

Originally posted by xnor View Post

Variable length encoding is not a PITA at all. It's easy to deal with, needs no special BOM, is byte order agnostic.

You lose random access. If you think scanning the string every time you want to use a length is not PITA, then... yeah, I don't know what to say really...
BOM is not trouble at all. You read it and convert it into your native byte order. Depending on the platform you're on, there's even hardware support for that operation so you don't even lose any significant performance.

Originally posted by xnor View Post

I don't think you know what you're talking about. UTF-16 is the worst possible choice. It's still variable length but also byte order dependent.

No, I get the feeling you're the one not knowing what you're talking about. Yes, UTF-16 is variable length, but that doesn't mean you always have to deal with it that way. If you know your string doesn't contain surrogates, you can assume constant length and gain constant random access again. For the wast majority of the processing any UTF-16 code will ever do, it will be dealing with strings without surrogates.
Another approach is to be an ignorant and implement UTF-16 for Unicode 3.0. At that point, Unicode didn't contain enough characters to cover more than 16 bits.

Originally posted by xnor View Post

Luckily in Linux land we didn't make the same mistake as Microsoft did with their Windows APIs, C# or Sun did with Java.

That's quite an ignorant statement. When Microsoft and Sun did their work, Unicode was 16 bits and thus UTF-16 could falsely be assumed to be a fixed sized encoding. I mean, technically they didn't even go for UTF-16, but UCS-2.

**xnor** · 31 March 2019, 07:09 AM

Originally posted by AHSauge View Post

It makes every bit of sense. It's just that you don't understand what UTF-8 is.

LOL, followed by a bunch of wrong statements. Dunning-Kruger, eh? xD

Originally posted by AHSauge View Post

Anything that can handle ASCII can also handle UTF-8.

Wrong.
Use a single-byte substr() function on an UTF-8 encoded string and it will happily chop off your string at any byte. You do understand why that is a problem, or do YOU not understand what UTF-8 is?

Originally posted by AHSauge View Post

There are no extra null bytes that will break you program, and all characters consist of a sequence of bytes that are unique. If you look for a byte that is 0x41, you will always and only find the letter 'A'. No other character in UTF-8 consist of that byte. So as long as you know how to handle ASCII, you can handle UTF-8 as well. If your ASCII program tries to split a string at the first letter 'A', it will do so successfully and both strings you're left with will still be valid UTF-8. The only operation that won't give you a valid result is strlen as it will give you the number of bytes, not character. Aside from that, any ASCII string is automatically valid UTF-8, and any UTF-8 string consisting of only single byte characters (0x00-0x7F) is automatically valid ASCII.

Sure, there is some compatibility. Thank you for making my case for UTF-8.

Originally posted by AHSauge View Post

Extended ASCII on the other hand, that's where the trouble starts. UTF-8 is certainly not compatible with that as it's based upon multi byte characters having the 8th bit set.

Thank you for teaching grandma how to suck eggs.

Originally posted by AHSauge View Post

If you think scanning the string every time you want to use a length is not PITA, then... yeah, I don't know what to say really...

You mean like strlen() does anyway, that is scanning the entire string counting the number of bytes?
Here's what you should say to yourself: "I'm clueless".

What do you want to count? The number of code points? Code units? Abstract characters? Encoded characters? User-perceived characters? Grapheme clusters? Glyphs?
The thing is that if you know what you're doing then this is not a problem. And UTF-16 doesn't improve anything here.

Also, in my many years working as a software developer I've very rarely needed to count string length (that is any of the above and not number of bytes), but that's probably because I'm not wasting my time on painting websites or doing boring input validation.

Originally posted by AHSauge View Post

BOM is not trouble at all.

It is in fact a huge trouble. You again don't know what you're talking about.
Even today there are still many applications that do not handle BOMs properly. It's one of the reasons why UTF-16 should be considered dangerous on the web ... and it's also one of the reasons why the majority of the web is UTF-8.

Originally posted by AHSauge View Post

No, I get the feeling you're the one not knowing what you're talking about. Yes, UTF-16 is variable length, but that doesn't mean you always have to deal with it that way.

😂

Originally posted by AHSauge View Post

That's quite an ignorant statement. When Microsoft and Sun did their work, Unicode was 16 bits and thus UTF-16 could falsely be assumed to be a fixed sized encoding. I mean, technically they didn't even go for UTF-16, but UCS-2.

That's quite an ignorant statement.
They consciously went for an encoding that not only completely broke backwards compatibility, they also knew that it was limited to 65,536 characters.
It just adds to the tragedy that UTF-8 had been developed only 2 years after UCS-2, still a few years before Java was published.

**uid313** · 31 March 2019, 07:22 AM

Originally posted by linner View Post

As a security researcher this is going to be awesome. Don't get me wrong, I write PHP all the time for my own sites. it's way better than the Perl stuff everything was originally (huge Perl fan BTW). However, PHP is not known for correctness and security and JIT adds a whole new dimension of exploits.

If you're wondering, on sites I need to be correct, fast, and secure I use C/C++ and sometimes Lua(JIT). Everything else though, general use, I use PHP. I think Go/Rust are OK but they don't offer anything I can't get from traditional very well proven methods.

If you like Perl, you might want to look into Ruby, it is inspired by Ruby and it is my understanding that it is popular among Perl and ex-Perl programmers. Me myself, I like neither Perl or Ruby though.

C and C++ are not so suitable languages for web sites, and you have to bother with lots of low level such as freeing used memory. It is much better to use a modern garbage-collected language suited for web development such as Python, C# or Go.

Originally posted by AHSauge View Post

Let me guess: Still no native support for Unicode?
It's beginning to be a bit too stupid. This is a standard from the early 90's, and this joke of a language hasn't succeeded in including support.

PHP does have Unicode support, but I've had problems with it because there is UTF-8 and UTF-8 with BOM. So if you happen to save it as UTF-8 with BOM then PHP chokes saying that the first statement must be a namespace declaration.

Originally posted by AJenbo View Post

PHP defaults to UTF-8 and as long as you use the mb_* extension you should be pretty set.

Yes, but it has problems handling UTF-8 with BOM.
So if you accidentally save the file as UTF-8 with BOM then PHP chokes and outputs a confusing error message about the first statement must be a namespace declaration.

Originally posted by Chugworth View Post

People tend to think of PHP as a language for web development, but I use PHP for a lot of scripting work in Linux and it works very well for that. I tend to prefer its syntax over Python, and their website has very good documentation.

Yes, you can use PHP for CLI and scripting too. I've done so.
Python has the advantage of being installed by default though. Python also have a cleaner API that is more consistent and doesn't return -1 and such. Python is also arguably more object oriented as even strings are objects which have methods on them.
Python has its things I don't like though, such as class methods having to declare self as argument, and the distinction between class instance variables/properties an static properties are confusing.

Originally posted by hreindl View Post

Well, you can write everything from cli up to Honeypot servers in PHP and it just works

people likely don't know anything expect they read 10 years ago and that's it in their stubborn mind

Well you can write most things in most languages. But some languages are less suited for some things.
PHP can't do asynchronous operations, and is not suited for things like WebSockets and Server-Sent Events.

**uid313** · 31 March 2019, 07:29 AM

Originally posted by DoMiNeLa10 View Post

Considering that PHP is pretty nasty about backwards compatibility, I think it would be fine for them to announce a major change in how the language works.

WordPress keeps PHP afloat so it is very important that any change in PHP does not break backwards compatibility with WordPress.

**AHSauge** · 31 March 2019, 07:58 AM

Originally posted by xnor View Post

LOL, followed by a bunch of wrong statements. Dunning-Kruger, eh? xD

Riight, because you're obvious an expert...

Originally posted by xnor View Post

Wrong.
Use a single-byte substr() function on an UTF-8 encoded string and it will happily chop off your string at any byte. You do understand why that is a problem, or do YOU not understand what UTF-8 is?

You are wrong here, and it shows how blatant ignorant you are about how UTF-8 works. Seeing as I'm supposable wrong here. Could you give me an example? Where does it break?

Originally posted by xnor View Post

You mean like strlen() does anyway, that is scanning the entire string counting the number of bytes?
Here's what you should say to yourself: "I'm clueless".

What do you want to count? The number of code points? Code units? Abstract characters? Encoded characters? User-perceived characters? Grapheme clusters? Glyphs?
The thing is that if you know what you're doing then this is not a problem. And UTF-16 doesn't improve anything here.

Also, in my many years working as a software developer I've very rarely needed to count string length (that is any of the above and not number of bytes), but that's probably because I'm not painting websites or doing boring input validation.

The answer is obviously counting bananas

Seriously though, if you say you rarely have needed to count string length, then I guess you've stayed away from all functions depending on the length as well? I've been developing for well over 10 years now. Knowing the length of a string is a quite common issue. You might not be directly exposed to it, but functions you are using sure needs to know it.

Originally posted by xnor View Post

It is in fact a huge trouble. You again don't know what you're talking about.
Even today there are still many applications that do not handle BOMs properly. It's one of the reasons why UTF-16 should be considered dangerous on the web ... and it's also one of the reasons why the majority of the web is UTF-8.

Riiiiight. UTF-16...dangerous...suuure...and I suppose you don't see any security issues with UTF-8 what so ever, right? Security issues due to non-shortest form in UTF-8 encoded strings, does not exist, right?

As for the web, don't kid yourself. The web uses UTF-8 because it's ASCII compatible and consequently also offers Unicode for western languages without requiring significantly more bandwidth. Why would an English-only website double the size of it's websites for absolutely no gain?

Originally posted by xnor View Post

😂

Some trivia for you: The python implementation of Unicode switches between ISO 8859-1 / Latin1, UCS-2 and UCS-4 depending on the string. The benefit? You always have constant random access and know number of Unicode code points.

Originally posted by xnor View Post

That's quite an ignorant statement.
They consciously went for an encoding that not only completely broke backwards compatibility, they also knew that it was limited to 65,536 characters.
It just adds to the tragedy that UTF-8 had been developed only 2 years later, still a few years before Java was published.

So they should have had a magic crystal ball so they could see that the Unicode consortium would add stuff exceeding 16 bits in the future. Is that your point here?

Seeing as you're so pro-UTF-8 here. Can you name a programming language or framework that internally uses UTF-8 for string handling?

**xnor** · 31 March 2019, 08:40 AM

Originally posted by AHSauge View Post

You are wrong here, and it shows how blatant ignorant you are about how UTF-8 works. Seeing as I'm supposable wrong here. Could you give me an example? Where does it break?

So you are absolutely clueless.

Code:

substr("😂", 0, 3)

Results in an invalid UTF-8 sequence.

Originally posted by AHSauge View Post

The answer is obviously counting bananas

You're an ignorant troll and with that you're done.
You're absolutely clueless about any of this. All your further comments (on string length and programming with strings in general, security, the web, ..) just again demonstrate this. What a colossal waste of time.

Announcement

JIT Is Approved For PHP 8 To Open Up Faster CPU Performance

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment