Announcement

**caligula** · 22 March 2022, 08:02 PM

Originally posted by Sin2x View Post

Having UTF-8 by default only in 2022 is just mind-blowing.

Mainly affects windows users. Linux and Mac were utf8. On windows cmd.exe still assumes 8bit code pages. Many progrqmmers, e.g. french, the eastern european genocide country, china asume local code pages.

**Delgarde** · 23 March 2022, 02:54 AM

Originally posted by Sin2x View Post

Having UTF-8 by default only in 2022 is just mind-blowing.

It's also completely misleading. This isn't a new capability for Java, which has supported UTF-8 for twenty years. All this means is that if you do anything that encodes or decodes between characters and bytes and you don't specify an encoding, it's now guaranteed to be UTF-8... whereas previously, no such guarantee existed. In practice, most JVMs were already defaulting to UTF-8 in such cases, so nothing has changed... all it means is that failing to specify an encoding is no longer undefined behaviour.

**Sin2x** · 23 March 2022, 04:16 AM

Thanks for the explanations, folks. Faith in Java restored.

**uxmkt** · 23 March 2022, 06:46 AM

Originally posted by paulpach View Post

Strings in java are UTF-16 not UTF-8. (nothing wrong with that)

Nothing wrong?! Surrogate pairs, anyone?

**Weasel** · 23 March 2022, 09:02 AM

Originally posted by paulpach View Post

Strings in java are UTF-16 not UTF-8. (nothing wrong with that)

What? UTF-16 is an atrocity. It is a pathetic encoding that complicates everything and doesn't provide ANY benefit compared to UTF-8 in most normal conditions. It makes code a fucking mess (especially when dealing with converting between it and UTF-8/ASCII) and on top of it, it's a waste of space as well, because it uses 2 bytes for each normal character, while 4 bytes or more for surrogate pairs, rendering its only redeeming benefits almost nil compared to UTF-8.

Only very select few languages benefit from UTF-16 space-wise, but who gives a shit about them when the most important ones (like English) do not, and in fact, bloats it for other languages too?

**paulpach** · 23 March 2022, 09:20 AM

Originally posted by uxmkt View Post

Nothing wrong?! Surrogate pairs, anyone?

Well, if java switched to UTF-8, then we would have to deal with multibyte characters, which are even more obnoxious to deal with than surrogate pairs.
If you don't want these pairs/multibyte characters, you would have to go with UTF-32, but then strings would take a huge amount of space in memory.

So I don't see how UTF-16 is the wrong choice here, every encoding has pros and cons. Note .Net also uses UTF-16 for strings.

**uxmkt** · 23 March 2022, 07:01 PM

Originally posted by paulpach View Post

Well, if java switched to UTF-8, then we would have to deal with multibyte characters, which are even more obnoxious to deal with than surrogate pairs.
[...] So I don't see how UTF-16 is the wrong choice here, every encoding has pros and cons. Note .Net also uses UTF-16 for strings.

A lot of formats/protocols are defined to be in UTF-8. If you already have UTF-8, read and write operations need no extra conversions.

Character substitution e.g. [C] `for (i=0; i<codepoints; i++) if (u32str[i] == 0x1F4A9) u32str[i] = 0x2603;` (substitute one emoji by another) work best with UTF-32. If you already have UTF-32, no extra conversion will be incurred.

UTF-16 is not practical, today, considering the proliferation of UTF-8: You need to convert from/to UTF-32 for the emoji substitution, and you need to convert from/to UTF-8 for formats/protocols. The picture (or vision) may certainly have been different around 1995 when Windows gained the *W APIs and Emojis were still plaintext and nobody needed surrogate pairs for them.

**Jabberwocky** · 24 March 2022, 11:07 AM

I really liked Java at one stage. I used it to learn algorthms, datastuctures, cryptography, netcode, physics, basic GUI-programming and even 3D game development. Many powerful and popular IDEs were built using Java. That was a very long time ago. I'm old and fat now.

These days even the people who use Java isolate to specific JRE/JDK versions. Totally the opposite of what it was like back in the day.

I'm happy to see this brewing over the years, it could help Java intergration with external (binary) projects: https://openjdk.java.net/jeps/424 It's been a very long time coming, like Wayland's progress was fast compared to the demand for this.

**GreenToad** · 24 March 2022, 06:46 PM

Originally posted by Weasel View Post

What? UTF-16 is an atrocity. It is a pathetic encoding that complicates everything and doesn't provide ANY benefit compared to UTF-8 in most normal conditions. It makes code a fucking mess (especially when dealing with converting between it and UTF-8/ASCII) and on top of it, it's a waste of space as well, because it uses 2 bytes for each normal character, while 4 bytes or more for surrogate pairs, rendering its only redeeming benefits almost nil compared to UTF-8.

Only very select few languages benefit from UTF-16 space-wise, but who gives a shit about them when the most important ones (like English) do not, and in fact, bloats it for other languages too?

Strings in java are stored as one byte per char if content can fit in Latin-1 encoding, UTF-16 is used in other cases. It's been like that by default since Java 9

**Weasel** · 25 March 2022, 09:23 AM

Originally posted by GreenToad View Post

Strings in java are stored as one byte per char if content can fit in Latin-1 encoding, UTF-16 is used in other cases. It's been like that by default since Java 9

I don't know anything about Java, I was merely talking about UTF-16 itself being horrible (and yes, I'm mad that the Windows API, which is extremely popular, uses it extensively for almost everything, instead of UTF-8).

Announcement

OpenJDK 18 Released With A Simple Web Server, UTF-8 By Default

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment