Announcement

**Mangix** · 22 December 2020, 06:28 PM

I remember him mentioning that the variable is used as an array index aka pointer. Which is problematic when using a variable that is not the size of a pointer. And yes, the compiler is free to cheat I mean optimize the code when using signed integers as behavior is not defined with them.

**david-nk** · 22 December 2020, 07:05 PM

Originally posted by cl333r View Post

If you watch the video he clearly mentions the issue with unsigned vs signed in the context of pointer arithmetic/assembly and how that maps to 64 vs 32 bit code.
But the author of video explains it pretty vaguely, in that he talks much about it to the point you don't understand what is actually the problem unless you know assembly.
At 44:40 he says "..to have twice as many instructions (..) just because we used an unsigned integer when we didn't need that extra precision (...)"

The problem here is that i1 and i2 are 32-bit variables but they reside in the 64-bit registers rdi and rsi. If one of their values is 0xFFFFFFFF and we increment it, the compiler has to make sure the value wraps around to 0 if it is an unsigned integer. Since the register is actually 64-bit, no wraparound would occur when incrementing, the value would be 0x100000000 instead. So the compiler uses the leal instruction, which performs the addition in 32-bit space, where 0xFFFFFFFF+1 is 0. Think of it as performing i1=(i1+1)&0xFFFFFFFF instead of i1++.

Now if that were all, the only effect would be that the INC instructions would be replaced by LEA, which would be no real difference.
But x86 allows for memory loads in the form of *(base+reg+constant) to be encoded as a single instruction and the compiler makes use of that by loading the values as *(block+i1+0), *(block+i1+1) and so on (without bothering to actually increment i1 and i2). This can be done in a single instruction, but what can't be done in a single instruction is *(block+((i1+1)&0xFFFFFFFF)). That's essentially what it boils down to.

You can fix the problem by using signed integers (then the compiler is allowed to assume that overflow will never occur) or by using size_t (then the width of the variable matches the register's width). Because of the second reason, the problem also wouldn't occur on a 32-bit architecture.

**Mangix** · 22 December 2020, 07:19 PM

https://gitlab.com/federicomenaquint...ocksort.c#L347 <-- still an issue.

I wonder who here is bored enough to find out what kind of speedup there is. hint hint.

**andreano** · 22 December 2020, 07:20 PM

Originally posted by cl333r View Post

I don't know assembly so I didn't understand why using signed instead of unsigned would be 2x more efficient:

uint8_t8 *block;
block[uint32_t] // much slower
block[int32_t] // much faster .. why?

The presenter could have done a better job of saying that this isn't actually about assembly, computer understanding, or anything that makes sense, but C's arbitrary rules of undefined behaviour: C defines unsigned overflow, while signed overflow is undefined. Because of that, compilers can optimize int32_t to size_t (because they can assume overflow won't happen), but not uint32_t to size_t, and it was really size_t (i.e. native bit width) that gave the best assembly.

Originally posted by cl333r View Post

both index numbers wrap when they get to their maximum values.. it's just that uint32_t has more precision, so what's the deal?

Now you see how it's not what it looks like after optimization.

**david-nk** · 22 December 2020, 07:50 PM

Originally posted by Mangix View Post

https://gitlab.com/federicomenaquint...ocksort.c#L347 <-- still an issue.

I wonder who here is bored enough to find out what kind of speedup there is. hint hint.

A quick test indicated that using int32_t is a fair amount slower than using uint32_t for the index, while size_t is fastest. This is unexpected, so I'm not trusting these results before I have dissected the assembly code. But since this is fairly relevant to me, I'll maybe do that tomorrow after work.

**Mangix** · 22 December 2020, 08:42 PM

Probably needs -O2 or 3. Who knows.

**maage** · 23 December 2020, 07:19 AM

Originally posted by puleglot View Post

For decompressing a big chunk of data xz is faster. But if you decompress multiple small files, the result may be different. Here are tests I performed several years ago:

Code:

$ time find man-bz2/ -type f -name "*.bz2" -exec bzcat '{}' > /dev/null \;

real 0m35.895s
user 0m14.232s
sys 0m14.121s

$ time find man-xz/ -type f -name "*.xz" -exec xzcat '{}' > /dev/null \;

real 0m44.342s
user 0m16.842s
sys 0m21.459s

$ time find man-bz2/ -type f -name "*.bz2" -exec bzcat '{}' > /dev/null \+

real 0m10.096s
user 0m9.000s
sys 0m0.787s

$ time find man-xz/ -type f -name "*.xz" -exec xzcat '{}' > /dev/null \+

real 0m7.846s
user 0m7.108s
sys 0m0.487s

My test, this is Fedora 33. Prearranged directories per type, test run in directory to minimize path length differencies. Total 10843 manual files, about 102 MiB uncompressed, symlinks removed. All compressed with respective command with --best or when missing lz4 -12, zstd -19. Only single run here. I tried to use brotli, but it could not decompress multiple files to stdout.

First cat to get baseline. Fedora uses gzip by default. I think bzip2/lzip/xz do not achieve enough space benefit over gzip compared to increased CPU resource usage. lz4 is about as fast as cat or maybe even faster because less disk IO, but space usage is higher than gzip. And if you want highest compression ratio, you can use zstd with trained dictionary. But it seems per fork CPU usage is higher than w/o dictionary (about 4s compared to 3s).

Code:

# cat 10843 106984794
/usr/bin/time find . -type f -exec cat -- {} \;
2.30user 15.30system 0:18.01elapsed 97%CPU (0avgtext+0avgdata 3940maxresident)k
0inputs+0outputs (0major+1252791minor)pagefaults 0swaps
/usr/bin/time find . -type f -exec cat -- {} +
0.02user 0.23system 0:00.26elapsed 95%CPU (0avgtext+0avgdata 4148maxresident)k
0inputs+0outputs (0major+1753minor)pagefaults 0swaps

# bzip2 10843 33274245
/usr/bin/time find . -type f -exec bzip2 --decompress --keep --stdout -- {} \;
7.78user 16.84system 0:24.95elapsed 98%CPU (0avgtext+0avgdata 4812maxresident)k
0inputs+0outputs (0major+1500047minor)pagefaults 0swaps
/usr/bin/time find . -type f -exec bzip2 --decompress --keep --stdout -- {} +
4.18user 0.34system 0:04.58elapsed 98%CPU (0avgtext+0avgdata 5528maxresident)k
0inputs+0outputs (0major+4143minor)pagefaults 0swaps

# gzip 10843 35515781
/usr/bin/time find . -type f -exec gzip --decompress --keep --stdout -- {} \;
3.06user 15.55system 0:18.85elapsed 98%CPU (0avgtext+0avgdata 3904maxresident)k
0inputs+0outputs (0major+1271597minor)pagefaults 0swaps
/usr/bin/time find . -type f -exec gzip --decompress --keep --stdout -- {} +
0.82user 0.20system 0:01.03elapsed 98%CPU (0avgtext+0avgdata 4104maxresident)k
0inputs+0outputs (0major+1967minor)pagefaults 0swaps

# lz4 10843 45559737
/usr/bin/time find . -type f -exec lz4 --decompress --multiple --keep --stdout -- {} \;
2.20user 15.92system 0:18.36elapsed 98%CPU (0avgtext+0avgdata 3972maxresident)k
0inputs+0outputs (0major+1343780minor)pagefaults 0swaps
/usr/bin/time find . -type f -exec lz4 --decompress --multiple --keep --stdout -- {} +
0.12user 0.18system 0:00.32elapsed 97%CPU (0avgtext+0avgdata 4020maxresident)k
0inputs+0outputs (0major+3585minor)pagefaults 0swaps

# lzip 10843 33266051
/usr/bin/time find . -type f -exec lzip --decompress --keep --stdout -- {} \;
11.94user 19.08system 0:31.46elapsed 98%CPU (0avgtext+0avgdata 4408maxresident)k
0inputs+0outputs (0major+1940034minor)pagefaults 0swaps
/usr/bin/time find . -type f -exec lzip --decompress --keep --stdout -- {} +
2.63user 0.28system 0:02.98elapsed 97%CPU (0avgtext+0avgdata 6424maxresident)k
0inputs+0outputs (0major+4827minor)pagefaults 0swaps

# xz 10843 33657540
/usr/bin/time find . -type f -exec xz --decompress --keep --stdout -- {} \;
6.96user 18.46system 0:25.78elapsed 98%CPU (0avgtext+0avgdata 4208maxresident)k
0inputs+0outputs (0major+1627563minor)pagefaults 0swaps
/usr/bin/time find . -type f -exec xz --decompress --keep --stdout -- {} +
2.56user 0.28system 0:02.87elapsed 99%CPU (0avgtext+0avgdata 4560maxresident)k
0inputs+0outputs (0major+2887minor)pagefaults 0swaps

# zstd 10843 33395207
/usr/bin/time find . -type f -exec zstd --decompress --keep --stdout -- {} \;
3.06user 17.38system 0:20.71elapsed 98%CPU (0avgtext+0avgdata 3880maxresident)k
0inputs+0outputs (0major+1507189minor)pagefaults 0swaps
/usr/bin/time find . -type f -exec zstd --decompress --keep --stdout -- {} +
0.33user 0.29system 0:00.64elapsed 97%CPU (0avgtext+0avgdata 4444maxresident)k
0inputs+0outputs (0major+4507minor)pagefaults 0swaps

# zstd_dict 10843 24593403
/usr/bin/time find . -type f -exec zstd -D ../man.dict --decompress --keep --stdout -- {} \;
3.97user 18.84system 0:23.15elapsed 98%CPU (0avgtext+0avgdata 4108maxresident)k
0inputs+0outputs (0major+2135541minor)pagefaults 0swaps
/usr/bin/time find . -type f -exec zstd -D ../man.dict --decompress --keep --stdout -- {} +
0.29user 0.29system 0:00.61elapsed 96%CPU (0avgtext+0avgdata 4360maxresident)k
0inputs+0outputs (0major+4708minor)pagefaults 0swaps

**uxmkt** · 23 December 2020, 08:37 AM

Originally posted by Mangix View Post

Hopefully this makes it through. bzip2 was abandoned for a long time.
For some bzip2 related laughs, https://youtu.be/yG1OZ69H_-o?t=2357
Last I checked, that issue is still present in the current bzip2 version.

You should check more often. bzip2 1.0.8, which still uses

Code:

Bool mainGtU ( UInt32 i1, UInt32 i2, UChar* block, UInt16* quadrant, UInt32 nblock, Int32* budget )

but, using a contemporary gcc 10.2.1 -m32, produces (objdump -Mintel -d blocksort.o) the movz instructions that Chandler was hoping for 4 years ago

Code:

9f0: 0f b6 5c 11 02 movzx ebx,BYTE PTR [ecx+edx*1+0x2]
9f5: 38 5c 01 02___ cmp BYTE PTR [ecx+eax*1+0x2],bl
9f9: 75 e0_________ jne 9db <mainGtU+0x1b>
9fb: 0f b6 5c 11 03 movzx ebx,BYTE PTR [ecx+edx*1+0x3]
a00: 38 5c 01 03___ cmp BYTE PTR [ecx+eax*1+0x3],bl
a04: 75 d5_________ jne 9db <mainGtU+0x1b>
a06: 0f b6 5c 11 04 movzx ebx,BYTE PTR [ecx+edx*1+0x4]
a0b: 38 5c 01 04___ cmp BYTE PTR [ecx+eax*1+0x4],bl
a0f: 75 ca_________ jne 9db <mainGtU+0x1b>

**david-nk** · 24 December 2020, 01:13 PM

Originally posted by uxmkt View Post

but, using a contemporary gcc 10.2.1 -m32, produces (objdump -Mintel -d blocksort.o) the movz instructions that Chandler was hoping for 4 years ago

uint32_t is equivalent to size_t here, so the problem being discussed never existed in the first place on 32-bit. It's expected that the generated assembly is efficient.

After looking at the assembly gcc generates, it's obvious why the int32_t variant is slower. It decides not to use the strict overflow optimization which means in addition to using lea, it also needs cltq/movslq for sign extension. Looks like the newer gcc 10 also behaves the same.

Here's the generated assembly for the arbitrarily chosen 4th iteration:
g++-8 uint32_t (100% performance):
leal 3(%rsi), %ecx
leal 3(%rdi), %eax
movzbl (%rdx,%rcx), %ecx
cmpb %cl, (%rdx,%rax)
jne .L107

g++-8 int32_t (75% performance):
leal 3(%rdi), %ecx
leal 3(%rsi), %eax
cltq
movslq %ecx, %rcx
movzbl (%rdx,%rax), %eax
cmpb %al, (%rdx,%rcx)
jne .L107

g++-8 size_t (133% performance):
movzbl 3(%rdx,%rsi), %eax
cmpb %al, 3(%rdx,%rdi)
jne .L107

I also tried clang-6, but none of the 3 variants are particularly fast because it uses movb. But at least in the case of int32_t, this means it's faster than gcc:
clang++-6 int32_t (99% performance):
movb 3(%rdx,%rcx), %al
cmpb %al, 3(%rdx,%rdi)
jne .LBB10_12

However, the fastest variant is actually when you put the code in a loop and let the compiler do the unrolling:

Code:

for (int i=0;i<12;i++)
{
  c1 = block[i1]; c2 = block[i2];
  if (c1 != c2) return (c1 > c2);
  i1++; i2++;
}

g++-8 size_t, with loop (147% performance):
movzbl 3(%rdx,%rdi), %ecx
movzbl 3(%rdx,%rsi), %r8d
cmpb %r8b, %cl
jne .L269

At least this validates the conventional wisdom that you should use ptrdiff_t/size_t for indices and let the compiler do loop unrolling when you are not sure. It also shows that compilers still do whatever they want and you cannot trust them blindly in cases where every bit of performance matters.

**Mangix** · 24 December 2020, 07:18 PM

Sounds like a merge request should be submitted

.

Note that your for loop uses the overflow optimization since you’re using int.

Announcement

Patch Proposed For Removing BZIP2 Support From The Linux Kernel

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment