I remember him mentioning that the variable is used as an array index aka pointer. Which is problematic when using a variable that is not the size of a pointer. And yes, the compiler is free to cheat I mean optimize the code when using signed integers as behavior is not defined with them.
Announcement
Collapse
No announcement yet.
Patch Proposed For Removing BZIP2 Support From The Linux Kernel
Collapse
X
-
Originally posted by cl333r View PostIf you watch the video he clearly mentions the issue with unsigned vs signed in the context of pointer arithmetic/assembly and how that maps to 64 vs 32 bit code.
But the author of video explains it pretty vaguely, in that he talks much about it to the point you don't understand what is actually the problem unless you know assembly.
At 44:40 he says "..to have twice as many instructions (..) just because we used an unsigned integer when we didn't need that extra precision (...)"
Now if that were all, the only effect would be that the INC instructions would be replaced by LEA, which would be no real difference.
But x86 allows for memory loads in the form of *(base+reg+constant) to be encoded as a single instruction and the compiler makes use of that by loading the values as *(block+i1+0), *(block+i1+1) and so on (without bothering to actually increment i1 and i2). This can be done in a single instruction, but what can't be done in a single instruction is *(block+((i1+1)&0xFFFFFFFF)). That's essentially what it boils down to.
You can fix the problem by using signed integers (then the compiler is allowed to assume that overflow will never occur) or by using size_t (then the width of the variable matches the register's width). Because of the second reason, the problem also wouldn't occur on a 32-bit architecture.Last edited by david-nk; 22 December 2020, 07:09 PM.
- Likes 3
Comment
-
https://gitlab.com/federicomenaquint...ocksort.c#L347 <-- still an issue.
I wonder who here is bored enough to find out what kind of speedup there is. hint hint.
Comment
-
Originally posted by cl333r View PostI don't know assembly so I didn't understand why using signed instead of unsigned would be 2x more efficient:
uint8_t8 *block;
block[uint32_t] // much slower
block[int32_t] // much faster .. why?
Originally posted by cl333r View Postboth index numbers wrap when they get to their maximum values.. it's just that uint32_t has more precision, so what's the deal?
Comment
-
Originally posted by Mangix View Posthttps://gitlab.com/federicomenaquint...ocksort.c#L347 <-- still an issue.
I wonder who here is bored enough to find out what kind of speedup there is. hint hint.
- Likes 1
Comment
-
Originally posted by puleglot View Post
For decompressing a big chunk of data xz is faster. But if you decompress multiple small files, the result may be different. Here are tests I performed several years ago:
Code:$ time find man-bz2/ -type f -name "*.bz2" -exec bzcat '{}' > /dev/null \; real 0m35.895s user 0m14.232s sys 0m14.121s $ time find man-xz/ -type f -name "*.xz" -exec xzcat '{}' > /dev/null \; real 0m44.342s user 0m16.842s sys 0m21.459s $ time find man-bz2/ -type f -name "*.bz2" -exec bzcat '{}' > /dev/null \+ real 0m10.096s user 0m9.000s sys 0m0.787s $ time find man-xz/ -type f -name "*.xz" -exec xzcat '{}' > /dev/null \+ real 0m7.846s user 0m7.108s sys 0m0.487s
First cat to get baseline. Fedora uses gzip by default. I think bzip2/lzip/xz do not achieve enough space benefit over gzip compared to increased CPU resource usage. lz4 is about as fast as cat or maybe even faster because less disk IO, but space usage is higher than gzip. And if you want highest compression ratio, you can use zstd with trained dictionary. But it seems per fork CPU usage is higher than w/o dictionary (about 4s compared to 3s).
Code:# cat 10843 106984794 /usr/bin/time find . -type f -exec cat -- {} \; 2.30user 15.30system 0:18.01elapsed 97%CPU (0avgtext+0avgdata 3940maxresident)k 0inputs+0outputs (0major+1252791minor)pagefaults 0swaps /usr/bin/time find . -type f -exec cat -- {} + 0.02user 0.23system 0:00.26elapsed 95%CPU (0avgtext+0avgdata 4148maxresident)k 0inputs+0outputs (0major+1753minor)pagefaults 0swaps # bzip2 10843 33274245 /usr/bin/time find . -type f -exec bzip2 --decompress --keep --stdout -- {} \; 7.78user 16.84system 0:24.95elapsed 98%CPU (0avgtext+0avgdata 4812maxresident)k 0inputs+0outputs (0major+1500047minor)pagefaults 0swaps /usr/bin/time find . -type f -exec bzip2 --decompress --keep --stdout -- {} + 4.18user 0.34system 0:04.58elapsed 98%CPU (0avgtext+0avgdata 5528maxresident)k 0inputs+0outputs (0major+4143minor)pagefaults 0swaps # gzip 10843 35515781 /usr/bin/time find . -type f -exec gzip --decompress --keep --stdout -- {} \; 3.06user 15.55system 0:18.85elapsed 98%CPU (0avgtext+0avgdata 3904maxresident)k 0inputs+0outputs (0major+1271597minor)pagefaults 0swaps /usr/bin/time find . -type f -exec gzip --decompress --keep --stdout -- {} + 0.82user 0.20system 0:01.03elapsed 98%CPU (0avgtext+0avgdata 4104maxresident)k 0inputs+0outputs (0major+1967minor)pagefaults 0swaps # lz4 10843 45559737 /usr/bin/time find . -type f -exec lz4 --decompress --multiple --keep --stdout -- {} \; 2.20user 15.92system 0:18.36elapsed 98%CPU (0avgtext+0avgdata 3972maxresident)k 0inputs+0outputs (0major+1343780minor)pagefaults 0swaps /usr/bin/time find . -type f -exec lz4 --decompress --multiple --keep --stdout -- {} + 0.12user 0.18system 0:00.32elapsed 97%CPU (0avgtext+0avgdata 4020maxresident)k 0inputs+0outputs (0major+3585minor)pagefaults 0swaps # lzip 10843 33266051 /usr/bin/time find . -type f -exec lzip --decompress --keep --stdout -- {} \; 11.94user 19.08system 0:31.46elapsed 98%CPU (0avgtext+0avgdata 4408maxresident)k 0inputs+0outputs (0major+1940034minor)pagefaults 0swaps /usr/bin/time find . -type f -exec lzip --decompress --keep --stdout -- {} + 2.63user 0.28system 0:02.98elapsed 97%CPU (0avgtext+0avgdata 6424maxresident)k 0inputs+0outputs (0major+4827minor)pagefaults 0swaps # xz 10843 33657540 /usr/bin/time find . -type f -exec xz --decompress --keep --stdout -- {} \; 6.96user 18.46system 0:25.78elapsed 98%CPU (0avgtext+0avgdata 4208maxresident)k 0inputs+0outputs (0major+1627563minor)pagefaults 0swaps /usr/bin/time find . -type f -exec xz --decompress --keep --stdout -- {} + 2.56user 0.28system 0:02.87elapsed 99%CPU (0avgtext+0avgdata 4560maxresident)k 0inputs+0outputs (0major+2887minor)pagefaults 0swaps # zstd 10843 33395207 /usr/bin/time find . -type f -exec zstd --decompress --keep --stdout -- {} \; 3.06user 17.38system 0:20.71elapsed 98%CPU (0avgtext+0avgdata 3880maxresident)k 0inputs+0outputs (0major+1507189minor)pagefaults 0swaps /usr/bin/time find . -type f -exec zstd --decompress --keep --stdout -- {} + 0.33user 0.29system 0:00.64elapsed 97%CPU (0avgtext+0avgdata 4444maxresident)k 0inputs+0outputs (0major+4507minor)pagefaults 0swaps # zstd_dict 10843 24593403 /usr/bin/time find . -type f -exec zstd -D ../man.dict --decompress --keep --stdout -- {} \; 3.97user 18.84system 0:23.15elapsed 98%CPU (0avgtext+0avgdata 4108maxresident)k 0inputs+0outputs (0major+2135541minor)pagefaults 0swaps /usr/bin/time find . -type f -exec zstd -D ../man.dict --decompress --keep --stdout -- {} + 0.29user 0.29system 0:00.61elapsed 96%CPU (0avgtext+0avgdata 4360maxresident)k 0inputs+0outputs (0major+4708minor)pagefaults 0swaps
Comment
-
Originally posted by Mangix View PostHopefully this makes it through. bzip2 was abandoned for a long time.
For some bzip2 related laughs, https://youtu.be/yG1OZ69H_-o?t=2357
Last I checked, that issue is still present in the current bzip2 version.
Code:Bool mainGtU ( UInt32 i1, UInt32 i2, UChar* block, UInt16* quadrant, UInt32 nblock, Int32* budget )
Code:9f0: 0f b6 5c 11 02 movzx ebx,BYTE PTR [ecx+edx*1+0x2] 9f5: 38 5c 01 02___ cmp BYTE PTR [ecx+eax*1+0x2],bl 9f9: 75 e0_________ jne 9db <mainGtU+0x1b> 9fb: 0f b6 5c 11 03 movzx ebx,BYTE PTR [ecx+edx*1+0x3] a00: 38 5c 01 03___ cmp BYTE PTR [ecx+eax*1+0x3],bl a04: 75 d5_________ jne 9db <mainGtU+0x1b> a06: 0f b6 5c 11 04 movzx ebx,BYTE PTR [ecx+edx*1+0x4] a0b: 38 5c 01 04___ cmp BYTE PTR [ecx+eax*1+0x4],bl a0f: 75 ca_________ jne 9db <mainGtU+0x1b>
Comment
-
Originally posted by uxmkt View Postbut, using a contemporary gcc 10.2.1 -m32, produces (objdump -Mintel -d blocksort.o) the movz instructions that Chandler was hoping for 4 years ago
After looking at the assembly gcc generates, it's obvious why the int32_t variant is slower. It decides not to use the strict overflow optimization which means in addition to using lea, it also needs cltq/movslq for sign extension. Looks like the newer gcc 10 also behaves the same.
Here's the generated assembly for the arbitrarily chosen 4th iteration:
g++-8 uint32_t (100% performance):
leal 3(%rsi), %ecx
leal 3(%rdi), %eax
movzbl (%rdx,%rcx), %ecx
cmpb %cl, (%rdx,%rax)
jne .L107
g++-8 int32_t (75% performance):
leal 3(%rdi), %ecx
leal 3(%rsi), %eax
cltq
movslq %ecx, %rcx
movzbl (%rdx,%rax), %eax
cmpb %al, (%rdx,%rcx)
jne .L107
g++-8 size_t (133% performance):
movzbl 3(%rdx,%rsi), %eax
cmpb %al, 3(%rdx,%rdi)
jne .L107
I also tried clang-6, but none of the 3 variants are particularly fast because it uses movb. But at least in the case of int32_t, this means it's faster than gcc:
clang++-6 int32_t (99% performance):
movb 3(%rdx,%rcx), %al
cmpb %al, 3(%rdx,%rdi)
jne .LBB10_12
However, the fastest variant is actually when you put the code in a loop and let the compiler do the unrolling:
Code:for (int i=0;i<12;i++) { c1 = block[i1]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++; }
movzbl 3(%rdx,%rdi), %ecx
movzbl 3(%rdx,%rsi), %r8d
cmpb %r8b, %cl
jne .L269
At least this validates the conventional wisdom that you should use ptrdiff_t/size_t for indices and let the compiler do loop unrolling when you are not sure. It also shows that compilers still do whatever they want and you cannot trust them blindly in cases where every bit of performance matters.
- Likes 2
Comment
Comment