Announcement

**birdie** · 07 June 2019, 08:25 AM

Originally posted by oleid View Post

It highly depends on your data. At work, we compress log files containing CAN data using bzip2. It's slow, granted, but as it's live data and the rate isn't too high, bzip2 is fast enough. Usually. The most important point, however, is, that the resulting files are smaller than for any other algo we tested. Even xz/lzma. If we need faster compression, we usually use zstandard.

So bzip2 _has_ it's niche and it's great that it's getting more traction again.

I even thought about trying to porting it to rust myself, so I'll see if I can contribute, if time permits.

Wow, six likes, yet bzip2 decompression speed is simply atrocious vs. LZMA2 based compressors. Oh, and its compression speed is also quite bad. And it's single-threaded.

So, maybe you're getting a 5% better compression ratio but those compressed files are a PITA to work with (e.g. grep).

**oleid** · 07 June 2019, 10:08 AM

Originally posted by birdie View Post

So, maybe you're getting a 5% better compression ratio but those compressed files are a PITA to work with (e.g. grep).

Well, it's good for data logging. In space restricted areas. If you need to work with the data afterwards, there is still zstandard.

**uxmkt** · 07 June 2019, 11:50 AM

Originally posted by birdie View Post

Wow, six likes, yet bzip2 decompression speed is simply atrocious vs. LZMA2 based compressors. Oh, and its compression speed is also quite bad. And it's single-threaded.

MT versions exist (pbzip2). But the real deal, as you said: bzip2 - the algorithm - falls right in the black hole of the compression universe: it's neither in the vincinity of "contemporary fast" (lz4/zstd..) nor in the vincinity of "contemporary well" (ratios >= 3.5) nor in the vincinity of "compatible with 30-year old ancient systems" (gzip/ncompress/pkzip).

**starshipeleven** · 07 June 2019, 02:04 PM

Originally posted by jjkk View Post

What I am trying to tell is that having decades old linux, bash, tar, gcc, binutils, coreutils, sed, awk, etc., a dozen or so of binary packages in total occupying less than 100 megabytes of space ... And no, you can not reproduce Rust without extending this bare minimum by a binary build of itself.

I still don't see why this is even bad. Rust is basically forcing you to use its own build system because they think it is better.

Are current build systems great? No they are not.

Shellscript is hell, that's why with the better build systems like Meson you are usually writing in the build system language, not in shell, and you won't be using cryptic console commands but high-level and much more readable string manipulation functions.

**starshipeleven** · 07 June 2019, 02:07 PM

Originally posted by monraaf

"Legacy" and "Rust" are antonyms. If you want legacy compatibility, you don't use Rust.

Compatibility is about interfaces. Rust lets you create C-like interfaces that can be used by C/C++ applications.

As long as he uses the same API it will work fine.

**DrYak** · 11 June 2019, 12:55 PM

Originally posted by starshipeleven View Post

Shellscript is hell, that's why with the better build systems like Meson you are usually writing in the build system language, not in shell, and you won't be using cryptic console commands but high-level and much more readable string manipulation functions.

But then you can't just build the thing from scratch because you need to have support for that language.
Requirement to build an automake/autoconf package? just a shell (M4 macros are only used upstream in the development. Not the final product).
Requirement for a CMake/Meson project? Have actual CMake/Meson that works. Now hope that *these* in turn can be built from scratch (spoilert alert: they can... because they fell back on shell script. Tada! :-P )

(of course I also understand the point of CMake/Meson for upstream developpers: Autoconf requires you to master their M4 macro language. CMake is just a bunch of simple text files).

Originally posted by clavko View Post

Everyone talking about dependencies of Rust... What are those? Glibc? Rustc is a compiler built on LLVM infra.

My (old) view on all these "modern language" is that they lack a standard library and a few more or less standardized frameworks.
- with a C/C++ project, you'd be using your OS' c library, C++ standard library, maybe boost if you want fancy tech, and a couple of libraries eventually. You'll have like maximum 6-7 "libBLA-devel" package that needs to be installed before building.
- with a Rust/Go/Node.JS project, half-way through your build, something will need to connect to the internet to download the specific commit 0xdeadbeef from github.com/foo/bar because that one implements some obscure function some project needs. This in a pile of a hundreds or so similar "fetch the commit from github" dependencies for every single function. (and gods forbid what happens if the guy who wrote that "left align string with space" function decides to shut down his repo - Spoiler Alert: Half of the Node.JS Planet goes down)

Disclaimer: I once almost lost my sanity while helping some cluster-oriented specific distro for scientific software packages.

Originally posted by dremon_nl View Post

and Internet connection is only required for 3rd-party crates if your project uses them. In the recent Rust release it's possible to set up a custom crate repository (e.g. on the intranet).

Still, in my book, there's a difference in software that requires maybe 6-7 other "libBLAH-devel" package, and software that needs to download half of github (no matter if from microsoft azure or from local cache) just to barely function.

Originally posted by clavko View Post

[USER="17799"]And then you're in the same mess, because you're developing real world software.

it might come to you as a surprise, but not every single piece of "real world software" needs a local copy of half of github to work.
There's a difference between being "real world" (and going beyond a simple "hello, world" sample) and needing to jump on every latest bit of hype.

Originally posted by clavko View Post

Well, probably not the same, because in C++ you're on your own handling versioning.

It's not necessarily the *compiler's* job to handle dependencies. It might be the *packaging system's* job. both RPM and DEB do a sufficiently decent job at it given the number of distro that have been successful.

Originally posted by clavko View Post

This issue is not about the language, is about reinventing the wheel vs reusing.

And sometime you just need a car, and ask for a car, instead of messing with either re-inventing or re-using 5 wheels. Again, my impression of using standard libraries and clear framework and pulling every single bit out of github.

**DrYak** · 11 June 2019, 01:04 PM

Originally posted by hotaru View Post

in my experience, anything without multi threaded decompression is obsolete outside of uses where I'm stuck with single threaded decompressors. most of the compressed data I have is xz or lz4,

Depends on the compressor. In the specific case of LZ4, it decompresses pretty close to RAM bandwidth. Throwing lots of core would simply congest the bandwidth between your CPU and your RAM sticks.

Also depends the framework in which you're compressing.
- Compressing a long linear stream of data? Yes indeed, very poor parallelizing opportunities, unless using a specific block-based algo (the B from Bzip2), or a special block-modified variant of the others (Resyncable gzip, etc.) but usually you're trading that specifically for better compression (due to larger dictionnary).
- Compressing a file system ? Most of the implementations compress data in chunks (e.g.: BTRFS compresses independent chunks of 128k) no matter what the algo, it is inherently parallelizable as each chunk is independent.

(And also Zstd has a separate different table of settings when in chunk mode. There's a special different setting when asking for level -19 on 128kb chunks - with the dictionnary size still limited to 128kb, obviously, but the other parameters (entropy, search exhaustiveness) tuned up).

Originally posted by atomsymbol

In other words, the number of cases where any of bzip2 / lbzip2 / pbzip2 is able to surpass all three of xz / zstd / brotli is extremely small.

Extremely small:
Basically, the "long string of almost nearly the same log entries" corner case that oleid is hitting with his logs.
And genome read sequences (obviously).

Originally posted by birdie View Post

bzip2 decompression speed is simply atrocious

bzip2 relies on a BWT - a reversible sort. It means that you must run the same core operation (the BWT) in both the compressor and the decompressor, only in reverse direction. each side runs a sort.

As opposed as most LZ-based dictionnary approach, which must manage a whole dictionary and associate look-up system at compression time, but only look-back stuff from the already decompressed stream at decompression time.
They are highly assymetric compression schemes.

Originally posted by birdie View Post

vs. LZMA2 based compressors.

Careful what you're comparing with. the LZ-part (dictionnary) is the *only* assymetric part in modern compressors. The entropy part in modern compressor IS USUALLY NOT.
(usually you have the compressor and decompressor side running their entropy engines in sync)
LZMA is about the limit where you get faster decompression.
Once you move into deeper into machine learning territory (paq compressors etc.) running the ml-entropy engine at decompression is nearly as expensive than at compression and it's the part that dominate the speed.

Originally posted by birdie View Post

And it's single-threaded.

the classic bzip2 implementation happens to be single-threaded. As bzip2 is block-based, there are no technical constrains forcing you to do so (except for memory constrains - you need to process the blocks somewhere). There are parallel implementation en vogue in specific workloads that bzip2 algo catters to (e.g.: very popular in bioinformatics).

Originally posted by oleid View Post

You nailed it. This logfile format with CAN data has similar patterns all over the place.

It might be, for fun, interesting to see how Zstd would fare when you have pre-trained a dictionnary over a large corpus of log files.
Might not be useful in your specific use-case (the embed device might be too memory-constrained), but it would be interesting to see how it reacts

Originally posted by oleid View Post

Once took the original bwt code and did a transformation of a whole file. Basically used it as a preprocessor for other compression algos. The resulting compressed files were some 10% smaller than the un-preprocessed files for XZ.

The thing is that the output of BWT is very peculiar (long swaths of almost sorted strings of symbols - the "almost" part being enough left over information to revert the transform... it's a *reversible* sort, after all). XZ isn't necessarily the best geared toward it. (Maybe if you leave also the "move-to-front" part of it? Then XZ basically works as a glorified RLE which might be better than the old "modified RLE" used in Bzip2)

Originally posted by oleid View Post

But the whole file bwt was unbearable slow xD

Because you need to manipulate the whole block at the same time (as opposed to LZ where you can accelerate stuff by using some hash look-up table that fits as much as possible in cache and then only look-up the relevant part (or best candidate) of your dictionnary).

You need a special kind of architecture to handle this kind of workload (some bioinformatics cluster ? :-P )

Originally posted by hotaru View Post

all the tests I've done have been with the highest level for Zstandard.

That is weird. Normally that's restricted to the above mentionned cases (log files, genomic sequencing data, etc.)

Originally posted by hotaru View Post

a compiled Linux kernel, an initrd image, a few dozen compiled Arch packages,

okay, these are already compressed stuff and can't change a thing (except maybe very aggressive compressors such as ZX and even moreso PAQ can squeeze a tiny bit of entropy bits out, specially if you used GZ compression instead in these binaries of the standard XZ or the fast ZStd.

Originally posted by hotaru View Post

while it is possible that the Linux kernel source code,

kernel.org switched from bz2 to xz for their "highly compressed" alternative because xz both compresses better and decompresses faster.
That's litterally one of the example use case of xz over bzip2.
There's not enough of the "almost but not quite" similarity where bzip2 excel (unless Linus got kidnapped by a bunch of SJW and some idiot took over in the meantime).

Originally posted by hotaru View Post

and a bunch of large text files in various languages all happen to be specific cases where BW excels or where Zstandard performs poorly,

So, basically it all boils down to your large collection of text files and where you got them from.

If they are mostly varied text (like the wikipedia corpus used in compression competitions), the no way, these are already proven to compress better with LZMA-based compressors (or even PAQ if you're open to leave normal everyday compressor and go for more exotic ones).
Something is wrong with your test if you see better performance from bzip2.

If they are more or less alike but not quite (logs, long list of translations, etc.) they might be the corner cases that compresses better than XZ, and if that part of your collection is large enough, it could compensate the dozens of MB lost in the kernel source part and dominate the final result of compression of your collection.
Then Bzip2 performs as expected. Which is an argument to always test compression on what you're most likely to compress, and don't forget to test solution with known corner cases - see oleid's log examples.

Originally posted by hotaru View Post

AFAIK, neither of those do multi threaded decompression, and decompression speed is a lot more important than compression speed for most use cases.

LZ4 beast the crap out of everything in terms of decompression speeds as it works close to memory bandwidth limit.

Zstd is the next best thing, taking lot of the optimisations that make LZ4 fast, and throwing a special type of entropy (tANS - table based assymetric numeric system. A weird type of range-encoding that has been turned on its head around, and can work as a special kind of "entropy table of huffman+remainder" that can reach Shannon limit while retaining huffman speed).

It completely avoids the "slow entropy must be run on both ends".

Zstd is both the fastest, and also the best you can reach before entering machine-learning territory (such as LZMA's (Relatively) fast Markov Model - that's where the MA comes from - or PAQ's horrendously slow neural nets).

Zstd just beats everything else, unless you need better (and slower) compression from XZ's LZMA or PAQ and co.

Yann Collet has single handedly revolutionnized the compression landscape (well him and Jarek Duda, the insane guy who wrote the ANS papers - plural, because it's such a revolutionnary stuff that it needed two papers to explain).

Originally posted by birdie View Post

but given bzip2 pathetic dictionary (900KB) it only applies to small documents (less than ~20MB).

There's no such thing as a "dictionnary" in Bzip2.
It's not LZ based. it's BWT based.
900KB refers to the size of block upon which BWT is performed. The result is very efficiently "sort-of-RLE" encoded, no dictionary involved at all, and instead of entropy it uses a "move to front scheme" to improve the RLE.

**oleid** · 11 June 2019, 01:32 PM