Announcement

**callegar** · 20 June 2021, 02:57 AM

> there are multiple active implementations in Linux.

Looks like the quite old Info-ZIP one is the dominant one, though. Yes, there is 7z, but it's CLI is completely incompatible with the (Info-ZIP) zip/unzip pair. Apparently, there should be a pre-release of Info-ZIP going back only to 2015 rather than 2009, but I have been unable to find it. And it also looks like Freebsd has a zip/unzip clone internally based on libarchive that should be portable, but I haven't found any linux port.

I wonder if there is a re-implementation of the Info-ZIP zip/unzip pair based on libzip and as such bringing in the nice new features.

**ssokolow** · 20 June 2021, 03:18 AM

Originally posted by caligula View Post

I guess you're joking? Zip is also horrible for streaming access. Have to wait until the end of the file to do anything. People should just let this bastard format die. Is there some company still developing Zip?

Are tools ignoring the spec? There's supposed to be a Tar/RAR-esque local file header before each compressed file in the stream, containing a duplicate copy of everything needed to unpack the data.

Code:

   4.3.7 Local file header:

      local file header signature     4 bytes (0x04034b50)
      version needed to extract      2 bytes
      general purpose bit flag        2 bytes
      compression method             2 bytes
      last mod file time              2 bytes
      last mod file date              2 bytes
      crc-32                         4 bytes
      compressed size                4 bytes
      uncompressed size              4 bytes
      file name length                2 bytes
      extra field length              2 bytes

      file name                       (variable size)
      extra field                     (variable size)

I have plans for a "detect as much corruption as possible in arbitrary files" tool where, for Zip files, the intent is to check for mismatches between local file headers and central directory records to catch corruption in stuff not protected by the CRCs.

**arun54321** · 20 June 2021, 07:58 AM

Originally posted by zxy_thf View Post

and contains a directory at the end of the file, making file-level random access possible.

android apk files are just zips and access data randomly. what are you talking about?

**ssokolow** · 20 June 2021, 09:27 AM

Originally posted by arun54321 View Post

android apk files are just zips and access data randomly. what are you talking about?

I think you misread "possible" as "impossible".

What you're replying to is saying that, because Zip files have a table of contents at the end, you don't have to do a linear scan through the whole archive to find the one file you're looking for. Just seek to the end of the archive, read the ToC, and then seek directly to the file you need.

**DrYak** · 20 June 2021, 11:57 AM

Originally posted by peterdk View Post

Then only with large text files there is probably a use for ZSTD/LZMA.

Not quite exact for Zstd in particular:
Zstd support using a pre-trained external dictionary.
If you have a large collection of very small files, you can pre-train Zstd on them, and then compress each individually using that dictionary.
Each file is much smaller, leveraging what was learnt across files (instead of need to restart from blank state like gz, bzip2, xz, etc.)
But each file can be restarted individually.

Then question is more:
when will libzip support a separate pre-trained dictionary.

Originally posted by tildearrow View Post

Currently we have tar, but is often paired with some compression algorithm, which makes the format linear (have to read the whole archive to read just one file).

This depends on the actual compression format. That's indeed litteraly the case with plain-old Gzip.
But there are other formats.
- Bzip2 is block-based. Each block being BW-sorted and compressed separately. You can individually decompress any of them due to the design of the algorithm.
- There are numerous derivative of the GZip deflate that compress input in completely independant chunks, some of them produce a stream that is compatible with Gzip itself. The advantages are two-fold: it's possible to decompress any random block, but also each compression can be done in an independent thread. Bgzip and Pigz (by noone else then... today's news' Zlib) are examples that are popular in the bioinformatics field exactly for these two advantages.
- Xz has also a block mode as an option, and this option is automatically turned on if you you multi-threading (each thread compress its own local block).
(And technically, Zstd could support blocks too, but currently the prefered method to access random file is different as I've mentioned above).

The limitation is that most format popular on linux like tar and cpio, don't have an easy lookup table to know where to jump to to find which block you're looking for.

In the specific field of bioinformatics, this is solved by having an extrenal index that points to blocks and position within the compressed stream.

Originally posted by tildearrow View Post

I wish something like ZIP existed for Linux... (and Unix in general) {...} The only ZIP equivalents I found are:

The direct equivalent would be to Gzip each file individually, and optionally tar- or cpio- them after-ward if you need to pack them in a single file.

That solves the "each individual" file part of the question.
(And, for the record, that's how man pages and several other datafiles are stored in most linux distribution: each file is individually compressed)

If you use an archive to then group them together, you again have the problem of not knowing where to jump in the file to find them easily.
Then again, this is traditionnally solved with extrenal indexes.
(Or binary searching in a easily resynced format).

Originally posted by tildearrow View Post

It was its intended use back then (tape archive), but tell me. Who uses tapes for backup that often?

Ever heard of hierarchical file systems?
(Think like BCache, BCacheFS, lvm-cache. Except that instead of flash vs spining rust as those, HFS tend to be hard-disks vs. tape).

Every bioinformatics center I've worked with tends use such system as a dead-cheap way to store gobs of old file by evincing them to tape).

Originally posted by ssokolow View Post

I have plans for a "detect as much corruption as possible in arbitrary files" tool where, for Zip files, the intent is to check for mismatches between local file headers and central directory records to catch corruption in stuff not protected by the CRCs.

ZIPFIX used to work this way.

**Nth_man** · 20 June 2021, 03:31 PM

Originally posted by callegar View Post

> there are multiple active implementations in Linux.

Looks like the quite old Info-ZIP one is the dominant one, though. Yes, there is 7z, but it's CLI is completely incompatible with the (Info-ZIP) zip/unzip pair. Apparently, there should be a pre-release of Info-ZIP going back only to 2015 rather than 2009, but I have been unable to find it. [...]

If it may help anyone:
- ftp://ftp.info-zip.org/pub/infozip/beta/
- http://antinode.info/ftp/info-zip/
- Other places?

**arQon** · 20 June 2021, 05:46 PM

Originally posted by caligula View Post

I guess you're joking? Zip is also horrible for streaming access. Have to wait until the end of the file to do anything. People should just let this bastard format die.

I guess you're joking? Hammers are horrible for driving in screws. Have to hit them much harder and more times. People should just let this bastard tool die.

IOW: Don't make stupid decisions and then blame someone else for your mistakes. The zip format is superb for a huge number of scenarios. Throwing a tantrum because it isn't the perfect option for *every* case reflects far more on you than it does the archive format / algorithms. Especially when you're wrong to begin with! You can stream from a zip file just fine.

Announcement

Libzip 1.8 Released With Support For Zstd Compressed ZIP Files

Comment

Comment

Comment

Comment

Comment

Comment

Comment