Comments
Smerity âą 2018-01-25
Agreed that if you have the control (and potentially the time depending on the algorithm), type specific compression is the way to go. Having said that, zstd beats Snappy handily for text ^_^
On enwik8 (100MB of Wikipedia XML encoded articles, mostly just text), zstd gets you to ~36MB, Snappy gets you to ~58MB, while gzip will get you to 36MB. If you turn up the compression dials on zstd, you can get down to 27MB - though instead of 2 seconds to compress it takes 52 seconds on my laptop. Decompression takes ~0.3 seconds for the low or high compression rate.
wolf550e âą 2018-01-25
ppmd will get you better compression on text (better than zstd and even better than lzma), but it is slow to compress and decompress. Use `7z a -m0=PPMd demo.7z demo.txt`
hokkos âą 2018-01-25
Why not use a XML/XSD specific compression like EXI for that ?
https://www.w3.org/TR/exi-primer/
cldellow âą 2018-01-25
It really is mostly just text. Itâs not _quite_
... \[30 kb of text\] ... but almost.
dorfsmay âą 2018-01-25
It really depends. I tend to use a specialized compression tool if I need to compress once and send/decompress often, but use zstd when I compress / decompress a lot. In my experience, if you have a fix, small amount of time(single digit minutes or less), zstd is the one that will compress to the smallest size. I even often pick `-3` as it is typically a lot faster than `-4` and subsequent, for not a huge difference in resulting size.
In my experience, if compression time is not a factor, for text (non-random letters and numbers), lzip is the best. I recently had to redistribute internally the data from python nltk, and tried to compress/decompress with different tools, this was my result (picked lzip again):
gzip -9 10 m 503 MiB 31 s zstd -19 29 m 360 MiB 29 s 7za a -si 26 m 348 MiB s lzip -9 78 m 310 MiB 50 s lrzip -z -L 9 (ZPAQ) 125 m 253 MiB 95 m
petre âą 2018-01-25
I did some tests myself on a 22MB SQL file and it turns out:
* 7za -m0=PPMd produced the smallest file being faster than bzip2
* bzip2 turned out to be way faster than both lz (684%) and xz (644%) and produced a smaller file
* xz is marginally faster than lz, compressed sizes are about the same with the xz file being a tad smaller
* without any switches 7za produces an archive a bit bigger than xz and lzip in about the same amount of time
* gzip and zst produce about the same compressed size, only zstd is a lot faster (517%) than gzip
The 7z file was produced using the -m0=PPMd switch. For the other files no command line switches were supplied. Here are the file sizes:
23668150 file.sql 3899477 file.sql.7z 4149962 file.sql.bz2 5954982 file.sql.gz 4540628 file.sql.lz 4506720 file.sql.xz 5961291 file.sql.zst
dorfsmay âą 2018-01-25
When going for smallest size, itâd be interesting to see your comparison using command lines switches for best compression (makes a big difference, both in terms of time and size).
Was bzip2 slightly, or considerably slower than zst?
petre âą 2018-01-25
Bzip2 being slower than gzip, yes, itâs also considerably slower than zstd. Yet zstd -19 produced a bigger (4.3M) file in about the same amount of time.
If I can remember correctly zstd = 0.2s, gzip = 0.8s, 7zip (PPMd) = 2.1s, bzip2 = 2.7s, lzip, xz, 7zip (lzma) = 15..16s. This is CPU time from memory, might not be fully accurate.
Iâd say zstd and gzip is better suited for general use, while bzip2 and 7zip (PPMd) are better suited for high compression of text files.
paladin314159 âą 2018-01-25
Weâve also had great success using zstd with training. We dump a lot of JSON data into kafka, most of which has a similar schema, and training it easily gave a 2-3x reduction in size over lz4.
tinco âą 2018-01-25
Is there a type specific algorithm for data that mostly consists of close numbers? I figure if I send the deltas only it would be just a sequence of small / close numbers that would easily be compressed by standard compression libraries.
An example of close number sequences is just simple graphs. Your CPU temperature is 78 degrees, most likely itâll be 78, 79 or 77 the next tick, so theyâre almost close, the deltaâs will be 0âs and 1âs usually.
amaranth âą 2018-01-25
Compression via next-symbol prediction seems to be what youâd be looking for. Thatâs what the PAQ compression schemes focus on, although theyâre very slow and definitely overkill for non-archival purposes. Youâd probably just want to write out that data as deltas manually and have the reader know a delta format is being used. So I guess the answer is actually delta encoding, because thatâs a compression algorithm too.
felixhandte âą 2018-01-25
Good point. While it wonât beat a hand-optimized algorithm for a specific use case, compression with a dictionary (like zstd supports/encourages) is sort of a partial specialization of the algorithm to the type of data youâre compressing.
beagle3 âą 2018-01-25
Provided that the compression stems from repetition of blocks. If you have a file with 16-bit integers, each either exactly 1 or 2 greater than the previous one, you will have 128K with no repetition that zstd will be unable to compress; However, if you transpose the bytes (all high order bytes, followed by all low-order bytes), the compression will be very significant; similarly, if you replace the numbers with their difference.
The correct term from information theory is that it approximates a âuniversalâ compressor with respect to observable markov or fsmx sources.
hackcasual âą 2018-01-25
zstd is in snappyâs domain, as a low overhead way of reducing bandwidth usage. Youâve got devs writing some service that talks in JSON or protobuf, just rub a little zstd on it, and bingo, your bandwidth is reduced.
halayli âą 2018-01-25
Iâd pick zstd over snappy for text