The Breakthrough
We use xz -9 as our baseline throughout these benchmarks — it's widely regarded as the strongest general-purpose compressor available and the standard benchmark for maximum compression.
In our previous HDFS benchmark, Smallest.zip compressed a 1.5GB HDFS log file to 63.2MB — 30.6% smaller than xz -9. That was already a strong result on one of the hardest log files we've tested.
With our V4 token detection system, we've more than doubled that advantage — now hitting 63.6% smaller than xz -9, a +33 percentage point improvement.
Results
| Metric | Before (V3) | After (V4) | Improvement |
|---|---|---|---|
| Compressed size | 63.2 MB | 46.5 MB | 26% smaller |
| vs xz -9 | -30.6% | -63.6% | +33.0 pp |
| Compression time | 335s | 56.5s | 6x faster |
What Changed
The V4 system introduces token detection — automatically identifying and encoding repeated structural patterns in log data. HDFS logs are full of these: block IDs, DataNode addresses, replication events, and status messages that follow predictable formats.
By recognizing these tokens before compression, V4 achieves dramatically better compression and dramatically faster processing. It's not a tradeoff — both dimensions improved simultaneously.
The Numbers in Context
- 1.5GB → 46.5MB — a 97% reduction from the original file
- 63.6% smaller than xz -9 — up from 30.6% in V3
- 6x faster — 56 seconds instead of 335 seconds
For the first time, Smallest.zip is beating xz by more than 60% on a file where traditional compressors already perform well. This is the largest single-benchmark improvement we've shipped.
Try It Yourself
Upload your own files at smallest.zip and see the difference. Every account starts with free credits.