Back to Blog

HDFS Logs: From 31% to 64% Smaller Than xz — Our V4 Breakthrough

Our V4 token detection system doubled the compression advantage on 1.5GB HDFS logs — now 63.6% smaller than xz -9, and 6x faster.

The Breakthrough

We use xz -9 as our baseline throughout these benchmarks — it's widely regarded as the strongest general-purpose compressor available and the standard benchmark for maximum compression.

In our previous HDFS benchmark, Smallest.zip compressed a 1.5GB HDFS log file to 63.2MB — 30.6% smaller than xz -9. That was already a strong result on one of the hardest log files we've tested.

With our V4 token detection system, we've more than doubled that advantage — now hitting 63.6% smaller than xz -9, a +33 percentage point improvement.

Results

Metric Before (V3) After (V4) Improvement
Compressed size 63.2 MB 46.5 MB 26% smaller
vs xz -9 -30.6% -63.6% +33.0 pp
Compression time 335s 56.5s 6x faster

What Changed

The V4 system introduces token detection — automatically identifying and encoding repeated structural patterns in log data. HDFS logs are full of these: block IDs, DataNode addresses, replication events, and status messages that follow predictable formats.

By recognizing these tokens before compression, V4 achieves dramatically better compression and dramatically faster processing. It's not a tradeoff — both dimensions improved simultaneously.

The Numbers in Context

  • 1.5GB → 46.5MB — a 97% reduction from the original file
  • 63.6% smaller than xz -9 — up from 30.6% in V3
  • 6x faster — 56 seconds instead of 335 seconds

For the first time, Smallest.zip is beating xz by more than 60% on a file where traditional compressors already perform well. This is the largest single-benchmark improvement we've shipped.

Try It Yourself

Upload your own files at smallest.zip and see the difference. Every account starts with free credits.