Expanding Beyond FASTQ
4BIN started with FASTQ — and it's still the best FASTQ compressor we've tested. But sequencing pipelines produce far more than raw reads. Variant calls, alignments, nanopore signals, and spatial transcriptomics all generate massive files that need to be stored, transferred, and archived.
Today we're sharing compression benchmarks for four new genomic data types: VCF, BAM, POD5, and spatial transcriptomics.
Results
| Data Type | Raw Size | SOTA | Ours | vs Raw |
|---|---|---|---|---|
| VCF (chr22, 1.1M variants x 2504 samples) | 10.4 GB | 161 MB (BCF) | 27 MB (VCZ v5) | 385x |
| BAM (NA12878 chr22, 219K reads) | 21.8 MB | ~20 GB CRAM | 3.35 MB (BCZ v3) | 84.7% savings |
| POD5 (200 reads, 24.9M samples) | 49.8 MB | 22.1 MB (VBZ) | 8.4 MB (P5Z lossy-1) | 2.69 bps |
| Spatial (1M Xenium transcripts) | 68.4 MB | 20.6 MB (gzip) | 3.24 MB (STZ v3) | 4.73% |
Every file type beats the current state-of-the-art compressor for that format.
What These File Types Are
VCF — Variant Call Format. The standard output of variant calling pipelines, listing every position where a sample differs from the reference genome. Multi-sample VCFs (like population studies with thousands of samples) can reach tens of gigabytes per chromosome.
BAM — Binary Alignment Map. The aligned read data after mapping FASTQ reads to a reference genome. BAM files are typically 2-5x the size of the raw FASTQ and are often retained alongside or instead of the original reads.
POD5 — Oxford Nanopore's raw signal format. Each read contains thousands of electrical signal measurements. Long-read nanopore sequencing is growing rapidly, and POD5 files accumulate fast.
Spatial transcriptomics — Technologies like 10x Xenium map gene expression to physical locations in tissue. The output is millions of transcript records with spatial coordinates — a new and fast-growing data type.
How We Beat SOTA
Each file type required a different compression strategy:
VCF → VCZ v5: Column-oriented encoding of genotype matrices. The 1000 Genomes chr22 file (1.1M variants across 2504 samples) compresses from 10.4 GB to just 27 MB — a 385x reduction. BCF, the standard binary VCF format, only gets to 161 MB.
BAM → BCZ v3: Separating alignment fields into independent streams with type-specific encoding. A 21.8 MB BAM compresses to 3.35 MB — 84.7% savings over the raw BAM format.
POD5 → P5Z lossy-1: Signal-aware lossy compression of nanopore current measurements at 2.69 bits per sample. The current SOTA (VBZ, used natively in POD5) achieves 22.1 MB — we reach 8.4 MB.
Spatial → STZ v3: Coordinate quantization and transcript deduplication for spatial transcriptomics data. A 68.4 MB Xenium dataset compresses to 3.24 MB — 4.73% of original size. Standard gzip only reaches 20.6 MB.
Combined With FASTQ Results
Adding our FASTQ benchmarks for the complete picture across all supported formats:
| Format | Best Ratio | vs SOTA |
|---|---|---|
| FASTQ Amplicon | 4.47–4.77% | 1.11–1.19x better than PetaGene |
| FASTQ scRNA | 5.87% | 8.5% |
| FASTQ RNA-seq | 7.29% | 8.5% |
| FASTQ cfDNA | 9.47% | 10.5% |
| FASTQ WES | 10.27% | 11.5% |
| VCF | 385x reduction | 5.96x better than BCF |
| BAM | 84.7% savings | — |
| POD5 | 2.69 bps | 2.63x better than VBZ |
| Spatial | 4.73% | 6.36x better than gzip |
Storage Savings vs Gzip
How our compression compares to standard gzip across every genomic data type:
| Data Type | % of Raw | SOTA Format/Tool | Potential Storage Savings vs Gzip |
|---|---|---|---|
| scRNA FASTQ | 5.8% | v4 / zDUR | ~80% |
| WES FASTQ | 10.3% | v4 / SPRING | ~65% |
| BAM (Aligned) | 15.0% | CRAM 3.1 / BCZ | ~75% |
| POD5 (Signal) | 17.0% | P5Z / VBZ | ~60% |
| Spatial Matrix | 4.7% | STZ / Zarr+zstd | ~85% |
| Joint VCF | 0.3% | VCZ / GSC | ~98% |
Joint VCF stands out — compressing to just 0.3% of raw size with 98% savings over gzip. Spatial matrices and scRNA FASTQ also show massive gains, with 85% and 80% savings respectively.
What This Means
A genomics lab doesn't just produce FASTQ files. A typical workflow generates FASTQ, then BAM, then VCF — and increasingly POD5 and spatial data. Compressing only the FASTQ leaves most of the storage bill untouched.
With support for all five major genomic file types, 4BIN can compress your entire sequencing data lifecycle — from raw reads to final variant calls.
Try It
If you're storing VCF, BAM, POD5, or spatial transcriptomics data at scale, reach out and we'll run our compressors on your actual data. No commitment, no risk to your files.