Back to Blog

Beyond FASTQ: 4BIN Compression for VCF, BAM, POD5, and Spatial Genomics

We expanded our genomics compression beyond FASTQ to four new file types. VCF achieves 385x reduction, BAM saves 84.7%, POD5 hits 2.69 bps, and spatial transcriptomics compresses to 4.73% of original size.

Expanding Beyond FASTQ

4BIN started with FASTQ — and it's still the best FASTQ compressor we've tested. But sequencing pipelines produce far more than raw reads. Variant calls, alignments, nanopore signals, and spatial transcriptomics all generate massive files that need to be stored, transferred, and archived.

Today we're sharing compression benchmarks for four new genomic data types: VCF, BAM, POD5, and spatial transcriptomics.

Results

Data Type Raw Size SOTA Ours vs Raw
VCF (chr22, 1.1M variants x 2504 samples) 10.4 GB 161 MB (BCF) 27 MB (VCZ v5) 385x
BAM (NA12878 chr22, 219K reads) 21.8 MB ~20 GB CRAM 3.35 MB (BCZ v3) 84.7% savings
POD5 (200 reads, 24.9M samples) 49.8 MB 22.1 MB (VBZ) 8.4 MB (P5Z lossy-1) 2.69 bps
Spatial (1M Xenium transcripts) 68.4 MB 20.6 MB (gzip) 3.24 MB (STZ v3) 4.73%

Every file type beats the current state-of-the-art compressor for that format.

What These File Types Are

  • VCF — Variant Call Format. The standard output of variant calling pipelines, listing every position where a sample differs from the reference genome. Multi-sample VCFs (like population studies with thousands of samples) can reach tens of gigabytes per chromosome.

  • BAM — Binary Alignment Map. The aligned read data after mapping FASTQ reads to a reference genome. BAM files are typically 2-5x the size of the raw FASTQ and are often retained alongside or instead of the original reads.

  • POD5 — Oxford Nanopore's raw signal format. Each read contains thousands of electrical signal measurements. Long-read nanopore sequencing is growing rapidly, and POD5 files accumulate fast.

  • Spatial transcriptomics — Technologies like 10x Xenium map gene expression to physical locations in tissue. The output is millions of transcript records with spatial coordinates — a new and fast-growing data type.

How We Beat SOTA

Each file type required a different compression strategy:

  • VCF → VCZ v5: Column-oriented encoding of genotype matrices. The 1000 Genomes chr22 file (1.1M variants across 2504 samples) compresses from 10.4 GB to just 27 MB — a 385x reduction. BCF, the standard binary VCF format, only gets to 161 MB.

  • BAM → BCZ v3: Separating alignment fields into independent streams with type-specific encoding. A 21.8 MB BAM compresses to 3.35 MB — 84.7% savings over the raw BAM format.

  • POD5 → P5Z lossy-1: Signal-aware lossy compression of nanopore current measurements at 2.69 bits per sample. The current SOTA (VBZ, used natively in POD5) achieves 22.1 MB — we reach 8.4 MB.

  • Spatial → STZ v3: Coordinate quantization and transcript deduplication for spatial transcriptomics data. A 68.4 MB Xenium dataset compresses to 3.24 MB — 4.73% of original size. Standard gzip only reaches 20.6 MB.

Combined With FASTQ Results

Adding our FASTQ benchmarks for the complete picture across all supported formats:

Format Best Ratio vs SOTA
FASTQ Amplicon 4.47–4.77% 1.11–1.19x better than PetaGene
FASTQ scRNA 5.87% 8.5%
FASTQ RNA-seq 7.29% 8.5%
FASTQ cfDNA 9.47% 10.5%
FASTQ WES 10.27% 11.5%
VCF 385x reduction 5.96x better than BCF
BAM 84.7% savings
POD5 2.69 bps 2.63x better than VBZ
Spatial 4.73% 6.36x better than gzip

Storage Savings vs Gzip

How our compression compares to standard gzip across every genomic data type:

Data Type % of Raw SOTA Format/Tool Potential Storage Savings vs Gzip
scRNA FASTQ 5.8% v4 / zDUR ~80%
WES FASTQ 10.3% v4 / SPRING ~65%
BAM (Aligned) 15.0% CRAM 3.1 / BCZ ~75%
POD5 (Signal) 17.0% P5Z / VBZ ~60%
Spatial Matrix 4.7% STZ / Zarr+zstd ~85%
Joint VCF 0.3% VCZ / GSC ~98%

Joint VCF stands out — compressing to just 0.3% of raw size with 98% savings over gzip. Spatial matrices and scRNA FASTQ also show massive gains, with 85% and 80% savings respectively.

What This Means

A genomics lab doesn't just produce FASTQ files. A typical workflow generates FASTQ, then BAM, then VCF — and increasingly POD5 and spatial data. Compressing only the FASTQ leaves most of the storage bill untouched.

With support for all five major genomic file types, 4BIN can compress your entire sequencing data lifecycle — from raw reads to final variant calls.

Try It

If you're storing VCF, BAM, POD5, or spatial transcriptomics data at scale, reach out and we'll run our compressors on your actual data. No commitment, no risk to your files.