March 05, 2026 · 4 min read · By Smallest.zip Team

Beyond FASTQ: 4BIN Compression for VCF, BAM, POD5, and Spatial Genomics

We expanded our genomics compression beyond FASTQ to four new file types. VCF achieves 385x reduction, BAM saves 84.7%, POD5 hits 2.69 bps, and spatial transcriptomics compresses to 4.73% of original size.

Compression Genomics DNA VCF BAM POD5 Spatial Benchmarks

Expanding Beyond FASTQ

4BIN started with FASTQ — and it's still the best FASTQ compressor we've tested. But sequencing pipelines produce far more than raw reads. Variant calls, alignments, nanopore signals, and spatial transcriptomics all generate massive files that need to be stored, transferred, and archived.

Today we're sharing compression benchmarks for four new genomic data types: VCF, BAM, POD5, and spatial transcriptomics.

Results

Data Type	Raw Size	SOTA	Ours	vs Raw
VCF (chr22, 1.1M variants x 2504 samples)	10.4 GB	161 MB (BCF)	27 MB	385x
BAM (NA12878 chr22, 219K reads)	21.8 MB	~20 GB CRAM	3.35 MB	84.7% savings
POD5 (200 reads, 24.9M samples)	49.8 MB	22.1 MB (VBZ)	8.4 MB	2.69 bps
Spatial (1M Xenium transcripts)	68.4 MB	20.6 MB (gzip)	3.24 MB	4.73%

Every file type beats the current state-of-the-art compressor for that format.

What These File Types Are

VCF — Variant Call Format. The standard output of variant calling pipelines, listing every position where a sample differs from the reference genome. Multi-sample VCFs (like population studies with thousands of samples) can reach tens of gigabytes per chromosome.
BAM — Binary Alignment Map. The aligned read data after mapping FASTQ reads to a reference genome. BAM files are typically 2-5x the size of the raw FASTQ and are often retained alongside or instead of the original reads.
POD5 — Oxford Nanopore's raw signal format. Each read contains thousands of electrical signal measurements. Long-read nanopore sequencing is growing rapidly, and POD5 files accumulate fast.
Spatial transcriptomics — Technologies like 10x Xenium map gene expression to physical locations in tissue. The output is millions of transcript records with spatial coordinates — a new and fast-growing data type.

Combined With FASTQ Results

Adding our FASTQ benchmarks for the complete picture across all supported formats:

Format	Best Ratio	vs SOTA
FASTQ Amplicon	4.47–4.77%	1.11–1.19x better than PetaGene
FASTQ scRNA	5.87%	8.5%
FASTQ RNA-seq	7.29%	8.5%
FASTQ cfDNA	9.47%	10.5%
FASTQ WES	10.27%	11.5%
VCF	385x reduction	5.96x better than BCF
BAM	84.7% savings	—
POD5	2.69 bps	2.63x better than VBZ
Spatial	4.73%	6.36x better than gzip

Storage Savings vs Gzip

How our compression compares to standard gzip across every genomic data type:

Data Type	% of Raw	SOTA Format/Tool	Potential Storage Savings vs Gzip
scRNA FASTQ	5.8%	v4 / zDUR	~80%
WES FASTQ	10.3%	v4 / SPRING	~65%
BAM (Aligned)	15.0%	CRAM 3.1 / BCZ	~75%
POD5 (Signal)	17.0%	P5Z / VBZ	~60%
Spatial Matrix	4.7%	STZ / Zarr+zstd	~85%
Joint VCF	0.3%	VCZ / GSC	~98%

Joint VCF stands out — compressing to just 0.3% of raw size with 98% savings over gzip. Spatial matrices and scRNA FASTQ also show massive gains, with 85% and 80% savings respectively.

What This Means

A genomics lab doesn't just produce FASTQ files. A typical workflow generates FASTQ, then BAM, then VCF — and increasingly POD5 and spatial data. Compressing only the FASTQ leaves most of the storage bill untouched.

With support for all five major genomic file types, 4BIN can compress your entire sequencing data lifecycle — from raw reads to final variant calls.

Try It

If you're storing VCF, BAM, POD5, or spatial transcriptomics data at scale, reach out and we'll run our compressors on your actual data. No commitment, no risk to your files.