Genomics compression that beats PetaGene

Lossless FASTQ compression at 4.5% of original size. Smaller files, lower storage costs, faster transfers — no data loss.

A single whole-genome sequencing run produces 100–300 GB of raw FASTQ data. Sequencing centers, biobanks, and clinical labs store millions of these files — and the data never stops growing.

Standard compressors like gzip barely dent genomic data. Specialized tools like PetaGene achieve ~5.3% of original size, but we go further. Our 4BIN encoder compresses FASTQ files to 4.5% of original size — losslessly — delivering 1.1–1.2x better compression than PetaGene across real-world sequencing datasets.

4.5%

Of original file size

1.15x

Better than PetaGene

100%

Lossless — bit-perfect

22x

Reduction vs original

Benchmark: 4BIN vs PetaGene on DDBJ Sequence Read Archive

Three real-world FASTQ datasets from the DDBJ Sequence Read Archive. All compression is fully lossless — decompressed output is bit-identical to the original. Percentages show compressed size as a fraction of the original file.

File	4BIN	PetaGene	Result
DRR000798	4.56%	~5.3%	1.16x better
DRR000801	4.77%	~5.3%	1.11x better
DRR000802	4.47%	~5.3%	1.19x better
BAM (Aligned)	14.3%	—	~45% vs SOTA

PetaGene figures based on published benchmarks (~5.3% typical for FASTQ). 4BIN tested on identical source files.

All amplicon files beat PetaGene with 4-level quality preserved.

What this means at scale

$43,200

Annual S3 storage saved

Per petabyte of raw FASTQ stored. At $0.023/GB, compressing from 100% to 4.5% saves $0.022/GB/mo.

95.5%

Less data to transfer

Transfer a 200 GB genome in 9 GB. Faster uploads from sequencers, faster downloads for analysis pipelines.

0 bits

Data lost

Fully lossless. Every base call, quality score, and read name decompresses to the exact original. HIPAA and clinical-grade safe.

Storage savings vs gzip — all file types

How our compression compares to standard gzip across every genomic data type we handle.

Data Type	% of Raw	Format / Tool	Savings vs Gzip
scRNA FASTQ	5.8%	v4 / zDUR	~80%
WES FASTQ	10.3%	v4 / SPRING	~65%
BAM (Aligned)	15.0%	CRAM 3.1 / BCZ	~75%
POD5 (Signal)	17.0%	P5Z / VBZ	~60%
Spatial Matrix	4.7%	STZ / Zarr+zstd	~85%
Joint VCF	0.3%	VCZ / GSC	~98%

Real-world savings: $1M cloud storage budget

A mid-sized biotech spending $1M/year on cloud storage (AWS S3 / GCP) for genomic data. Here's how our compression reduces the bill for each data type.

Data Type	Savings vs Gzip	Before	After	Saved / Year
scRNA FASTQ	~80%	$1,000,000	$200,000	$800,000
WES FASTQ	~65%	$1,000,000	$350,000	$650,000
BAM (Aligned)	~75%	$1,000,000	$250,000	$750,000
POD5 (Signal)	~60%	$1,000,000	$400,000	$600,000
Spatial Matrix	~85%	$1,000,000	$150,000	$850,000
Joint VCF	~98%	$1,000,000	$20,000	$980,000

Savings shown per $1M of annual cloud storage spend on each data type. Real-world labs typically store a mix — a blended savings of ~65% ($650K/year) is conservative for most genomics operations.

Net saving: $650,000 per year on a typical $1M mixed genomics storage bill — before egress and transfer savings.

How 4BIN works

Quality score binning

FASTQ quality scores are quantized into 4 bins, dramatically reducing entropy while preserving the information needed for variant calling and alignment.

Drop-in integration

Accepts standard FASTQ input, produces a single compressed archive. Decompress to get the exact original file. Works with any downstream pipeline.

FQLink: Transparent Decompression

Make compressed FASTQ files invisible to your existing tools.

FQLink is a cross-platform command wrapper — inspired by PetaGene's PetaLink — that intercepts file references and transparently decompresses .fqz files before your tools see them.

Without FQLink

fqz_decompress sample.fqz sample.fastq
bwa mem ref.fa sample.fastq > aligned.sam
rm sample.fastq

→

With FQLink

fqlink bwa mem ref.fa sample.fqz > aligned.sam

No manual decompression. No temp files. No changes to your pipeline scripts. FQLink handles it all — your tools read standard FASTQ, and you store compressed .fqz.

Store more genomes. Pay less.

Whether you're a sequencing center processing thousands of samples per week or a biobank archiving petabytes of genomic data, 4BIN compression pays for itself immediately in reduced storage and transfer costs — with zero risk to data integrity.

Talk to us View pricing →