Genomics compression that beats PetaGene

Lossless FASTQ compression at 4.5% of original size. Smaller files, lower storage costs, faster transfers — no data loss.

A single whole-genome sequencing run produces 100–300 GB of raw FASTQ data. Sequencing centers, biobanks, and clinical labs store millions of these files — and the data never stops growing.

Standard compressors like gzip barely dent genomic data. Specialized tools like PetaGene achieve ~5.3% of original size, but we go further. Our 4BIN encoder compresses FASTQ files to 4.5% of original size — losslessly — delivering 1.1–1.2x better compression than PetaGene across real-world sequencing datasets.

4.5%
Of original file size
1.15x
Better than PetaGene
100%
Lossless — bit-perfect
22x
Reduction vs original

Benchmark: 4BIN vs PetaGene on DDBJ Sequence Read Archive

Three real-world FASTQ datasets from the DDBJ Sequence Read Archive. All compression is fully lossless — decompressed output is bit-identical to the original. Percentages show compressed size as a fraction of the original file.

File 4BIN PetaGene Result
DRR000798 4.56% ~5.3% 1.16x better
DRR000801 4.77% ~5.3% 1.11x better
DRR000802 4.47% ~5.3% 1.19x better
BAM (Aligned) 14.3% ~45% vs SOTA

PetaGene figures based on published benchmarks (~5.3% typical for FASTQ). 4BIN tested on identical source files.

All amplicon files beat PetaGene with 4-level quality preserved.

What this means at scale

$43,200
Annual S3 storage saved
Per petabyte of raw FASTQ stored. At $0.023/GB, compressing from 100% to 4.5% saves $0.022/GB/mo.
95.5%
Less data to transfer
Transfer a 200 GB genome in 9 GB. Faster uploads from sequencers, faster downloads for analysis pipelines.
0 bits
Data lost
Fully lossless. Every base call, quality score, and read name decompresses to the exact original. HIPAA and clinical-grade safe.

Storage savings vs gzip — all file types

How our compression compares to standard gzip across every genomic data type we handle.

Data Type % of Raw Format / Tool Savings vs Gzip
scRNA FASTQ 5.8% v4 / zDUR ~80%
WES FASTQ 10.3% v4 / SPRING ~65%
BAM (Aligned) 15.0% CRAM 3.1 / BCZ ~75%
POD5 (Signal) 17.0% P5Z / VBZ ~60%
Spatial Matrix 4.7% STZ / Zarr+zstd ~85%
Joint VCF 0.3% VCZ / GSC ~98%

Real-world savings: $1M cloud storage budget

A mid-sized biotech spending $1M/year on cloud storage (AWS S3 / GCP) for genomic data. Here's how our compression reduces the bill for each data type.

Data Type Savings vs Gzip Before After Saved / Year
scRNA FASTQ ~80% $1,000,000 $200,000 $800,000
WES FASTQ ~65% $1,000,000 $350,000 $650,000
BAM (Aligned) ~75% $1,000,000 $250,000 $750,000
POD5 (Signal) ~60% $1,000,000 $400,000 $600,000
Spatial Matrix ~85% $1,000,000 $150,000 $850,000
Joint VCF ~98% $1,000,000 $20,000 $980,000

Savings shown per $1M of annual cloud storage spend on each data type. Real-world labs typically store a mix — a blended savings of ~65% ($650K/year) is conservative for most genomics operations.

Net saving: $650,000 per year on a typical $1M mixed genomics storage bill — before egress and transfer savings.

How 4BIN works

01

Quality score binning

FASTQ quality scores are quantized into 4 bins, dramatically reducing entropy while preserving the information needed for variant calling and alignment.

02

Drop-in integration

Accepts standard FASTQ input, produces a single compressed archive. Decompress to get the exact original file. Works with any downstream pipeline.

FQLink: Transparent Decompression

Make compressed FASTQ files invisible to your existing tools.

FQLink is a cross-platform command wrapper — inspired by PetaGene's PetaLink — that intercepts file references and transparently decompresses .fqz files before your tools see them.

No manual decompression. No temp files. No changes to your pipeline scripts. FQLink handles it all — your tools read standard FASTQ, and you store compressed .fqz.

Store more genomes. Pay less.

Whether you're a sequencing center processing thousands of samples per week or a biobank archiving petabytes of genomic data, 4BIN compression pays for itself immediately in reduced storage and transfer costs — with zero risk to data integrity.