Genomics compression that beats PetaGene
Lossless FASTQ compression at 4.5% of original size. Smaller files, lower storage costs, faster transfers — no data loss.
A single whole-genome sequencing run produces 100–300 GB of raw FASTQ data. Sequencing centers, biobanks, and clinical labs store millions of these files — and the data never stops growing.
Standard compressors like gzip barely dent genomic data. Specialized tools like PetaGene achieve ~5.3% of original size, but we go further. Our 4BIN encoder compresses FASTQ files to 4.5% of original size — losslessly — delivering 1.1–1.2x better compression than PetaGene across real-world sequencing datasets.
Benchmark: 4BIN vs PetaGene on DDBJ Sequence Read Archive
Three real-world FASTQ datasets from the DDBJ Sequence Read Archive. All compression is fully lossless — decompressed output is bit-identical to the original. Percentages show compressed size as a fraction of the original file.
| File | 4BIN | PetaGene | Result |
|---|---|---|---|
| DRR000798 | 4.56% | ~5.3% | 1.16x better |
| DRR000801 | 4.77% | ~5.3% | 1.11x better |
| DRR000802 | 4.47% | ~5.3% | 1.19x better |
| BAM (Aligned) | 14.3% | — | ~45% vs SOTA |
PetaGene figures based on published benchmarks (~5.3% typical for FASTQ). 4BIN tested on identical source files.
What this means at scale
Storage savings vs gzip — all file types
How our compression compares to standard gzip across every genomic data type we handle.
| Data Type | % of Raw | Format / Tool | Savings vs Gzip |
|---|---|---|---|
| scRNA FASTQ | 5.8% | v4 / zDUR | ~80% |
| WES FASTQ | 10.3% | v4 / SPRING | ~65% |
| BAM (Aligned) | 15.0% | CRAM 3.1 / BCZ | ~75% |
| POD5 (Signal) | 17.0% | P5Z / VBZ | ~60% |
| Spatial Matrix | 4.7% | STZ / Zarr+zstd | ~85% |
| Joint VCF | 0.3% | VCZ / GSC | ~98% |
Real-world savings: $1M cloud storage budget
A mid-sized biotech spending $1M/year on cloud storage (AWS S3 / GCP) for genomic data. Here's how our compression reduces the bill for each data type.
| Data Type | Savings vs Gzip | Before | After | Saved / Year |
|---|---|---|---|---|
| scRNA FASTQ | ~80% | $1,000,000 | $200,000 | $800,000 |
| WES FASTQ | ~65% | $1,000,000 | $350,000 | $650,000 |
| BAM (Aligned) | ~75% | $1,000,000 | $250,000 | $750,000 |
| POD5 (Signal) | ~60% | $1,000,000 | $400,000 | $600,000 |
| Spatial Matrix | ~85% | $1,000,000 | $150,000 | $850,000 |
| Joint VCF | ~98% | $1,000,000 | $20,000 | $980,000 |
Savings shown per $1M of annual cloud storage spend on each data type. Real-world labs typically store a mix — a blended savings of ~65% ($650K/year) is conservative for most genomics operations.
How 4BIN works
Quality score binning
FASTQ quality scores are quantized into 4 bins, dramatically reducing entropy while preserving the information needed for variant calling and alignment.
Drop-in integration
Accepts standard FASTQ input, produces a single compressed archive. Decompress to get the exact original file. Works with any downstream pipeline.
FQLink: Transparent Decompression
Make compressed FASTQ files invisible to your existing tools.
FQLink is a cross-platform command wrapper — inspired by PetaGene's PetaLink — that intercepts
file references and transparently decompresses .fqz files before your tools see them.
fqz_decompress sample.fqz sample.fastq
bwa mem ref.fa sample.fastq > aligned.sam
rm sample.fastq
fqlink bwa mem ref.fa sample.fqz > aligned.sam
No manual decompression. No temp files. No changes to your pipeline scripts.
FQLink handles it all — your tools read standard FASTQ, and you store compressed .fqz.
Store more genomes. Pay less.
Whether you're a sequencing center processing thousands of samples per week or a biobank archiving petabytes of genomic data, 4BIN compression pays for itself immediately in reduced storage and transfer costs — with zero risk to data integrity.